| % toolsoftrade/toolsoftrade.tex |
| |
| \QuickQuizChapter{chp:Tools of the Trade}{Tools of the Trade} |
| % |
| \Epigraph{You are only as good as your tools, and your tools are only |
| as good as you are.}{\emph{Unknown}} |
| |
| This chapter provides a brief introduction to some basic tools of the |
| parallel-programming trade, focusing mainly on those available to |
| user applications running on operating systems similar to Linux. |
| Section~\ref{sec:toolsoftrade:Scripting Languages} begins with |
| scripting languages, |
| Section~\ref{sec:toolsoftrade:POSIX Multiprocessing} |
| describes the multi-process parallelism supported by the POSIX API and |
| touches on POSIX threads, |
| Section~\ref{sec:toolsoftrade:Alternatives to POSIX Operations} |
| presents analogous operations in other environments, and finally, |
| Section~\ref{sec:toolsoftrade:The Right Tool for the Job: How to Choose?} |
| helps to choose the tool that will get the job done. |
| |
| \QuickQuiz{} |
| You call these tools??? |
| They look more like low-level synchronization primitives to me! |
| \QuickQuizAnswer{ |
| They look that way because they are in fact low-level synchronization |
| primitives. |
| But as such, they are in fact the fundamental tools for building |
| low-level concurrent software. |
| } \QuickQuizEnd |
| |
| Please note that this chapter provides but a brief introduction. |
| More detail is available from the references cited (and especially |
| from Internet), and more information |
| on how best to use these tools will be provided in later chapters. |
| |
| \section{Scripting Languages} |
| \label{sec:toolsoftrade:Scripting Languages} |
| |
| The Linux shell scripting languages provide simple but effective ways |
| of managing parallelism. |
| For example, suppose that you had a program \co{compute_it} |
| that you needed to run twice with two different sets of arguments. |
| This can be accomplished using UNIX shell scripting as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 compute_it 1 > compute_it.1.out & |
| 2 compute_it 2 > compute_it.2.out & |
| 3 wait |
| 4 cat compute_it.1.out |
| 5 cat compute_it.2.out |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{toolsoftrade/shellparallel}} |
| \caption{Execution Diagram for Parallel Shell Execution} |
| \label{fig:toolsoftrade:Execution Diagram for Parallel Shell Execution} |
| \end{figure} |
| |
| Lines~1 and 2 launch two instances of this program, redirecting their |
| output to two separate files, with the \co{&} character directing the |
| shell to run the two instances of the program in the background. |
| Line~3 waits for both instances to complete, and lines~4 and 5 |
| display their output. |
| The resulting execution is as shown in |
| Figure~\ref{fig:toolsoftrade:Execution Diagram for Parallel Shell Execution}: |
| the two instances of \co{compute_it} execute in parallel, |
| \co{wait} completes after both of them do, and then the two instances |
| of \co{cat} execute sequentially. |
| % @@@ Maui scheduler, load balancing, BOINC, and so on. |
| % @@@ Diagram showing parallel execution. |
| |
| \QuickQuiz{} |
| But this silly shell script isn't a \emph{real} parallel program! |
| Why bother with such trivia??? |
| \QuickQuizAnswer{ |
| Because you should \emph{never} forget the simple stuff! |
| |
| Please keep in mind that the title of this book is |
| ``Is Parallel Programming Hard, And, If So, What Can You Do About It?''. |
| One of the most effective things you can do about it is to |
| avoid forgetting the simple stuff! |
| After all, if you choose to do parallel programming the hard |
| way, you have no one but yourself to blame. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Is there a simpler way to create a parallel shell script? |
| If so, how? If not, why not? |
| \QuickQuizAnswer{ |
| One straightforward approach is the shell pipeline: |
| |
| \begin{minipage}[t]{\columnwidth} |
| \vspace{0.1ex} |
| \small |
| \begin{verbatim} |
| grep $pattern1 | sed -e 's/a/b/' | sort |
| \end{verbatim} |
| \vspace{0.1ex} |
| \end{minipage} |
| |
| For a sufficiently large input file, |
| \co{grep} will pattern-match in parallel with \co{sed} |
| editing and with the input processing of \co{sort}. |
| See the file \co{parallel.sh} for a demonstration of |
| shell-script parallelism and pipelining. |
| } \QuickQuizEnd |
| |
| For another example, the \co{make} software-build scripting language |
| provides a \co{-j} option that specifies how much parallelism should be |
| introduced into the build process. |
| For example, typing \co{make -j4} when building a Linux kernel |
| specifies that up to four parallel compiles be carried out concurrently. |
| |
| It is hoped that these simple examples convince you that parallel |
| programming need not always be complex or difficult. |
| |
| \QuickQuiz{} |
| But if script-based parallel programming is so easy, why |
| bother with anything else? |
| \QuickQuizAnswer{ |
| In fact, it is quite likely that a very large fraction of |
| parallel programs in use today are script-based. |
| However, script-based parallelism does have its limitations: |
| \begin{enumerate} |
| \item Creation of new processes is usually quite heavyweight, |
| involving the expensive \co{fork()} and \co{exec()} |
| system calls. |
| \item Sharing of data, including pipelining, typically involves |
| expensive file I/O. |
| \item The reliable synchronization primitives available to |
| scripts also typically involve expensive file I/O. |
| \item Scripting languages are often too slow, but are often |
| quite useful when coordinating execution of long-running |
| programs written in lower-level programming languages. |
| \end{enumerate} |
| These limitations require that script-based parallelism use |
| coarse-grained parallelism, with each unit of work having |
| execution time of at least tens of milliseconds, and preferably |
| much longer. |
| |
| Those requiring finer-grained parallelism are well advised to |
| think hard about their problem to see if it can be expressed |
| in a coarse-grained form. |
| If not, they should consider using other parallel-programming |
| environments, such as those discussed in |
| Section~\ref{sec:toolsoftrade:POSIX Multiprocessing}. |
| } \QuickQuizEnd |
| |
| \section{POSIX Multiprocessing} |
| \label{sec:toolsoftrade:POSIX Multiprocessing} |
| |
| This section scratches the surface of the |
| POSIX environment, including pthreads~\cite{OpenGroup1997pthreads}, |
| as this environment is readily available and widely implemented. |
| Section~\ref{sec:toolsoftrade:POSIX Process Creation and Destruction} |
| provides a glimpse of the POSIX \co{fork()} and related primitives, |
| Section~\ref{sec:toolsoftrade:POSIX Thread Creation and Destruction} |
| touches on thread creation and destruction, |
| Section~\ref{sec:toolsoftrade:POSIX Locking} gives a brief overview |
| of POSIX locking, and, finally, |
| Section~\ref{sec:toolsoftrade:POSIX Reader-Writer Locking} describes a |
| specific lock which can be used for data that is read by many threads and only |
| occasionally updated. |
| |
| \subsection{POSIX Process Creation and Destruction} |
| \label{sec:toolsoftrade:POSIX Process Creation and Destruction} |
| |
| Processes are created using the \co{fork()} primitive, they may |
| be destroyed using the \co{kill()} primitive, they may destroy |
| themselves using the \co{exit()} primitive. |
| A process executing a \co{fork()} primitive is said to be the ``parent'' |
| of the newly created process. |
| A parent may wait on its children using the \co{wait()} primitive. |
| |
| Please note that the examples in this section are quite simple. |
| Real-world applications using these primitives might need to manipulate |
| signals, file descriptors, shared memory segments, and any number of |
| other resources. |
| In addition, some applications need to take specific actions if a given |
| child terminates, and might also need to be concerned with the reason |
| that the child terminated. |
| These concerns can of course add substantial complexity to the code. |
| For more information, see any of a number of textbooks on the |
| subject~\cite{WRichardStevens1992,StewartWeiss2013UNIX}. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 pid = fork(); |
| 2 if (pid == 0) { |
| 3 /* child */ |
| 4 } else if (pid < 0) { |
| 5 /* parent, upon error */ |
| 6 perror("fork"); |
| 7 exit(-1); |
| 8 } else { |
| 9 /* parent, pid == child ID */ |
| 10 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Using the \tco{fork()} Primitive} |
| \label{fig:toolsoftrade:Using the fork() Primitive} |
| \end{figure} |
| |
| If \co{fork()} succeeds, it returns twice, once for the parent |
| and again for the child. |
| The value returned from \co{fork()} allows the caller to tell |
| the difference, as shown in |
| Figure~\ref{fig:toolsoftrade:Using the fork() Primitive} |
| (\path{forkjoin.c}). |
| Line~1 executes the \co{fork()} primitive, and saves its return value |
| in local variable \co{pid}. |
| Line~2 checks to see if \co{pid} is zero, in which case, this is the |
| child, which continues on to execute line~3. |
| As noted earlier, the child may terminate via the \co{exit()} primitive. |
| Otherwise, this is the parent, which checks for an error return from |
| the \co{fork()} primitive on line~4, and prints an error and exits |
| on lines~5-7 if so. |
| Otherwise, the \co{fork()} has executed successfully, and the parent |
| therefore executes line~9 with the variable \co{pid} containing the |
| process ID of the child. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 void waitall(void) |
| 2 { |
| 3 int pid; |
| 4 int status; |
| 5 |
| 6 for (;;) { |
| 7 pid = wait(&status); |
| 8 if (pid == -1) { |
| 9 if (errno == ECHILD) |
| 10 break; |
| 11 perror("wait"); |
| 12 exit(-1); |
| 13 } |
| 14 } |
| 15 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Using the \tco{wait()} Primitive} |
| \label{fig:toolsoftrade:Using the wait() Primitive} |
| \end{figure} |
| |
| The parent process may use the \co{wait()} primitive to wait for its children |
| to complete. |
| However, use of this primitive is a bit more complicated than its shell-script |
| counterpart, as each invocation of \co{wait()} waits for but one child |
| process. |
| It is therefore customary to wrap \co{wait()} into a function similar |
| to the \co{waitall()} function shown in |
| Figure~\ref{fig:toolsoftrade:Using the wait() Primitive} |
| (\path{api-pthread.h}), |
| with this \co{waitall()} function having semantics similar to the |
| shell-script \co{wait} command. |
| Each pass through the loop spanning lines~6-15 waits on one child process. |
| Line~7 invokes the \co{wait()} primitive, which blocks until a child process |
| exits, and returns that child's process ID. |
| If the process ID is instead $-1$, this indicates that the \co{wait()} |
| primitive was unable to wait on a child. |
| If so, line~9 checks for the \co{ECHILD} errno, which indicates that there |
| are no more child processes, so that line~10 exits the loop. |
| Otherwise, lines~11 and 12 print an error and exit. |
| |
| \QuickQuiz{} |
| Why does this \co{wait()} primitive need to be so complicated? |
| Why not just make it work like the shell-script \co{wait} does? |
| \QuickQuizAnswer{ |
| Some parallel applications need to take special action when |
| specific children exit, and therefore need to wait for each |
| child individually. |
| In addition, some parallel applications need to detect the |
| reason that the child died. |
| As we saw in Figure~\ref{fig:toolsoftrade:Using the wait() Primitive}, |
| it is not hard to build a \co{waitall()} function out of |
| the \co{wait()} function, but it would be impossible to |
| do the reverse. |
| Once the information about a specific child is lost, it is lost. |
| } \QuickQuizEnd |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 int x = 0; |
| 2 int pid; |
| 3 |
| 4 pid = fork(); |
| 5 if (pid == 0) { /* child */ |
| 6 x = 1; |
| 7 printf("Child process set x=1\n"); |
| 8 exit(0); |
| 9 } |
| 10 if (pid < 0) { /* parent, upon error */ |
| 11 perror("fork"); |
| 12 exit(-1); |
| 13 } |
| 14 waitall(); |
| 15 printf("Parent process sees x=%d\n", x); |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Processes Created Via \tco{fork()} Do Not Share Memory} |
| \label{fig:toolsoftrade:Processes Created Via fork() Do Not Share Memory} |
| \end{figure} |
| |
| It is critically important to note that the parent and child do \emph{not} |
| share memory. |
| This is illustrated by the program shown in |
| Figure~\ref{fig:toolsoftrade:Processes Created Via fork() Do Not Share Memory} |
| (\path{forkjoinvar.c}), |
| in which the child sets a global variable \co{x} to 1 on line~6, |
| prints a message on line~7, and exits on line~8. |
| The parent continues at line~14, where it waits on the child, |
| and on line~15 finds that its copy of the variable \co{x} is still zero. |
| The output is thus as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| Child process set x=1 |
| Parent process sees x=0 |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| \QuickQuiz{} |
| Isn't there a lot more to \co{fork()} and \co{wait()} |
| than discussed here? |
| \QuickQuizAnswer{ |
| Indeed there is, and |
| it is quite possible that this section will be expanded in |
| future versions to include messaging features (such as UNIX |
| pipes, TCP/IP, and shared file I/O) and memory mapping |
| (such as \co{mmap()} and \co{shmget()}). |
| In the meantime, there are any number of textbooks that cover |
| these primitives in great detail, |
| and the truly motivated can read manpages, existing parallel |
| applications using these primitives, as well as the |
| source code of the Linux-kernel implementations themselves. |
| |
| It is important to note that the parent process in |
| Figure~\ref{fig:toolsoftrade:Processes Created Via fork() Do Not Share Memory} |
| waits until after the child terminates to do its \co{printf()}. |
| Using \co{printf()}'s buffered I/O concurrently to the same file |
| from multiple processes is non-trivial, and is best avoided. |
| If you really need to do concurrent buffered I/O, |
| consult the documentation for your OS. |
| For UNIX/Linux systems, Stewart Weiss's lecture notes provide |
| a good introduction with informative |
| examples~\cite{StewartWeiss2013UNIX}. |
| } \QuickQuizEnd |
| |
| The finest-grained parallelism requires shared memory, and |
| this is covered in |
| Section~\ref{sec:toolsoftrade:POSIX Thread Creation and Destruction}. |
| That said, shared-memory parallelism can be significantly more complex |
| than fork-join parallelism. |
| |
| \subsection{POSIX Thread Creation and Destruction} |
| \label{sec:toolsoftrade:POSIX Thread Creation and Destruction} |
| |
| To create a thread within an existing process, invoke the |
| \co{pthread_create()} primitive, for example, as shown on lines~15 |
| and~16 of |
| Figure~\ref{fig:toolsoftrade:Threads Created Via pthread-create() Share Memory} |
| (\path{pcreate.c}). |
| The first argument is a pointer to a \co{pthread_t} in which to store the |
| ID of the thread to be created, the second \co{NULL} argument is a pointer |
| to an optional \co{pthread_attr_t}, the third argument is the function |
| (in this case, \co{mythread()}) |
| that is to be invoked by the new thread, and the last \co{NULL} argument |
| is the argument that will be passed to \co{mythread}. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 int x = 0; |
| 2 |
| 3 void *mythread(void *arg) |
| 4 { |
| 5 x = 1; |
| 6 printf("Child process set x=1\n"); |
| 7 return NULL; |
| 8 } |
| 9 |
| 10 int main(int argc, char *argv[]) |
| 11 { |
| 12 pthread_t tid; |
| 13 void *vp; |
| 14 |
| 15 if (pthread_create(&tid, NULL, |
| 16 mythread, NULL) != 0) { |
| 17 perror("pthread_create"); |
| 18 exit(-1); |
| 19 } |
| 20 if (pthread_join(tid, &vp) != 0) { |
| 21 perror("pthread_join"); |
| 22 exit(-1); |
| 23 } |
| 24 printf("Parent process sees x=%d\n", x); |
| 25 return 0; |
| 26 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Threads Created Via \tco{pthread_create()} Share Memory} |
| \label{fig:toolsoftrade:Threads Created Via pthread-create() Share Memory} |
| \end{figure} |
| |
| In this example, \co{mythread()} simply returns, but it could instead |
| call \co{pthread_exit()}. |
| |
| \QuickQuiz{} |
| If the \co{mythread()} function in |
| Figure~\ref{fig:toolsoftrade:Threads Created Via pthread-create() Share Memory} |
| can simply return, why bother with \co{pthread_exit()}? |
| \QuickQuizAnswer{ |
| In this simple example, there is no reason whatsoever. |
| However, imagine a more complex example, where \co{mythread()} |
| invokes other functions, possibly separately compiled. |
| In such a case, \co{pthread_exit()} allows these other functions |
| to end the thread's execution without having to pass some sort |
| of error return all the way back up to \co{mythread()}. |
| } \QuickQuizEnd |
| |
| The \co{pthread_join()} primitive, shown on line~20, is analogous to |
| the fork-join \co{wait()} primitive. |
| It blocks until the thread specified by the \co{tid} variable completes |
| execution, either by invoking \co{pthread_exit()} or by returning from |
| the thread's top-level function. |
| The thread's exit value will be stored through the pointer passed as the |
| second argument to \co{pthread_join()}. |
| The thread's exit value is either the value passed to \co{pthread_exit()} |
| or the value returned by the thread's top-level function, depending on |
| how the thread in question exits. |
| |
| The program shown in |
| Figure~\ref{fig:toolsoftrade:Threads Created Via pthread-create() Share Memory} |
| produces output as follows, demonstrating that memory is in fact |
| shared between the two threads: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| Child process set x=1 |
| Parent process sees x=1 |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Note that this program carefully makes sure that only one of the threads |
| stores a value to variable \co{x} at a time. |
| Any situation in which one thread might be storing a value to a given |
| variable while some other thread either loads from or stores to that |
| same variable is termed a ``data race''. |
| Because the C language makes no guarantee that the results of a data race |
| will be in any way reasonable, we need some way of safely accessing |
| and modifying data concurrently, such as the locking primitives discussed |
| in the following section. |
| |
| \QuickQuiz{} |
| If the C language makes no guarantees in presence of a data |
| race, then why does the Linux kernel have so many data races? |
| Are you trying to tell me that the Linux kernel is completely |
| broken??? |
| \QuickQuizAnswer{ |
| Ah, but the Linux kernel is written in a carefully selected |
| superset of the C language that includes special gcc |
| extensions, such as asms, that permit safe execution even |
| in presence of data races. |
| In addition, the Linux kernel does not run on a number of |
| platforms where data races would be especially problematic. |
| For an example, consider embedded systems with 32-bit pointers |
| and 16-bit busses. |
| On such a system, a data race involving a store to and a load |
| from a given pointer might well result in the load returning the |
| low-order 16 bits of the old value of the pointer concatenated |
| with the high-order 16 bits of the new value of the pointer. |
| |
| Nevertheless, even in the Linux kernel, data races can be |
| quite dangerous and should be avoided where |
| feasible~\cite{JonCorbet2012ACCESS:ONCE}. |
| } \QuickQuizEnd |
| |
| \subsection{POSIX Locking} |
| \label{sec:toolsoftrade:POSIX Locking} |
| |
| The POSIX standard allows the programmer to avoid data races via |
| ``POSIX locking''. |
| POSIX locking features a number of primitives, the most fundamental |
| of which are \co{pthread_mutex_lock()} and \co{pthread_mutex_unlock()}. |
| These primitives operate on locks, which are of type \co{pthread_mutex_t}. |
| These locks may be declared statically and initialized with |
| \co{PTHREAD_MUTEX_INITIALIZER}, or they may be allocated dynamically |
| and initialized using the \co{pthread_mutex_init()} primitive. |
| The demonstration code in this section will take the former course. |
| |
| The \co{pthread_mutex_lock()} primitive ``acquires'' the specified lock, |
| and the \co{pthread_mutex_unlock()} ``releases'' the specified lock. |
| Because these are ``exclusive'' locking primitives, |
| only one thread at a time may ``hold'' a given lock at a given time. |
| For example, if a pair of threads attempt to acquire the same lock |
| concurrently, one of the pair will be ``granted'' the lock first, and |
| the other will wait until the first thread releases the lock. |
| A simple and reasonably useful programming model permits a given data item |
| to be accessed only while holding the corresponding |
| lock~\cite{Hoare74}. |
| |
| \QuickQuiz{} |
| What if I want several threads to hold the same lock at the |
| same time? |
| \QuickQuizAnswer{ |
| The first thing you should do is to ask yourself why you would |
| want to do such a thing. |
| If the answer is ``because I have a lot of data that is read |
| by many threads, and only occasionally updated'', then |
| POSIX reader-writer locks might be what you are looking for. |
| These are introduced in |
| Section~\ref{sec:toolsoftrade:POSIX Reader-Writer Locking}. |
| |
| Another way to get the effect of multiple threads holding |
| the same lock is for one thread to acquire the lock, and |
| then use \co{pthread_create()} to create the other threads. |
| The question of why this would ever be a good idea is left |
| to the reader. |
| } \QuickQuizEnd |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 pthread_mutex_t lock_a = PTHREAD_MUTEX_INITIALIZER; |
| 2 pthread_mutex_t lock_b = PTHREAD_MUTEX_INITIALIZER; |
| 3 int x = 0; |
| 4 |
| 5 void *lock_reader(void *arg) |
| 6 { |
| 7 int i; |
| 8 int newx = -1; |
| 9 int oldx = -1; |
| 10 pthread_mutex_t *pmlp = (pthread_mutex_t *)arg; |
| 11 |
| 12 if (pthread_mutex_lock(pmlp) != 0) { |
| 13 perror("lock_reader:pthread_mutex_lock"); |
| 14 exit(-1); |
| 15 } |
| 16 for (i = 0; i < 100; i++) { |
| 17 newx = READ_ONCE(x); |
| 18 if (newx != oldx) { |
| 19 printf("lock_reader(): x = %d\n", newx); |
| 20 } |
| 21 oldx = newx; |
| 22 poll(NULL, 0, 1); |
| 23 } |
| 24 if (pthread_mutex_unlock(pmlp) != 0) { |
| 25 perror("lock_reader:pthread_mutex_unlock"); |
| 26 exit(-1); |
| 27 } |
| 28 return NULL; |
| 29 } |
| 30 |
| 31 void *lock_writer(void *arg) |
| 32 { |
| 33 int i; |
| 34 pthread_mutex_t *pmlp = (pthread_mutex_t *)arg; |
| 35 |
| 36 if (pthread_mutex_lock(pmlp) != 0) { |
| 37 perror("lock_writer:pthread_mutex_lock"); |
| 38 exit(-1); |
| 39 } |
| 40 for (i = 0; i < 3; i++) { |
| 41 WRITE_ONCE(x, READ_ONCE(x) + 1); |
| 42 poll(NULL, 0, 5); |
| 43 } |
| 44 if (pthread_mutex_unlock(pmlp) != 0) { |
| 45 perror("lock_writer:pthread_mutex_unlock"); |
| 46 exit(-1); |
| 47 } |
| 48 return NULL; |
| 49 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Demonstration of Exclusive Locks} |
| \label{fig:toolsoftrade:Demonstration of Exclusive Locks} |
| \end{figure} |
| |
| This exclusive-locking property is demonstrated using the code shown in |
| Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks} |
| (\path{lock.c}). |
| Line~1 defines and initializes a POSIX lock named \co{lock_a}, while |
| line~2 similarly defines and initializes a lock named \co{lock_b}. |
| Line~3 defines and initializes a shared variable ~\co{x}. |
| |
| Lines~5-28 defines a function \co{lock_reader()} which repeatedly |
| reads the shared variable \co{x} while holding |
| the lock specified by \co{arg}. |
| Line~10 casts \co{arg} to a pointer to a \co{pthread_mutex_t}, as |
| required by the \co{pthread_mutex_lock()} and \co{pthread_mutex_unlock()} |
| primitives. |
| |
| \QuickQuiz{} |
| Why not simply make the argument to \co{lock_reader()} |
| on line~5 of |
| Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks} |
| be a pointer to a \co{pthread_mutex_t}? |
| \QuickQuizAnswer{ |
| Because we will need to pass \co{lock_reader()} to |
| \co{pthread_create()}. |
| Although we could cast the function when passing it to |
| \co{pthread_create()}, function casts are quite a bit |
| uglier and harder to get right than are simple pointer casts. |
| } \QuickQuizEnd |
| |
| Lines~12-15 acquire the specified \co{pthread_mutex_t}, checking |
| for errors and exiting the program if any occur. |
| Lines~16-23 repeatedly check the value of \co{x}, printing the new value |
| each time that it changes. |
| Line~22 sleeps for one millisecond, which allows this demonstration |
| to run nicely on a uniprocessor machine. |
| Lines~24-27 release the \co{pthread_mutex_t}, again checking for |
| errors and exiting the program if any occur. |
| Finally, line~28 returns \co{NULL}, again to match the function type |
| required by \co{pthread_create()}. |
| |
| \QuickQuiz{} |
| Writing four lines of code for each acquisition and release |
| of a \co{pthread_mutex_t} sure seems painful! |
| Isn't there a better way? |
| \QuickQuizAnswer{ |
| Indeed! |
| And for that reason, the \co{pthread_mutex_lock()} and |
| \co{pthread_mutex_unlock()} primitives are normally wrapped |
| in functions that do this error checking. |
| Later on, we will wrap them with the Linux kernel |
| \co{spin_lock()} and \co{spin_unlock()} APIs. |
| } \QuickQuizEnd |
| |
| Lines~31-49 of |
| Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks} |
| shows \co{lock_writer()}, which |
| periodically update the shared variable \co{x} while holding the |
| specified \co{pthread_mutex_t}. |
| As with \co{lock_reader()}, line~34 casts \co{arg} to a pointer |
| to \co{pthread_mutex_t}, lines~36-39 acquires the specified lock, |
| and lines~44-47 releases it. |
| While holding the lock, lines~40-43 increment the shared variable \co{x}, |
| sleeping for five milliseconds between each increment. |
| Finally, lines~44-47 release the lock. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 printf("Creating two threads using same lock:\n"); |
| 2 if (pthread_create(&tid1, NULL, |
| 3 lock_reader, &lock_a) != 0) { |
| 4 perror("pthread_create"); |
| 5 exit(-1); |
| 6 } |
| 7 if (pthread_create(&tid2, NULL, |
| 8 lock_writer, &lock_a) != 0) { |
| 9 perror("pthread_create"); |
| 10 exit(-1); |
| 11 } |
| 12 if (pthread_join(tid1, &vp) != 0) { |
| 13 perror("pthread_join"); |
| 14 exit(-1); |
| 15 } |
| 16 if (pthread_join(tid2, &vp) != 0) { |
| 17 perror("pthread_join"); |
| 18 exit(-1); |
| 19 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Demonstration of Same Exclusive Lock} |
| \label{fig:toolsoftrade:Demonstration of Same Exclusive Lock} |
| \end{figure} |
| |
| Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock} |
| shows a code fragment that runs \co{lock_reader()} and |
| \co{lock_writer()} as threads using the same lock, namely, \co{lock_a}. |
| Lines~2-6 create a thread running \co{lock_reader()}, and then |
| Lines~7-11 create a thread running \co{lock_writer()}. |
| Lines~12-19 wait for both threads to complete. |
| The output of this code fragment is as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| Creating two threads using same lock: |
| lock_reader(): x = 0 |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Because both threads are using the same lock, the \co{lock_reader()} |
| thread cannot see any of the intermediate values of \co{x} produced |
| by \co{lock_writer()} while holding the lock. |
| |
| \QuickQuiz{} |
| Is ``x = 0'' the only possible output from the code fragment |
| shown in |
| Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock}? |
| If so, why? |
| If not, what other output could appear, and why? |
| \QuickQuizAnswer{ |
| No. |
| The reason that ``x = 0'' was output was that \co{lock_reader()} |
| acquired the lock first. |
| Had \co{lock_writer()} instead acquired the lock first, then |
| the output would have been ``x = 3''. |
| However, because the code fragment started \co{lock_reader()} first |
| and because this run was performed on a multiprocessor, |
| one would normally expect \co{lock_reader()} to acquire the |
| lock first. |
| However, there are no guarantees, especially on a busy system. |
| } \QuickQuizEnd |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 printf("Creating two threads w/different locks:\n"); |
| 2 x = 0; |
| 3 if (pthread_create(&tid1, NULL, |
| 4 lock_reader, &lock_a) != 0) { |
| 5 perror("pthread_create"); |
| 6 exit(-1); |
| 7 } |
| 8 if (pthread_create(&tid2, NULL, |
| 9 lock_writer, &lock_b) != 0) { |
| 10 perror("pthread_create"); |
| 11 exit(-1); |
| 12 } |
| 13 if (pthread_join(tid1, &vp) != 0) { |
| 14 perror("pthread_join"); |
| 15 exit(-1); |
| 16 } |
| 17 if (pthread_join(tid2, &vp) != 0) { |
| 18 perror("pthread_join"); |
| 19 exit(-1); |
| 20 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Demonstration of Different Exclusive Locks} |
| \label{fig:toolsoftrade:Demonstration of Different Exclusive Locks} |
| \end{figure} |
| |
| Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks} |
| shows a similar code fragment, but this time using different locks: |
| \co{lock_a} for \co{lock_reader()} and \co{lock_b} for |
| \co{lock_writer()}. |
| The output of this code fragment is as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| Creating two threads w/different locks: |
| lock_reader(): x = 0 |
| lock_reader(): x = 1 |
| lock_reader(): x = 2 |
| lock_reader(): x = 3 |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Because the two threads are using different locks, they do not exclude |
| each other, and can run concurrently. |
| The \co{lock_reader()} function can therefore see the intermediate |
| values of \co{x} stored by \co{lock_writer()}. |
| |
| \QuickQuiz{} |
| Using different locks could cause quite a bit of confusion, |
| what with threads seeing each others' intermediate states. |
| So should well-written parallel programs restrict themselves |
| to using a single lock in order to avoid this kind of confusion? |
| \QuickQuizAnswer{ |
| Although it is sometimes possible to write a program using a |
| single global lock that both performs and scales well, such |
| programs are exceptions to the rule. |
| You will normally need to use multiple locks to attain good |
| performance and scalability. |
| |
| One possible exception to this rule is ``transactional memory'', |
| which is currently a research topic. |
| Transactional\-/memory semantics can be loosely thought of as those |
| of a single global lock with optimizations permitted and |
| with the addition of rollback~\cite{HansJBoehm2009HOTPAR}. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| In the code shown in |
| Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks}, |
| is \co{lock_reader()} guaranteed to see all the values produced |
| by \co{lock_writer()}? |
| Why or why not? |
| \QuickQuizAnswer{ |
| No. |
| On a busy system, \co{lock_reader()} might be preempted |
| for the entire duration of \co{lock_writer()}'s execution, |
| in which case it would not see \emph{any} of \co{lock_writer()}'s |
| intermediate states for \co{x}. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Wait a minute here!!! |
| Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock} |
| didn't initialize shared variable \co{x}, |
| so why does it need to be initialized in |
| Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks}? |
| \QuickQuizAnswer{ |
| See line~3 of |
| Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks}. |
| Because the code in |
| Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock} |
| ran first, it could rely on the compile-time initialization of |
| \co{x}. |
| The code in |
| Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks} |
| ran next, so it had to re-initialize \co{x}. |
| } \QuickQuizEnd |
| |
| Although there is quite a bit more to POSIX exclusive locking, these |
| primitives provide a good start and are in fact sufficient in a great |
| many situations. |
| The next section takes a brief look at POSIX reader-writer locking. |
| |
| \subsection{POSIX Reader-Writer Locking} |
| \label{sec:toolsoftrade:POSIX Reader-Writer Locking} |
| |
| The POSIX API provides a reader-writer lock, which is represented by |
| a \co{pthread_rwlock_t}. |
| As with \co{pthread_mutex_t}, \co{pthread_rwlock_t} may be statically |
| initialized via \co{PTHREAD_RWLOCK_INITIALIZER} or dynamically |
| initialized via the \co{pthread_rwlock_init()} primitive. |
| The \co{pthread_rwlock_rdlock()} primitive read-acquires the |
| specified \co{pthread_rwlock_t}, the \co{pthread_rwlock_wrlock()} |
| primitive write-acquires it, and the \co{pthread_rwlock_unlock()} |
| primitive releases it. |
| Only a single thread may write-hold a given \co{pthread_rwlock_t} |
| at any given time, but multiple threads may read-hold a given |
| \co{pthread_rwlock_t}, at least while there is no thread |
| currently write-holding it. |
| |
| As you might expect, reader-writer locks are designed for read-mostly |
| situations. |
| In these situations, a reader-writer lock can provide greater scalability |
| than can an exclusive lock because the exclusive lock is by definition |
| limited to a single thread holding the lock at any given time, while |
| the reader-writer lock permits |
| an arbitrarily large number of readers to concurrently hold the lock. |
| However, in practice, we need to know how much additional scalability is |
| provided by reader-writer locks. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 pthread_rwlock_t rwl = PTHREAD_RWLOCK_INITIALIZER; |
| 2 int holdtime = 0; |
| 3 int thinktime = 0; |
| 4 long long *readcounts; |
| 5 int nreadersrunning = 0; |
| 6 |
| 7 #define GOFLAG_INIT 0 |
| 8 #define GOFLAG_RUN 1 |
| 9 #define GOFLAG_STOP 2 |
| 10 char goflag = GOFLAG_INIT; |
| 11 |
| 12 void *reader(void *arg) |
| 13 { |
| 14 int i; |
| 15 long long loopcnt = 0; |
| 16 long me = (long)arg; |
| 17 |
| 18 __sync_fetch_and_add(&nreadersrunning, 1); |
| 19 while (READ_ONCE(goflag) == GOFLAG_INIT) { |
| 20 continue; |
| 21 } |
| 22 while (READ_ONCE(goflag) == GOFLAG_RUN) { |
| 23 if (pthread_rwlock_rdlock(&rwl) != 0) { |
| 24 perror("pthread_rwlock_rdlock"); |
| 25 exit(-1); |
| 26 } |
| 27 for (i = 1; i < holdtime; i++) { |
| 28 barrier(); |
| 29 } |
| 30 if (pthread_rwlock_unlock(&rwl) != 0) { |
| 31 perror("pthread_rwlock_unlock"); |
| 32 exit(-1); |
| 33 } |
| 34 for (i = 1; i < thinktime; i++) { |
| 35 barrier(); |
| 36 } |
| 37 loopcnt++; |
| 38 } |
| 39 readcounts[me] = loopcnt; |
| 40 return NULL; |
| 41 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Measuring Reader-Writer Lock Scalability} |
| \label{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability} |
| \end{figure} |
| |
| Figure~\ref{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability} |
| (\path{rwlockscale.c}) |
| shows one way of measuring reader-writer lock scalability. |
| Line~1 shows the definition and initialization of the reader-writer |
| lock, line~2 shows the \co{holdtime} argument controlling the |
| time each thread holds the reader-writer lock, |
| line~3 shows the \co{thinktime} argument controlling the time between |
| the release of the reader-writer lock and the next acquisition, |
| line~4 defines the \co{readcounts} array into which each reader thread |
| places the number of times it acquired the lock, and |
| line~5 defines the \co{nreadersrunning} variable, which |
| determines when all reader threads have started running. |
| |
| Lines~7-10 define \co{goflag}, which synchronizes the start and the |
| end of the test. |
| This variable is initially set to \co{GOFLAG_INIT}, then set to |
| \co{GOFLAG_RUN} after all the reader threads have started, and finally |
| set to \co{GOFLAG_STOP} to terminate the test run. |
| |
| Lines~12-41 define \co{reader()}, which is the reader thread. |
| Line~18 atomically increments the \co{nreadersrunning} variable |
| to indicate that this thread is now running, and |
| lines~19-21 wait for the test to start. |
| The \co{READ_ONCE()} primitive forces the compiler to fetch \co{goflag} |
| on each pass through the loop---the compiler would otherwise be within its |
| rights to assume that the value of \co{goflag} would never change. |
| |
| \QuickQuiz{} |
| Instead of using \co{READ_ONCE()} everywhere, why not just |
| declare \co{goflag} as \co{volatile} on line~10 of |
| Figure~\ref{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability}? |
| \QuickQuizAnswer{ |
| A \co{volatile} declaration is in fact a reasonable alternative in |
| this particular case. |
| However, use of \co{READ_ONCE()} has the benefit of clearly |
| flagging to the reader that \co{goflag} is subject to concurrent |
| reads and updates. |
| However, \co{READ_ONCE()} is especially useful in cases where |
| most of the accesses are protected by a lock (and thus \emph{not} |
| subject to change), but where a few of the accesses are made outside |
| of the lock. |
| Using a volatile declaration in this case would make it harder |
| for the reader to note the special accesses outside of the lock, |
| and would also make it harder for the compiler to generate good |
| code under the lock. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| \co{READ_ONCE()} only affects the compiler, not the CPU. |
| Don't we also need memory barriers to make sure |
| that the change in \co{goflag}'s value propagates to the |
| CPU in a timely fashion in |
| Figure~\ref{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability}? |
| \QuickQuizAnswer{ |
| No, memory barriers are not needed and won't help here. |
| Memory barriers only enforce ordering among multiple memory |
| references: They absolutely do not guarantee to expedite the |
| propagation of data from one part of the system to another.\footnote{ |
| There have been persistent rumors of hardware in which |
| memory barriers actually do expedite propagation of data, |
| but no confirmed sightings.} |
| This leads to a quick rule of thumb: You do not need |
| memory barriers unless you are using more than one |
| variable to communicate between multiple threads. |
| |
| But what about \co{nreadersrunning}? |
| Isn't that a second variable used for communication? |
| Indeed it is, and there really are the needed memory-barrier |
| instructions buried in \co{__sync_fetch_and_add()}, |
| which make sure that the thread proclaims its presence |
| before checking to see if it should start. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Would it ever be necessary to use \co{READ_ONCE()} when accessing |
| a per-thread variable, for example, a variable declared using |
| the \co{gcc} \co{__thread} storage class? |
| \QuickQuizAnswer{ |
| It depends. |
| If the per-thread variable was accessed only from its thread, |
| and never from a signal handler, then no. |
| Otherwise, it is quite possible that \co{READ_ONCE()} is needed. |
| We will see examples of both situations in |
| Section~\ref{sec:count:Signal-Theft Limit Counter Implementation}. |
| |
| This leads to the question of how one thread can gain access to |
| another thread's \co{__thread} variable, and the answer is that |
| the second thread must store a pointer to its \co{__thread} |
| pointer somewhere that the first thread has access to. |
| One common approach is to maintain a linked list with one |
| element per thread, and to store the address of each thread's |
| \co{__thread} variable in the corresponding element. |
| } \QuickQuizEnd |
| |
| The loop spanning lines~22-38 carries out the performance test. |
| Lines~23-26 acquire the lock, lines~27-29 hold the lock for the specified |
| duration (and the \co{barrier()} directive prevents the compiler from |
| optimizing the loop out of existence), lines~30-33 release the lock, |
| and lines~34-36 wait for the specified duration before re-acquiring the |
| lock. |
| Line~37 counts this lock acquisition. |
| |
| Line~39 moves the lock-acquisition count to this thread's element of the |
| \co{readcounts[]} array, and line~40 returns, terminating this thread. |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{CodeSamples/toolsoftrade/rwlockscale}} |
| \caption{Reader-Writer Lock Scalability} |
| \label{fig:toolsoftrade:Reader-Writer Lock Scalability} |
| \end{figure} |
| |
| Figure~\ref{fig:toolsoftrade:Reader-Writer Lock Scalability} |
| shows the results of running this test on a 64-core Power-5 system |
| with two hardware threads per core for a total of 128 software-visible |
| CPUs. |
| The \co{thinktime} parameter was zero for all these tests, and the |
| \co{holdtime} parameter set to values ranging from one thousand (``1K'' |
| on the graph) to 100 million (``100M'' on the graph). |
| The actual value plotted is: |
| \begin{equation} |
| \frac{L_N}{N L_1} |
| \end{equation} |
| where $N$ is the number of threads, |
| $L_N$ is the number of lock acquisitions by $N$ threads, and |
| $L_1$ is the number of lock acquisitions by a single thread. |
| Given ideal hardware and software scalability, this value will always |
| be 1.0. |
| |
| As can be seen in the figure, reader-writer locking scalability is |
| decidedly non-ideal, especially for smaller sizes of critical |
| sections. |
| To see why read-acquisition can be so slow, consider |
| that all the acquiring threads must update the \co{pthread_rwlock_t} |
| data structure. |
| Therefore, if all 128 executing threads attempt to |
| read-acquire the reader-writer lock concurrently, they must update |
| this underlying \co{pthread_rwlock_t} one at a time. |
| One lucky thread might do so almost immediately, but the least-lucky |
| thread must wait for all the other 127 threads to do their updates. |
| This situation will only get worse as you add CPUs. |
| |
| \QuickQuiz{} |
| Isn't comparing against single-CPU throughput a bit harsh? |
| \QuickQuizAnswer{ |
| Not at all. |
| In fact, this comparison was, if anything, overly lenient. |
| A more balanced comparison would be against single-CPU |
| throughput with the locking primitives commented out. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| But 1,000 instructions is not a particularly small size for |
| a critical section. |
| What do I do if I need a much smaller critical section, for |
| example, one containing only a few tens of instructions? |
| \QuickQuizAnswer{ |
| If the data being read \emph{never} changes, then you do not |
| need to hold any locks while accessing it. |
| If the data changes sufficiently infrequently, you might be |
| able to checkpoint execution, terminate all threads, change |
| the data, then restart at the checkpoint. |
| |
| Another approach is to keep a single exclusive lock per |
| thread, so that a thread read-acquires the larger aggregate |
| reader-writer lock by acquiring its own lock, and write-acquires |
| by acquiring all the per-thread locks~\cite{WilsonCHsieh92a}. |
| This can work quite well for readers, but causes writers |
| to incur increasingly large overheads as the number of threads |
| increases. |
| |
| Some other ways of handling very small critical sections are |
| described in Chapter~\ref{chp:Deferred Processing}. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| In |
| Figure~\ref{fig:toolsoftrade:Reader-Writer Lock Scalability}, |
| all of the traces other than the 100M trace deviate gently |
| from the ideal line. |
| In contrast, the 100M trace breaks sharply from the ideal |
| line at 64 CPUs. |
| In addition, the spacing between the 100M trace and the 10M |
| trace is much smaller than that between the 10M trace and the |
| 1M trace. |
| Why does the 100M trace behave so much differently than the |
| other traces? |
| \QuickQuizAnswer{ |
| Your first clue is that 64 CPUs is exactly half of the 128 |
| CPUs on the machine. |
| The difference is an artifact of hardware threading. |
| This system has 64 cores with two hardware threads per core. |
| As long as fewer than 64 threads are running, each can run |
| in its own core. |
| But as soon as there are more than 64 threads, some of the threads |
| must share cores. |
| Because the pair of threads in any given core share some hardware |
| resources, the throughput of two threads sharing a core is not |
| quite as high as that of two threads each in their own core. |
| So the performance of the 100M trace is limited not by the |
| reader-writer lock, but rather by the sharing of hardware resources |
| between hardware threads in a single core. |
| |
| This can also be seen in the 10M trace, which deviates gently from |
| the ideal line up to 64 threads, then breaks sharply down, parallel |
| to the 100M trace. |
| Up to 64 threads, the 10M trace is limited primarily by reader-writer |
| lock scalability, and beyond that, also by sharing of hardware |
| resources between hardware threads in a single core. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Power-5 is several years old, and new hardware should |
| be faster. |
| So why should anyone worry about reader-writer locks being slow? |
| \QuickQuizAnswer{ |
| In general, newer hardware is improving. |
| However, it will need to improve more than two orders of magnitude |
| to permit reader-writer lock to achieve ideal performance on |
| 128 CPUs. |
| Worse yet, the greater the number of CPUs, the larger the |
| required performance improvement. |
| The performance problems of reader-writer locking are therefore |
| very likely to be with us for quite some time to come. |
| } \QuickQuizEnd |
| |
| Despite these limitations, reader-writer locking is quite useful in many |
| cases, for example when the readers must do high-latency file or network I/O. |
| There are alternatives, some of which will be presented in |
| Chapters~\ref{chp:Counting} and \ref{chp:Deferred Processing}. |
| |
| \subsection{Atomic Operations (gcc Classic)} |
| \label{sec:toolsoftrade:Atomic Operations (gcc Classic)} |
| |
| Given that |
| Figure~\ref{fig:toolsoftrade:Reader-Writer Lock Scalability} |
| shows that the overhead of reader-writer locking is most severe for the |
| smallest critical sections, it would be nice to have some other way |
| to protect the tiniest of critical sections. |
| One such way are atomic operations. |
| We have seen one atomic operations already, in the form of the |
| \co{__sync_fetch_and_add()} primitive on line~18 of |
| Figure~\ref{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability}. |
| This primitive atomically adds the value of its second argument to |
| the value referenced by its first argument, returning the old value |
| (which was ignored in this case). |
| If a pair of threads concurrently execute \co{__sync_fetch_and_add()} on |
| the same variable, the resulting value of the variable will include |
| the result of both additions. |
| |
| The {\sf gcc} compiler offers a number of additional atomic operations, |
| including \co{__sync_fetch_and_sub()}, |
| \co{__sync_fetch_and_or()}, |
| \co{__sync_fetch_and_and()}, |
| \co{__sync_fetch_and_xor()}, and |
| \co{__sync_fetch_and_nand()}, all of which return the old value. |
| If you instead need the new value, you can instead use the |
| \co{__sync_add_and_fetch()}, |
| \co{__sync_sub_and_fetch()}, |
| \co{__sync_or_and_fetch()}, |
| \co{__sync_and_and_fetch()}, |
| \co{__sync_xor_and_fetch()}, and |
| \co{__sync_nand_and_fetch()} primitives. |
| |
| \QuickQuiz{} |
| Is it really necessary to have both sets of primitives? |
| \QuickQuizAnswer{ |
| Strictly speaking, no. |
| One could implement any member of the second set using the |
| corresponding member of the first set. |
| For example, one could implement \co{__sync_nand_and_fetch()} |
| in terms of \co{__sync_fetch_and_nand()} as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| tmp = v; |
| ret = __sync_fetch_and_nand(p, tmp); |
| ret = ~ret & tmp; |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| It is similarly possible to implement \co{__sync_fetch_and_add()}, |
| \co{__sync_fetch_and_sub()}, and \co{__sync_fetch_and_xor()} |
| in terms of their post-value counterparts. |
| |
| However, the alternative forms can be quite convenient, both |
| for the programmer and for the compiler/library implementor. |
| } \QuickQuizEnd |
| |
| The classic compare-and-swap operation is provided by a pair of |
| primitives, \co{__sync_bool_compare_and_swap()} and |
| \co{__sync_val_compare_and_swap()}. |
| Both of these primitive atomically update a location to a new value, |
| but only if its prior value was equal to the specified old value. |
| The first variant returns 1 if the operation succeeded and 0 if it |
| failed, for example, if the prior value was not equal to the specified |
| old value. |
| The second variant returns the prior value of the location, which, if |
| equal to the specified old value, indicates that the operation succeeded. |
| Either of the compare-and-swap operation is ``universal'' in the sense |
| that any atomic operation on a single location can be implemented in |
| terms of compare-and-swap, though the earlier operations are often |
| more efficient where they apply. |
| The compare-and-swap operation is also capable of serving as the basis |
| for a wider set of atomic operations, though the more elaborate of |
| these often suffer from complexity, scalability, and performance |
| problems~\cite{MauriceHerlihy90a}. |
| |
| The \co{__sync_synchronize()} primitive issues a ``memory barrier'', |
| which constrains both the compiler's and the CPU's ability to reorder |
| operations, as discussed in Section~\ref{sec:advsync:Memory Barriers}. |
| In some cases, it is sufficient to constrain the compiler's ability |
| to reorder operations, while allowing the CPU free rein, in which |
| case the \co{barrier()} primitive may be used, as it in fact was |
| on line~28 of |
| Figure~\ref{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability}. |
| In some cases, it is only necessary to ensure that the compiler |
| avoids optimizing away a given memory read, in which case the |
| \co{READ_ONCE()} primitive may be used, as it was on line~17 of |
| Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks}. |
| Similarly, the \co{WRITE_ONCE()} primitive may be used to prevent the |
| compiler from optimizing away a given memory write. |
| These last three primitives are not provided directly by gcc, |
| but may be implemented straightforwardly as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x)) |
| #define READ_ONCE(x) \ |
| ({ typeof(x) ___x = ACCESS_ONCE(x); ___x; }) |
| #define WRITE_ONCE(x, val) ({ ACCESS_ONCE(x) = (val); }) |
| #define barrier() __asm__ __volatile__("": : :"memory") |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| \QuickQuiz{} |
| Given that these atomic operations will often be able to |
| generate single atomic instructions that are directly |
| supported by the underlying instruction set, shouldn't |
| they be the fastest possible way to get things done? |
| \QuickQuizAnswer{ |
| Unfortunately, no. |
| See Chapter~\ref{chp:Counting} for some stark counterexamples. |
| } \QuickQuizEnd |
| |
| \subsection{Atomic Operations (C11)} |
| \label{sec:toolsoftrade:Atomic Operations (C11)} |
| |
| The C11 standard added atomic operations, |
| including loads (\co{atomic_load()}), |
| stores (\co{atomic_store()}), |
| memory barriers (\co{atomic_thread_fence()} and |
| \co{atomic_signal_fence()}), and read-modify-write atomics. |
| The read-modify-write atomics include |
| \co{atomic_fetch_add()}, |
| \co{atomic_fetch_sub()}, |
| \co{atomic_fetch_and()}, |
| \co{atomic_fetch_xor()}, |
| \co{atomic_exchange()}, |
| \co{atomic_compare_exchange_strong()}, |
| and |
| \co{atomic_compare_exchange_weak()}. |
| These operate in a manner similar to those described in |
| Section~\ref{sec:toolsoftrade:Atomic Operations (gcc Classic)}, |
| but with the addition of memory-order arguments to \co{_explicit} |
| variants of all of the operations. |
| Without memory-order arguments, all the atomic operations are |
| fully ordered, and the arguments permit weaker orderings. |
| For example, ``\co{memory_order_explicit(&a, memory_order_relaxed)}'' |
| is vaguely similar to the Linux kernel's ``\co{READ_ONCE()}''.\footnote{ |
| Memory ordering is described in more detail in |
| Section~\ref{sec:advsync:Memory Barriers} and |
| Appendix~\ref{chp:app:whymb:Why Memory Barriers?}.} |
| |
| One restriction of the C11 atomics is that they apply only to special |
| atomic types, which can be restrictive. |
| The gcc compiler therefore provides |
| \co{__atomic_load()}, |
| \co{__atomic_load_n()}, |
| \co{__atomic_store()}, |
| \co{__atomic_store_n()}, etc. |
| These primitives offer the same semantics as their C11 counterparts, |
| but may be used on plain non-atomic objects. |
| |
| \subsection{Per-Thread Variables} |
| \label{sec:toolsoftrade:Per-Thread Variables} |
| |
| Per-thread variables, also called thread-specific data, thread-local |
| storage, and other less-polite names, are used extremely |
| heavily in concurrent code, as will be explored in |
| Chapters~\ref{chp:Counting} and~\ref{chp:Data Ownership}. |
| POSIX supplies the \co{pthread_key_create()} function to create a |
| per-thread variable (and return the corresponding key), |
| \co{pthread_key_delete()} to delete the per-thread variable corresponding |
| to key, |
| \co{pthread_setspecific()} to set the value of the current thread's |
| variable corresponding to the specified key, |
| and \co{pthread_getspecific()} to return that value. |
| |
| A number of compilers (including gcc) provide a \co{__thread} specifier |
| that may be used in a variable definition to designate that variable |
| as being per-thread. |
| The name of the variable may then be used normally to access the |
| value of the current thread's instance of that variable. |
| Of course, \co{__thread} is much easier to use than the POSIX |
| thead-specific data, and so \co{__thread} is usually preferred for |
| code that is to be built only with gcc or other compilers supporting |
| \co{__thread}. |
| |
| Fortunately, the C11 standard introduced a \co{_Thread_local} keyword |
| that can be used in place of \co{__thread}. |
| In the fullness of time, this new keyword should combine the ease of use |
| of \co{__thread} with the portability of POSIX thread-specific data. |
| |
| \section{Alternatives to POSIX Operations} |
| \label{sec:toolsoftrade:Alternatives to POSIX Operations} |
| |
| Unfortunately, threading operations, locking primitives, and atomic |
| operations were in reasonably wide use long before the various standards |
| committees got around to them. |
| As a result, there is considerable variation in how these operations |
| are supported. |
| It is still quite common to find these operations implemented in |
| assembly language, either for historical reasons or to obtain better |
| performance in specialized circumstances. |
| For example, the gcc \co{__sync_} family of primitives all provide full |
| memory-ordering semantics, which in the past motivated many developers |
| to create their own implementations for situations where the full memory |
| ordering semantics are not required. |
| The following sections show some alternatives from the Linux kernel |
| and some historical primitives used by this book's sample code. |
| |
| \subsection{Organization and Initialization} |
| \label{sec:toolsoftrade:Organization and Initialization} |
| |
| Although many environments do not require any special initialization |
| code, the code samples in this book start with a call to \co{smp_init()}, |
| which initializes a mapping from \co{pthread_t} to consecutive integers. |
| The userspace RCU library similarly requires a call to \co{rcu_init()}. |
| Although these calls can be hidden in environments (such as that of |
| gcc) that support constructors, |
| most of the RCU flavors supported by the userspace RCU library |
| also require each thread invoke \co{rcu_register_thread()} upon thread |
| creation and \co{rcu_unregister_thread()} before thread exit. |
| |
| In the case of the Linux kernel, it is a philosophical question as to |
| whether the kernel does not require calls to special initialization |
| code or whether the kernel's boot-time code is in fact the required |
| initialization code. |
| |
| \subsection{Thread Creation, Destruction, and Control} |
| \label{sec:toolsoftrade:Thread Creation, Destruction, and Control} |
| |
| The Linux kernel uses |
| \co{struct task_struct} pointers to track kthreads, |
| \co{kthread_create()} to create them, |
| \co{kthread_should_stop()} to externally suggest that they stop |
| (which has no POSIX equivalent), |
| \co{kthread_stop()} to wait for them to stop, and |
| \co{schedule_timeout_interruptible()} for a timed wait. |
| There are quite a few additional kthread-management APIs, but this |
| provides a good start, as well as good search terms. |
| |
| The CodeSamples API focuses on ``threads'', which are a locus of |
| control.\footnote{ |
| There are many other names for similar software constructs, including |
| ``process'', ``task'', ``fiber'', ``event'', and so on. |
| Similar design principles apply to all of them.} |
| Each such thread has an identifier of type \co{thread_id_t}, |
| and no two threads running at a given time will have the same |
| identifier. |
| Threads share everything except for per-thread local state,\footnote{ |
| How is that for a circular definition?} |
| which includes program counter and stack. |
| |
| The thread API is shown in |
| Figure~\ref{fig:toolsoftrade:Thread API}, and members are described in the |
| following sections. |
| |
| \begin{figure*}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| int smp_thread_id(void) |
| thread_id_t create_thread(void *(*func)(void *), void *arg) |
| for_each_thread(t) |
| for_each_running_thread(t) |
| void *wait_thread(thread_id_t tid) |
| void wait_all_threads(void) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Thread API} |
| \label{fig:toolsoftrade:Thread API} |
| \end{figure*} |
| |
| \subsubsection{\tco{create_thread()}} |
| |
| The \co{create_thread()} primitive creates a new thread, |
| starting the new thread's execution |
| at the function \co{func} specified by \co{create_thread()}'s |
| first argument, and passing it the argument specified by |
| \co{create_thread()}'s second argument. |
| This newly created thread will terminate when it returns from the |
| starting function specified by \co{func}. |
| The \co{create_thread()} primitive returns the \co{thread_id_t} |
| corresponding to the newly created child thread. |
| |
| This primitive will abort the program if more than \co{NR_THREADS} |
| threads are created, counting the one implicitly created by running |
| the program. |
| \co{NR_THREADS} is a compile-time constant that may be modified, |
| though some systems may have an upper bound for the allowable number |
| of threads. |
| |
| \subsubsection{\tco{smp_thread_id()}} |
| |
| Because the \co{thread_id_t} returned from \co{create_thread()} is |
| system-dependent, the \co{smp_thread_id()} primitive returns a thread |
| index corresponding to the thread making the request. |
| This index is guaranteed to be less than the maximum number of threads |
| that have been in existence since the program started, |
| and is therefore useful for bitmasks, array indices, and |
| the like. |
| |
| \subsubsection{\tco{for_each_thread()}} |
| |
| The \co{for_each_thread()} macro loops through all threads that exist, |
| including all threads that \emph{would} exist if created. |
| This macro is useful for handling per-thread variables as will be |
| seen in Section~\ref{sec:toolsoftrade:Per-Thread Variables}. |
| |
| \subsubsection{\tco{for_each_running_thread()}} |
| |
| The \co{for_each_running_thread()} |
| macro loops through only those threads that currently exist. |
| It is the caller's responsibility to synchronize with thread |
| creation and deletion if required. |
| |
| \subsubsection{\tco{wait_thread()}} |
| |
| The \co{wait_thread()} primitive waits for completion of the thread |
| specified by the \co{thread_id_t} passed to it. |
| This in no way interferes with the execution of the specified thread; |
| instead, it merely waits for it. |
| Note that \co{wait_thread()} returns the value that was returned by |
| the corresponding thread. |
| |
| \subsubsection{\tco{wait_all_threads()}} |
| |
| The \co{wait_all_threads()} |
| primitive waits for completion of all currently running threads. |
| It is the caller's responsibility to synchronize with thread creation |
| and deletion if required. |
| However, this primitive is normally used to clean up at the end of |
| a run, so such synchronization is normally not needed. |
| |
| \subsubsection{Example Usage} |
| |
| Figure~\ref{fig:toolsoftrade:Example Child Thread} |
| shows an example hello-world-like child thread. |
| As noted earlier, each thread is allocated its own stack, so |
| each thread has its own private \co{arg} argument and \co{myarg} variable. |
| Each child simply prints its argument and its \co{smp_thread_id()} |
| before exiting. |
| Note that the \co{return} statement on line~7 terminates the thread, |
| returning a \co{NULL} to whoever invokes \co{wait_thread()} on this |
| thread. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 void *thread_test(void *arg) |
| 2 { |
| 3 int myarg = (int)arg; |
| 4 |
| 5 printf("child thread %d: smp_thread_id() = %d\n", |
| 6 myarg, smp_thread_id()); |
| 7 return NULL; |
| 8 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Example Child Thread} |
| \label{fig:toolsoftrade:Example Child Thread} |
| \end{figure} |
| |
| The parent program is shown in |
| Figure~\ref{fig:toolsoftrade:Example Parent Thread}. |
| It invokes \co{smp_init()} to initialize the threading system on |
| line~6, |
| parses arguments on lines~7-14, and announces its presence on line~15. |
| It creates the specified number of child threads on lines~16-17, |
| and waits for them to complete on line~18. |
| Note that \co{wait_all_threads()} discards the threads return values, |
| as in this case they are all \co{NULL}, which is not very interesting. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 int main(int argc, char *argv[]) |
| 2 { |
| 3 int i; |
| 4 int nkids = 1; |
| 5 |
| 6 smp_init(); |
| 7 if (argc > 1) { |
| 8 nkids = strtoul(argv[1], NULL, 0); |
| 9 if (nkids > NR_THREADS) { |
| 10 fprintf(stderr, "nkids=%d too big, max=%d\n", |
| 11 nkids, NR_THREADS); |
| 12 usage(argv[0]); |
| 13 } |
| 14 } |
| 15 printf("Parent spawning %d threads.\n", nkids); |
| 16 for (i = 0; i < nkids; i++) |
| 17 create_thread(thread_test, (void *)i); |
| 18 wait_all_threads(); |
| 19 printf("All threads completed.\n", nkids); |
| 20 exit(0); |
| 21 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Example Parent Thread} |
| \label{fig:toolsoftrade:Example Parent Thread} |
| \end{figure} |
| |
| \QuickQuiz{} |
| What happened to the Linux-kernel equivalents to \co{fork()} |
| and \co{wait()}? |
| \QuickQuizAnswer{ |
| They don't really exist. |
| All tasks executing within the Linux kernel share memory, |
| at least unless you want to do a huge amount of memory-mapping |
| work by hand. |
| } \QuickQuizEnd |
| |
| \subsection{Locking} |
| \label{sec:toolsoftrade:Locking} |
| |
| A good starting subset of the Linux kernel's locking API is shown in |
| Figure~\ref{fig:toolsoftrade:Locking API}, |
| each API element being described in the following sections. |
| This book's CodeSamples locking API closely follows that of the Linux kernel. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| void spin_lock_init(spinlock_t *sp); |
| void spin_lock(spinlock_t *sp); |
| int spin_trylock(spinlock_t *sp); |
| void spin_unlock(spinlock_t *sp); |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Locking API} |
| \label{fig:toolsoftrade:Locking API} |
| \end{figure} |
| |
| \subsubsection{\tco{spin_lock_init()}} |
| |
| The \co{spin_lock_init()} primitive initializes the specified |
| \co{spinlock_t} variable, and must be invoked before |
| this variable is passed to any other spinlock primitive. |
| |
| \subsubsection{\tco{spin_lock()}} |
| |
| The \co{spin_lock()} primitive acquires the specified spinlock, |
| if necessary, waiting until the spinlock becomes available. |
| In some environments, such as pthreads, this waiting will involve |
| ``spinning'', while |
| in others, such as the Linux kernel, it will involve blocking. |
| |
| The key point is that only one thread may hold a spinlock at any |
| given time. |
| |
| \subsubsection{\tco{spin_trylock()}} |
| |
| The \co{spin_trylock()} primitive acquires the specified spinlock, |
| but only if it is immediately available. |
| It returns \co{true} if it was able to acquire the spinlock and |
| \co{false} otherwise. |
| |
| \subsubsection{\tco{spin_unlock()}} |
| |
| The \co{spin_unlock()} primitive releases the specified spinlock, |
| allowing other threads to acquire it. |
| |
| % \emph{@@@ likely need to add reader-writer locking.} |
| |
| \subsubsection{Example Usage} |
| |
| A spinlock named \co{mutex} may be used to protect a variable |
| \co{counter} as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \small |
| \begin{verbatim} |
| spin_lock(&mutex); |
| counter++; |
| spin_unlock(&mutex); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| \QuickQuiz{} |
| What problems could occur if the variable {\tt counter} were |
| incremented without the protection of {\tt mutex}? |
| \QuickQuizAnswer{ |
| On CPUs with load-store architectures, incrementing {\tt counter} |
| might compile into something like the following: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \small |
| \begin{verbatim} |
| LOAD counter,r0 |
| INC r0 |
| STORE r0,counter |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| On such machines, two threads might simultaneously load the |
| value of {\tt counter}, each increment it, and each store the |
| result. |
| The new value of {\tt counter} will then only be one greater |
| than before, despite two threads each incrementing it. |
| } \QuickQuizEnd |
| |
| However, the \co{spin_lock()} and \co{spin_unlock()} primitives |
| do have performance consequences, as will be seen in |
| Section~\ref{sec:toolsoftrade:Performance}. |
| |
| \subsection{Atomic Operations} |
| \label{sec:toolsoftrade:Atomic Operations} |
| |
| The Linux kernel provides a wide variety of atomic operations, but |
| those defined on type \co{atomic_t} provide a good start. |
| Normal non-tearing reads and stores are provided by |
| \co{atomic_read()} and \co{atomic_set()}, respectively. |
| Acquire load is provided by \co{smp_load_acquire()} and release |
| store by \co{smp_store_release()}. |
| |
| Non-value-returning fetch-and-add operations are provided by |
| \co{atomic_add()}, \co{atomic_sub()}, \co{atomic_inc()}, and |
| \co{atomic_dec()}, among others. |
| An atomic decrement that returns a reached-zero indication is provided |
| by both \co{atomic_dec_and_test()} and \co{atomic_sub_and_test()}. |
| An atomic add that returns the new value is provided by |
| \co{atomic_add_return()}. |
| Both \co{atomic_add_unless()} and \co{atomic_inc_not_zero()} provide |
| conditional atomic operations, where nothing happens unless the |
| original value of the atomic variable is different than the value |
| specified (these are very handy for managing reference counters, for |
| example). |
| |
| An atomic exchange operation is provided by \co{atomic_xchg()}, and |
| the celebrated compare-and-swap (CAS) operation is provided by |
| \co{atomic_cmpxchg()}. |
| Both of these return the old value. |
| Many additional atomic RMW primitives are available in the Linux kernel, |
| see the \co{Documentation/atomic_ops.txt} file in the Linux-kernel |
| source tree. |
| |
| This book's CodeSamples API closely follows that of the Linux kernel. |
| |
| \subsection{Per-CPU Variables} |
| \label{sec:toolsoftrade:Per-CPU Variables} |
| |
| The Linux kernel uses \co{DEFINE_PER_CPU()} to define a per-CPU variable, |
| \co{this_cpu_ptr()} to form a reference to this CPU's instance of a |
| given per-CPU variable, \co{per_cpu()} to access a specified CPU's |
| instance of a given per-CPU variable, along with many other special-purpose |
| per-CPU operations. |
| |
| Figure~\ref{fig:toolsoftrade:Per-Thread-Variable API} |
| shows this book's per-thread-variable API, which is patterned |
| after the Linux kernel's per-CPU-variable API. |
| This API provides the per-thread equivalent of global variables. |
| Although this API is, strictly speaking, not necessary\footnote{ |
| You could instead use \co{__thread} or \co{_Thread_local}.}, |
| it can provide a good userspace analogy to Linux kernel code. |
| |
| \begin{figure}[htbp] |
| { \scriptsize |
| \begin{verbbox} |
| DEFINE_PER_THREAD(type, name) |
| DECLARE_PER_THREAD(type, name) |
| per_thread(name, thread) |
| __get_thread_var(name) |
| init_per_thread(name, v) |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Per-Thread-Variable API} |
| \label{fig:toolsoftrade:Per-Thread-Variable API} |
| \end{figure} |
| |
| \QuickQuiz{} |
| How could you work around the lack of a per-thread-variable |
| API on systems that do not provide it? |
| \QuickQuizAnswer{ |
| One approach would be to create an array indexed by |
| \co{smp_thread_id()}, and another would be to use a hash |
| table to map from \co{smp_thread_id()} to an array |
| index---which is in fact what this |
| set of APIs does in pthread environments. |
| |
| Another approach would be for the parent to allocate a structure |
| containing fields for each desired per-thread variable, then |
| pass this to the child during thread creation. |
| However, this approach can impose large software-engineering |
| costs in large systems. |
| To see this, imagine if all global variables in a large system |
| had to be declared in a single file, regardless of whether or |
| not they were C static variables! |
| } \QuickQuizEnd |
| |
| \subsubsection{\tco{DEFINE_PER_THREAD()}} |
| |
| The \co{DEFINE_PER_THREAD()} primitive defines a per-thread variable. |
| Unfortunately, it is not possible to provide an initializer in the way |
| permitted by the Linux kernel's \co{DEFINE_PER_THREAD()} primitive, |
| but there is an \co{init_per_thread()} primitive that permits easy |
| runtime initialization. |
| |
| \subsubsection{\tco{DECLARE_PER_THREAD()}} |
| |
| The \co{DECLARE_PER_THREAD()} primitive is a declaration in the C sense, |
| as opposed to a definition. |
| Thus, a \co{DECLARE_PER_THREAD()} primitive may be used to access |
| a per-thread variable defined in some other file. |
| |
| \subsubsection{\tco{per_thread()}} |
| |
| The \co{per_thread()} primitive accesses the specified thread's variable. |
| |
| \subsubsection{\tco{__get_thread_var()}} |
| |
| The \co{__get_thread_var()} primitive accesses the current thread's variable. |
| |
| \subsubsection{\tco{init_per_thread()}} |
| |
| The \co{init_per_thread()} primitive sets all threads' instances of |
| the specified variable to the specified value. |
| The Linux kernel accomplishes this via normal C initialization, |
| relying in clever use of linker scripts and code executed during |
| the CPU-online process. |
| |
| \subsubsection{Usage Example} |
| |
| Suppose that we have a counter that is incremented very frequently |
| but read out quite rarely. |
| As will become clear in |
| Section~\ref{sec:toolsoftrade:Performance}, |
| it is helpful to implement such a counter using a per-thread variable. |
| Such a variable can be defined as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \small |
| \begin{verbatim} |
| DEFINE_PER_THREAD(int, counter); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| The counter must be initialized as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \small |
| \begin{verbatim} |
| init_per_thread(counter, 0); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| A thread can increment its instance of this counter as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \small |
| \begin{verbatim} |
| __get_thread_var(counter)++; |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| The value of the counter is then the sum of its instances. |
| A snapshot of the value of the counter can thus be collected |
| as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \small |
| \begin{verbatim} |
| for_each_thread(i) |
| sum += per_thread(counter, i); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Again, it is possible to gain a similar effect using other mechanisms, |
| but per-thread variables combine convenience and high performance. |
| |
| \subsection{Performance} |
| \label{sec:toolsoftrade:Performance} |
| |
| It is instructive to compare the performance of the locked increment |
| shown in |
| Section~\ref{sec:toolsoftrade:Atomic Operations} |
| to that of per-CPU (or per-thread) variables |
| (see Section~\ref{sec:toolsoftrade:Per-CPU Variables}), |
| as well as to conventional increment (as in ``\co{counter++}''). |
| |
| % \emph{@@@ need parable on cache thrashing.} |
| |
| % \emph{@@@ more here using performance results from a modest multiprocessor.} |
| |
| % \emph{@@@ Also work in something about critical-section size? Or put later?} |
| |
| The difference in performance is quite large, to put it mildly. |
| The purpose of this book is to help you write SMP programs, |
| perhaps with realtime response, while avoiding such performance |
| pitfalls. |
| Chapter~\ref{chp:Counting} |
| starts this process by describing a few parallel counting algorithms. |
| |
| \section{The Right Tool for the Job: How to Choose?} |
| \label{sec:toolsoftrade:The Right Tool for the Job: How to Choose?} |
| |
| As a rough rule of thumb, use the simplest tool that will get the job done. |
| If you can, simply program sequentially. |
| If that is insufficient, try using a shell script to mediate parallelism. |
| If the resulting shell-script \co{fork()}/\co{exec()} overhead |
| (about 480 microseconds for a minimal C program on an Intel Core Duo |
| laptop) is too |
| large, try using the C-language \co{fork()} and \co{wait()} primitives. |
| If the overhead of these primitives (about 80 microseconds for a minimal |
| child process) is still too large, then you |
| might need to use the POSIX threading primitives, choosing the appropriate |
| locking and/or atomic-operation primitives. |
| If the overhead of the POSIX threading primitives (typically sub-microsecond) |
| is too great, then the primitives introduced in |
| Chapter~\ref{chp:Deferred Processing} may be required. |
| Always remember that inter-process communication and message-passing |
| can be good alternatives to shared-memory multithreaded execution. |
| |
| \QuickQuiz{} |
| Wouldn't the shell normally use \co{vfork()} rather than |
| \co{fork()}? |
| \QuickQuizAnswer{ |
| It might well do that, however, checking is left as an exercise |
| for the reader. |
| But in the meantime, I hope that we can agree that \co{vfork()} |
| is a variant of \co{fork()}, so that we can use \co{fork()} |
| as a generic term covering both. |
| } \QuickQuizEnd |
| |
| Of course, the actual overheads will depend not only on your hardware, |
| but most critically on the manner in which you use the primitives. |
| In particular, randomly hacking multi-threaded code is a spectacularly |
| bad idea, especially given that shared-memory parallel systems use |
| your own intelligence against you: |
| The smarter you are, the deeper a hole you will dig for yourself before |
| you realize that you are in trouble~\cite{DeadlockEmpire2016}. |
| Therefore, it is necessary to make the right design choices as well as |
| the correct choice of individual primitives, |
| as is discussed at length in subsequent chapters. |