blob: 483d2870f49b0a2b2c261d298a5ba51f77c85363 [file] [log] [blame]
% toolsoftrade/toolsoftrade.tex
\QuickQuizChapter{chp:Tools of the Trade}{Tools of the Trade}
%
\Epigraph{You are only as good as your tools, and your tools are only
as good as you are.}{\emph{Unknown}}
This chapter provides a brief introduction to some basic tools of the
parallel-programming trade, focusing mainly on those available to
user applications running on operating systems similar to Linux.
Section~\ref{sec:toolsoftrade:Scripting Languages} begins with
scripting languages,
Section~\ref{sec:toolsoftrade:POSIX Multiprocessing}
describes the multi-process parallelism supported by the POSIX API and
touches on POSIX threads,
Section~\ref{sec:toolsoftrade:Alternatives to POSIX Operations}
presents analogous operations in other environments, and finally,
Section~\ref{sec:toolsoftrade:The Right Tool for the Job: How to Choose?}
helps to choose the tool that will get the job done.
\QuickQuiz{}
You call these tools???
They look more like low-level synchronization primitives to me!
\QuickQuizAnswer{
They look that way because they are in fact low-level synchronization
primitives.
But as such, they are in fact the fundamental tools for building
low-level concurrent software.
} \QuickQuizEnd
Please note that this chapter provides but a brief introduction.
More detail is available from the references cited (and especially
from Internet), and more information
on how best to use these tools will be provided in later chapters.
\section{Scripting Languages}
\label{sec:toolsoftrade:Scripting Languages}
The Linux shell scripting languages provide simple but effective ways
of managing parallelism.
For example, suppose that you had a program \co{compute_it}
that you needed to run twice with two different sets of arguments.
This can be accomplished using UNIX shell scripting as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 compute_it 1 > compute_it.1.out &
2 compute_it 2 > compute_it.2.out &
3 wait
4 cat compute_it.1.out
5 cat compute_it.2.out
\end{verbatim}
\end{minipage}
\vspace{5pt}
\begin{figure}[tb]
\centering
\resizebox{3in}{!}{\includegraphics{toolsoftrade/shellparallel}}
\caption{Execution Diagram for Parallel Shell Execution}
\label{fig:toolsoftrade:Execution Diagram for Parallel Shell Execution}
\end{figure}
Lines~1 and 2 launch two instances of this program, redirecting their
output to two separate files, with the \co{&} character directing the
shell to run the two instances of the program in the background.
Line~3 waits for both instances to complete, and lines~4 and 5
display their output.
The resulting execution is as shown in
Figure~\ref{fig:toolsoftrade:Execution Diagram for Parallel Shell Execution}:
the two instances of \co{compute_it} execute in parallel,
\co{wait} completes after both of them do, and then the two instances
of \co{cat} execute sequentially.
% @@@ Maui scheduler, load balancing, BOINC, and so on.
% @@@ Diagram showing parallel execution.
\QuickQuiz{}
But this silly shell script isn't a \emph{real} parallel program!
Why bother with such trivia???
\QuickQuizAnswer{
Because you should \emph{never} forget the simple stuff!
Please keep in mind that the title of this book is
``Is Parallel Programming Hard, And, If So, What Can You Do About It?''.
One of the most effective things you can do about it is to
avoid forgetting the simple stuff!
After all, if you choose to do parallel programming the hard
way, you have no one but yourself to blame.
} \QuickQuizEnd
\QuickQuiz{}
Is there a simpler way to create a parallel shell script?
If so, how? If not, why not?
\QuickQuizAnswer{
One straightforward approach is the shell pipeline:
\begin{minipage}[t]{\columnwidth}
\vspace{0.1ex}
\small
\begin{verbatim}
grep $pattern1 | sed -e 's/a/b/' | sort
\end{verbatim}
\vspace{0.1ex}
\end{minipage}
For a sufficiently large input file,
\co{grep} will pattern-match in parallel with \co{sed}
editing and with the input processing of \co{sort}.
See the file \co{parallel.sh} for a demonstration of
shell-script parallelism and pipelining.
} \QuickQuizEnd
For another example, the \co{make} software-build scripting language
provides a \co{-j} option that specifies how much parallelism should be
introduced into the build process.
For example, typing \co{make -j4} when building a Linux kernel
specifies that up to four parallel compiles be carried out concurrently.
It is hoped that these simple examples convince you that parallel
programming need not always be complex or difficult.
\QuickQuiz{}
But if script-based parallel programming is so easy, why
bother with anything else?
\QuickQuizAnswer{
In fact, it is quite likely that a very large fraction of
parallel programs in use today are script-based.
However, script-based parallelism does have its limitations:
\begin{enumerate}
\item Creation of new processes is usually quite heavyweight,
involving the expensive \co{fork()} and \co{exec()}
system calls.
\item Sharing of data, including pipelining, typically involves
expensive file I/O.
\item The reliable synchronization primitives available to
scripts also typically involve expensive file I/O.
\item Scripting languages are often too slow, but are often
quite useful when coordinating execution of long-running
programs written in lower-level programming languages.
\end{enumerate}
These limitations require that script-based parallelism use
coarse-grained parallelism, with each unit of work having
execution time of at least tens of milliseconds, and preferably
much longer.
Those requiring finer-grained parallelism are well advised to
think hard about their problem to see if it can be expressed
in a coarse-grained form.
If not, they should consider using other parallel-programming
environments, such as those discussed in
Section~\ref{sec:toolsoftrade:POSIX Multiprocessing}.
} \QuickQuizEnd
\section{POSIX Multiprocessing}
\label{sec:toolsoftrade:POSIX Multiprocessing}
This section scratches the surface of the
POSIX environment, including pthreads~\cite{OpenGroup1997pthreads},
as this environment is readily available and widely implemented.
Section~\ref{sec:toolsoftrade:POSIX Process Creation and Destruction}
provides a glimpse of the POSIX \co{fork()} and related primitives,
Section~\ref{sec:toolsoftrade:POSIX Thread Creation and Destruction}
touches on thread creation and destruction,
Section~\ref{sec:toolsoftrade:POSIX Locking} gives a brief overview
of POSIX locking, and, finally,
Section~\ref{sec:toolsoftrade:POSIX Reader-Writer Locking} describes a
specific lock which can be used for data that is read by many threads and only
occasionally updated.
\subsection{POSIX Process Creation and Destruction}
\label{sec:toolsoftrade:POSIX Process Creation and Destruction}
Processes are created using the \co{fork()} primitive, they may
be destroyed using the \co{kill()} primitive, they may destroy
themselves using the \co{exit()} primitive.
A process executing a \co{fork()} primitive is said to be the ``parent''
of the newly created process.
A parent may wait on its children using the \co{wait()} primitive.
Please note that the examples in this section are quite simple.
Real-world applications using these primitives might need to manipulate
signals, file descriptors, shared memory segments, and any number of
other resources.
In addition, some applications need to take specific actions if a given
child terminates, and might also need to be concerned with the reason
that the child terminated.
These concerns can of course add substantial complexity to the code.
For more information, see any of a number of textbooks on the
subject~\cite{WRichardStevens1992,StewartWeiss2013UNIX}.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 pid = fork();
2 if (pid == 0) {
3 /* child */
4 } else if (pid < 0) {
5 /* parent, upon error */
6 perror("fork");
7 exit(-1);
8 } else {
9 /* parent, pid == child ID */
10 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Using the \tco{fork()} Primitive}
\label{fig:toolsoftrade:Using the fork() Primitive}
\end{figure}
If \co{fork()} succeeds, it returns twice, once for the parent
and again for the child.
The value returned from \co{fork()} allows the caller to tell
the difference, as shown in
Figure~\ref{fig:toolsoftrade:Using the fork() Primitive}
(\path{forkjoin.c}).
Line~1 executes the \co{fork()} primitive, and saves its return value
in local variable \co{pid}.
Line~2 checks to see if \co{pid} is zero, in which case, this is the
child, which continues on to execute line~3.
As noted earlier, the child may terminate via the \co{exit()} primitive.
Otherwise, this is the parent, which checks for an error return from
the \co{fork()} primitive on line~4, and prints an error and exits
on lines~5-7 if so.
Otherwise, the \co{fork()} has executed successfully, and the parent
therefore executes line~9 with the variable \co{pid} containing the
process ID of the child.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 void waitall(void)
2 {
3 int pid;
4 int status;
5
6 for (;;) {
7 pid = wait(&status);
8 if (pid == -1) {
9 if (errno == ECHILD)
10 break;
11 perror("wait");
12 exit(-1);
13 }
14 }
15 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Using the \tco{wait()} Primitive}
\label{fig:toolsoftrade:Using the wait() Primitive}
\end{figure}
The parent process may use the \co{wait()} primitive to wait for its children
to complete.
However, use of this primitive is a bit more complicated than its shell-script
counterpart, as each invocation of \co{wait()} waits for but one child
process.
It is therefore customary to wrap \co{wait()} into a function similar
to the \co{waitall()} function shown in
Figure~\ref{fig:toolsoftrade:Using the wait() Primitive}
(\path{api-pthread.h}),
with this \co{waitall()} function having semantics similar to the
shell-script \co{wait} command.
Each pass through the loop spanning lines~6-15 waits on one child process.
Line~7 invokes the \co{wait()} primitive, which blocks until a child process
exits, and returns that child's process ID.
If the process ID is instead $-1$, this indicates that the \co{wait()}
primitive was unable to wait on a child.
If so, line~9 checks for the \co{ECHILD} errno, which indicates that there
are no more child processes, so that line~10 exits the loop.
Otherwise, lines~11 and 12 print an error and exit.
\QuickQuiz{}
Why does this \co{wait()} primitive need to be so complicated?
Why not just make it work like the shell-script \co{wait} does?
\QuickQuizAnswer{
Some parallel applications need to take special action when
specific children exit, and therefore need to wait for each
child individually.
In addition, some parallel applications need to detect the
reason that the child died.
As we saw in Figure~\ref{fig:toolsoftrade:Using the wait() Primitive},
it is not hard to build a \co{waitall()} function out of
the \co{wait()} function, but it would be impossible to
do the reverse.
Once the information about a specific child is lost, it is lost.
} \QuickQuizEnd
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 int x = 0;
2 int pid;
3
4 pid = fork();
5 if (pid == 0) { /* child */
6 x = 1;
7 printf("Child process set x=1\n");
8 exit(0);
9 }
10 if (pid < 0) { /* parent, upon error */
11 perror("fork");
12 exit(-1);
13 }
14 waitall();
15 printf("Parent process sees x=%d\n", x);
\end{verbbox}
}
\centering
\theverbbox
\caption{Processes Created Via \tco{fork()} Do Not Share Memory}
\label{fig:toolsoftrade:Processes Created Via fork() Do Not Share Memory}
\end{figure}
It is critically important to note that the parent and child do \emph{not}
share memory.
This is illustrated by the program shown in
Figure~\ref{fig:toolsoftrade:Processes Created Via fork() Do Not Share Memory}
(\path{forkjoinvar.c}),
in which the child sets a global variable \co{x} to 1 on line~6,
prints a message on line~7, and exits on line~8.
The parent continues at line~14, where it waits on the child,
and on line~15 finds that its copy of the variable \co{x} is still zero.
The output is thus as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
Child process set x=1
Parent process sees x=0
\end{verbatim}
\end{minipage}
\vspace{5pt}
\QuickQuiz{}
Isn't there a lot more to \co{fork()} and \co{wait()}
than discussed here?
\QuickQuizAnswer{
Indeed there is, and
it is quite possible that this section will be expanded in
future versions to include messaging features (such as UNIX
pipes, TCP/IP, and shared file I/O) and memory mapping
(such as \co{mmap()} and \co{shmget()}).
In the meantime, there are any number of textbooks that cover
these primitives in great detail,
and the truly motivated can read manpages, existing parallel
applications using these primitives, as well as the
source code of the Linux-kernel implementations themselves.
It is important to note that the parent process in
Figure~\ref{fig:toolsoftrade:Processes Created Via fork() Do Not Share Memory}
waits until after the child terminates to do its \co{printf()}.
Using \co{printf()}'s buffered I/O concurrently to the same file
from multiple processes is non-trivial, and is best avoided.
If you really need to do concurrent buffered I/O,
consult the documentation for your OS.
For UNIX/Linux systems, Stewart Weiss's lecture notes provide
a good introduction with informative
examples~\cite{StewartWeiss2013UNIX}.
} \QuickQuizEnd
The finest-grained parallelism requires shared memory, and
this is covered in
Section~\ref{sec:toolsoftrade:POSIX Thread Creation and Destruction}.
That said, shared-memory parallelism can be significantly more complex
than fork-join parallelism.
\subsection{POSIX Thread Creation and Destruction}
\label{sec:toolsoftrade:POSIX Thread Creation and Destruction}
To create a thread within an existing process, invoke the
\co{pthread_create()} primitive, for example, as shown on lines~15
and~16 of
Figure~\ref{fig:toolsoftrade:Threads Created Via pthread-create() Share Memory}
(\path{pcreate.c}).
The first argument is a pointer to a \co{pthread_t} in which to store the
ID of the thread to be created, the second \co{NULL} argument is a pointer
to an optional \co{pthread_attr_t}, the third argument is the function
(in this case, \co{mythread()})
that is to be invoked by the new thread, and the last \co{NULL} argument
is the argument that will be passed to \co{mythread}.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 int x = 0;
2
3 void *mythread(void *arg)
4 {
5 x = 1;
6 printf("Child process set x=1\n");
7 return NULL;
8 }
9
10 int main(int argc, char *argv[])
11 {
12 pthread_t tid;
13 void *vp;
14
15 if (pthread_create(&tid, NULL,
16 mythread, NULL) != 0) {
17 perror("pthread_create");
18 exit(-1);
19 }
20 if (pthread_join(tid, &vp) != 0) {
21 perror("pthread_join");
22 exit(-1);
23 }
24 printf("Parent process sees x=%d\n", x);
25 return 0;
26 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Threads Created Via \tco{pthread_create()} Share Memory}
\label{fig:toolsoftrade:Threads Created Via pthread-create() Share Memory}
\end{figure}
In this example, \co{mythread()} simply returns, but it could instead
call \co{pthread_exit()}.
\QuickQuiz{}
If the \co{mythread()} function in
Figure~\ref{fig:toolsoftrade:Threads Created Via pthread-create() Share Memory}
can simply return, why bother with \co{pthread_exit()}?
\QuickQuizAnswer{
In this simple example, there is no reason whatsoever.
However, imagine a more complex example, where \co{mythread()}
invokes other functions, possibly separately compiled.
In such a case, \co{pthread_exit()} allows these other functions
to end the thread's execution without having to pass some sort
of error return all the way back up to \co{mythread()}.
} \QuickQuizEnd
The \co{pthread_join()} primitive, shown on line~20, is analogous to
the fork-join \co{wait()} primitive.
It blocks until the thread specified by the \co{tid} variable completes
execution, either by invoking \co{pthread_exit()} or by returning from
the thread's top-level function.
The thread's exit value will be stored through the pointer passed as the
second argument to \co{pthread_join()}.
The thread's exit value is either the value passed to \co{pthread_exit()}
or the value returned by the thread's top-level function, depending on
how the thread in question exits.
The program shown in
Figure~\ref{fig:toolsoftrade:Threads Created Via pthread-create() Share Memory}
produces output as follows, demonstrating that memory is in fact
shared between the two threads:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
Child process set x=1
Parent process sees x=1
\end{verbatim}
\end{minipage}
\vspace{5pt}
Note that this program carefully makes sure that only one of the threads
stores a value to variable \co{x} at a time.
Any situation in which one thread might be storing a value to a given
variable while some other thread either loads from or stores to that
same variable is termed a ``data race''.
Because the C language makes no guarantee that the results of a data race
will be in any way reasonable, we need some way of safely accessing
and modifying data concurrently, such as the locking primitives discussed
in the following section.
\QuickQuiz{}
If the C language makes no guarantees in presence of a data
race, then why does the Linux kernel have so many data races?
Are you trying to tell me that the Linux kernel is completely
broken???
\QuickQuizAnswer{
Ah, but the Linux kernel is written in a carefully selected
superset of the C language that includes special gcc
extensions, such as asms, that permit safe execution even
in presence of data races.
In addition, the Linux kernel does not run on a number of
platforms where data races would be especially problematic.
For an example, consider embedded systems with 32-bit pointers
and 16-bit busses.
On such a system, a data race involving a store to and a load
from a given pointer might well result in the load returning the
low-order 16 bits of the old value of the pointer concatenated
with the high-order 16 bits of the new value of the pointer.
Nevertheless, even in the Linux kernel, data races can be
quite dangerous and should be avoided where
feasible~\cite{JonCorbet2012ACCESS:ONCE}.
} \QuickQuizEnd
\subsection{POSIX Locking}
\label{sec:toolsoftrade:POSIX Locking}
The POSIX standard allows the programmer to avoid data races via
``POSIX locking''.
POSIX locking features a number of primitives, the most fundamental
of which are \co{pthread_mutex_lock()} and \co{pthread_mutex_unlock()}.
These primitives operate on locks, which are of type \co{pthread_mutex_t}.
These locks may be declared statically and initialized with
\co{PTHREAD_MUTEX_INITIALIZER}, or they may be allocated dynamically
and initialized using the \co{pthread_mutex_init()} primitive.
The demonstration code in this section will take the former course.
The \co{pthread_mutex_lock()} primitive ``acquires'' the specified lock,
and the \co{pthread_mutex_unlock()} ``releases'' the specified lock.
Because these are ``exclusive'' locking primitives,
only one thread at a time may ``hold'' a given lock at a given time.
For example, if a pair of threads attempt to acquire the same lock
concurrently, one of the pair will be ``granted'' the lock first, and
the other will wait until the first thread releases the lock.
A simple and reasonably useful programming model permits a given data item
to be accessed only while holding the corresponding
lock~\cite{Hoare74}.
\QuickQuiz{}
What if I want several threads to hold the same lock at the
same time?
\QuickQuizAnswer{
The first thing you should do is to ask yourself why you would
want to do such a thing.
If the answer is ``because I have a lot of data that is read
by many threads, and only occasionally updated'', then
POSIX reader-writer locks might be what you are looking for.
These are introduced in
Section~\ref{sec:toolsoftrade:POSIX Reader-Writer Locking}.
Another way to get the effect of multiple threads holding
the same lock is for one thread to acquire the lock, and
then use \co{pthread_create()} to create the other threads.
The question of why this would ever be a good idea is left
to the reader.
} \QuickQuizEnd
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 pthread_mutex_t lock_a = PTHREAD_MUTEX_INITIALIZER;
2 pthread_mutex_t lock_b = PTHREAD_MUTEX_INITIALIZER;
3 int x = 0;
4
5 void *lock_reader(void *arg)
6 {
7 int i;
8 int newx = -1;
9 int oldx = -1;
10 pthread_mutex_t *pmlp = (pthread_mutex_t *)arg;
11
12 if (pthread_mutex_lock(pmlp) != 0) {
13 perror("lock_reader:pthread_mutex_lock");
14 exit(-1);
15 }
16 for (i = 0; i < 100; i++) {
17 newx = READ_ONCE(x);
18 if (newx != oldx) {
19 printf("lock_reader(): x = %d\n", newx);
20 }
21 oldx = newx;
22 poll(NULL, 0, 1);
23 }
24 if (pthread_mutex_unlock(pmlp) != 0) {
25 perror("lock_reader:pthread_mutex_unlock");
26 exit(-1);
27 }
28 return NULL;
29 }
30
31 void *lock_writer(void *arg)
32 {
33 int i;
34 pthread_mutex_t *pmlp = (pthread_mutex_t *)arg;
35
36 if (pthread_mutex_lock(pmlp) != 0) {
37 perror("lock_writer:pthread_mutex_lock");
38 exit(-1);
39 }
40 for (i = 0; i < 3; i++) {
41 WRITE_ONCE(x, READ_ONCE(x) + 1);
42 poll(NULL, 0, 5);
43 }
44 if (pthread_mutex_unlock(pmlp) != 0) {
45 perror("lock_writer:pthread_mutex_unlock");
46 exit(-1);
47 }
48 return NULL;
49 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Demonstration of Exclusive Locks}
\label{fig:toolsoftrade:Demonstration of Exclusive Locks}
\end{figure}
This exclusive-locking property is demonstrated using the code shown in
Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks}
(\path{lock.c}).
Line~1 defines and initializes a POSIX lock named \co{lock_a}, while
line~2 similarly defines and initializes a lock named \co{lock_b}.
Line~3 defines and initializes a shared variable ~\co{x}.
Lines~5-28 defines a function \co{lock_reader()} which repeatedly
reads the shared variable \co{x} while holding
the lock specified by \co{arg}.
Line~10 casts \co{arg} to a pointer to a \co{pthread_mutex_t}, as
required by the \co{pthread_mutex_lock()} and \co{pthread_mutex_unlock()}
primitives.
\QuickQuiz{}
Why not simply make the argument to \co{lock_reader()}
on line~5 of
Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks}
be a pointer to a \co{pthread_mutex_t}?
\QuickQuizAnswer{
Because we will need to pass \co{lock_reader()} to
\co{pthread_create()}.
Although we could cast the function when passing it to
\co{pthread_create()}, function casts are quite a bit
uglier and harder to get right than are simple pointer casts.
} \QuickQuizEnd
Lines~12-15 acquire the specified \co{pthread_mutex_t}, checking
for errors and exiting the program if any occur.
Lines~16-23 repeatedly check the value of \co{x}, printing the new value
each time that it changes.
Line~22 sleeps for one millisecond, which allows this demonstration
to run nicely on a uniprocessor machine.
Lines~24-27 release the \co{pthread_mutex_t}, again checking for
errors and exiting the program if any occur.
Finally, line~28 returns \co{NULL}, again to match the function type
required by \co{pthread_create()}.
\QuickQuiz{}
Writing four lines of code for each acquisition and release
of a \co{pthread_mutex_t} sure seems painful!
Isn't there a better way?
\QuickQuizAnswer{
Indeed!
And for that reason, the \co{pthread_mutex_lock()} and
\co{pthread_mutex_unlock()} primitives are normally wrapped
in functions that do this error checking.
Later on, we will wrap them with the Linux kernel
\co{spin_lock()} and \co{spin_unlock()} APIs.
} \QuickQuizEnd
Lines~31-49 of
Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks}
shows \co{lock_writer()}, which
periodically update the shared variable \co{x} while holding the
specified \co{pthread_mutex_t}.
As with \co{lock_reader()}, line~34 casts \co{arg} to a pointer
to \co{pthread_mutex_t}, lines~36-39 acquires the specified lock,
and lines~44-47 releases it.
While holding the lock, lines~40-43 increment the shared variable \co{x},
sleeping for five milliseconds between each increment.
Finally, lines~44-47 release the lock.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 printf("Creating two threads using same lock:\n");
2 if (pthread_create(&tid1, NULL,
3 lock_reader, &lock_a) != 0) {
4 perror("pthread_create");
5 exit(-1);
6 }
7 if (pthread_create(&tid2, NULL,
8 lock_writer, &lock_a) != 0) {
9 perror("pthread_create");
10 exit(-1);
11 }
12 if (pthread_join(tid1, &vp) != 0) {
13 perror("pthread_join");
14 exit(-1);
15 }
16 if (pthread_join(tid2, &vp) != 0) {
17 perror("pthread_join");
18 exit(-1);
19 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Demonstration of Same Exclusive Lock}
\label{fig:toolsoftrade:Demonstration of Same Exclusive Lock}
\end{figure}
Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock}
shows a code fragment that runs \co{lock_reader()} and
\co{lock_writer()} as threads using the same lock, namely, \co{lock_a}.
Lines~2-6 create a thread running \co{lock_reader()}, and then
Lines~7-11 create a thread running \co{lock_writer()}.
Lines~12-19 wait for both threads to complete.
The output of this code fragment is as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
Creating two threads using same lock:
lock_reader(): x = 0
\end{verbatim}
\end{minipage}
\vspace{5pt}
Because both threads are using the same lock, the \co{lock_reader()}
thread cannot see any of the intermediate values of \co{x} produced
by \co{lock_writer()} while holding the lock.
\QuickQuiz{}
Is ``x = 0'' the only possible output from the code fragment
shown in
Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock}?
If so, why?
If not, what other output could appear, and why?
\QuickQuizAnswer{
No.
The reason that ``x = 0'' was output was that \co{lock_reader()}
acquired the lock first.
Had \co{lock_writer()} instead acquired the lock first, then
the output would have been ``x = 3''.
However, because the code fragment started \co{lock_reader()} first
and because this run was performed on a multiprocessor,
one would normally expect \co{lock_reader()} to acquire the
lock first.
However, there are no guarantees, especially on a busy system.
} \QuickQuizEnd
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 printf("Creating two threads w/different locks:\n");
2 x = 0;
3 if (pthread_create(&tid1, NULL,
4 lock_reader, &lock_a) != 0) {
5 perror("pthread_create");
6 exit(-1);
7 }
8 if (pthread_create(&tid2, NULL,
9 lock_writer, &lock_b) != 0) {
10 perror("pthread_create");
11 exit(-1);
12 }
13 if (pthread_join(tid1, &vp) != 0) {
14 perror("pthread_join");
15 exit(-1);
16 }
17 if (pthread_join(tid2, &vp) != 0) {
18 perror("pthread_join");
19 exit(-1);
20 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Demonstration of Different Exclusive Locks}
\label{fig:toolsoftrade:Demonstration of Different Exclusive Locks}
\end{figure}
Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks}
shows a similar code fragment, but this time using different locks:
\co{lock_a} for \co{lock_reader()} and \co{lock_b} for
\co{lock_writer()}.
The output of this code fragment is as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
Creating two threads w/different locks:
lock_reader(): x = 0
lock_reader(): x = 1
lock_reader(): x = 2
lock_reader(): x = 3
\end{verbatim}
\end{minipage}
\vspace{5pt}
Because the two threads are using different locks, they do not exclude
each other, and can run concurrently.
The \co{lock_reader()} function can therefore see the intermediate
values of \co{x} stored by \co{lock_writer()}.
\QuickQuiz{}
Using different locks could cause quite a bit of confusion,
what with threads seeing each others' intermediate states.
So should well-written parallel programs restrict themselves
to using a single lock in order to avoid this kind of confusion?
\QuickQuizAnswer{
Although it is sometimes possible to write a program using a
single global lock that both performs and scales well, such
programs are exceptions to the rule.
You will normally need to use multiple locks to attain good
performance and scalability.
One possible exception to this rule is ``transactional memory'',
which is currently a research topic.
Transactional\-/memory semantics can be loosely thought of as those
of a single global lock with optimizations permitted and
with the addition of rollback~\cite{HansJBoehm2009HOTPAR}.
} \QuickQuizEnd
\QuickQuiz{}
In the code shown in
Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks},
is \co{lock_reader()} guaranteed to see all the values produced
by \co{lock_writer()}?
Why or why not?
\QuickQuizAnswer{
No.
On a busy system, \co{lock_reader()} might be preempted
for the entire duration of \co{lock_writer()}'s execution,
in which case it would not see \emph{any} of \co{lock_writer()}'s
intermediate states for \co{x}.
} \QuickQuizEnd
\QuickQuiz{}
Wait a minute here!!!
Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock}
didn't initialize shared variable \co{x},
so why does it need to be initialized in
Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks}?
\QuickQuizAnswer{
See line~3 of
Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks}.
Because the code in
Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock}
ran first, it could rely on the compile-time initialization of
\co{x}.
The code in
Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks}
ran next, so it had to re-initialize \co{x}.
} \QuickQuizEnd
Although there is quite a bit more to POSIX exclusive locking, these
primitives provide a good start and are in fact sufficient in a great
many situations.
The next section takes a brief look at POSIX reader-writer locking.
\subsection{POSIX Reader-Writer Locking}
\label{sec:toolsoftrade:POSIX Reader-Writer Locking}
The POSIX API provides a reader-writer lock, which is represented by
a \co{pthread_rwlock_t}.
As with \co{pthread_mutex_t}, \co{pthread_rwlock_t} may be statically
initialized via \co{PTHREAD_RWLOCK_INITIALIZER} or dynamically
initialized via the \co{pthread_rwlock_init()} primitive.
The \co{pthread_rwlock_rdlock()} primitive read-acquires the
specified \co{pthread_rwlock_t}, the \co{pthread_rwlock_wrlock()}
primitive write-acquires it, and the \co{pthread_rwlock_unlock()}
primitive releases it.
Only a single thread may write-hold a given \co{pthread_rwlock_t}
at any given time, but multiple threads may read-hold a given
\co{pthread_rwlock_t}, at least while there is no thread
currently write-holding it.
As you might expect, reader-writer locks are designed for read-mostly
situations.
In these situations, a reader-writer lock can provide greater scalability
than can an exclusive lock because the exclusive lock is by definition
limited to a single thread holding the lock at any given time, while
the reader-writer lock permits
an arbitrarily large number of readers to concurrently hold the lock.
However, in practice, we need to know how much additional scalability is
provided by reader-writer locks.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 pthread_rwlock_t rwl = PTHREAD_RWLOCK_INITIALIZER;
2 int holdtime = 0;
3 int thinktime = 0;
4 long long *readcounts;
5 int nreadersrunning = 0;
6
7 #define GOFLAG_INIT 0
8 #define GOFLAG_RUN 1
9 #define GOFLAG_STOP 2
10 char goflag = GOFLAG_INIT;
11
12 void *reader(void *arg)
13 {
14 int i;
15 long long loopcnt = 0;
16 long me = (long)arg;
17
18 __sync_fetch_and_add(&nreadersrunning, 1);
19 while (READ_ONCE(goflag) == GOFLAG_INIT) {
20 continue;
21 }
22 while (READ_ONCE(goflag) == GOFLAG_RUN) {
23 if (pthread_rwlock_rdlock(&rwl) != 0) {
24 perror("pthread_rwlock_rdlock");
25 exit(-1);
26 }
27 for (i = 1; i < holdtime; i++) {
28 barrier();
29 }
30 if (pthread_rwlock_unlock(&rwl) != 0) {
31 perror("pthread_rwlock_unlock");
32 exit(-1);
33 }
34 for (i = 1; i < thinktime; i++) {
35 barrier();
36 }
37 loopcnt++;
38 }
39 readcounts[me] = loopcnt;
40 return NULL;
41 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Measuring Reader-Writer Lock Scalability}
\label{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability}
\end{figure}
Figure~\ref{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability}
(\path{rwlockscale.c})
shows one way of measuring reader-writer lock scalability.
Line~1 shows the definition and initialization of the reader-writer
lock, line~2 shows the \co{holdtime} argument controlling the
time each thread holds the reader-writer lock,
line~3 shows the \co{thinktime} argument controlling the time between
the release of the reader-writer lock and the next acquisition,
line~4 defines the \co{readcounts} array into which each reader thread
places the number of times it acquired the lock, and
line~5 defines the \co{nreadersrunning} variable, which
determines when all reader threads have started running.
Lines~7-10 define \co{goflag}, which synchronizes the start and the
end of the test.
This variable is initially set to \co{GOFLAG_INIT}, then set to
\co{GOFLAG_RUN} after all the reader threads have started, and finally
set to \co{GOFLAG_STOP} to terminate the test run.
Lines~12-41 define \co{reader()}, which is the reader thread.
Line~18 atomically increments the \co{nreadersrunning} variable
to indicate that this thread is now running, and
lines~19-21 wait for the test to start.
The \co{READ_ONCE()} primitive forces the compiler to fetch \co{goflag}
on each pass through the loop---the compiler would otherwise be within its
rights to assume that the value of \co{goflag} would never change.
\QuickQuiz{}
Instead of using \co{READ_ONCE()} everywhere, why not just
declare \co{goflag} as \co{volatile} on line~10 of
Figure~\ref{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability}?
\QuickQuizAnswer{
A \co{volatile} declaration is in fact a reasonable alternative in
this particular case.
However, use of \co{READ_ONCE()} has the benefit of clearly
flagging to the reader that \co{goflag} is subject to concurrent
reads and updates.
However, \co{READ_ONCE()} is especially useful in cases where
most of the accesses are protected by a lock (and thus \emph{not}
subject to change), but where a few of the accesses are made outside
of the lock.
Using a volatile declaration in this case would make it harder
for the reader to note the special accesses outside of the lock,
and would also make it harder for the compiler to generate good
code under the lock.
} \QuickQuizEnd
\QuickQuiz{}
\co{READ_ONCE()} only affects the compiler, not the CPU.
Don't we also need memory barriers to make sure
that the change in \co{goflag}'s value propagates to the
CPU in a timely fashion in
Figure~\ref{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability}?
\QuickQuizAnswer{
No, memory barriers are not needed and won't help here.
Memory barriers only enforce ordering among multiple
memory references: They do absolutely nothing to expedite
the propagation of data from one part of the system to
another.
This leads to a quick rule of thumb: You do not need
memory barriers unless you are using more than one
variable to communicate between multiple threads.
But what about \co{nreadersrunning}?
Isn't that a second variable used for communication?
Indeed it is, and there really are the needed memory-barrier
instructions buried in \co{__sync_fetch_and_add()},
which make sure that the thread proclaims its presence
before checking to see if it should start.
} \QuickQuizEnd
\QuickQuiz{}
Would it ever be necessary to use \co{READ_ONCE()} when accessing
a per-thread variable, for example, a variable declared using
the \co{gcc} \co{__thread} storage class?
\QuickQuizAnswer{
It depends.
If the per-thread variable was accessed only from its thread,
and never from a signal handler, then no.
Otherwise, it is quite possible that \co{READ_ONCE()} is needed.
We will see examples of both situations in
Section~\ref{sec:count:Signal-Theft Limit Counter Implementation}.
This leads to the question of how one thread can gain access to
another thread's \co{__thread} variable, and the answer is that
the second thread must store a pointer to its \co{__thread}
pointer somewhere that the first thread has access to.
One common approach is to maintain a linked list with one
element per thread, and to store the address of each thread's
\co{__thread} variable in the corresponding element.
} \QuickQuizEnd
The loop spanning lines~22-38 carries out the performance test.
Lines~23-26 acquire the lock, lines~27-29 hold the lock for the specified
duration (and the \co{barrier()} directive prevents the compiler from
optimizing the loop out of existence), lines~30-33 release the lock,
and lines~34-36 wait for the specified duration before re-acquiring the
lock.
Line~37 counts this lock acquisition.
Line~39 moves the lock-acquisition count to this thread's element of the
\co{readcounts[]} array, and line~40 returns, terminating this thread.
\begin{figure}[tb]
\centering
\resizebox{3in}{!}{\includegraphics{CodeSamples/toolsoftrade/rwlockscale}}
\caption{Reader-Writer Lock Scalability}
\label{fig:toolsoftrade:Reader-Writer Lock Scalability}
\end{figure}
Figure~\ref{fig:toolsoftrade:Reader-Writer Lock Scalability}
shows the results of running this test on a 64-core Power-5 system
with two hardware threads per core for a total of 128 software-visible
CPUs.
The \co{thinktime} parameter was zero for all these tests, and the
\co{holdtime} parameter set to values ranging from one thousand (``1K''
on the graph) to 100 million (``100M'' on the graph).
The actual value plotted is:
\begin{equation}
\frac{L_N}{N L_1}
\end{equation}
where $N$ is the number of threads,
$L_N$ is the number of lock acquisitions by $N$ threads, and
$L_1$ is the number of lock acquisitions by a single thread.
Given ideal hardware and software scalability, this value will always
be 1.0.
As can be seen in the figure, reader-writer locking scalability is
decidedly non-ideal, especially for smaller sizes of critical
sections.
To see why read-acquisition can be so slow, consider
that all the acquiring threads must update the \co{pthread_rwlock_t}
data structure.
Therefore, if all 128 executing threads attempt to
read-acquire the reader-writer lock concurrently, they must update
this underlying \co{pthread_rwlock_t} one at a time.
One lucky thread might do so almost immediately, but the least-lucky
thread must wait for all the other 127 threads to do their updates.
This situation will only get worse as you add CPUs.
\QuickQuiz{}
Isn't comparing against single-CPU throughput a bit harsh?
\QuickQuizAnswer{
Not at all.
In fact, this comparison was, if anything, overly lenient.
A more balanced comparison would be against single-CPU
throughput with the locking primitives commented out.
} \QuickQuizEnd
\QuickQuiz{}
But 1,000 instructions is not a particularly small size for
a critical section.
What do I do if I need a much smaller critical section, for
example, one containing only a few tens of instructions?
\QuickQuizAnswer{
If the data being read \emph{never} changes, then you do not
need to hold any locks while accessing it.
If the data changes sufficiently infrequently, you might be
able to checkpoint execution, terminate all threads, change
the data, then restart at the checkpoint.
Another approach is to keep a single exclusive lock per
thread, so that a thread read-acquires the larger aggregate
reader-writer lock by acquiring its own lock, and write-acquires
by acquiring all the per-thread locks~\cite{WilsonCHsieh92a}.
This can work quite well for readers, but causes writers
to incur increasingly large overheads as the number of threads
increases.
Some other ways of handling very small critical sections are
described in Chapter~\ref{chp:Deferred Processing}.
} \QuickQuizEnd
\QuickQuiz{}
In
Figure~\ref{fig:toolsoftrade:Reader-Writer Lock Scalability},
all of the traces other than the 100M trace deviate gently
from the ideal line.
In contrast, the 100M trace breaks sharply from the ideal
line at 64 CPUs.
In addition, the spacing between the 100M trace and the 10M
trace is much smaller than that between the 10M trace and the
1M trace.
Why does the 100M trace behave so much differently than the
other traces?
\QuickQuizAnswer{
Your first clue is that 64 CPUs is exactly half of the 128
CPUs on the machine.
The difference is an artifact of hardware threading.
This system has 64 cores with two hardware threads per core.
As long as fewer than 64 threads are running, each can run
in its own core.
But as soon as there are more than 64 threads, some of the threads
must share cores.
Because the pair of threads in any given core share some hardware
resources, the throughput of two threads sharing a core is not
quite as high as that of two threads each in their own core.
So the performance of the 100M trace is limited not by the
reader-writer lock, but rather by the sharing of hardware resources
between hardware threads in a single core.
This can also be seen in the 10M trace, which deviates gently from
the ideal line up to 64 threads, then breaks sharply down, parallel
to the 100M trace.
Up to 64 threads, the 10M trace is limited primarily by reader-writer
lock scalability, and beyond that, also by sharing of hardware
resources between hardware threads in a single core.
} \QuickQuizEnd
\QuickQuiz{}
Power-5 is several years old, and new hardware should
be faster.
So why should anyone worry about reader-writer locks being slow?
\QuickQuizAnswer{
In general, newer hardware is improving.
However, it will need to improve more than two orders of magnitude
to permit reader-writer lock to achieve ideal performance on
128 CPUs.
Worse yet, the greater the number of CPUs, the larger the
required performance improvement.
The performance problems of reader-writer locking are therefore
very likely to be with us for quite some time to come.
} \QuickQuizEnd
Despite these limitations, reader-writer locking is quite useful in many
cases, for example when the readers must do high-latency file or network I/O.
There are alternatives, some of which will be presented in
Chapters~\ref{chp:Counting} and \ref{chp:Deferred Processing}.
\subsection{Atomic Operations (gcc Classic)}
\label{sec:toolsoftrade:Atomic Operations (gcc Classic)}
Given that
Figure~\ref{fig:toolsoftrade:Reader-Writer Lock Scalability}
shows that the overhead of reader-writer locking is most severe for the
smallest critical sections, it would be nice to have some other way
to protect the tiniest of critical sections.
One such way are atomic operations.
We have seen one atomic operations already, in the form of the
\co{__sync_fetch_and_add()} primitive on line~18 of
Figure~\ref{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability}.
This primitive atomically adds the value of its second argument to
the value referenced by its first argument, returning the old value
(which was ignored in this case).
If a pair of threads concurrently execute \co{__sync_fetch_and_add()} on
the same variable, the resulting value of the variable will include
the result of both additions.
The {\sf gcc} compiler offers a number of additional atomic operations,
including \co{__sync_fetch_and_sub()},
\co{__sync_fetch_and_or()},
\co{__sync_fetch_and_and()},
\co{__sync_fetch_and_xor()}, and
\co{__sync_fetch_and_nand()}, all of which return the old value.
If you instead need the new value, you can instead use the
\co{__sync_add_and_fetch()},
\co{__sync_sub_and_fetch()},
\co{__sync_or_and_fetch()},
\co{__sync_and_and_fetch()},
\co{__sync_xor_and_fetch()}, and
\co{__sync_nand_and_fetch()} primitives.
\QuickQuiz{}
Is it really necessary to have both sets of primitives?
\QuickQuizAnswer{
Strictly speaking, no.
One could implement any member of the second set using the
corresponding member of the first set.
For example, one could implement \co{__sync_nand_and_fetch()}
in terms of \co{__sync_fetch_and_nand()} as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
tmp = v;
ret = __sync_fetch_and_nand(p, tmp);
ret = ~ret & tmp;
\end{verbatim}
\end{minipage}
\vspace{5pt}
It is similarly possible to implement \co{__sync_fetch_and_add()},
\co{__sync_fetch_and_sub()}, and \co{__sync_fetch_and_xor()}
in terms of their post-value counterparts.
However, the alternative forms can be quite convenient, both
for the programmer and for the compiler/library implementor.
} \QuickQuizEnd
The classic compare-and-swap operation is provided by a pair of
primitives, \co{__sync_bool_compare_and_swap()} and
\co{__sync_val_compare_and_swap()}.
Both of these primitive atomically update a location to a new value,
but only if its prior value was equal to the specified old value.
The first variant returns 1 if the operation succeeded and 0 if it
failed, for example, if the prior value was not equal to the specified
old value.
The second variant returns the prior value of the location, which, if
equal to the specified old value, indicates that the operation succeeded.
Either of the compare-and-swap operation is ``universal'' in the sense
that any atomic operation on a single location can be implemented in
terms of compare-and-swap, though the earlier operations are often
more efficient where they apply.
The compare-and-swap operation is also capable of serving as the basis
for a wider set of atomic operations, though the more elaborate of
these often suffer from complexity, scalability, and performance
problems~\cite{MauriceHerlihy90a}.
The \co{__sync_synchronize()} primitive issues a ``memory barrier'',
which constrains both the compiler's and the CPU's ability to reorder
operations, as discussed in Section~\ref{sec:advsync:Memory Barriers}.
In some cases, it is sufficient to constrain the compiler's ability
to reorder operations, while allowing the CPU free rein, in which
case the \co{barrier()} primitive may be used, as it in fact was
on line~28 of
Figure~\ref{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability}.
In some cases, it is only necessary to ensure that the compiler
avoids optimizing away a given memory read, in which case the
\co{READ_ONCE()} primitive may be used, as it was on line~17 of
Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks}.
Similarly, the \co{WRITE_ONCE()} primitive may be used to prevent the
compiler from optimizing away a given memory write.
These last three primitives are not provided directly by gcc,
but may be implemented straightforwardly as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
#define READ_ONCE(x) ACCESS_ONCE(x)
#define WRITE_ONCE(x, val) ({ ACCESS_ONCE(x) = (val); })
#define barrier() __asm__ __volatile__("": : :"memory")
\end{verbatim}
\end{minipage}
\vspace{5pt}
\QuickQuiz{}
Given that these atomic operations will often be able to
generate single atomic instructions that are directly
supported by the underlying instruction set, shouldn't
they be the fastest possible way to get things done?
\QuickQuizAnswer{
Unfortunately, no.
See Chapter~\ref{chp:Counting} for some stark counterexamples.
} \QuickQuizEnd
\subsection{Atomic Operations (C11)}
\label{sec:toolsoftrade:Atomic Operations (C11)}
The C11 standard added atomic operations,
including loads (\co{atomic_load()}),
stores (\co{atomic_store()}),
memory barriers (\co{atomic_thread_fence()} and
\co{atomic_signal_fence()}), and read-modify-write atomics.
The read-modify-write atomics include
\co{atomic_fetch_add()},
\co{atomic_fetch_sub()},
\co{atomic_fetch_and()},
\co{atomic_fetch_xor()},
\co{atomic_exchange()},
\co{atomic_compare_exchange_strong()},
and
\co{atomic_compare_exchange_weak()}.
These operate in a manner similar to those described in
Section~\ref{sec:toolsoftrade:Atomic Operations (gcc Classic)},
but with the addition of memory-order arguments to \co{_explicit}
variants of all of the operations.
Without memory-order arguments, all the atomic operations are
fully ordered, and the arguments permit weaker orderings.
For example, ``\co{memory_order_explicit(&a, memory_order_relaxed)}''
is vaguely similar to the Linux kernel's ``\co{READ_ONCE()}''.\footnote{
Memory ordering is described in more detail in
Section~\ref{sec:advsync:Memory Barriers} and
Appendix~\ref{chp:app:whymb:Why Memory Barriers?}.}
One restriction of the C11 atomics is that they apply only to special
atomic types, which can be restrictive.
The gcc compiler therefore provides
\co{__atomic_load()},
\co{__atomic_load_n()},
\co{__atomic_store()},
\co{__atomic_store_n()}, etc.
These primitives offer the same semantics as their C11 counterparts,
but may be used on plain non-atomic objects.
\subsection{Per-Thread Variables}
\label{sec:toolsoftrade:Per-Thread Variables}
Per-thread variables, also called thread-specific data, thread-local
storage, and other less-polite names, are used extremely
heavily in concurrent code, as will be explored in
Chapters~\ref{chp:Counting} and~\ref{chp:Data Ownership}.
POSIX supplies the \co{pthread_key_create()} function to create a
per-thread variable (and return the corresponding key),
\co{pthread_key_delete()} to delete the per-thread variable corresponding
to key,
\co{pthread_setspecific()} to set the value of the current thread's
variable corresponding to the specified key,
and \co{pthread_getspecific()} to return that value.
A number of compilers (including gcc) provide a \co{__thread} specifier
that may be used in a variable definition to designate that variable
as being per-thread.
The name of the variable may then be used normally to access the
value of the current thread's instance of that variable.
Of course, \co{__thread} is much easier to use than the POSIX
thead-specific data, and so \co{__thread} is usually preferred for
code that is to be built only with gcc or other compilers supporting
\co{__thread}.
Fortunately, the C11 standard introduced a \co{_Thread_local} keyword
that can be used in place of \co{__thread}.
In the fullness of time, this new keyword should combine the ease of use
of \co{__thread} with the portability of POSIX thread-specific data.
\section{Alternatives to POSIX Operations}
\label{sec:toolsoftrade:Alternatives to POSIX Operations}
Unfortunately, threading operations, locking primitives, and atomic
operations were in reasonably wide use long before the various standards
committees got around to them.
As a result, there is considerable variation in how these operations
are supported.
It is still quite common to find these operations implemented in
assembly language, either for historical reasons or to obtain better
performance in specialized circumstances.
For example, the gcc \co{__sync_} family of primitives all provide full
memory-ordering semantics, which in the past motivated many developers
to create their own implementations for situations where the full memory
ordering semantics are not required.
The following sections show some alternatives from the Linux kernel
and some historical primitives used by this book's sample code.
\subsection{Organization and Initialization}
\label{sec:toolsoftrade:Organization and Initialization}
Although many environments do not require any special initialization
code, the code samples in this book start with a call to \co{smp_init()},
which initializes a mapping from \co{pthread_t} to consecutive integers.
The userspace RCU library similarly requires a call to \co{rcu_init()}.
Although these calls can be hidden in environments (such as that of
gcc) that support constructors,
most of the RCU flavors supported by the userspace RCU library
also require each thread invoke \co{rcu_register_thread()} upon thread
creation and \co{rcu_unregister_thread()} before thread exit.
In the case of the Linux kernel, it is a philosophical question as to
whether the kernel does not require calls to special initialization
code or whether the kernel's boot-time code is in fact the required
initialization code.
\subsection{Thread Creation, Destruction, and Control}
\label{sec:toolsoftrade:Thread Creation, Destruction, and Control}
The Linux kernel uses
\co{struct task_struct} pointers to track kthreads,
\co{kthread_create()} to create them,
\co{kthread_should_stop()} to externally suggest that they stop
(which has no POSIX equivalent),
\co{kthread_stop()} to wait for them to stop, and
\co{schedule_timeout_interruptible()} for a timed wait.
There are quite a few additional kthread-management APIs, but this
provides a good start, as well as good search terms.
The CodeSamples API focuses on ``threads'', which are a locus of
control.\footnote{
There are many other names for similar software constructs, including
``process'', ``task'', ``fiber'', ``event'', and so on.
Similar design principles apply to all of them.}
Each such thread has an identifier of type \co{thread_id_t},
and no two threads running at a given time will have the same
identifier.
Threads share everything except for per-thread local state,\footnote{
How is that for a circular definition?}
which includes program counter and stack.
The thread API is shown in
Figure~\ref{fig:toolsoftrade:Thread API}, and members are described in the
following sections.
\begin{figure*}[tbp]
{ \scriptsize
\begin{verbbox}
int smp_thread_id(void)
thread_id_t create_thread(void *(*func)(void *), void *arg)
for_each_thread(t)
for_each_running_thread(t)
void *wait_thread(thread_id_t tid)
void wait_all_threads(void)
\end{verbbox}
}
\centering
\theverbbox
\caption{Thread API}
\label{fig:toolsoftrade:Thread API}
\end{figure*}
\subsubsection{\tco{create_thread()}}
The \co{create_thread()} primitive creates a new thread,
starting the new thread's execution
at the function \co{func} specified by \co{create_thread()}'s
first argument, and passing it the argument specified by
\co{create_thread()}'s second argument.
This newly created thread will terminate when it returns from the
starting function specified by \co{func}.
The \co{create_thread()} primitive returns the \co{thread_id_t}
corresponding to the newly created child thread.
This primitive will abort the program if more than \co{NR_THREADS}
threads are created, counting the one implicitly created by running
the program.
\co{NR_THREADS} is a compile-time constant that may be modified,
though some systems may have an upper bound for the allowable number
of threads.
\subsubsection{\tco{smp_thread_id()}}
Because the \co{thread_id_t} returned from \co{create_thread()} is
system-dependent, the \co{smp_thread_id()} primitive returns a thread
index corresponding to the thread making the request.
This index is guaranteed to be less than the maximum number of threads
that have been in existence since the program started,
and is therefore useful for bitmasks, array indices, and
the like.
\subsubsection{\tco{for_each_thread()}}
The \co{for_each_thread()} macro loops through all threads that exist,
including all threads that \emph{would} exist if created.
This macro is useful for handling per-thread variables as will be
seen in Section~\ref{sec:toolsoftrade:Per-Thread Variables}.
\subsubsection{\tco{for_each_running_thread()}}
The \co{for_each_running_thread()}
macro loops through only those threads that currently exist.
It is the caller's responsibility to synchronize with thread
creation and deletion if required.
\subsubsection{\tco{wait_thread()}}
The \co{wait_thread()} primitive waits for completion of the thread
specified by the \co{thread_id_t} passed to it.
This in no way interferes with the execution of the specified thread;
instead, it merely waits for it.
Note that \co{wait_thread()} returns the value that was returned by
the corresponding thread.
\subsubsection{\tco{wait_all_threads()}}
The \co{wait_all_threads()}
primitive waits for completion of all currently running threads.
It is the caller's responsibility to synchronize with thread creation
and deletion if required.
However, this primitive is normally used to clean up at the end of
a run, so such synchronization is normally not needed.
\subsubsection{Example Usage}
Figure~\ref{fig:toolsoftrade:Example Child Thread}
shows an example hello-world-like child thread.
As noted earlier, each thread is allocated its own stack, so
each thread has its own private \co{arg} argument and \co{myarg} variable.
Each child simply prints its argument and its \co{smp_thread_id()}
before exiting.
Note that the \co{return} statement on line~7 terminates the thread,
returning a \co{NULL} to whoever invokes \co{wait_thread()} on this
thread.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 void *thread_test(void *arg)
2 {
3 int myarg = (int)arg;
4
5 printf("child thread %d: smp_thread_id() = %d\n",
6 myarg, smp_thread_id());
7 return NULL;
8 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Example Child Thread}
\label{fig:toolsoftrade:Example Child Thread}
\end{figure}
The parent program is shown in
Figure~\ref{fig:toolsoftrade:Example Parent Thread}.
It invokes \co{smp_init()} to initialize the threading system on
line~6,
parses arguments on lines~7-14, and announces its presence on line~15.
It creates the specified number of child threads on lines~16-17,
and waits for them to complete on line~18.
Note that \co{wait_all_threads()} discards the threads return values,
as in this case they are all \co{NULL}, which is not very interesting.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 int main(int argc, char *argv[])
2 {
3 int i;
4 int nkids = 1;
5
6 smp_init();
7 if (argc > 1) {
8 nkids = strtoul(argv[1], NULL, 0);
9 if (nkids > NR_THREADS) {
10 fprintf(stderr, "nkids=%d too big, max=%d\n",
11 nkids, NR_THREADS);
12 usage(argv[0]);
13 }
14 }
15 printf("Parent spawning %d threads.\n", nkids);
16 for (i = 0; i < nkids; i++)
17 create_thread(thread_test, (void *)i);
18 wait_all_threads();
19 printf("All threads completed.\n", nkids);
20 exit(0);
21 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Example Parent Thread}
\label{fig:toolsoftrade:Example Parent Thread}
\end{figure}
\QuickQuiz{}
What happened to the Linux-kernel equivalents to \co{fork()}
and \co{wait()}?
\QuickQuizAnswer{
They don't really exist.
All tasks executing within the Linux kernel share memory,
at least unless you want to do a huge amount of memory-mapping
work by hand.
} \QuickQuizEnd
\subsection{Locking}
\label{sec:toolsoftrade:Locking}
A good starting subset of the Linux kernel's locking API is shown in
Figure~\ref{fig:toolsoftrade:Locking API},
each API element being described in the following sections.
This book's CodeSamples locking API closely follows that of the Linux kernel.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
void spin_lock_init(spinlock_t *sp);
void spin_lock(spinlock_t *sp);
int spin_trylock(spinlock_t *sp);
void spin_unlock(spinlock_t *sp);
\end{verbbox}
}
\centering
\theverbbox
\caption{Locking API}
\label{fig:toolsoftrade:Locking API}
\end{figure}
\subsubsection{\tco{spin_lock_init()}}
The \co{spin_lock_init()} primitive initializes the specified
\co{spinlock_t} variable, and must be invoked before
this variable is passed to any other spinlock primitive.
\subsubsection{\tco{spin_lock()}}
The \co{spin_lock()} primitive acquires the specified spinlock,
if necessary, waiting until the spinlock becomes available.
In some environments, such as pthreads, this waiting will involve
``spinning'', while
in others, such as the Linux kernel, it will involve blocking.
The key point is that only one thread may hold a spinlock at any
given time.
\subsubsection{\tco{spin_trylock()}}
The \co{spin_trylock()} primitive acquires the specified spinlock,
but only if it is immediately available.
It returns \co{true} if it was able to acquire the spinlock and
\co{false} otherwise.
\subsubsection{\tco{spin_unlock()}}
The \co{spin_unlock()} primitive releases the specified spinlock,
allowing other threads to acquire it.
% \emph{@@@ likely need to add reader-writer locking.}
\subsubsection{Example Usage}
A spinlock named \co{mutex} may be used to protect a variable
\co{counter} as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
spin_lock(&mutex);
counter++;
spin_unlock(&mutex);
\end{verbatim}
\end{minipage}
\vspace{5pt}
\QuickQuiz{}
What problems could occur if the variable {\tt counter} were
incremented without the protection of {\tt mutex}?
\QuickQuizAnswer{
On CPUs with load-store architectures, incrementing {\tt counter}
might compile into something like the following:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
LOAD counter,r0
INC r0
STORE r0,counter
\end{verbatim}
\end{minipage}
\vspace{5pt}
On such machines, two threads might simultaneously load the
value of {\tt counter}, each increment it, and each store the
result.
The new value of {\tt counter} will then only be one greater
than before, despite two threads each incrementing it.
} \QuickQuizEnd
However, the \co{spin_lock()} and \co{spin_unlock()} primitives
do have performance consequences, as will be seen in
Section~\ref{sec:toolsoftrade:Performance}.
\subsection{Atomic Operations}
\label{sec:toolsoftrade:Atomic Operations}
The Linux kernel provides a wide variety of atomic operations, but
those defined on type \co{atomic_t} provide a good start.
Normal non-tearing reads and stores are provided by
\co{atomic_read()} and \co{atomic_set()}, respectively.
Acquire load is provided by \co{smp_load_acquire()} and release
store by \co{smp_store_release()}.
Non-value-returning fetch-and-add operations are provided by
\co{atomic_add()}, \co{atomic_sub()}, \co{atomic_inc()}, and
\co{atomic_dec()}, among others.
An atomic decrement that returns a reached-zero indication is provided
by both \co{atomic_dec_and_test()} and \co{atomic_sub_and_test()}.
An atomic add that returns the new value is provided by
\co{atomic_add_return()}.
Both \co{atomic_add_unless()} and \co{atomic_inc_not_zero()} provide
conditional atomic operations, where nothing happens unless the
original value of the atomic variable is different than the value
specified (these are very handy for managing reference counters, for
example).
An atomic exchange operation is provided by \co{atomic_xchg()}, and
the celebrated compare-and-swap (CAS) operation is provided by
\co{atomic_cmpxchg()}.
Both of these return the old value.
Many additional atomic RMW primitives are available in the Linux kernel,
see the \co{Documentation/atomic_ops.txt} file in the Linux-kernel
source tree.
This book's CodeSamples API closely follows that of the Linux kernel.
\subsection{Per-CPU Variables}
\label{sec:toolsoftrade:Per-CPU Variables}
The Linux kernel uses \co{DEFINE_PER_CPU()} to define a per-CPU variable,
\co{this_cpu_ptr()} to form a reference to this CPU's instance of a
given per-CPU variable, \co{per_cpu()} to access a specified CPU's
instance of a given per-CPU variable, along with many other special-purpose
per-CPU operations.
Figure~\ref{fig:toolsoftrade:Per-Thread-Variable API}
shows this book's per-thread-variable API, which is patterned
after the Linux kernel's per-CPU-variable API.
This API provides the per-thread equivalent of global variables.
Although this API is, strictly speaking, not necessary\footnote{
You could instead use \co{__thread} or \co{_Thread_local}.},
it can provide a good userspace analogy to Linux kernel code.
\begin{figure}[htbp]
{ \scriptsize
\begin{verbbox}
DEFINE_PER_THREAD(type, name)
DECLARE_PER_THREAD(type, name)
per_thread(name, thread)
__get_thread_var(name)
init_per_thread(name, v)
\end{verbbox}
}
\centering
\theverbbox
\caption{Per-Thread-Variable API}
\label{fig:toolsoftrade:Per-Thread-Variable API}
\end{figure}
\QuickQuiz{}
How could you work around the lack of a per-thread-variable
API on systems that do not provide it?
\QuickQuizAnswer{
One approach would be to create an array indexed by
\co{smp_thread_id()}, and another would be to use a hash
table to map from \co{smp_thread_id()} to an array
index---which is in fact what this
set of APIs does in pthread environments.
Another approach would be for the parent to allocate a structure
containing fields for each desired per-thread variable, then
pass this to the child during thread creation.
However, this approach can impose large software-engineering
costs in large systems.
To see this, imagine if all global variables in a large system
had to be declared in a single file, regardless of whether or
not they were C static variables!
} \QuickQuizEnd
\subsubsection{\tco{DEFINE_PER_THREAD()}}
The \co{DEFINE_PER_THREAD()} primitive defines a per-thread variable.
Unfortunately, it is not possible to provide an initializer in the way
permitted by the Linux kernel's \co{DEFINE_PER_THREAD()} primitive,
but there is an \co{init_per_thread()} primitive that permits easy
runtime initialization.
\subsubsection{\tco{DECLARE_PER_THREAD()}}
The \co{DECLARE_PER_THREAD()} primitive is a declaration in the C sense,
as opposed to a definition.
Thus, a \co{DECLARE_PER_THREAD()} primitive may be used to access
a per-thread variable defined in some other file.
\subsubsection{\tco{per_thread()}}
The \co{per_thread()} primitive accesses the specified thread's variable.
\subsubsection{\tco{__get_thread_var()}}
The \co{__get_thread_var()} primitive accesses the current thread's variable.
\subsubsection{\tco{init_per_thread()}}
The \co{init_per_thread()} primitive sets all threads' instances of
the specified variable to the specified value.
The Linux kernel accomplishes this via normal C initialization,
relying in clever use of linker scripts and code executed during
the CPU-online process.
\subsubsection{Usage Example}
Suppose that we have a counter that is incremented very frequently
but read out quite rarely.
As will become clear in
Section~\ref{sec:toolsoftrade:Performance},
it is helpful to implement such a counter using a per-thread variable.
Such a variable can be defined as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
DEFINE_PER_THREAD(int, counter);
\end{verbatim}
\end{minipage}
\vspace{5pt}
The counter must be initialized as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
init_per_thread(counter, 0);
\end{verbatim}
\end{minipage}
\vspace{5pt}
A thread can increment its instance of this counter as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
__get_thread_var(counter)++;
\end{verbatim}
\end{minipage}
\vspace{5pt}
The value of the counter is then the sum of its instances.
A snapshot of the value of the counter can thus be collected
as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
for_each_thread(i)
sum += per_thread(counter, i);
\end{verbatim}
\end{minipage}
\vspace{5pt}
Again, it is possible to gain a similar effect using other mechanisms,
but per-thread variables combine convenience and high performance.
\subsection{Performance}
\label{sec:toolsoftrade:Performance}
It is instructive to compare the performance of the locked increment
shown in
Section~\ref{sec:toolsoftrade:Atomic Operations}
to that of per-CPU (or per-thread) variables
(see Section~\ref{sec:toolsoftrade:Per-CPU Variables}),
as well as to conventional increment (as in ``\co{counter++}'').
% \emph{@@@ need parable on cache thrashing.}
% \emph{@@@ more here using performance results from a modest multiprocessor.}
% \emph{@@@ Also work in something about critical-section size? Or put later?}
The difference in performance is quite large, to put it mildly.
The purpose of this book is to help you write SMP programs,
perhaps with realtime response, while avoiding such performance
pitfalls.
Chapter~\ref{chp:Counting}
starts this process by describing a few parallel counting algorithms.
\section{The Right Tool for the Job: How to Choose?}
\label{sec:toolsoftrade:The Right Tool for the Job: How to Choose?}
As a rough rule of thumb, use the simplest tool that will get the job done.
If you can, simply program sequentially.
If that is insufficient, try using a shell script to mediate parallelism.
If the resulting shell-script \co{fork()}/\co{exec()} overhead
(about 480 microseconds for a minimal C program on an Intel Core Duo
laptop) is too
large, try using the C-language \co{fork()} and \co{wait()} primitives.
If the overhead of these primitives (about 80 microseconds for a minimal
child process) is still too large, then you
might need to use the POSIX threading primitives, choosing the appropriate
locking and/or atomic-operation primitives.
If the overhead of the POSIX threading primitives (typically sub-microsecond)
is too great, then the primitives introduced in
Chapter~\ref{chp:Deferred Processing} may be required.
Always remember that inter-process communication and message-passing
can be good alternatives to shared-memory multithreaded execution.
\QuickQuiz{}
Wouldn't the shell normally use \co{vfork()} rather than
\co{fork()}?
\QuickQuizAnswer{
It might well do that, however, checking is left as an exercise
for the reader.
But in the meantime, I hope that we can agree that \co{vfork()}
is a variant of \co{fork()}, so that we can use \co{fork()}
as a generic term covering both.
} \QuickQuizEnd
Of course, the actual overheads will depend not only on your hardware,
but most critically on the manner in which you use the primitives.
In particular, randomly hacking multi-threaded code is a spectacularly
bad idea, especially given that shared-memory parallel systems use
your own intelligence against you:
The smarter you are, the deeper a hole you will dig for yourself before
you realize that you are in trouble~\cite{DeadlockEmpire2016}.
Therefore, it is necessary to make the right design choices as well as
the correct choice of individual primitives,
as is discussed at length in subsequent chapters.