blob: ec0039a56cbd1bfc4285639179eab653d16bd532 [file] [log] [blame]
% owned/owned.tex
\QuickQuizChapter{chp:Data Ownership}{Data Ownership}
%
\Epigraph{It is mine, I tell you. My own. My precious. Yes, my precious.}
{\emph{Gollum in ``The Fellowship of the Ring'', J.R.R.~Tolkien}}
One of the simplest ways to avoid the synchronization overhead that
comes with locking is to parcel the data out among the threads (or,
in the case of kernels, CPUs)
so that a given piece of data is accessed and modified by only one
of the threads.
Interestingly enough, data ownership covers each of the ``big three''
parallel design techniques:
It partitions over threads (or CPUs, as the case may be),
it batches all local operations,
and its elimination of synchronization operations is weakening
carried to its logical extreme.
It should therefore be no surprise that data ownership is used extremely
heavily, in fact, it is one usage pattern that even novices use almost
instinctively.
In fact, it is used so heavily that this chapter will not introduce
any new examples, but will instead reference examples from previous
chapters.
\QuickQuiz{}
What form of data ownership is extremely difficult
to avoid when creating shared-memory parallel programs
(for example, using pthreads) in C or C++?
\QuickQuizAnswer{
Use of auto variables in functions.
By default, these are private to the thread executing the
current function.
} \QuickQuizEnd
There are a number of approaches to data ownership.
Section~\ref{sec:owned:Multiple Processes} presents the logical extreme
in data ownership, where each thread has its own private address space.
Section~\ref{sec:owned:Partial Data Ownership and pthreads} looks at
the opposite extreme, where the data is shared, but different threads
own different access rights to the data.
Section~\ref{sec:owned:Function Shipping} describes function shipping,
which is a way of allowing other threads to have indirect access to
data owned by a particular thread.
Section~\ref{sec:owned:Designated Thread} describes how designated
threads can be assigned ownership of a specified function and the
related data.
Section~\ref{sec:owned:Privatization} discusses improving performance
by transforming algorithms with shared data to instead use data ownership.
Finally, Section~\ref{sec:owned:Other Uses of Data Ownership} lists
a few software environments that feature data ownership as a
first-class citizen.
\section{Multiple Processes}
\label{sec:owned:Multiple Processes}
Section~\ref{sec:toolsoftrade:Scripting Languages}
introduced the following example:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 compute_it 1 > compute_it.1.out &
2 compute_it 2 > compute_it.2.out &
3 wait
4 cat compute_it.1.out
5 cat compute_it.2.out
\end{verbatim}
\end{minipage}
\vspace{5pt}
This example runs two instances of the \co{compute_it} program in
parallel, as separate processes that do not share memory.
Therefore, all data in a given process is owned by that process,
so that almost the entirety of data in the above example is owned.
This approach almost entirely eliminates synchronization overhead.
The resulting combination of extreme simplicity and optimal performance
is obviously quite attractive.
\QuickQuiz{}
What synchronization remains in the example shown in
Section~\ref{sec:owned:Multiple Processes}?
\QuickQuizAnswer{
The creation of the threads via the \co{sh} \co{&} operator
and the joining of thread via the \co{sh} \co{wait}
command.
Of course, if the processes explicitly share memory, for
example, using the \co{shmget()} or \co{mmap()} system
calls, explicit synchronization might well be needed when
acccessing or updating the shared memory.
The processes might also synchronize using any of the following
interprocess communications mechanisms:
\begin{enumerate}
\item System V semaphores.
\item System V message queues.
\item UNIX-domain sockets.
\item Networking protocols, including TCP/IP, UDP, and a whole
host of others.
\item File locking.
\item Use of the \co{open()} system call with the
\co{O_CREAT} and \co{O_EXCL} flags.
\item Use of the \co{rename()} system call.
\end{enumerate}
A complete list of possible synchronization mechanisms is left
as an exercise to the reader, who is warned that it will be
an extremely long list.
A surprising number of unassuming system calls can be pressed
into service as synchronization mechanisms.
} \QuickQuizEnd
\QuickQuiz{}
Is there any shared data in the example shown in
Section~\ref{sec:owned:Multiple Processes}?
\QuickQuizAnswer{
That is a philosophical question.
Those wishing the answer ``no'' might argue that processes by
definition do not share memory.
Those wishing to answer ``yes'' might list a large number of
synchronization mechanisms that do not require shared memory,
note that the kernel will have some shared state, and perhaps
even argue that the assignment of process IDs (PIDs) constitute
shared data.
Such arguments are excellent intellectual exercise, and are
also a wonderful way of feeling intelligent, scoring points
against hapless classmates or colleagues,
and (especially!) avoiding getting anything useful done.
} \QuickQuizEnd
This same pattern can be written in C as well as in \co{sh}, as illustrated by
Figures~\ref{fig:toolsoftrade:Using the fork() Primitive}
and~\ref{fig:toolsoftrade:Using the wait() Primitive}.
The next section discusses use of data ownership in shared-memory
parallel programs.
\section{Partial Data Ownership and pthreads}
\label{sec:owned:Partial Data Ownership and pthreads}
Chapter~\ref{chp:Counting} makes heavy use of data ownership,
but adds a twist.
Threads are not allowed to modify data owned by other threads,
but they are permitted to read it.
In short, the use of shared memory allows more nuanced notions
of ownership and access rights.
For example, consider the per-thread statistical counter implementation
shown in
Figure~\ref{fig:count:Per-Thread Statistical Counters} on
page~\pageref{fig:count:Per-Thread Statistical Counters}.
Here, \co{inc_count()} updates only the corresponding thread's
instance of \co{counter},
while \co{read_count()} accesses, but does not modify, all
threads' instances of \co{counter}.
\QuickQuiz{}
Does it ever make sense to have partial data ownership where
each thread reads only its own instance of a per-thread variable,
but writes to other threads' instances?
\QuickQuizAnswer{
Amazingly enough, yes.
One example is a simple message-passing system where threads post
messages to other threads' mailboxes, and where each thread
is responsible for removing any message it sent once that message
has been acted on.
Implementation of such an algorithm is left as an exercise for the
reader, as is the task of identifying other algorithms with
similar ownership patterns.
} \QuickQuizEnd
Pure data ownership is also both common and useful, for example, the
per-thread memory-allocator caches discussed in
Section~\ref{sec:SMPdesign:Resource Allocator Caches}
starting on
page~\pageref{sec:SMPdesign:Resource Allocator Caches}.
In this algorithm, each thread's cache is completely private to that
thread.
\section{Function Shipping}
\label{sec:owned:Function Shipping}
The previous section described a weak form of data ownership where
threads reached out to other threads' data.
This can be thought of as bringing the data to the functions that
need it.
An alternative approach is to send the functions to the data.
Such an approach is illustrated in
Section~\ref{sec:count:Signal-Theft Limit Counter Design}
beginning on
page~\pageref{sec:count:Signal-Theft Limit Counter Design},
in particular the \co{flush_local_count_sig()} and
\co{flush_local_count()} functions in
Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}
on
page~\pageref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}.
The \co{flush_local_count_sig()} function is a signal handler that
acts as the shipped function.
The \co{pthread_kill()} function in \co{flush_local_count()}
sends the signal---shipping the function---and then waits until
the shipped function executes.
This shipped function has the not-unusual added complication of
needing to interact with any concurrently executing \co{add_count()}
or \co{sub_count()} functions (see
Figure~\ref{fig:count:Signal-Theft Limit Counter Add Function}
on
page~\pageref{fig:count:Signal-Theft Limit Counter Add Function} and
Figure~\ref{fig:count:Signal-Theft Limit Counter Subtract Function}
on
page~\pageref{fig:count:Signal-Theft Limit Counter Subtract Function}).
\QuickQuiz{}
What mechanisms other than POSIX signals may be used for function
shipping?
\QuickQuizAnswer{
There is a very large number of such mechanisms, including:
\begin{enumerate}
\item System V message queues.
\item Shared-memory dequeue (see
Section~\ref{sec:SMPdesign:Double-Ended Queue}).
\item Shared-memory mailboxes.
\item UNIX-domain sockets.
\item TCP/IP or UDP, possibly augmented by any number of
higher-level protocols, including RPC, HTTP,
XML, SOAP, and so on.
\end{enumerate}
Compilation of a complete list is left as an exercise to
sufficiently single-minded readers, who are warned that the
list will be extremely long.
} \QuickQuizEnd
\section{Designated Thread}
\label{sec:owned:Designated Thread}
The earlier sections describe ways of allowing each thread to keep its
own copy or its own portion of the data.
In contrast, this section describes a functional-decomposition approach,
where a special designated thread owns the rights to the data
that is required to do its job.
The eventually consistent counter implementation described in
Section~\ref{sec:count:Eventually Consistent Implementation} provides an example.
This implementation has a designated thread that runs the
\co{eventual()} function shown on lines~15-32 of
Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}.
This \co{eventual()} thread periodically pulls the per-thread counts
into the global counter, so that accesses to the global counter will,
as the name says, eventually converge on the actual value.
\QuickQuiz{}
But none of the data in the \co{eventual()} function shown on
lines~15-32 of
Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
is actually owned by the \co{eventual()} thread!
In just what way is this data ownership???
\QuickQuizAnswer{
The key phrase is ``owns the rights to the data''.
In this case, the rights in question are the rights to access
the per-thread \co{counter} variable defined on line~1
of the figure.
This situation is similar to that described in
Section~\ref{sec:owned:Partial Data Ownership and pthreads}.
However, there really is data that is owned by the \co{eventual()}
thread, namely the \co{t} and \co{sum} variables defined on
lines~17 and~18 of the figure.
For other examples of designated threads, look at the kernel
threads in the Linux kernel, for example, those created by
\co{kthread_create()} and \co{kthread_run()}.
} \QuickQuizEnd
\section{Privatization}
\label{sec:owned:Privatization}
One way of improving the performance and scalability of a shared-memory
parallel program is to transform it so as to convert shared data to
private data that is owned by a particular thread.
An excellent example of this is shown in the answer to one of the
Quick Quizzes in
Section~\ref{sec:SMPdesign:Dining Philosophers Problem},
which uses privatization to produce a solution to the
Dining Philosophers problem with much better performance and scalability
than that of the standard textbook solution.
The original problem has five philosophers sitting around the table
with one fork between each adjacent pair of philosophers, which permits
at most two philosophers to eat concurrently.
We can trivially privatize this problem by providing an additional five
forks, so that each philosopher has his or her own private pair of forks.
This allows all five philosophers to eat concurrently, and also offers
a considerable reduction in the spread of certain types of disease.
In other cases, privatization imposes costs.
For example, consider the simple limit counter shown in
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read} on
page~\pageref{fig:count:Simple Limit Counter Add, Subtract, and Read}.
This is an example of an algorithm where threads can read each others'
data, but are only permitted to update their own data.
A quick review of the algorithm shows that the only cross-thread
accesses are in the summation loop in \co{read_count()}.
If this loop is eliminated, we move to the more-efficient pure
data ownership, but at the cost of a less-accurate result
from \co{read_count()}.
\QuickQuiz{}
Is it possible to obtain greater accuracy while still
maintaining full privacy of the per-thread data?
\QuickQuizAnswer{
Yes.
One approach is for \co{read_count()} to add the value
of its own per-thread variable.
This maintains full ownership and performance, but only
a slight improvement in accuracy, particularly on systems
with very large numbers of threads.
Another approach is for \co{read_count()} to use function
shipping, for example, in the form of per-thread signals.
This greatly improves accuracy, but at a significant performance
cost for \co{read_count()}.
However, both of these methods have the advantage of eliminating
cache-line bouncing for the common case of updating counters.
} \QuickQuizEnd
In short, privatization is a powerful tool in the parallel programmer's
toolbox, but it must nevertheless be used with care.
Just like every other synchronization primitive, it has the potential
to increase complexity while decreasing performance and scalability.
\section{Other Uses of Data Ownership}
\label{sec:owned:Other Uses of Data Ownership}
Data ownership works best when the data can be partitioned so that there
is little or no need for cross thread access or update.
Fortunately, this situation is reasonably common, and in a wide variety
of parallel-programming environments.
Examples of data ownership include:
\begin{enumerate}
\item All message-passing environments, such as MPI~\cite{MPIForum2008}
and BOINC~\cite{BOINC2008}.
\item Map-reduce~\cite{MapReduce2008MIT}.
\item Client-server systems, including RPC, web services, and
pretty much any system with a back-end database server.
\item Shared-nothing database systems.
\item Fork-join systems with separate per-process address spaces.
\item Process-based parallelism, such as the Erlang language.
\item Private variables, for example, C-language on-stack auto variables,
in threaded environments.
\item Many parallel linear-algebra algorithms, especially those
well-suited for GPGPUs.\footnote{
But note that a great many other classes of applications
have been ported to
GPGPUs~\cite{NormMatloff2013ParProcBook,AMD2017OpenCL,NVidia2017GPGPU,NVidia2017GPGPU-university}.}
\end{enumerate}
Data ownership is perhaps the most underappreciated synchronization
mechanism in existence.
When used properly, it delivers unrivaled simplicity, performance,
and scalability.
Perhaps its simplicity costs it the respect that it deserves.
Hopefully a greater appreciation for the subtlety and power of data ownership
will lead to greater level of respect, to say nothing of leading to
greater performance and scalability coupled with reduced complexity.
% populate with problems showing benefits of coupling data ownership with
% other approaches. For example, work-stealing schedulers. Perhaps also
% move memory allocation here, though its current location is quite good.