blob: b5d9654b1571f0947ce18056900511b783d7a2c1 [file] [log] [blame]
% owned/owned.tex
% mainfile: ../perfbook.tex
% SPDX-License-Identifier: CC-BY-SA-3.0
\QuickQuizChapter{chp:Data Ownership}{Data Ownership}{qqzowned}
%
\Epigraph{It is mine, I tell you.
My own.
My precious.
Yes, my precious.}
{Gollum in \emph{The Fellowship of the Ring}, J.R.R.~Tolkien}
One of the simplest ways to avoid the synchronization overhead that
comes with locking is to parcel the data out among the threads (or,
in the case of kernels, CPUs)
so that a given piece of data is accessed and modified by only one
of the threads.
Interestingly enough, data ownership covers each of the ``big three''
parallel design techniques:
It partitions over threads (or CPUs, as the case may be),
it batches all local operations,
and its elimination of synchronization operations is weakening
carried to its logical extreme.
It should therefore be no surprise that data ownership is heavily used:
Even novices use it almost instinctively.
In fact, it is so heavily used that this chapter will not introduce any
new examples, but will instead refer back to those of previous chapters.
\QuickQuiz{
What form of data ownership is extremely difficult
to avoid when creating shared-memory parallel programs
(for example, using pthreads) in C or C++?
}\QuickQuizAnswer{
Use of auto variables in functions.
By default, these are private to the thread executing the
current function.
}\QuickQuizEnd
There are a number of approaches to data ownership.
\Cref{sec:owned:Multiple Processes} presents the logical extreme
in data ownership, where each thread has its own private address space.
\Cref{sec:owned:Partial Data Ownership and pthreads} looks at
the opposite extreme, where the data is shared, but different threads
own different access rights to the data.
\Cref{sec:owned:Function Shipping} describes function shipping,
which is a way of allowing other threads to have indirect access to
data owned by a particular thread.
\Cref{sec:owned:Designated Thread} describes how designated
threads can be assigned ownership of a specified function and the
related data.
\Cref{sec:owned:Privatization} discusses improving performance
by transforming algorithms with shared data to instead use data ownership.
Finally, \cref{sec:owned:Other Uses of Data Ownership} lists
a few software environments that feature data ownership as a
first-class citizen.
\section{Multiple Processes}
\label{sec:owned:Multiple Processes}
%
\epigraph{A man's home is his castle}{Ancient Laws of England}
\Cref{sec:toolsoftrade:Scripting Languages}
introduced the following example:
\begin{VerbatimN}[samepage=true]
compute_it 1 > compute_it.1.out &
compute_it 2 > compute_it.2.out &
wait
cat compute_it.1.out
cat compute_it.2.out
\end{VerbatimN}
This example runs two instances of the \co{compute_it} program in
parallel, as separate processes that do not share memory.
Therefore, all data in a given process is owned by that process,
so that almost the entirety of data in the above example is owned.
This approach almost entirely eliminates synchronization overhead.
The resulting combination of extreme simplicity and optimal performance
is obviously quite attractive.
\QuickQuizSeries{%
\QuickQuizB{
What synchronization remains in the example shown in
\cref{sec:owned:Multiple Processes}?
}\QuickQuizAnswerB{
The creation of the threads via the \co{sh} \co{&} operator
and the joining of thread via the \co{sh} \co{wait}
command.
Of course, if the processes explicitly share memory, for
example, using the \co{shmget()} or \co{mmap()} system
calls, explicit synchronization might well be needed when
acccessing or updating the shared memory.
The processes might also synchronize using any of the following
interprocess communications mechanisms:
\begin{enumerate}
\item System V semaphores.
\item System V message queues.
\item UNIX-domain sockets.
\item Networking protocols, including TCP/IP, UDP, and a whole
host of others.
\item File locking.
\item Use of the \co{open()} system call with the
\co{O_CREAT} and \co{O_EXCL} flags.
\item Use of the \co{rename()} system call.
\end{enumerate}
A complete list of possible synchronization mechanisms is left
as an exercise to the reader, who is warned that it will be
an extremely long list.
A surprising number of unassuming system calls can be pressed
into service as synchronization mechanisms.
}\QuickQuizEndB
%
\QuickQuizE{
Is there any shared data in the example shown in
\cref{sec:owned:Multiple Processes}?
}\QuickQuizAnswerE{
That is a philosophical question.
Those wishing the answer ``no'' might argue that processes by
definition do not share memory.
Those wishing to answer ``yes'' might list a large number of
synchronization mechanisms that do not require shared memory,
note that the kernel will have some shared state, and perhaps
even argue that the assignment of process IDs (PIDs) constitute
shared data.
Such arguments are excellent intellectual exercise, and are
also a wonderful way of feeling intelligent and scoring points
against hapless classmates or colleagues, but are mostly a way
of avoiding getting anything useful done.
}\QuickQuizEndE
}
This same pattern can be written in C as well as in \co{sh}, as illustrated by
\cref{lst:toolsoftrade:Using the fork() Primitive,%
lst:toolsoftrade:Using the wait() Primitive}.
It bears repeating that these trivial forms of parallelism are not in
any way cheating or ducking responsibility, but are rather simple and
elegant ways to make your code run faster.
It is fast, scales well, is easy to program, easy to maintain, and
gets the job done.
In addition, taking this approach (where applicable) allows the developer
more time to focus on other things whether these things might involve
applying sophisticated single-threaded optimizations to \co{compute_it}
on the one hand, or applying sophisticated parallel-programming patterns
to portions of the code where this approach is inapplicable.
What is not to like?
The next section discusses the use of data ownership in shared-memory
parallel programs.
\section{Partial Data Ownership and pthreads}
\label{sec:owned:Partial Data Ownership and pthreads}
%
\epigraph{Give thy mind more to what thou hast than to what thou hast not.}
{Marcus Aurelius Antoninus}
Concurrent counting (see \cref{chp:Counting}) uses data ownership heavily,
but adds a twist.
Threads are not allowed to modify data owned by other threads,
but they are permitted to read it.
In short, the use of shared memory allows more nuanced notions
of ownership and access rights.
For example, consider the per-thread statistical counter implementation
shown in
\cref{lst:count:Per-Thread Statistical Counters} on
\cpageref{lst:count:Per-Thread Statistical Counters}.
Here, \co{inc_count()} updates only the corresponding thread's
instance of \co{counter},
while \co{read_count()} accesses, but does not modify, all
threads' instances of \co{counter}.
\QuickQuiz{
Does it ever make sense to have partial data ownership where
each thread reads only its own instance of a per-thread variable,
but writes to other threads' instances?
}\QuickQuizAnswer{
Amazingly enough, yes.
One example is a simple message-passing system where threads post
messages to other threads' mailboxes, and where each thread
is responsible for removing any message it sent once that message
has been acted on.
Implementation of such an algorithm is left as an exercise for
the reader, as is identifying other algorithms with similar
ownership patterns.
}\QuickQuizEnd
Partial data ownership is also common within the Linux kernel.
For example, a given CPU might be permitted to read a given set of its
own per-CPU variables only with interrupts disabled, another CPU might
be permitted to read that same set of the first CPU's per-CPU variables
only when holding the corresponding per-CPU lock.
Then that given CPU would be permitted to update this set of its own
per-CPU variables if it both has interrupts disabled and holds its
per-CPU lock.
This arrangement can be thought of as a reader-writer lock that allows
each CPU very low-overhead access to its own set of per-CPU variables.
There are a great many variations on this theme.
For its own part, pure data ownership is also both common and useful,
for example, the per-thread memory-allocator caches discussed in
\cref{sec:SMPdesign:Resource Allocator Caches}
starting on
\cpageref{sec:SMPdesign:Resource Allocator Caches}.
In this algorithm, each thread's cache is completely private to that
thread.
\section{Function Shipping}
\label{sec:owned:Function Shipping}
%
\epigraph{If the mountain will not come to Muhammad, then Muhammad must
go to the mountain.}
{\emph{Essays}, Francis Bacon}
The previous section described a weak form of data ownership where
threads reached out to other threads' data.
This can be thought of as bringing the data to the functions that
need it.
An alternative approach is to send the functions to the data.
Such an approach is illustrated in
\cref{sec:count:Signal-Theft Limit Counter Design}
beginning on
\cpageref{sec:count:Signal-Theft Limit Counter Design},
in particular the \co{flush_local_count_sig()} and
\co{flush_local_count()} functions in
\cref{lst:count:Signal-Theft Limit Counter Value-Migration Functions}
on
\cpageref{lst:count:Signal-Theft Limit Counter Value-Migration Functions}.
The \co{flush_local_count_sig()} function is a signal handler that
acts as the shipped function.
The \co{pthread_kill()} function in \co{flush_local_count()}
sends the signal---shipping the function---and then waits until
the shipped function executes.
This shipped function has the not-unusual added complication of
needing to interact with any concurrently executing \co{add_count()}
or \co{sub_count()} functions (see
\cref{lst:count:Signal-Theft Limit Counter Add Function}
on
\cpageref{lst:count:Signal-Theft Limit Counter Add Function} and
\cref{lst:count:Signal-Theft Limit Counter Subtract Function}
on
\cpageref{lst:count:Signal-Theft Limit Counter Subtract Function}).
\QuickQuiz{
What mechanisms other than POSIX signals may be used for function
shipping?
}\QuickQuizAnswer{
There is a very large number of such mechanisms, including:
\begin{enumerate}
\item System V message queues.
\item Shared-memory dequeue (see
\cref{sec:SMPdesign:Double-Ended Queue}).
\item Shared-memory mailboxes.
\item UNIX-domain sockets.
\item TCP/IP or UDP, possibly augmented by any number of
higher-level protocols, including RPC, HTTP,
XML, SOAP, and so on.
\end{enumerate}
Compilation of a complete list is left as an exercise to
sufficiently single-minded readers, who are warned that the
list will be extremely long.
}\QuickQuizEnd
\section{Designated Thread}
\label{sec:owned:Designated Thread}
%
\epigraph{Let a man practice the profession which he best knows.}
{Cicero}
The earlier sections describe ways of allowing each thread to keep its
own copy or its own portion of the data.
In contrast, this section describes a functional-decomposition approach,
where a special designated thread owns the rights to the data
that is required to do its job.
\begin{fcvref}[ln:count:count_stat_eventual:whole:eventual]
The eventually consistent counter implementation described in
\cref{sec:count:Eventually Consistent Implementation} provides an example.
This implementation has a designated thread that runs the
\co{eventual()} function shown on \clnrefrange{b}{e} of
\cref{lst:count:Array-Based Per-Thread Eventually Consistent Counters}.
This \co{eventual()} thread periodically pulls the per-thread counts
into the global counter, so that accesses to the global counter will,
as the name says, eventually converge on the actual value.
\end{fcvref}
\QuickQuiz{
\begin{fcvref}[ln:count:count_stat_eventual:whole:eventual]
But none of the data in the \co{eventual()} function shown on
\clnrefrange{b}{e} of
\cref{lst:count:Array-Based Per-Thread Eventually Consistent Counters}
is actually owned by the \co{eventual()} thread!
In just what way is this data ownership???
\end{fcvref}
}\QuickQuizAnswer{
\begin{fcvref}[ln:count:count_stat_eventual:whole]
The key phrase is ``owns the rights to the data''.
In this case, the rights in question are the rights to access
the per-thread \co{counter} variable defined on \clnref{per_thr_cnt}
of the listing.
This situation is similar to that described in
\cref{sec:owned:Partial Data Ownership and pthreads}.
However, there really is data that is owned by the \co{eventual()}
thread, namely the \co{t} and \co{sum} variables defined on
\clnref{t,sum} of the listing.
For other examples of designated threads, look at the kernel
threads in the Linux kernel, for example, those created by
\co{kthread_create()} and \co{kthread_run()}.
\end{fcvref}
}\QuickQuizEnd
\section{Privatization}
\label{sec:owned:Privatization}
%
\epigraph{There is, of course, a difference between what a man seizes
and what he really possesses.}
{Pearl S.~Buck}
One way of improving the performance and scalability of a shared-memory
parallel program is to transform it so as to convert shared data to
private data that is owned by a particular thread.
An excellent example of this is shown in the answer to one of the
Quick Quizzes in
\cref{sec:SMPdesign:Dining Philosophers Problem},
which uses privatization to produce a solution to the
Dining Philosophers problem with much better performance and scalability
than that of the standard textbook solution.
The original problem has five philosophers sitting around the table
with one fork between each adjacent pair of philosophers, which permits
at most two philosophers to eat concurrently.
We can trivially privatize this problem by providing an additional five
forks, so that each philosopher has his or her own private pair of forks.
This allows all five philosophers to eat concurrently, and also offers
a considerable reduction in the spread of certain types of disease.
In other cases, privatization imposes costs.
For example, consider the simple limit counter shown in
\cref{lst:count:Simple Limit Counter Add; Subtract; and Read} on
\cpageref{lst:count:Simple Limit Counter Add; Subtract; and Read}.
This is an example of an algorithm where threads can read each others'
data, but are only permitted to update their own data.
A quick review of the algorithm shows that the only cross-thread
accesses are in the summation loop in \co{read_count()}.
If this loop is eliminated, we move to the more-efficient pure
data ownership, but at the cost of a less-accurate result
from \co{read_count()}.
\QuickQuiz{
Is it possible to obtain greater accuracy while still
maintaining full privacy of the per-thread data?
}\QuickQuizAnswer{
Yes.
One approach is for \co{read_count()} to add the value
of its own per-thread variable.
This maintains full ownership and performance, but only
a slight improvement in accuracy, particularly on systems
with very large numbers of threads.
Another approach is for \co{read_count()} to use function
shipping, for example, in the form of per-thread signals.
This greatly improves accuracy, but at a significant performance
cost for \co{read_count()}.
However, both of these methods have the advantage of eliminating
cache thrashing for the common case of updating counters.
}\QuickQuizEnd
Partial privatization is also possible, with some synchronization
requirements, but less than in the fully shared case.
Some partial-privatization possibilities were explored in
\cref{sec:toolsoftrade:Avoiding Data Races}.
\Cref{chp:Deferred Processing} will introduce a temporal component
to data ownership by providing ways of safely taking public data
structures private.
In short, privatization is a powerful tool in the parallel programmer's
toolbox, but it must nevertheless be used with care.
Just like every other synchronization primitive, it has the potential
to increase complexity while decreasing performance and scalability.
\section{Other Uses of Data Ownership}
\label{sec:owned:Other Uses of Data Ownership}
%
\epigraph{Everything comes to us that belongs to us if we create the
capacity to receive it.}
{Rabindranath Tagore}
Data ownership works best when the data can be partitioned so that there
is little or no need for cross thread access or update.
Fortunately, this situation is reasonably common, and in a wide variety
of parallel-programming environments.
Examples of data ownership include:
\begin{enumerate}
\item All message-passing environments, such as MPI~\cite{MPIForum2008}
and BOINC~\cite{BOINC2008}.
\item Map-reduce~\cite{MapReduce2008MIT}.
\item Client-server systems, including RPC, web services, and
pretty much any system with a back-end database server.
\item Shared-nothing database systems.
\item Fork-join systems with separate per-process address spaces.
\item Process-based parallelism, such as the Erlang language.
\item Private variables, for example, C-language on-stack auto variables,
in threaded environments.
\item Many parallel linear-algebra algorithms, especially those
well-suited for GPGPUs.\footnote{
But note that a great many other classes of applications
have also been ported to
GPGPUs~\cite{NormMatloff2017ParProcBook,AMD2020ROCm,NVidia2017GPGPU,NVidia2017GPGPU-university}.}
\item Operating-system kernels adapted for networking, where each connection
(also called \emph{flow}~\cite{Shenker89,ZhangPhD,McKenney90})
is assigned to a specific thread.
One recent example of this approach is the IX operating
system~\cite{Belay:2016:IOS:3014162.2997641}.
IX does have some shared data structures, which use synchronization
mechanisms to be described in
\cref{sec:defer:Read-Copy Update (RCU)}.
\end{enumerate}
Data ownership is perhaps the most underappreciated synchronization
mechanism in existence.
When used properly, it delivers unrivaled simplicity, performance,
and scalability.
Perhaps its simplicity costs it the respect that it deserves.
Hopefully a greater appreciation for the subtlety and power of data ownership
will lead to greater level of respect, to say nothing of leading to
greater performance and scalability coupled with reduced complexity.
% populate with problems showing benefits of coupling data ownership with
% other approaches. For example, work-stealing schedulers. Perhaps also
% move memory allocation here, though its current location is quite good.
\QuickQuizAnswersChp{qqzowned}