owned/owned.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % owned/owned.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \QuickQuizChapter{chp:Data Ownership}{Data Ownership}{qqzowned}
 %
 \Epigraph{It is mine, I tell you.
 	  My own.
 	  My precious.
 	  Yes, my precious.}
 	 {Gollum in \emph{The Fellowship of the Ring}, J.R.R.~Tolkien}

 One of the simplest ways to avoid the synchronization overhead that
 comes with locking is to parcel the data out among the threads (or,
 in the case of kernels, CPUs)
 so that a given piece of data is accessed and modified by only one
 of the threads.
 Interestingly enough, data ownership covers each of the ``big three''
 parallel design techniques:
 It partitions over threads (or CPUs, as the case may be),
 it batches all local operations,
 and its elimination of synchronization operations is weakening
 carried to its logical extreme.
 It should therefore be no surprise that data ownership is heavily used:
 Even novices use it almost instinctively.
 In fact, it is so heavily used that this chapter will not introduce any
 new examples, but will instead refer back to those of previous chapters.

 \QuickQuiz{
 	What form of data ownership is extremely difficult
 	to avoid when creating shared-memory parallel programs
 	(for example, using pthreads) in C or C++?
 }\QuickQuizAnswer{
 	Use of auto variables in functions.
 	By default, these are private to the thread executing the
 	current function.
 }\QuickQuizEnd

 There are a number of approaches to data ownership.
 \Cref{sec:owned:Multiple Processes} presents the logical extreme
 in data ownership, where each thread has its own private address space.
 \Cref{sec:owned:Partial Data Ownership and pthreads} looks at
 the opposite extreme, where the data is shared, but different threads
 own different access rights to the data.
 \Cref{sec:owned:Function Shipping} describes function shipping,
 which is a way of allowing other threads to have indirect access to
 data owned by a particular thread.
 \Cref{sec:owned:Designated Thread} describes how designated
 threads can be assigned ownership of a specified function and the
 related data.
 \Cref{sec:owned:Privatization} discusses improving performance
 by transforming algorithms with shared data to instead use data ownership.
 Finally, \cref{sec:owned:Other Uses of Data Ownership} lists
 a few software environments that feature data ownership as a
 first-class citizen.

 \section{Multiple Processes}
 \label{sec:owned:Multiple Processes}
 %
 \epigraph{A man's home is his castle}{Ancient Laws of England}

 \Cref{sec:toolsoftrade:Scripting Languages}
 introduced the following example:

 \begin{VerbatimN}[samepage=true]
 compute_it 1 > compute_it.1.out &
 compute_it 2 > compute_it.2.out &
 wait
 cat compute_it.1.out
 cat compute_it.2.out
 \end{VerbatimN}

 This example runs two instances of the \co{compute_it} program in
 parallel, as separate processes that do not share memory.
 Therefore, all data in a given process is owned by that process,
 so that almost the entirety of data in the above example is owned.
 This approach almost entirely eliminates synchronization overhead.
 The resulting combination of extreme simplicity and optimal performance
 is obviously quite attractive.

 \QuickQuizSeries{%
 \QuickQuizB{
 	What synchronization remains in the example shown in
 	\cref{sec:owned:Multiple Processes}?
 }\QuickQuizAnswerB{
 	The creation of the threads via the \co{sh} \co{&} operator
 	and the joining of thread via the \co{sh} \co{wait}
 	command.

 	Of course, if the processes explicitly share memory, for
 	example, using the \co{shmget()} or \co{mmap()} system
 	calls, explicit synchronization might well be needed when
 	acccessing or updating the shared memory.
 	The processes might also synchronize using any of the following
 	interprocess communications mechanisms:
 	\begin{enumerate}
 	\item	System V semaphores.
 	\item	System V message queues.
 	\item	UNIX-domain sockets.
 	\item	Networking protocols, including TCP/IP, UDP, and a whole
 		host of others.
 	\item	File locking.
 	\item	Use of the \co{open()} system call with the
 		\co{O_CREAT} and \co{O_EXCL} flags.
 	\item	Use of the \co{rename()} system call.
 	\end{enumerate}
 	A complete list of possible synchronization mechanisms is left
 	as an exercise to the reader, who is warned that it will be
 	an extremely long list.
 	A surprising number of unassuming system calls can be pressed
 	into service as synchronization mechanisms.
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	Is there any shared data in the example shown in
 	\cref{sec:owned:Multiple Processes}?
 }\QuickQuizAnswerE{
 	That is a philosophical question.

 	Those wishing the answer ``no'' might argue that processes by
 	definition do not share memory.

 	Those wishing to answer ``yes'' might list a large number of
 	synchronization mechanisms that do not require shared memory,
 	note that the kernel will have some shared state, and perhaps
 	even argue that the assignment of process IDs (PIDs) constitute
 	shared data.

 	Such arguments are excellent intellectual exercise, and are
 	also a wonderful way of feeling intelligent and scoring points
 	against hapless classmates or colleagues, but are mostly a way
 	of avoiding getting anything useful done.
 }\QuickQuizEndE
 }

 This same pattern can be written in C as well as in \co{sh}, as illustrated by
 \cref{lst:toolsoftrade:Using the fork() Primitive,%
 lst:toolsoftrade:Using the wait() Primitive}.

 It bears repeating that these trivial forms of parallelism are not in
 any way cheating or ducking responsibility, but are rather simple and
 elegant ways to make your code run faster.
 It is fast, scales well, is easy to program, easy to maintain, and
 gets the job done.
 In addition, taking this approach (where applicable) allows the developer
 more time to focus on other things whether these things might involve
 applying sophisticated single-threaded optimizations to \co{compute_it}
 on the one hand, or applying sophisticated parallel-programming patterns
 to portions of the code where this approach is inapplicable.
 What is not to like?

 The next section discusses the use of data ownership in shared-memory
 parallel programs.

 \section{Partial Data Ownership and pthreads}
 \label{sec:owned:Partial Data Ownership and pthreads}
 %
 \epigraph{Give thy mind more to what thou hast than to what thou hast not.}
 	 {Marcus Aurelius Antoninus}

 Concurrent counting (see \cref{chp:Counting}) uses data ownership heavily,
 but adds a twist.
 Threads are not allowed to modify data owned by other threads,
 but they are permitted to read it.
 In short, the use of shared memory allows more nuanced notions
 of ownership and access rights.

 For example, consider the per-thread statistical counter implementation
 shown in
 \cref{lst:count:Per-Thread Statistical Counters} on
 \cpageref{lst:count:Per-Thread Statistical Counters}.
 Here, \co{inc_count()} updates only the corresponding thread's
 instance of \co{counter},
 while \co{read_count()} accesses, but does not modify, all
 threads' instances of \co{counter}.

 \QuickQuiz{
 	Does it ever make sense to have partial data ownership where
 	each thread reads only its own instance of a per-thread variable,
 	but writes to other threads' instances?
 }\QuickQuizAnswer{
 	Amazingly enough, yes.
 	One example is a simple message-passing system where threads post
 	messages to other threads' mailboxes, and where each thread
 	is responsible for removing any message it sent once that message
 	has been acted on.
 	Implementation of such an algorithm is left as an exercise for
 	the reader, as is identifying other algorithms with similar
 	ownership patterns.
 }\QuickQuizEnd

 Partial data ownership is also common within the Linux kernel.
 For example, a given CPU might be permitted to read a given set of its
 own per-CPU variables only with interrupts disabled, another CPU might
 be permitted to read that same set of the first CPU's per-CPU variables
 only when holding the corresponding per-CPU lock.
 Then that given CPU would be permitted to update this set of its own
 per-CPU variables if it both has interrupts disabled and holds its
 per-CPU lock.
 This arrangement can be thought of as a reader-writer lock that allows
 each CPU very low-overhead access to its own set of per-CPU variables.
 There are a great many variations on this theme.

 For its own part, pure data ownership is also both common and useful,
 for example, the per-thread memory-allocator caches discussed in
 \cref{sec:SMPdesign:Resource Allocator Caches}
 starting on
 \cpageref{sec:SMPdesign:Resource Allocator Caches}.
 In this algorithm, each thread's cache is completely private to that
 thread.

 \section{Function Shipping}
 \label{sec:owned:Function Shipping}
 %
 \epigraph{If the mountain will not come to Muhammad, then Muhammad must
 	  go to the mountain.}
 	 {\emph{Essays}, Francis Bacon}

 The previous section described a weak form of data ownership where
 threads reached out to other threads' data.
 This can be thought of as bringing the data to the functions that
 need it.
 An alternative approach is to send the functions to the data.

 Such an approach is illustrated in
 \cref{sec:count:Signal-Theft Limit Counter Design}
 beginning on
 \cpageref{sec:count:Signal-Theft Limit Counter Design},
 in particular the \co{flush_local_count_sig()} and
 \co{flush_local_count()} functions in
 \cref{lst:count:Signal-Theft Limit Counter Value-Migration Functions}
 on
 \cpageref{lst:count:Signal-Theft Limit Counter Value-Migration Functions}.

 The \co{flush_local_count_sig()} function is a signal handler that
 acts as the shipped function.
 The \co{pthread_kill()} function in \co{flush_local_count()}
 sends the signal---shipping the function---and then waits until
 the shipped function executes.
 This shipped function has the not-unusual added complication of
 needing to interact with any concurrently executing \co{add_count()}
 or \co{sub_count()} functions (see
 \cref{lst:count:Signal-Theft Limit Counter Add Function}
 on
 \cpageref{lst:count:Signal-Theft Limit Counter Add Function} and
 \cref{lst:count:Signal-Theft Limit Counter Subtract Function}
 on
 \cpageref{lst:count:Signal-Theft Limit Counter Subtract Function}).

 \QuickQuiz{
 	What mechanisms other than POSIX signals may be used for function
 	shipping?
 }\QuickQuizAnswer{
 	There is a very large number of such mechanisms, including:
 	\begin{enumerate}
 	\item	System V message queues.
 	\item	Shared-memory dequeue (see
 		\cref{sec:SMPdesign:Double-Ended Queue}).
 	\item	Shared-memory mailboxes.
 	\item	UNIX-domain sockets.
 	\item	TCP/IP or UDP, possibly augmented by any number of
 		higher-level protocols, including RPC, HTTP,
 		XML, SOAP, and so on.
 	\end{enumerate}
 	Compilation of a complete list is left as an exercise to
 	sufficiently single-minded readers, who are warned that the
 	list will be extremely long.
 }\QuickQuizEnd

 \section{Designated Thread}
 \label{sec:owned:Designated Thread}
 %
 \epigraph{Let a man practice the profession which he best knows.}
 	 {Cicero}

 The earlier sections describe ways of allowing each thread to keep its
 own copy or its own portion of the data.
 In contrast, this section describes a functional-decomposition approach,
 where a special designated thread owns the rights to the data
 that is required to do its job.
 \begin{fcvref}[ln:count:count_stat_eventual:whole:eventual]
 The eventually consistent counter implementation described in
 \cref{sec:count:Eventually Consistent Implementation} provides an example.
 This implementation has a designated thread that runs the
 \co{eventual()} function shown on \clnrefrange{b}{e} of
 \cref{lst:count:Array-Based Per-Thread Eventually Consistent Counters}.
 This \co{eventual()} thread periodically pulls the per-thread counts
 into the global counter, so that accesses to the global counter will,
 as the name says, eventually converge on the actual value.
 \end{fcvref}

 \QuickQuiz{
 	\begin{fcvref}[ln:count:count_stat_eventual:whole:eventual]
 	But none of the data in the \co{eventual()} function shown on
 	\clnrefrange{b}{e} of
 	\cref{lst:count:Array-Based Per-Thread Eventually Consistent Counters}
 	is actually owned by the \co{eventual()} thread!
 	In just what way is this data ownership???
 	\end{fcvref}
 }\QuickQuizAnswer{
 	\begin{fcvref}[ln:count:count_stat_eventual:whole]
 	The key phrase is ``owns the rights to the data''.
 	In this case, the rights in question are the rights to access
 	the per-thread \co{counter} variable defined on \clnref{per_thr_cnt}
 	of the listing.
 	This situation is similar to that described in
 	\cref{sec:owned:Partial Data Ownership and pthreads}.

 	However, there really is data that is owned by the \co{eventual()}
 	thread, namely the \co{t} and \co{sum} variables defined on
 	\clnref{t,sum} of the listing.

 	For other examples of designated threads, look at the kernel
 	threads in the Linux kernel, for example, those created by
 	\co{kthread_create()} and \co{kthread_run()}.
 	\end{fcvref}
 }\QuickQuizEnd

 \section{Privatization}
 \label{sec:owned:Privatization}
 %
 \epigraph{There is, of course, a difference between what a man seizes
 	  and what he really possesses.}
 	 {Pearl S.~Buck}

 One way of improving the performance and scalability of a shared-memory
 parallel program is to transform it so as to convert shared data to
 private data that is owned by a particular thread.

 An excellent example of this is shown in the answer to one of the
 Quick Quizzes in
 \cref{sec:SMPdesign:Dining Philosophers Problem},
 which uses privatization to produce a solution to the
 Dining Philosophers problem with much better performance and scalability
 than that of the standard textbook solution.
 The original problem has five philosophers sitting around the table
 with one fork between each adjacent pair of philosophers, which permits
 at most two philosophers to eat concurrently.

 We can trivially privatize this problem by providing an additional five
 forks, so that each philosopher has his or her own private pair of forks.
 This allows all five philosophers to eat concurrently, and also offers
 a considerable reduction in the spread of certain types of disease.

 In other cases, privatization imposes costs.
 For example, consider the simple limit counter shown in
 \cref{lst:count:Simple Limit Counter Add; Subtract; and Read} on
 \cpageref{lst:count:Simple Limit Counter Add; Subtract; and Read}.
 This is an example of an algorithm where threads can read each others'
 data, but are only permitted to update their own data.
 A quick review of the algorithm shows that the only cross-thread
 accesses are in the summation loop in \co{read_count()}.
 If this loop is eliminated, we move to the more-efficient pure
 data ownership, but at the cost of a less-accurate result
 from \co{read_count()}.

 \QuickQuiz{
 	Is it possible to obtain greater accuracy while still
 	maintaining full privacy of the per-thread data?
 }\QuickQuizAnswer{
 	Yes.
 	One approach is for \co{read_count()} to add the value
 	of its own per-thread variable.
 	This maintains full ownership and performance, but only
 	a slight improvement in accuracy, particularly on systems
 	with very large numbers of threads.

 	Another approach is for \co{read_count()} to use function
 	shipping, for example, in the form of per-thread signals.
 	This greatly improves accuracy, but at a significant performance
 	cost for \co{read_count()}.

 	However, both of these methods have the advantage of eliminating
 	cache thrashing for the common case of updating counters.
 }\QuickQuizEnd

 Partial privatization is also possible, with some synchronization
 requirements, but less than in the fully shared case.
 Some partial-privatization possibilities were explored in
 \cref{sec:toolsoftrade:Avoiding Data Races}.
 \Cref{chp:Deferred Processing} will introduce a temporal component
 to data ownership by providing ways of safely taking public data
 structures private.

 In short, privatization is a powerful tool in the parallel programmer's
 toolbox, but it must nevertheless be used with care.
 Just like every other synchronization primitive, it has the potential
 to increase complexity while decreasing performance and scalability.

 \section{Other Uses of Data Ownership}
 \label{sec:owned:Other Uses of Data Ownership}
 %
 \epigraph{Everything comes to us that belongs to us if we create the
 	  capacity to receive it.}
 	 {Rabindranath Tagore}

 Data ownership works best when the data can be partitioned so that there
 is little or no need for cross thread access or update.
 Fortunately, this situation is reasonably common, and in a wide variety
 of parallel-programming environments.

 Examples of data ownership include:

 \begin{enumerate}
 \item	All message-passing environments, such as MPI~\cite{MPIForum2008}
 	and BOINC~\cite{BOINC2008}.
 \item	Map-reduce~\cite{MapReduce2008MIT}.
 \item	Client-server systems, including RPC, web services, and
 	pretty much any system with a back-end database server.
 \item	Shared-nothing database systems.
 \item	Fork-join systems with separate per-process address spaces.
 \item	Process-based parallelism, such as the Erlang language.
 \item	Private variables, for example, C-language on-stack auto variables,
 	in threaded environments.
 \item	Many parallel linear-algebra algorithms, especially those
 	well-suited for GPGPUs.\footnote{
 		But note that a great many other classes of applications
 		have also been ported to
 		GPGPUs~\cite{NormMatloff2017ParProcBook,AMD2020ROCm,NVidia2017GPGPU,NVidia2017GPGPU-university}.}
 \item	Operating-system kernels adapted for networking, where each connection
 	(also called \emph{flow}~\cite{Shenker89,ZhangPhD,McKenney90})
 	is assigned to a specific thread.
 	One recent example of this approach is the IX operating
 	system~\cite{Belay:2016:IOS:3014162.2997641}.
 	IX does have some shared data structures, which use synchronization
 	mechanisms to be described in
 	\cref{sec:defer:Read-Copy Update (RCU)}.
 \end{enumerate}

 Data ownership is perhaps the most underappreciated synchronization
 mechanism in existence.
 When used properly, it delivers unrivaled simplicity, performance,
 and scalability.
 Perhaps its simplicity costs it the respect that it deserves.
 Hopefully a greater appreciation for the subtlety and power of data ownership
 will lead to greater level of respect, to say nothing of leading to
 greater performance and scalability coupled with reduced complexity.

 % populate with problems showing benefits of coupling data ownership with
 % other approaches. For example, work-stealing schedulers. Perhaps also
 % move memory allocation here, though its current location is quite good.

 \QuickQuizAnswersChp{qqzowned}
	% owned/owned.tex
	% mainfile: ../perfbook.tex
	% SPDX-License-Identifier: CC-BY-SA-3.0

	\QuickQuizChapter{chp:Data Ownership}{Data Ownership}{qqzowned}
	%
	\Epigraph{It is mine, I tell you.
	My own.
	My precious.
	Yes, my precious.}
	{Gollum in \emph{The Fellowship of the Ring}, J.R.R.~Tolkien}

	One of the simplest ways to avoid the synchronization overhead that
	comes with locking is to parcel the data out among the threads (or,
	in the case of kernels, CPUs)
	so that a given piece of data is accessed and modified by only one
	of the threads.
	Interestingly enough, data ownership covers each of the ``big three''
	parallel design techniques:
	It partitions over threads (or CPUs, as the case may be),
	it batches all local operations,
	and its elimination of synchronization operations is weakening
	carried to its logical extreme.
	It should therefore be no surprise that data ownership is heavily used:
	Even novices use it almost instinctively.
	In fact, it is so heavily used that this chapter will not introduce any
	new examples, but will instead refer back to those of previous chapters.

	\QuickQuiz{
	What form of data ownership is extremely difficult
	to avoid when creating shared-memory parallel programs
	(for example, using pthreads) in C or C++?
	}\QuickQuizAnswer{
	Use of auto variables in functions.
	By default, these are private to the thread executing the
	current function.
	}\QuickQuizEnd

	There are a number of approaches to data ownership.
	\Cref{sec:owned:Multiple Processes} presents the logical extreme
	in data ownership, where each thread has its own private address space.
	\Cref{sec:owned:Partial Data Ownership and pthreads} looks at
	the opposite extreme, where the data is shared, but different threads
	own different access rights to the data.
	\Cref{sec:owned:Function Shipping} describes function shipping,
	which is a way of allowing other threads to have indirect access to
	data owned by a particular thread.
	\Cref{sec:owned:Designated Thread} describes how designated
	threads can be assigned ownership of a specified function and the
	related data.
	\Cref{sec:owned:Privatization} discusses improving performance
	by transforming algorithms with shared data to instead use data ownership.
	Finally, \cref{sec:owned:Other Uses of Data Ownership} lists
	a few software environments that feature data ownership as a
	first-class citizen.

	\section{Multiple Processes}
	\label{sec:owned:Multiple Processes}
	%
	\epigraph{A man's home is his castle}{Ancient Laws of England}

	\Cref{sec:toolsoftrade:Scripting Languages}
	introduced the following example:

	\begin{VerbatimN}[samepage=true]
	compute_it 1 > compute_it.1.out &
	compute_it 2 > compute_it.2.out &
	wait
	cat compute_it.1.out
	cat compute_it.2.out
	\end{VerbatimN}

	This example runs two instances of the \co{compute_it} program in
	parallel, as separate processes that do not share memory.
	Therefore, all data in a given process is owned by that process,
	so that almost the entirety of data in the above example is owned.
	This approach almost entirely eliminates synchronization overhead.
	The resulting combination of extreme simplicity and optimal performance
	is obviously quite attractive.

	\QuickQuizSeries{%
	\QuickQuizB{
	What synchronization remains in the example shown in
	\cref{sec:owned:Multiple Processes}?
	}\QuickQuizAnswerB{
	The creation of the threads via the \co{sh} \co{&} operator
	and the joining of thread via the \co{sh} \co{wait}
	command.

	Of course, if the processes explicitly share memory, for
	example, using the \co{shmget()} or \co{mmap()} system
	calls, explicit synchronization might well be needed when
	acccessing or updating the shared memory.
	The processes might also synchronize using any of the following
	interprocess communications mechanisms:
	\begin{enumerate}
	\item System V semaphores.
	\item System V message queues.
	\item UNIX-domain sockets.
	\item Networking protocols, including TCP/IP, UDP, and a whole
	host of others.
	\item File locking.
	\item Use of the \co{open()} system call with the
	\co{O_CREAT} and \co{O_EXCL} flags.
	\item Use of the \co{rename()} system call.
	\end{enumerate}
	A complete list of possible synchronization mechanisms is left
	as an exercise to the reader, who is warned that it will be
	an extremely long list.
	A surprising number of unassuming system calls can be pressed
	into service as synchronization mechanisms.
	}\QuickQuizEndB
	%
	\QuickQuizE{
	Is there any shared data in the example shown in
	\cref{sec:owned:Multiple Processes}?
	}\QuickQuizAnswerE{
	That is a philosophical question.

	Those wishing the answer ``no'' might argue that processes by
	definition do not share memory.

	Those wishing to answer ``yes'' might list a large number of
	synchronization mechanisms that do not require shared memory,
	note that the kernel will have some shared state, and perhaps
	even argue that the assignment of process IDs (PIDs) constitute
	shared data.

	Such arguments are excellent intellectual exercise, and are
	also a wonderful way of feeling intelligent and scoring points
	against hapless classmates or colleagues, but are mostly a way
	of avoiding getting anything useful done.
	}\QuickQuizEndE
	}

	This same pattern can be written in C as well as in \co{sh}, as illustrated by
	\cref{lst:toolsoftrade:Using the fork() Primitive,%
	lst:toolsoftrade:Using the wait() Primitive}.

	It bears repeating that these trivial forms of parallelism are not in
	any way cheating or ducking responsibility, but are rather simple and
	elegant ways to make your code run faster.
	It is fast, scales well, is easy to program, easy to maintain, and
	gets the job done.
	In addition, taking this approach (where applicable) allows the developer
	more time to focus on other things whether these things might involve
	applying sophisticated single-threaded optimizations to \co{compute_it}
	on the one hand, or applying sophisticated parallel-programming patterns
	to portions of the code where this approach is inapplicable.
	What is not to like?

	The next section discusses the use of data ownership in shared-memory
	parallel programs.

	\section{Partial Data Ownership and pthreads}
	\label{sec:owned:Partial Data Ownership and pthreads}
	%
	\epigraph{Give thy mind more to what thou hast than to what thou hast not.}
	{Marcus Aurelius Antoninus}

	Concurrent counting (see \cref{chp:Counting}) uses data ownership heavily,
	but adds a twist.
	Threads are not allowed to modify data owned by other threads,
	but they are permitted to read it.
	In short, the use of shared memory allows more nuanced notions
	of ownership and access rights.

	For example, consider the per-thread statistical counter implementation
	shown in
	\cref{lst:count:Per-Thread Statistical Counters} on
	\cpageref{lst:count:Per-Thread Statistical Counters}.
	Here, \co{inc_count()} updates only the corresponding thread's
	instance of \co{counter},
	while \co{read_count()} accesses, but does not modify, all
	threads' instances of \co{counter}.

	\QuickQuiz{
	Does it ever make sense to have partial data ownership where
	each thread reads only its own instance of a per-thread variable,
	but writes to other threads' instances?
	}\QuickQuizAnswer{
	Amazingly enough, yes.
	One example is a simple message-passing system where threads post
	messages to other threads' mailboxes, and where each thread
	is responsible for removing any message it sent once that message
	has been acted on.
	Implementation of such an algorithm is left as an exercise for
	the reader, as is identifying other algorithms with similar
	ownership patterns.
	}\QuickQuizEnd

	Partial data ownership is also common within the Linux kernel.
	For example, a given CPU might be permitted to read a given set of its
	own per-CPU variables only with interrupts disabled, another CPU might
	be permitted to read that same set of the first CPU's per-CPU variables
	only when holding the corresponding per-CPU lock.
	Then that given CPU would be permitted to update this set of its own
	per-CPU variables if it both has interrupts disabled and holds its
	per-CPU lock.
	This arrangement can be thought of as a reader-writer lock that allows
	each CPU very low-overhead access to its own set of per-CPU variables.
	There are a great many variations on this theme.

	For its own part, pure data ownership is also both common and useful,
	for example, the per-thread memory-allocator caches discussed in
	\cref{sec:SMPdesign:Resource Allocator Caches}
	starting on
	\cpageref{sec:SMPdesign:Resource Allocator Caches}.
	In this algorithm, each thread's cache is completely private to that
	thread.

	\section{Function Shipping}
	\label{sec:owned:Function Shipping}
	%
	\epigraph{If the mountain will not come to Muhammad, then Muhammad must
	go to the mountain.}
	{\emph{Essays}, Francis Bacon}

	The previous section described a weak form of data ownership where
	threads reached out to other threads' data.
	This can be thought of as bringing the data to the functions that
	need it.
	An alternative approach is to send the functions to the data.

	Such an approach is illustrated in
	\cref{sec:count:Signal-Theft Limit Counter Design}
	beginning on
	\cpageref{sec:count:Signal-Theft Limit Counter Design},
	in particular the \co{flush_local_count_sig()} and
	\co{flush_local_count()} functions in
	\cref{lst:count:Signal-Theft Limit Counter Value-Migration Functions}
	on
	\cpageref{lst:count:Signal-Theft Limit Counter Value-Migration Functions}.

	The \co{flush_local_count_sig()} function is a signal handler that
	acts as the shipped function.
	The \co{pthread_kill()} function in \co{flush_local_count()}
	sends the signal---shipping the function---and then waits until
	the shipped function executes.
	This shipped function has the not-unusual added complication of
	needing to interact with any concurrently executing \co{add_count()}
	or \co{sub_count()} functions (see
	\cref{lst:count:Signal-Theft Limit Counter Add Function}
	on
	\cpageref{lst:count:Signal-Theft Limit Counter Add Function} and
	\cref{lst:count:Signal-Theft Limit Counter Subtract Function}
	on
	\cpageref{lst:count:Signal-Theft Limit Counter Subtract Function}).

	\QuickQuiz{
	What mechanisms other than POSIX signals may be used for function
	shipping?
	}\QuickQuizAnswer{
	There is a very large number of such mechanisms, including:
	\begin{enumerate}
	\item System V message queues.
	\item Shared-memory dequeue (see
	\cref{sec:SMPdesign:Double-Ended Queue}).
	\item Shared-memory mailboxes.
	\item UNIX-domain sockets.
	\item TCP/IP or UDP, possibly augmented by any number of
	higher-level protocols, including RPC, HTTP,
	XML, SOAP, and so on.
	\end{enumerate}
	Compilation of a complete list is left as an exercise to
	sufficiently single-minded readers, who are warned that the
	list will be extremely long.
	}\QuickQuizEnd

	\section{Designated Thread}
	\label{sec:owned:Designated Thread}
	%
	\epigraph{Let a man practice the profession which he best knows.}
	{Cicero}

	The earlier sections describe ways of allowing each thread to keep its
	own copy or its own portion of the data.
	In contrast, this section describes a functional-decomposition approach,
	where a special designated thread owns the rights to the data
	that is required to do its job.
	\begin{fcvref}[ln:count:count_stat_eventual:whole:eventual]
	The eventually consistent counter implementation described in
	\cref{sec:count:Eventually Consistent Implementation} provides an example.
	This implementation has a designated thread that runs the
	\co{eventual()} function shown on \clnrefrange{b}{e} of
	\cref{lst:count:Array-Based Per-Thread Eventually Consistent Counters}.
	This \co{eventual()} thread periodically pulls the per-thread counts
	into the global counter, so that accesses to the global counter will,
	as the name says, eventually converge on the actual value.
	\end{fcvref}

	\QuickQuiz{
	\begin{fcvref}[ln:count:count_stat_eventual:whole:eventual]
	But none of the data in the \co{eventual()} function shown on
	\clnrefrange{b}{e} of
	\cref{lst:count:Array-Based Per-Thread Eventually Consistent Counters}
	is actually owned by the \co{eventual()} thread!
	In just what way is this data ownership???
	\end{fcvref}
	}\QuickQuizAnswer{
	\begin{fcvref}[ln:count:count_stat_eventual:whole]
	The key phrase is ``owns the rights to the data''.
	In this case, the rights in question are the rights to access
	the per-thread \co{counter} variable defined on \clnref{per_thr_cnt}
	of the listing.
	This situation is similar to that described in
	\cref{sec:owned:Partial Data Ownership and pthreads}.

	However, there really is data that is owned by the \co{eventual()}
	thread, namely the \co{t} and \co{sum} variables defined on
	\clnref{t,sum} of the listing.

	For other examples of designated threads, look at the kernel
	threads in the Linux kernel, for example, those created by
	\co{kthread_create()} and \co{kthread_run()}.
	\end{fcvref}
	}\QuickQuizEnd

	\section{Privatization}
	\label{sec:owned:Privatization}
	%
	\epigraph{There is, of course, a difference between what a man seizes
	and what he really possesses.}
	{Pearl S.~Buck}

	One way of improving the performance and scalability of a shared-memory
	parallel program is to transform it so as to convert shared data to
	private data that is owned by a particular thread.

	An excellent example of this is shown in the answer to one of the
	Quick Quizzes in
	\cref{sec:SMPdesign:Dining Philosophers Problem},
	which uses privatization to produce a solution to the
	Dining Philosophers problem with much better performance and scalability
	than that of the standard textbook solution.
	The original problem has five philosophers sitting around the table
	with one fork between each adjacent pair of philosophers, which permits
	at most two philosophers to eat concurrently.

	We can trivially privatize this problem by providing an additional five
	forks, so that each philosopher has his or her own private pair of forks.
	This allows all five philosophers to eat concurrently, and also offers
	a considerable reduction in the spread of certain types of disease.

	In other cases, privatization imposes costs.
	For example, consider the simple limit counter shown in
	\cref{lst:count:Simple Limit Counter Add; Subtract; and Read} on
	\cpageref{lst:count:Simple Limit Counter Add; Subtract; and Read}.
	This is an example of an algorithm where threads can read each others'
	data, but are only permitted to update their own data.
	A quick review of the algorithm shows that the only cross-thread
	accesses are in the summation loop in \co{read_count()}.
	If this loop is eliminated, we move to the more-efficient pure
	data ownership, but at the cost of a less-accurate result
	from \co{read_count()}.

	\QuickQuiz{
	Is it possible to obtain greater accuracy while still
	maintaining full privacy of the per-thread data?
	}\QuickQuizAnswer{
	Yes.
	One approach is for \co{read_count()} to add the value
	of its own per-thread variable.
	This maintains full ownership and performance, but only
	a slight improvement in accuracy, particularly on systems
	with very large numbers of threads.

	Another approach is for \co{read_count()} to use function
	shipping, for example, in the form of per-thread signals.
	This greatly improves accuracy, but at a significant performance
	cost for \co{read_count()}.

	However, both of these methods have the advantage of eliminating
	cache thrashing for the common case of updating counters.
	}\QuickQuizEnd

	Partial privatization is also possible, with some synchronization
	requirements, but less than in the fully shared case.
	Some partial-privatization possibilities were explored in
	\cref{sec:toolsoftrade:Avoiding Data Races}.
	\Cref{chp:Deferred Processing} will introduce a temporal component
	to data ownership by providing ways of safely taking public data
	structures private.

	In short, privatization is a powerful tool in the parallel programmer's
	toolbox, but it must nevertheless be used with care.
	Just like every other synchronization primitive, it has the potential
	to increase complexity while decreasing performance and scalability.

	\section{Other Uses of Data Ownership}
	\label{sec:owned:Other Uses of Data Ownership}
	%
	\epigraph{Everything comes to us that belongs to us if we create the
	capacity to receive it.}
	{Rabindranath Tagore}

	Data ownership works best when the data can be partitioned so that there
	is little or no need for cross thread access or update.
	Fortunately, this situation is reasonably common, and in a wide variety
	of parallel-programming environments.

	Examples of data ownership include:

	\begin{enumerate}
	\item All message-passing environments, such as MPI~\cite{MPIForum2008}
	and BOINC~\cite{BOINC2008}.
	\item Map-reduce~\cite{MapReduce2008MIT}.
	\item Client-server systems, including RPC, web services, and
	pretty much any system with a back-end database server.
	\item Shared-nothing database systems.
	\item Fork-join systems with separate per-process address spaces.
	\item Process-based parallelism, such as the Erlang language.
	\item Private variables, for example, C-language on-stack auto variables,
	in threaded environments.
	\item Many parallel linear-algebra algorithms, especially those
	well-suited for GPGPUs.\footnote{
	But note that a great many other classes of applications
	have also been ported to
	GPGPUs~\cite{NormMatloff2017ParProcBook,AMD2020ROCm,NVidia2017GPGPU,NVidia2017GPGPU-university}.}
	\item Operating-system kernels adapted for networking, where each connection
	(also called \emph{flow}~\cite{Shenker89,ZhangPhD,McKenney90})
	is assigned to a specific thread.
	One recent example of this approach is the IX operating
	system~\cite{Belay:2016:IOS:3014162.2997641}.
	IX does have some shared data structures, which use synchronization
	mechanisms to be described in
	\cref{sec:defer:Read-Copy Update (RCU)}.
	\end{enumerate}

	Data ownership is perhaps the most underappreciated synchronization
	mechanism in existence.
	When used properly, it delivers unrivaled simplicity, performance,
	and scalability.
	Perhaps its simplicity costs it the respect that it deserves.
	Hopefully a greater appreciation for the subtlety and power of data ownership
	will lead to greater level of respect, to say nothing of leading to
	greater performance and scalability coupled with reduced complexity.

	% populate with problems showing benefits of coupling data ownership with
	% other approaches. For example, work-stealing schedulers. Perhaps also
	% move memory allocation here, though its current location is quite good.

	\QuickQuizAnswersChp{qqzowned}