owned/owned.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % owned/owned.tex

 \QuickQuizChapter{chp:Data Ownership}{Data Ownership}
 %
 \Epigraph{It is mine, I tell you. My own. My precious. Yes, my precious.}
 	 {\emph{Gollum in ``The Fellowship of the Ring'', J.R.R.~Tolkien}}

 One of the simplest ways to avoid the synchronization overhead that
 comes with locking is to parcel the data out among the threads (or,
 in the case of kernels, CPUs)
 so that a given piece of data is accessed and modified by only one
 of the threads.
 Interestingly enough, data ownership covers each of the ``big three''
 parallel design techniques:
 It partitions over threads (or CPUs, as the case may be),
 it batches all local operations,
 and its elimination of synchronization operations is weakening
 carried to its logical extreme.
 It should therefore be no surprise that data ownership is used extremely
 heavily, in fact, it is one usage pattern that even novices use almost
 instinctively.
 In fact, it is used so heavily that this chapter will not introduce
 any new examples, but will instead reference examples from previous
 chapters.

 \QuickQuiz{}
 	What form of data ownership is extremely difficult
 	to avoid when creating shared-memory parallel programs
 	(for example, using pthreads) in C or C++?
 \QuickQuizAnswer{
 	Use of auto variables in functions.
 	By default, these are private to the thread executing the
 	current function.
 } \QuickQuizEnd

 There are a number of approaches to data ownership.
 Section~\ref{sec:owned:Multiple Processes} presents the logical extreme
 in data ownership, where each thread has its own private address space.
 Section~\ref{sec:owned:Partial Data Ownership and pthreads} looks at
 the opposite extreme, where the data is shared, but different threads
 own different access rights to the data.
 Section~\ref{sec:owned:Function Shipping} describes function shipping,
 which is a way of allowing other threads to have indirect access to
 data owned by a particular thread.
 Section~\ref{sec:owned:Designated Thread} describes how designated
 threads can be assigned ownership of a specified function and the
 related data.
 Section~\ref{sec:owned:Privatization} discusses improving performance
 by transforming algorithms with shared data to instead use data ownership.
 Finally, Section~\ref{sec:owned:Other Uses of Data Ownership} lists
 a few software environments that feature data ownership as a
 first-class citizen.

 \section{Multiple Processes}
 \label{sec:owned:Multiple Processes}

 Section~\ref{sec:toolsoftrade:Scripting Languages}
 introduced the following example:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
   1 compute_it 1 > compute_it.1.out &
   2 compute_it 2 > compute_it.2.out &
   3 wait
   4 cat compute_it.1.out
   5 cat compute_it.2.out
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 This example runs two instances of the \co{compute_it} program in
 parallel, as separate processes that do not share memory.
 Therefore, all data in a given process is owned by that process,
 so that almost the entirety of data in the above example is owned.
 This approach almost entirely eliminates synchronization overhead.
 The resulting combination of extreme simplicity and optimal performance
 is obviously quite attractive.

 \QuickQuiz{}
 	What synchronization remains in the example shown in
 	Section~\ref{sec:owned:Multiple Processes}?
 \QuickQuizAnswer{
 	The creation of the threads via the \co{sh} \co{&} operator
 	and the joining of thread via the \co{sh} \co{wait}
 	command.

 	Of course, if the processes explicitly share memory, for
 	example, using the \co{shmget()} or \co{mmap()} system
 	calls, explicit synchronization might well be needed when
 	acccessing or updating the shared memory.
 	The processes might also synchronize using any of the following
 	interprocess communications mechanisms:
 	\begin{enumerate}
 	\item	System V semaphores.
 	\item	System V message queues.
 	\item	UNIX-domain sockets.
 	\item	Networking protocols, including TCP/IP, UDP, and a whole
 		host of others.
 	\item	File locking.
 	\item	Use of the \co{open()} system call with the
 		\co{O_CREAT} and \co{O_EXCL} flags.
 	\item	Use of the \co{rename()} system call.
 	\end{enumerate}
 	A complete list of possible synchronization mechanisms is left
 	as an exercise to the reader, who is warned that it will be
 	an extremely long list.
 	A surprising number of unassuming system calls can be pressed
 	into service as synchronization mechanisms.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Is there any shared data in the example shown in
 	Section~\ref{sec:owned:Multiple Processes}?
 \QuickQuizAnswer{
 	That is a philosophical question.

 	Those wishing the answer ``no'' might argue that processes by
 	definition do not share memory.

 	Those wishing to answer ``yes'' might list a large number of
 	synchronization mechanisms that do not require shared memory,
 	note that the kernel will have some shared state, and perhaps
 	even argue that the assignment of process IDs (PIDs) constitute
 	shared data.

 	Such arguments are excellent intellectual exercise, and are
 	also a wonderful way of feeling intelligent, scoring points
 	against hapless classmates or colleagues,
 	and (especially!) avoiding getting anything useful done.
 } \QuickQuizEnd

 This same pattern can be written in C as well as in \co{sh}, as illustrated by
 Figures~\ref{fig:toolsoftrade:Using the fork() Primitive}
 and~\ref{fig:toolsoftrade:Using the wait() Primitive}.

 The next section discusses use of data ownership in shared-memory
 parallel programs.

 \section{Partial Data Ownership and pthreads}
 \label{sec:owned:Partial Data Ownership and pthreads}

 Chapter~\ref{chp:Counting} makes heavy use of data ownership,
 but adds a twist.
 Threads are not allowed to modify data owned by other threads,
 but they are permitted to read it.
 In short, the use of shared memory allows more nuanced notions
 of ownership and access rights.

 For example, consider the per-thread statistical counter implementation
 shown in
 Figure~\ref{fig:count:Per-Thread Statistical Counters} on
 page~\pageref{fig:count:Per-Thread Statistical Counters}.
 Here, \co{inc_count()} updates only the corresponding thread's
 instance of \co{counter},
 while \co{read_count()} accesses, but does not modify, all
 threads' instances of \co{counter}.

 \QuickQuiz{}
 	Does it ever make sense to have partial data ownership where
 	each thread reads only its own instance of a per-thread variable,
 	but writes to other threads' instances?
 \QuickQuizAnswer{
 	Amazingly enough, yes.
 	One example is a simple message-passing system where threads post
 	messages to other threads' mailboxes, and where each thread
 	is responsible for removing any message it sent once that message
 	has been acted on.
 	Implementation of such an algorithm is left as an exercise for the
 	reader, as is the task of identifying other algorithms with
 	similar ownership patterns.
 } \QuickQuizEnd

 Pure data ownership is also both common and useful, for example, the
 per-thread memory-allocator caches discussed in
 Section~\ref{sec:SMPdesign:Resource Allocator Caches}
 starting on
 page~\pageref{sec:SMPdesign:Resource Allocator Caches}.
 In this algorithm, each thread's cache is completely private to that
 thread.

 \section{Function Shipping}
 \label{sec:owned:Function Shipping}

 The previous section described a weak form of data ownership where
 threads reached out to other threads' data.
 This can be thought of as bringing the data to the functions that
 need it.
 An alternative approach is to send the functions to the data.

 Such an approach is illustrated in
 Section~\ref{sec:count:Signal-Theft Limit Counter Design}
 beginning on
 page~\pageref{sec:count:Signal-Theft Limit Counter Design},
 in particular the \co{flush_local_count_sig()} and
 \co{flush_local_count()} functions in
 Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}
 on
 page~\pageref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}.

 The \co{flush_local_count_sig()} function is a signal handler that
 acts as the shipped function.
 The \co{pthread_kill()} function in \co{flush_local_count()}
 sends the signal---shipping the function---and then waits until
 the shipped function executes.
 This shipped function has the not-unusual added complication of
 needing to interact with any concurrently executing \co{add_count()}
 or \co{sub_count()} functions (see
 Figure~\ref{fig:count:Signal-Theft Limit Counter Add Function}
 on
 page~\pageref{fig:count:Signal-Theft Limit Counter Add Function} and
 Figure~\ref{fig:count:Signal-Theft Limit Counter Subtract Function}
 on
 page~\pageref{fig:count:Signal-Theft Limit Counter Subtract Function}).

 \QuickQuiz{}
 	What mechanisms other than POSIX signals may be used for function
 	shipping?
 \QuickQuizAnswer{
 	There is a very large number of such mechanisms, including:
 	\begin{enumerate}
 	\item	System V message queues.
 	\item	Shared-memory dequeue (see
 		Section~\ref{sec:SMPdesign:Double-Ended Queue}).
 	\item	Shared-memory mailboxes.
 	\item	UNIX-domain sockets.
 	\item	TCP/IP or UDP, possibly augmented by any number of
 		higher-level protocols, including RPC, HTTP,
 		XML, SOAP, and so on.
 	\end{enumerate}
 	Compilation of a complete list is left as an exercise to
 	sufficiently single-minded readers, who are warned that the
 	list will be extremely long.
 } \QuickQuizEnd

 \section{Designated Thread}
 \label{sec:owned:Designated Thread}

 The earlier sections describe ways of allowing each thread to keep its
 own copy or its own portion of the data.
 In contrast, this section describes a functional-decomposition approach,
 where a special designated thread owns the rights to the data
 that is required to do its job.
 The eventually consistent counter implementation described in
 Section~\ref{sec:count:Eventually Consistent Implementation} provides an example.
 This implementation has a designated thread that runs the
 \co{eventual()} function shown on lines~15-32 of
 Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}.
 This \co{eventual()} thread periodically pulls the per-thread counts
 into the global counter, so that accesses to the global counter will,
 as the name says, eventually converge on the actual value.

 \QuickQuiz{}
 	But none of the data in the \co{eventual()} function shown on
 	lines~15-32 of
 	Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
 	is actually owned by the \co{eventual()} thread!
 	In just what way is this data ownership???
 \QuickQuizAnswer{
 	The key phrase is ``owns the rights to the data''.
 	In this case, the rights in question are the rights to access
 	the per-thread \co{counter} variable defined on line~1
 	of the figure.
 	This situation is similar to that described in
 	Section~\ref{sec:owned:Partial Data Ownership and pthreads}.

 	However, there really is data that is owned by the \co{eventual()}
 	thread, namely the \co{t} and \co{sum} variables defined on
 	lines~17 and~18 of the figure.

 	For other examples of designated threads, look at the kernel
 	threads in the Linux kernel, for example, those created by
 	\co{kthread_create()} and \co{kthread_run()}.
 } \QuickQuizEnd

 \section{Privatization}
 \label{sec:owned:Privatization}

 One way of improving the performance and scalability of a shared-memory
 parallel program is to transform it so as to convert shared data to
 private data that is owned by a particular thread.

 An excellent example of this is shown in the answer to one of the
 Quick Quizzes in
 Section~\ref{sec:SMPdesign:Dining Philosophers Problem},
 which uses privatization to produce a solution to the
 Dining Philosophers problem with much better performance and scalability
 than that of the standard textbook solution.
 The original problem has five philosophers sitting around the table
 with one fork between each adjacent pair of philosophers, which permits
 at most two philosophers to eat concurrently.

 We can trivially privatize this problem by providing an additional five
 forks, so that each philosopher has his or her own private pair of forks.
 This allows all five philosophers to eat concurrently, and also offers
 a considerable reduction in the spread of certain types of disease.

 In other cases, privatization imposes costs.
 For example, consider the simple limit counter shown in
 Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read} on
 page~\pageref{fig:count:Simple Limit Counter Add, Subtract, and Read}.
 This is an example of an algorithm where threads can read each others'
 data, but are only permitted to update their own data.
 A quick review of the algorithm shows that the only cross-thread
 accesses are in the summation loop in \co{read_count()}.
 If this loop is eliminated, we move to the more-efficient pure
 data ownership, but at the cost of a less-accurate result
 from \co{read_count()}.

 \QuickQuiz{}
 	Is it possible to obtain greater accuracy while still
 	maintaining full privacy of the per-thread data?
 \QuickQuizAnswer{
 	Yes.
 	One approach is for \co{read_count()} to add the value
 	of its own per-thread variable.
 	This maintains full ownership and performance, but only
 	a slight improvement in accuracy, particularly on systems
 	with very large numbers of threads.

 	Another approach is for \co{read_count()} to use function
 	shipping, for example, in the form of per-thread signals.
 	This greatly improves accuracy, but at a significant performance
 	cost for \co{read_count()}.

 	However, both of these methods have the advantage of eliminating
 	cache-line bouncing for the common case of updating counters.
 } \QuickQuizEnd

 In short, privatization is a powerful tool in the parallel programmer's
 toolbox, but it must nevertheless be used with care.
 Just like every other synchronization primitive, it has the potential
 to increase complexity while decreasing performance and scalability.

 \section{Other Uses of Data Ownership}
 \label{sec:owned:Other Uses of Data Ownership}

 Data ownership works best when the data can be partitioned so that there
 is little or no need for cross thread access or update.
 Fortunately, this situation is reasonably common, and in a wide variety
 of parallel-programming environments.

 Examples of data ownership include:

 \begin{enumerate}
 \item	All message-passing environments, such as MPI~\cite{MPIForum2008}
 	and BOINC~\cite{BOINC2008}.
 \item	Map-reduce~\cite{MapReduce2008MIT}.
 \item	Client-server systems, including RPC, web services, and
 	pretty much any system with a back-end database server.
 \item	Shared-nothing database systems.
 \item	Fork-join systems with separate per-process address spaces.
 \item	Process-based parallelism, such as the Erlang language.
 \item	Private variables, for example, C-language on-stack auto variables,
 	in threaded environments.
 \item	Many parallel linear-algebra algorithms, especially those
 	well-suited for GPGPUs.\footnote{
 		But note that a great many other classes of applications
 		have been ported to
 		GPGPUs~\cite{NormMatloff2013ParProcBook,AMD2017OpenCL,NVidia2017GPGPU,NVidia2017GPGPU-university}.}
 \end{enumerate}

 Data ownership is perhaps the most underappreciated synchronization
 mechanism in existence.
 When used properly, it delivers unrivaled simplicity, performance,
 and scalability.
 Perhaps its simplicity costs it the respect that it deserves.
 Hopefully a greater appreciation for the subtlety and power of data ownership
 will lead to greater level of respect, to say nothing of leading to
 greater performance and scalability coupled with reduced complexity.

 % populate with problems showing benefits of coupling data ownership with
 % other approaches. For example, work-stealing schedulers. Perhaps also
 % move memory allocation here, though its current location is quite good.
	% owned/owned.tex

	\QuickQuizChapter{chp:Data Ownership}{Data Ownership}
	%
	\Epigraph{It is mine, I tell you. My own. My precious. Yes, my precious.}
	{\emph{Gollum in ``The Fellowship of the Ring'', J.R.R.~Tolkien}}

	One of the simplest ways to avoid the synchronization overhead that
	comes with locking is to parcel the data out among the threads (or,
	in the case of kernels, CPUs)
	so that a given piece of data is accessed and modified by only one
	of the threads.
	Interestingly enough, data ownership covers each of the ``big three''
	parallel design techniques:
	It partitions over threads (or CPUs, as the case may be),
	it batches all local operations,
	and its elimination of synchronization operations is weakening
	carried to its logical extreme.
	It should therefore be no surprise that data ownership is used extremely
	heavily, in fact, it is one usage pattern that even novices use almost
	instinctively.
	In fact, it is used so heavily that this chapter will not introduce
	any new examples, but will instead reference examples from previous
	chapters.

	\QuickQuiz{}
	What form of data ownership is extremely difficult
	to avoid when creating shared-memory parallel programs
	(for example, using pthreads) in C or C++?
	\QuickQuizAnswer{
	Use of auto variables in functions.
	By default, these are private to the thread executing the
	current function.
	} \QuickQuizEnd

	There are a number of approaches to data ownership.
	Section~\ref{sec:owned:Multiple Processes} presents the logical extreme
	in data ownership, where each thread has its own private address space.
	Section~\ref{sec:owned:Partial Data Ownership and pthreads} looks at
	the opposite extreme, where the data is shared, but different threads
	own different access rights to the data.
	Section~\ref{sec:owned:Function Shipping} describes function shipping,
	which is a way of allowing other threads to have indirect access to
	data owned by a particular thread.
	Section~\ref{sec:owned:Designated Thread} describes how designated
	threads can be assigned ownership of a specified function and the
	related data.
	Section~\ref{sec:owned:Privatization} discusses improving performance
	by transforming algorithms with shared data to instead use data ownership.
	Finally, Section~\ref{sec:owned:Other Uses of Data Ownership} lists
	a few software environments that feature data ownership as a
	first-class citizen.

	\section{Multiple Processes}
	\label{sec:owned:Multiple Processes}

	Section~\ref{sec:toolsoftrade:Scripting Languages}
	introduced the following example:

	\vspace{5pt}
	\begin{minipage}[t]{\columnwidth}
	\scriptsize
	\begin{verbatim}
	1 compute_it 1 > compute_it.1.out &
	2 compute_it 2 > compute_it.2.out &
	3 wait
	4 cat compute_it.1.out
	5 cat compute_it.2.out
	\end{verbatim}
	\end{minipage}
	\vspace{5pt}

	This example runs two instances of the \co{compute_it} program in
	parallel, as separate processes that do not share memory.
	Therefore, all data in a given process is owned by that process,
	so that almost the entirety of data in the above example is owned.
	This approach almost entirely eliminates synchronization overhead.
	The resulting combination of extreme simplicity and optimal performance
	is obviously quite attractive.

	\QuickQuiz{}
	What synchronization remains in the example shown in
	Section~\ref{sec:owned:Multiple Processes}?
	\QuickQuizAnswer{
	The creation of the threads via the \co{sh} \co{&} operator
	and the joining of thread via the \co{sh} \co{wait}
	command.

	Of course, if the processes explicitly share memory, for
	example, using the \co{shmget()} or \co{mmap()} system
	calls, explicit synchronization might well be needed when
	acccessing or updating the shared memory.
	The processes might also synchronize using any of the following
	interprocess communications mechanisms:
	\begin{enumerate}
	\item System V semaphores.
	\item System V message queues.
	\item UNIX-domain sockets.
	\item Networking protocols, including TCP/IP, UDP, and a whole
	host of others.
	\item File locking.
	\item Use of the \co{open()} system call with the
	\co{O_CREAT} and \co{O_EXCL} flags.
	\item Use of the \co{rename()} system call.
	\end{enumerate}
	A complete list of possible synchronization mechanisms is left
	as an exercise to the reader, who is warned that it will be
	an extremely long list.
	A surprising number of unassuming system calls can be pressed
	into service as synchronization mechanisms.
	} \QuickQuizEnd

	\QuickQuiz{}
	Is there any shared data in the example shown in
	Section~\ref{sec:owned:Multiple Processes}?
	\QuickQuizAnswer{
	That is a philosophical question.

	Those wishing the answer ``no'' might argue that processes by
	definition do not share memory.

	Those wishing to answer ``yes'' might list a large number of
	synchronization mechanisms that do not require shared memory,
	note that the kernel will have some shared state, and perhaps
	even argue that the assignment of process IDs (PIDs) constitute
	shared data.

	Such arguments are excellent intellectual exercise, and are
	also a wonderful way of feeling intelligent, scoring points
	against hapless classmates or colleagues,
	and (especially!) avoiding getting anything useful done.
	} \QuickQuizEnd

	This same pattern can be written in C as well as in \co{sh}, as illustrated by
	Figures~\ref{fig:toolsoftrade:Using the fork() Primitive}
	and~\ref{fig:toolsoftrade:Using the wait() Primitive}.

	The next section discusses use of data ownership in shared-memory
	parallel programs.

	\section{Partial Data Ownership and pthreads}
	\label{sec:owned:Partial Data Ownership and pthreads}

	Chapter~\ref{chp:Counting} makes heavy use of data ownership,
	but adds a twist.
	Threads are not allowed to modify data owned by other threads,
	but they are permitted to read it.
	In short, the use of shared memory allows more nuanced notions
	of ownership and access rights.

	For example, consider the per-thread statistical counter implementation
	shown in
	Figure~\ref{fig:count:Per-Thread Statistical Counters} on
	page~\pageref{fig:count:Per-Thread Statistical Counters}.
	Here, \co{inc_count()} updates only the corresponding thread's
	instance of \co{counter},
	while \co{read_count()} accesses, but does not modify, all
	threads' instances of \co{counter}.

	\QuickQuiz{}
	Does it ever make sense to have partial data ownership where
	each thread reads only its own instance of a per-thread variable,
	but writes to other threads' instances?
	\QuickQuizAnswer{
	Amazingly enough, yes.
	One example is a simple message-passing system where threads post
	messages to other threads' mailboxes, and where each thread
	is responsible for removing any message it sent once that message
	has been acted on.
	Implementation of such an algorithm is left as an exercise for the
	reader, as is the task of identifying other algorithms with
	similar ownership patterns.
	} \QuickQuizEnd

	Pure data ownership is also both common and useful, for example, the
	per-thread memory-allocator caches discussed in
	Section~\ref{sec:SMPdesign:Resource Allocator Caches}
	starting on
	page~\pageref{sec:SMPdesign:Resource Allocator Caches}.
	In this algorithm, each thread's cache is completely private to that
	thread.

	\section{Function Shipping}
	\label{sec:owned:Function Shipping}

	The previous section described a weak form of data ownership where
	threads reached out to other threads' data.
	This can be thought of as bringing the data to the functions that
	need it.
	An alternative approach is to send the functions to the data.

	Such an approach is illustrated in
	Section~\ref{sec:count:Signal-Theft Limit Counter Design}
	beginning on
	page~\pageref{sec:count:Signal-Theft Limit Counter Design},
	in particular the \co{flush_local_count_sig()} and
	\co{flush_local_count()} functions in
	Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}
	on
	page~\pageref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}.

	The \co{flush_local_count_sig()} function is a signal handler that
	acts as the shipped function.
	The \co{pthread_kill()} function in \co{flush_local_count()}
	sends the signal---shipping the function---and then waits until
	the shipped function executes.
	This shipped function has the not-unusual added complication of
	needing to interact with any concurrently executing \co{add_count()}
	or \co{sub_count()} functions (see
	Figure~\ref{fig:count:Signal-Theft Limit Counter Add Function}
	on
	page~\pageref{fig:count:Signal-Theft Limit Counter Add Function} and
	Figure~\ref{fig:count:Signal-Theft Limit Counter Subtract Function}
	on
	page~\pageref{fig:count:Signal-Theft Limit Counter Subtract Function}).

	\QuickQuiz{}
	What mechanisms other than POSIX signals may be used for function
	shipping?
	\QuickQuizAnswer{
	There is a very large number of such mechanisms, including:
	\begin{enumerate}
	\item System V message queues.
	\item Shared-memory dequeue (see
	Section~\ref{sec:SMPdesign:Double-Ended Queue}).
	\item Shared-memory mailboxes.
	\item UNIX-domain sockets.
	\item TCP/IP or UDP, possibly augmented by any number of
	higher-level protocols, including RPC, HTTP,
	XML, SOAP, and so on.
	\end{enumerate}
	Compilation of a complete list is left as an exercise to
	sufficiently single-minded readers, who are warned that the
	list will be extremely long.
	} \QuickQuizEnd

	\section{Designated Thread}
	\label{sec:owned:Designated Thread}

	The earlier sections describe ways of allowing each thread to keep its
	own copy or its own portion of the data.
	In contrast, this section describes a functional-decomposition approach,
	where a special designated thread owns the rights to the data
	that is required to do its job.
	The eventually consistent counter implementation described in
	Section~\ref{sec:count:Eventually Consistent Implementation} provides an example.
	This implementation has a designated thread that runs the
	\co{eventual()} function shown on lines~15-32 of
	Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}.
	This \co{eventual()} thread periodically pulls the per-thread counts
	into the global counter, so that accesses to the global counter will,
	as the name says, eventually converge on the actual value.

	\QuickQuiz{}
	But none of the data in the \co{eventual()} function shown on
	lines~15-32 of
	Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
	is actually owned by the \co{eventual()} thread!
	In just what way is this data ownership???
	\QuickQuizAnswer{
	The key phrase is ``owns the rights to the data''.
	In this case, the rights in question are the rights to access
	the per-thread \co{counter} variable defined on line~1
	of the figure.
	This situation is similar to that described in
	Section~\ref{sec:owned:Partial Data Ownership and pthreads}.

	However, there really is data that is owned by the \co{eventual()}
	thread, namely the \co{t} and \co{sum} variables defined on
	lines~17 and~18 of the figure.

	For other examples of designated threads, look at the kernel
	threads in the Linux kernel, for example, those created by
	\co{kthread_create()} and \co{kthread_run()}.
	} \QuickQuizEnd

	\section{Privatization}
	\label{sec:owned:Privatization}

	One way of improving the performance and scalability of a shared-memory
	parallel program is to transform it so as to convert shared data to
	private data that is owned by a particular thread.

	An excellent example of this is shown in the answer to one of the
	Quick Quizzes in
	Section~\ref{sec:SMPdesign:Dining Philosophers Problem},
	which uses privatization to produce a solution to the
	Dining Philosophers problem with much better performance and scalability
	than that of the standard textbook solution.
	The original problem has five philosophers sitting around the table
	with one fork between each adjacent pair of philosophers, which permits
	at most two philosophers to eat concurrently.

	We can trivially privatize this problem by providing an additional five
	forks, so that each philosopher has his or her own private pair of forks.
	This allows all five philosophers to eat concurrently, and also offers
	a considerable reduction in the spread of certain types of disease.

	In other cases, privatization imposes costs.
	For example, consider the simple limit counter shown in
	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read} on
	page~\pageref{fig:count:Simple Limit Counter Add, Subtract, and Read}.
	This is an example of an algorithm where threads can read each others'
	data, but are only permitted to update their own data.
	A quick review of the algorithm shows that the only cross-thread
	accesses are in the summation loop in \co{read_count()}.
	If this loop is eliminated, we move to the more-efficient pure
	data ownership, but at the cost of a less-accurate result
	from \co{read_count()}.

	\QuickQuiz{}
	Is it possible to obtain greater accuracy while still
	maintaining full privacy of the per-thread data?
	\QuickQuizAnswer{
	Yes.
	One approach is for \co{read_count()} to add the value
	of its own per-thread variable.
	This maintains full ownership and performance, but only
	a slight improvement in accuracy, particularly on systems
	with very large numbers of threads.

	Another approach is for \co{read_count()} to use function
	shipping, for example, in the form of per-thread signals.
	This greatly improves accuracy, but at a significant performance
	cost for \co{read_count()}.

	However, both of these methods have the advantage of eliminating
	cache-line bouncing for the common case of updating counters.
	} \QuickQuizEnd

	In short, privatization is a powerful tool in the parallel programmer's
	toolbox, but it must nevertheless be used with care.
	Just like every other synchronization primitive, it has the potential
	to increase complexity while decreasing performance and scalability.

	\section{Other Uses of Data Ownership}
	\label{sec:owned:Other Uses of Data Ownership}

	Data ownership works best when the data can be partitioned so that there
	is little or no need for cross thread access or update.
	Fortunately, this situation is reasonably common, and in a wide variety
	of parallel-programming environments.

	Examples of data ownership include:

	\begin{enumerate}
	\item All message-passing environments, such as MPI~\cite{MPIForum2008}
	and BOINC~\cite{BOINC2008}.
	\item Map-reduce~\cite{MapReduce2008MIT}.
	\item Client-server systems, including RPC, web services, and
	pretty much any system with a back-end database server.
	\item Shared-nothing database systems.
	\item Fork-join systems with separate per-process address spaces.
	\item Process-based parallelism, such as the Erlang language.
	\item Private variables, for example, C-language on-stack auto variables,
	in threaded environments.
	\item Many parallel linear-algebra algorithms, especially those
	well-suited for GPGPUs.\footnote{
	But note that a great many other classes of applications
	have been ported to
	GPGPUs~\cite{NormMatloff2013ParProcBook,AMD2017OpenCL,NVidia2017GPGPU,NVidia2017GPGPU-university}.}
	\end{enumerate}

	Data ownership is perhaps the most underappreciated synchronization
	mechanism in existence.
	When used properly, it delivers unrivaled simplicity, performance,
	and scalability.
	Perhaps its simplicity costs it the respect that it deserves.
	Hopefully a greater appreciation for the subtlety and power of data ownership
	will lead to greater level of respect, to say nothing of leading to
	greater performance and scalability coupled with reduced complexity.

	% populate with problems showing benefits of coupling data ownership with
	% other approaches. For example, work-stealing schedulers. Perhaps also
	% move memory allocation here, though its current location is quite good.