count/count.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % count/count.tex

 \QuickQuizChapter{chp:Counting}{Counting}
 %
 \Epigraph{As easy as 1, 2, 3!}{\emph{Unknown}}

 Counting is perhaps the simplest and most natural thing a computer can do.
 However, counting efficiently and scalably on a large
 shared-memory multiprocessor can be quite challenging.
 Furthermore, the simplicity of the underlying concept of counting
 allows us to explore the fundamental issues of concurrency without
 the distractions
 of elaborate data structures or complex synchronization primitives.
 Counting therefore provides an excellent introduction to
 parallel programming.

 This chapter covers a number of special cases for which there are simple,
 fast, and scalable counting algorithms.
 But first, let us find out how much you already know about concurrent
 counting.

 \QuickQuiz{}
 	Why on earth should efficient and scalable counting be hard?
 	After all, computers have special hardware for the sole purpose
 	of doing counting,
 	addition, subtraction, and lots more besides, don't they???
 \QuickQuizAnswer{
 	Because the straightforward counting algorithms, for example,
 	atomic operations on a shared counter, either are slow and scale
 	badly, or are inaccurate, as will be seen in
 	Section~\ref{sec:count:Why Isn't Concurrent Counting Trivial?}.
 } \QuickQuizEnd

 \QuickQuiz{}
 	{ \bfseries Network-packet counting problem. }
 	Suppose that you need to collect statistics on the number
 	of networking packets (or total number of bytes) transmitted
 	and/or received.
 	Packets might be transmitted or received by any CPU on
 	the system.
 	Suppose further that this large machine is capable of
 	handling a million packets per second, and that there
 	is a systems-monitoring package that reads out the count
 	every five seconds.
 	How would you implement this statistical counter?
 \QuickQuizAnswer{
 	Hint: The act of updating the counter must be blazingly
 	fast, but because the counter is read out only about once
 	in five million updates, the act of reading out the counter can be
 	quite slow.
 	In addition, the value read out normally need not be all that
 	accurate---after all, since the counter is updated a thousand
 	times per millisecond, we should be able to work with a value
 	that is within a few thousand counts of the ``true value'',
 	whatever ``true value'' might mean in this context.
 	However, the value read out should maintain roughly the same
 	absolute error over time.
 	For example, a 1\% error might be just fine when the count
 	is on the order of a million or so, but might be absolutely
 	unacceptable once the count reaches a trillion.
 	See Section~\ref{sec:count:Statistical Counters}.
 } \QuickQuizEnd

 \QuickQuizLabel{\QcountQstatcnt}

 \QuickQuiz{}
 	{ \bfseries Approximate structure-allocation limit problem. }
 	Suppose that you need to maintain a count of the number of
 	structures allocated in order to fail any allocations
 	once the number of structures in use exceeds a limit
 	(say, 10,000).
 	Suppose further that these structures are short-lived,
 	that the limit is rarely exceeded, and that a ``sloppy''
 	approximate limit is acceptable.
 \QuickQuizAnswer{
 	Hint: The act of updating the counter must again be blazingly
 	fast, but the counter is read out each time that the
 	counter is increased.
 	However, the value read out need not be accurate
 	\emph{except} that it must distinguish approximately
 	between values below the limit and values greater than or
 	equal to the limit.
 	See Section~\ref{sec:count:Approximate Limit Counters}.
 } \QuickQuizEnd

 \QuickQuizLabel{\QcountQapproxcnt}

 \QuickQuiz{}
 	{ \bfseries Exact structure-allocation limit problem. }
 	Suppose that you need to maintain a count of the number of
 	structures allocated in order to fail any allocations
 	once the number of structures in use exceeds an exact limit
 	(again, say 10,000).
 	Suppose further that these structures are short-lived,
 	and that the limit is rarely exceeded, that there is almost
 	always at least one structure in use, and suppose further
 	still that it is necessary to know exactly when this counter reaches
 	zero, for example, in order to free up some memory
 	that is not required unless there is at least one structure
 	in use.
 \QuickQuizAnswer{
 	Hint: The act of updating the counter must once again be blazingly
 	fast, but the counter is read out each time that the
 	counter is increased.
 	However, the value read out need not be accurate
 	\emph{except} that it absolutely must distinguish perfectly
 	between values between the limit and zero on the one hand,
 	and values that either are less than or equal to zero or
 	are greater than or equal to the limit on the other hand.
 	See Section~\ref{sec:count:Exact Limit Counters}.
 } \QuickQuizEnd

 \QuickQuizLabel{\QcountQexactcnt}

 \QuickQuiz{}
 	{ \bfseries Removable I/O device access-count problem. }
 	Suppose that you need to maintain a reference count on a
 	heavily used removable mass-storage device, so that you
 	can tell the user when it is safe to remove the device.
 	This device follows the usual removal procedure where
 	the user indicates a desire to remove the device, and
 	the system tells the user when it is safe to do so.
 \QuickQuizAnswer{
 	Hint: Yet again, the act of updating the counter must be blazingly
 	fast and scalable in order to avoid slowing down I/O operations,
 	but because the counter is read out only when the
 	user wishes to remove the device, the counter read-out
 	operation can be extremely slow.
 	Furthermore, there is no need to be able to read out
 	the counter at all unless the user has already indicated
 	a desire to remove the device.
 	In addition, the value read out need not be accurate
 	\emph{except} that it absolutely must distinguish perfectly
 	between non-zero and zero values, and even then only when
 	the device is in the process of being removed.
 	However, once it has read out a zero value, it must act
 	to keep the value at zero until it has taken some action
 	to prevent subsequent threads from gaining access to the
 	device being removed.
 	See Section~\ref{sec:count:Applying Specialized Parallel Counters}.
 } \QuickQuizEnd

 \QuickQuizLabel{\QcountQIOcnt}

 The remainder of this chapter will develop answers to these questions.
 Section~\ref{sec:count:Why Isn't Concurrent Counting Trivial?}
 asks why counting on multicore systems isn't trivial, and
 Section~\ref{sec:count:Statistical Counters}
 looks into ways of solving the network-packet counting problem.
 Section~\ref{sec:count:Approximate Limit Counters}
 investigates the approximate structure-allocation limit problem, while
 Section~\ref{sec:count:Exact Limit Counters}
 takes on the exact structure-allocation limit problem.
 Section~\ref{sec:count:Applying Specialized Parallel Counters}
 discusses how to use the various specialized parallel counters
 introduced in the preceding sections.
 Finally, Section~\ref{sec:count:Parallel Counting Discussion}
 concludes the chapter with performance measurements.

 Sections~\ref{sec:count:Why Isn't Concurrent Counting Trivial?}
 and~\ref{sec:count:Statistical Counters}
 contain introductory material, while the remaining sections
 are more appropriate for advanced students.

 \section{Why Isn't Concurrent Counting Trivial?}
 \label{sec:count:Why Isn't Concurrent Counting Trivial?}

 \begin{figure}[bp]
 { \scriptsize
 \begin{verbbox}
   1 long counter = 0;
   2
   3 void inc_count(void)
   4 {
   5   counter++;
   6 }
   7
   8 long read_count(void)
   9 {
  10   return counter;
  11 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Just Count!}
 \label{fig:count:Just Count!}
 \end{figure}

 Let's start with something simple, for example, the straightforward
 use of arithmetic shown in
 Figure~\ref{fig:count:Just Count!} (\path{count_nonatomic.c}).
 Here, we have a counter on line~1, we increment it on line~5, and we
 read out its value on line~10.
 What could be simpler?

 This approach has the additional advantage of being blazingly fast if
 you are doing lots of reading and almost no incrementing, and on small
 systems, the performance is excellent.

 There is just one large fly in the ointment: this approach can lose
 counts.
 On my dual-core laptop, a short run invoked \co{inc_count()}
 100,014,000 times, but the final value of the counter was only
 52,909,118.
 Although approximate values do have their place in computing,
 accuracies far greater than 50\% are almost always necessary.

 \QuickQuiz{}
 	But doesn't the \co{++} operator produce an x86 add-to-memory
 	instruction?
 	And won't the CPU cache cause this to be atomic?
 \QuickQuizAnswer{
 	Although the \co{++} operator \emph{could} be atomic, there
 	is no requirement that it be so.
 	And indeed, \co{gcc} often
 	chooses to load the value to a register, increment
 	the register, then store the value to memory, which is
 	decidedly non-atomic.
 } \QuickQuizEnd

 \QuickQuiz{}
 	The 8-figure accuracy on the number of failures indicates
 	that you really did test this.
 	Why would it be necessary to test such a trivial program,
 	especially when the bug is easily seen by inspection?
 \QuickQuizAnswer{
 	Not only are there very few
 	trivial parallel programs, and most days I am
 	not so sure that there are many trivial sequential programs, either.

 	No matter how small or simple the program, if you haven't tested
 	it, it does not work.
 	And even if you have tested it, Murphy's Law says that there will
 	be at least a few bugs still lurking.

 	Furthermore, while proofs of correctness certainly do have their
 	place, they never will replace testing, including the
 	\path{counttorture.h} test setup used here.
 	After all, proofs are only as good as the assumptions that they
 	are based on.
 	Furthermore, proofs can have bugs just as easily as programs can!
 } \QuickQuizEnd

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 atomic_t counter = ATOMIC_INIT(0);
   2
   3 void inc_count(void)
   4 {
   5   atomic_inc(&counter);
   6 }
   7
   8 long read_count(void)
   9 {
  10   return atomic_read(&counter);
  11 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Just Count Atomically!}
 \label{fig:count:Just Count Atomically!}
 \end{figure}

 \begin{figure}[tb]
 \centering
 \resizebox{2.5in}{!}{\includegraphics{CodeSamples/count/atomic}}
 \caption{Atomic Increment Scalability on Nehalem}
 \label{fig:count:Atomic Increment Scalability on Nehalem}
 \end{figure}

 The straightforward way to count accurately is to use atomic operations,
 as shown in
 Figure~\ref{fig:count:Just Count Atomically!} (\path{count_atomic.c}).
 Line~1 defines an atomic variable, line~5 atomically increments it, and
 line~10 reads it out.
 Because this is atomic, it keeps perfect count.
 However, it is slower: on a Intel Core Duo laptop, it is about
 six times slower than non-atomic increment
 when a single thread is incrementing, and more than \emph{ten times}
 slower if two threads are incrementing.\footnote{
 	Interestingly enough, a pair of threads non-atomically incrementing
 	a counter will cause the counter to increase more quickly than
 	a pair of threads atomically incrementing the counter.
 	Of course, if your only goal is to make the counter increase
 	quickly, an easier approach is to simply assign a large value
 	to the counter.
 	Nevertheless, there is likely to be a role for algorithms that
 	use carefully relaxed notions of correctness in order to gain
 	greater performance and
 	scalability~\cite{Andrews91textbook,Arcangeli03,DavidUngar2011unsync}.}

 This poor performance should not be a surprise, given the discussion in
 Chapter~\ref{chp:Hardware and its Habits},
 nor should it be a surprise that the performance of atomic increment
 gets slower as the number of CPUs and threads increase, as shown in
 Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem}.
 In this figure, the horizontal dashed line resting on the x~axis
 is the ideal performance that would be achieved
 by a perfectly scalable algorithm: with such an algorithm, a given
 increment would incur the same overhead that it would in a single-threaded
 program.
 Atomic increment of a single global variable is clearly
 decidedly non-ideal, and gets worse as you add CPUs.

 \QuickQuiz{}
 	Why doesn't the dashed line on the x~axis meet the
 	diagonal line at $x=1$?
 \QuickQuizAnswer{
 	Because of the overhead of the atomic operation.
 	The dashed line on the x~axis represents the overhead of
 	a single \emph{non-atomic} increment.
 	After all, an \emph{ideal} algorithm would not only scale
 	linearly, it would also incur no performance penalty compared
 	to single-threaded code.

 	This level of idealism may seem severe, but if it is good
 	enough for Linus Torvalds, it is good enough for you.
 } \QuickQuizEnd

 \QuickQuiz{}
 	But atomic increment is still pretty fast.
 	And incrementing a single variable in a tight loop sounds
 	pretty unrealistic to me, after all, most of the program's
 	execution should be devoted to actually doing work, not accounting
 	for the work it has done!
 	Why should I care about making this go faster?
 \QuickQuizAnswer{
 	In many cases, atomic increment will in fact be fast enough
 	for you.
 	In those cases, you should by all means use atomic increment.
 	That said, there are many real-world situations where
 	more elaborate counting algorithms are required.
 	The canonical example of such a situation is counting packets
 	and bytes in highly optimized networking stacks, where it is
 	all too easy to find much of the execution time going into
 	these sorts of accounting tasks, especially on large
 	multiprocessors.

 	In addition, as noted at the beginning of this chapter,
 	counting provides an excellent view of the
 	issues encountered in shared-memory parallel programs.
 } \QuickQuizEnd

 \begin{figure}[tb]
 \centering
 \resizebox{3in}{!}{\includegraphics{count/GlobalInc}}
 \caption{Data Flow For Global Atomic Increment}
 \label{fig:count:Data Flow For Global Atomic Increment}
 \end{figure}

 \begin{figure}[tb]
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/r-2014-One-one-thousand}}
 \caption{Waiting to Count}
 \ContributedBy{Figure}{fig:count:Waiting to Count}{Melissa Broussard}
 \end{figure}

 For another perspective on global atomic increment, consider
 Figure~\ref{fig:count:Data Flow For Global Atomic Increment}.
 In order for each CPU to get a chance to increment a given
 global variable, the cache line containing that variable must
 circulate among all the CPUs, as shown by the red arrows.
 Such circulation will take significant time, resulting in
 the poor performance seen in
 Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem},
 which might be thought of as shown in
 Figure~\ref{fig:count:Waiting to Count}.

 The following sections discuss high-performance counting, which
 avoids the delays inherent in such circulation.

 \QuickQuiz{}
 	But why can't CPU designers simply ship the addition operation to the
 	data, avoiding the need to circulate the cache line containing
 	the global variable being incremented?
 \QuickQuizAnswer{
 	It might well be possible to do this in some cases.
 	However, there are a few complications:
 	\begin{enumerate}
 	\item	If the value of the variable is required, then the
 		thread will be forced to wait for the operation
 		to be shipped to the data, and then for the result
 		to be shipped back.
 	\item	If the atomic increment must be ordered with respect
 		to prior and/or subsequent operations, then the thread
 		will be forced to wait for the operation to be shipped
 		to the data, and for an indication that the operation
 		completed to be shipped back.
 	\item	Shipping operations among CPUs will likely require
 		more lines in the system interconnect, which will consume
 		more die area and more electrical power.
 	\end{enumerate}
 	But what if neither of the first two conditions holds?
 	Then you should think carefully about the algorithms discussed
 	in Section~\ref{sec:count:Statistical Counters}, which achieve
 	near-ideal performance on commodity hardware.

 \begin{figure}[tb]
 \centering
 \resizebox{3in}{!}{\includegraphics{count/GlobalTreeInc}}
 \caption{Data Flow For Global Combining-Tree Atomic Increment}
 \label{fig:count:Data Flow For Global Combining-Tree Atomic Increment}
 \end{figure}

 	If either or both of the first two conditions hold, there is
 	\emph{some} hope for improved hardware.
 	One could imagine the hardware implementing a combining tree,
 	so that the increment requests from multiple CPUs are combined
 	by the hardware into a single addition when the combined request
 	reaches the hardware.
 	The hardware could also apply an order to the requests, thus
 	returning to each CPU the return value corresponding to its
 	particular atomic increment.
 	This results in instruction latency that varies as $O(\log N)$,
 	where $N$ is the number of CPUs, as shown in
 	Figure~\ref{fig:count:Data Flow For Global Combining-Tree Atomic Increment}.
 	And CPUs with this sort of hardware optimization are starting to
 	appear as of 2011.

 	This is a great improvement over the $O(N)$ performance
 	of current hardware shown in
 	Figure~\ref{fig:count:Data Flow For Global Atomic Increment},
 	and it is possible that hardware latencies might decrease
 	further if innovations such as three-dimensional fabrication prove
 	practical.
 	Nevertheless, we will see that in some important special cases,
 	software can do \emph{much} better.
 } \QuickQuizEnd

 \section{Statistical Counters}
 \label{sec:count:Statistical Counters}

 This section covers the common special case of statistical counters, where
 the count is updated extremely frequently and the value is read out
 rarely, if ever.
 These will be used to solve the network-packet counting problem
 posed in \QuickQuizRef{\QcountQstatcnt}.

 \subsection{Design}

 Statistical counting is typically handled by providing a counter per
 thread (or CPU, when running in the kernel), so that each thread
 updates its own counter.
 The aggregate value of the counters is read out by simply summing up
 all of the threads' counters,
 relying on the commutative and associative properties of addition.
 This is an example of the Data Ownership pattern that will be introduced in
 Section~\ref{sec:SMPdesign:Data Ownership}.

 \QuickQuiz{}
 	But doesn't the fact that C's ``integers'' are limited in size
 	complicate things?
 \QuickQuizAnswer{
 	No, because modulo addition is still commutative and associative.
 	At least as long as you use unsigned integers.
 	Recall that in the C standard, overflow of signed integers results
 	in undefined behavior, never mind the fact that machines that
 	do anything other than wrap on overflow are quite rare these days.
 	Unfortunately, compilers frequently carry out optimizations that
 	assume that signed integers will not overflow, so if your code
 	allows signed integers to overflow, you can run into trouble
 	even on twos-complement hardware.

 	That said, one potential source of additional complexity arises
 	when attempting to gather (say) a 64-bit sum from 32-bit
 	per-thread counters.
 	Dealing with this added complexity is left as
 	an exercise for the reader, for whom some of the techniques
 	introduced later in this chapter could be quite helpful.
 } \QuickQuizEnd

 \subsection{Array-Based Implementation}
 \label{sec:count:Array-Based Implementation}

 One way to provide per-thread variables is to allocate an array with
 one element per
 thread (presumably cache aligned and padded to avoid false sharing).

 \QuickQuiz{}
 	An array???
 	But doesn't that limit the number of threads?
 \QuickQuizAnswer{
 	It can, and in this toy implementation, it does.
 	But it is not that hard to come up with an alternative
 	implementation that permits an arbitrary number of threads,
 	for example, using the \co{gcc} \co{__thread} facility,
 	as shown in
 	Section~\ref{sec:count:Per-Thread-Variable-Based Implementation}.
 } \QuickQuizEnd

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 DEFINE_PER_THREAD(long, counter);
   2
   3 void inc_count(void)
   4 {
   5   __get_thread_var(counter)++;
   6 }
   7
   8 long read_count(void)
   9 {
  10   int t;
  11   long sum = 0;
  12
  13   for_each_thread(t)
  14     sum += per_thread(counter, t);
  15   return sum;
  16 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Array-Based Per-Thread Statistical Counters}
 \label{fig:count:Array-Based Per-Thread Statistical Counters}
 \end{figure}

 Such an array can be wrapped into per-thread primitives, as shown in
 Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
 (\path{count_stat.c}).
 Line~1 defines an array containing a set of per-thread counters of
 type \co{long} named, creatively enough, \co{counter}.

 Lines~3-6 show a function that increments the counters, using the
 \co{__get_thread_var()} primitive to locate the currently running
 thread's element of the \co{counter} array.
 Because this element is modified only by the corresponding thread,
 non-atomic increment suffices.

 Lines~8-16 show a function that reads out the aggregate value of the counter,
 using the \co{for_each_thread()} primitive to iterate over the list of
 currently running threads, and using the \co{per_thread()} primitive
 to fetch the specified thread's counter.
 Because the hardware can fetch and store a properly aligned \co{long}
 atomically, and because gcc is kind enough to make use of this capability,
 normal loads suffice, and no special atomic instructions are required.

 \QuickQuiz{}
 	What other choice does gcc have, anyway???
 \QuickQuizAnswer{
 	According to the C standard, the effects of fetching a variable
 	that might be concurrently modified by some other thread are
 	undefined.
 	It turns out that the C standard really has no other choice,
 	given that C must support (for example) eight-bit architectures
 	which are incapable of atomically loading a \co{long}.
 	An upcoming version of the C standard aims to fill this gap,
 	but until then, we depend on the kindness of the gcc developers.

 	Alternatively, use of volatile accesses such as those provided
 	by \co{ACCESS_ONCE()}~\cite{JonCorbet2012ACCESS:ONCE}
 	can help constrain the compiler, at least
 	in cases where the hardware is capable of accessing the value
 	with a single memory-reference instruction.
 } \QuickQuizEnd

 \QuickQuiz{}
 	How does the per-thread \co{counter} variable in
 	Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
 	get initialized?
 \QuickQuizAnswer{
 	The C standard specifies that the initial value of
 	global variables is zero, unless they are explicitly initialized.
 	So the initial value of all the instances of \co{counter}
 	will be zero.
 	Furthermore, in the common case where the user is interested only
 	in differences between consecutive reads
 	from statistical counters, the initial value is irrelevant.
 } \QuickQuizEnd

 \QuickQuiz{}
 	How is the code in
 	Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
 	supposed to permit more than one counter?
 \QuickQuizAnswer{
 	Indeed, this toy example does not support more than one counter.
 	Modifying it so that it can provide multiple counters is left
 	as an exercise to the reader.
 } \QuickQuizEnd

 \begin{figure}[tb]
 \centering
 \resizebox{3in}{!}{\includegraphics{count/PerThreadInc}}
 \caption{Data Flow For Per-Thread Increment}
 \label{fig:count:Data Flow For Per-Thread Increment}
 \end{figure}

 This approach scales linearly with increasing number of updater threads
 invoking \co{inc_count()}.
 As is shown by the green arrows on each CPU in
 Figure~\ref{fig:count:Data Flow For Per-Thread Increment},
 the reason for this is that each CPU can make rapid progress incrementing
 its thread's variable, without any expensive cross-system communication.
 As such, this section solves the network-packet counting problem presented
 at the beginning of this chapter.

 \QuickQuiz{}
 	The read operation takes time to sum up the per-thread values,
 	and during that time, the counter could well be changing.
 	This means that the value returned by
 	\co{read_count()} in
 	Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
 	will not necessarily be exact.
 	Assume that the counter is being incremented at rate
 	$r$ counts per unit time, and that \co{read_count()}'s
 	execution consumes $\Delta$ units of time.
 	What is the expected error in the return value?
 \QuickQuizAnswer{
 	Let's do worst-case analysis first, followed by a less
 	conservative analysis.

 	In the worst case, the read operation completes immediately,
 	but is then delayed for $\Delta$ time units before returning,
 	in which case the worst-case error is simply $r \Delta$.

 	This worst-case behavior is rather unlikely, so let us instead
 	consider the case where the reads from each of the $N$
 	counters is spaced equally over the time period $\Delta$.
 	There will be $N+1$ intervals of duration $\frac{\Delta}{N+1}$
 	between the $N$ reads.
 	The error due to the delay after the read from the last thread's
 	counter will be given by $\frac{r \Delta}{N \left( N + 1 \right)}$,
 	the second-to-last thread's counter by
 	$\frac{2 r \Delta}{N \left( N + 1 \right)}$,
 	the third-to-last by
 	$\frac{3 r \Delta}{N \left( N + 1 \right)}$,
 	and so on.
 	The total error is given by the sum of the errors due to the
 	reads from each thread's counter, which is:

 	\begin{equation}
 		\frac{r \Delta}{N \left( N + 1 \right)}
 			\sum_{i = 1}^N i
 	\end{equation}

 	Expressing the summation in closed form yields:

 	\begin{equation}
 		\frac{r \Delta}{N \left( N + 1 \right)}
 			\frac{N \left( N + 1 \right)}{2}
 	\end{equation}

 	Cancelling yields the intuitively expected result:

 	\begin{equation}
 		\frac{r \Delta}{2}
 	\label{eq:count:CounterErrorAverage}
 	\end{equation}

 	It is important to remember that error continues accumulating
 	as the caller executes code making use of the count returned
 	by the read operation.
 	For example, if the caller spends time $t$ executing some
 	computation based on the result of the returned count, the
 	worst-case error will have increased to $r \left(\Delta + t\right)$.

 	The expected error will have similarly increased to:

 	\begin{equation}
 		r \left( \frac{\Delta}{2} + t \right)
 	\end{equation}

 	Of course, it is sometimes unacceptable for the counter to
 	continue incrementing during the read operation.
 	Section~\ref{sec:count:Applying Specialized Parallel Counters}
 	discusses a way to handle this situation.

 	Thus far, we have been considering a counter that is only
 	increased, never decreased.
 	If the counter value is being changed by $r$ counts per unit
 	time, but in either direction, we should expect the error
 	to reduce.
 	However, the worst case is unchanged because although the
 	counter \emph{could} move in either direction, the worst
 	case is when the read operation completes immediately,
 	but then is delayed for $\Delta$ time units, during which
 	time all the changes in the counter's value move it in
 	the same direction, again giving us an absolute error
 	of $r \Delta$.

 	There are a number of ways to compute the average error,
 	based on a variety of assumptions about the patterns of
 	increments and decrements.
 	For simplicity, let's assume that the $f$ fraction of
 	the operations are decrements, and that the error of interest
 	is the deviation from the counter's long-term trend line.
 	Under this assumption, if $f$ is less than or equal to 0.5,
 	each decrement will be cancelled by an increment, so that
 	$2f$ of the operations will cancel each other, leaving
 	$1-2f$ of the operations being uncancelled increments.
 	On the other hand, if $f$ is greater than 0.5, $1-f$ of
 	the decrements are cancelled by increments, so that the
 	counter moves in the negative direction by $-1+2\left(1-f\right)$,
 	which simplifies to $1-2f$, so that the counter moves an average
 	of $1-2f$ per operation in either case.
 	Therefore, that the long-term
 	movement of the counter is given by $\left( 1-2f \right) r$.
 	Plugging this into
 	Equation~\ref{eq:count:CounterErrorAverage} yields:

 	\begin{equation}
 		\frac{\left( 1 - 2 f \right) r \Delta}{2}
 	\end{equation}

 	All that aside, in most uses of statistical counters, the
 	error in the value returned by \co{read_count()} is
 	irrelevant.
 	This irrelevance is due to the fact that the time required
 	for \co{read_count()} to execute is normally extremely
 	small compared to the time interval between successive
 	calls to \co{read_count()}.
 } \QuickQuizEnd

 However, this excellent update-side scalability comes at great read-side
 expense for large numbers of threads.
 The next section shows one way to reduce read-side expense while
 still retaining the update-side scalability.

 \subsection{Eventually Consistent Implementation}
 \label{sec:count:Eventually Consistent Implementation}

 One way to retain update-side scalability while greatly improving
 read-side performance is to weaken consistency requirements.
 The counting algorithm in the previous section is guaranteed to
 return a value between the value that an ideal counter would have
 taken on near the beginning of \co{read_count()}'s execution and
 that near the end of \co{read_count()}'s execution.
 \emph{Eventual consistency}~\cite{WernerVogels:2009:EventuallyConsistent}
 provides a weaker
 guarantee: in absence of calls to \co{inc_count()}, calls to
 \co{read_count()} will eventually return an accurate count.

 We exploit eventual consistency by maintaining a global counter.
 However, updaters only manipulate their per-thread counters.
 A separate thread is provided to transfer counts from the per-thread
 counters to the global counter.
 Readers simply access the value of the global counter.
 If updaters are active, the value used by the readers will be out of
 date, however, once updates cease, the global counter will eventually
 converge on the true value---hence this approach qualifies as
 eventually consistent.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 DEFINE_PER_THREAD(unsigned long, counter);
   2 unsigned long global_count;
   3 int stopflag;
   4
   5 void inc_count(void)
   6 {
   7   ACCESS_ONCE(__get_thread_var(counter))++;
   8 }
   9
  10 unsigned long read_count(void)
  11 {
  12   return ACCESS_ONCE(global_count);
  13 }
  14
  15 void *eventual(void *arg)
  16 {
  17   int t;
  18   int sum;
  19
  20   while (stopflag < 3) {
  21     sum = 0;
  22     for_each_thread(t)
  23       sum += ACCESS_ONCE(per_thread(counter, t));
  24     ACCESS_ONCE(global_count) = sum;
  25     poll(NULL, 0, 1);
  26     if (stopflag) {
  27       smp_mb();
  28       stopflag++;
  29     }
  30   }
  31   return NULL;
  32 }
  33
  34 void count_init(void)
  35 {
  36   thread_id_t tid;
  37
  38   if (pthread_create(&tid, NULL, eventual, NULL)) {
  39     perror("count_init:pthread_create");
  40     exit(-1);
  41   }
  42 }
  43
  44 void count_cleanup(void)
  45 {
  46   stopflag = 1;
  47   while (stopflag < 3)
  48     poll(NULL, 0, 1);
  49   smp_mb();
  50 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Array-Based Per-Thread Eventually Consistent Counters}
 \label{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
 \end{figure}

 The implementation is shown in
 Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
 (\path{count_stat_eventual.c}).
 Lines~1-2 show the per-thread variable and the global variable that
 track the counter's value, and line three shows \co{stopflag}
 which is used to coordinate termination (for the case where we want
 to terminate the program with an accurate counter value).
 The \co{inc_count()} function shown on lines~5-8 is similar to its
 counterpart in
 Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}.
 The \co{read_count()} function shown on lines~10-13 simply returns the
 value of the \co{global_count} variable.

 However, the \co{count_init()} function on lines~34-42
 creates the \co{eventual()} thread shown on lines~15-32, which
 cycles through all the threads,
 summing the per-thread local \co{counter} and storing the
 sum to the \co{global_count} variable.
 The \co{eventual()} thread waits an arbitrarily chosen one millisecond
 between passes.
 The \co{count_cleanup()} function on lines~44-50 coordinates termination.

 This approach gives extremely fast counter read-out while still
 supporting linear counter-update performance.
 However, this excellent read-side performance and update-side scalability
 comes at the cost of the additional thread running \co{eventual()}.

 \QuickQuiz{}
 	Why doesn't \co{inc_count()} in
 	Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
 	need to use atomic instructions?
 	After all, we now have multiple threads accessing the per-thread
 	counters!
 \QuickQuizAnswer{
 	Because one of the two threads only reads, and because the
 	variable is aligned and machine-sized, non-atomic instructions
 	suffice.
 	That said, the \co{ACCESS_ONCE()} macro is used to prevent
 	compiler optimizations that might otherwise prevent the
 	counter updates from becoming visible to
 	\co{eventual()}~\cite{JonCorbet2012ACCESS:ONCE}.

 	An older version of this algorithm did in fact use atomic
 	instructions, kudos to Ersoy Bayramoglu for pointing out that
 	they are in fact unnecessary.
 	That said, atomic instructions would be needed in cases where
 	the per-thread \co{counter} variables were smaller than the
 	global \co{global_count}.
 	However, note that on a 32-bit system,
 	the per-thread \co{counter} variables
 	might need to be limited to 32 bits in order to sum them accurately,
 	but with a 64-bit \co{global_count} variable to avoid overflow.
 	In this case, it is necessary to zero the per-thread
 	\co{counter} variables periodically in order to avoid overflow.
 	It is extremely important to note that this zeroing cannot
 	be delayed too long or overflow of the smaller per-thread
 	variables will result.
 	This approach therefore imposes real-time requirements on the
 	underlying system, and in turn must be used with extreme care.

 	In contrast, if all variables are the same size, overflow
 	of any variable is harmless because the eventual sum
 	will be modulo the word size.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Won't the single global thread in the function \co{eventual()} of
 	Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
 	be just as severe a bottleneck as a global lock would be?
 \QuickQuizAnswer{
 	In this case, no.
 	What will happen instead is that as the number of threads increases,
 	the estimate of the counter
 	value returned by \co{read_count()} will become more inaccurate.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Won't the estimate returned by \co{read_count()} in
 	Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
 	become increasingly
 	inaccurate as the number of threads rises?
 \QuickQuizAnswer{
 	Yes.
 	If this proves problematic, one fix is to provide multiple
 	\co{eventual()} threads, each covering its own subset of
 	the other threads.
 	In more extreme cases, a tree-like hierarchy of
 	\co{eventual()} threads might be required.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Given that in the eventually\-/consistent algorithm shown in
 	Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
 	both reads and updates have extremely low overhead
 	and are extremely scalable, why would anyone bother with the
 	implementation described in
 	Section~\ref{sec:count:Array-Based Implementation},
 	given its costly read-side code?
 \QuickQuizAnswer{
 	The thread executing \co{eventual()} consumes CPU time.
 	As more of these eventually\-/consistent counters are added,
 	the resulting \co{eventual()} threads will eventually
 	consume all available CPUs.
 	This implementation therefore suffers a different sort of
 	scalability limitation, with the scalability limit being in
 	terms of the number of eventually consistent counters rather
 	than in terms of the number of threads or CPUs.

 	Of course, it is possible to make other tradeoffs.
 	For example, a single thread could be created to handle all
 	eventually\-/consistent counters, which would limit the
 	overhead to a single CPU, but would result in increasing
 	update-to-read latencies as the number of counters increased.
 	Alternatively, that single thread could track the update rates
 	of the counters, visiting the frequently\-/updated counters
 	more frequently.
 	In addition, the number of threads handling the counters could
 	be set to some fraction of the total number of CPUs, and
 	perhaps also adjusted at runtime.
 	Finally, each counter could specify its latency, and
 	deadline\-/scheduling techniques could be used to provide
 	the required latencies to each counter.

 	There are no doubt many other tradeoffs that could be made.
 } \QuickQuizEnd

 \subsection{Per-Thread-Variable-Based Implementation}
 \label{sec:count:Per-Thread-Variable-Based Implementation}

 \begin{figure}[tb]
 { \scriptsize
 \begin{verbbox}
   1 long __thread counter = 0;
   2 long *counterp[NR_THREADS] = { NULL };
   3 long finalcount = 0;
   4 DEFINE_SPINLOCK(final_mutex);
   5
   6 void inc_count(void)
   7 {
   8   counter++;
   9 }
  10
  11 long read_count(void)
  12 {
  13   int t;
  14   long sum;
  15
  16   spin_lock(&final_mutex);
  17   sum = finalcount;
  18   for_each_thread(t)
  19     if (counterp[t] != NULL)
  20       sum += *counterp[t];
  21   spin_unlock(&final_mutex);
  22   return sum;
  23 }
  24
  25 void count_register_thread(void)
  26 {
  27   int idx = smp_thread_id();
  28
  29   spin_lock(&final_mutex);
  30   counterp[idx] = &counter;
  31   spin_unlock(&final_mutex);
  32 }
  33
  34 void count_unregister_thread(int nthreadsexpected)
  35 {
  36   int idx = smp_thread_id();
  37
  38   spin_lock(&final_mutex);
  39   finalcount += counter;
  40   counterp[idx] = NULL;
  41   spin_unlock(&final_mutex);
  42 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Per-Thread Statistical Counters}
 \label{fig:count:Per-Thread Statistical Counters}
 \end{figure}

 Fortunately, gcc provides an \co{__thread} storage class that provides
 per-thread storage.
 This can be used as shown in
 Figure~\ref{fig:count:Per-Thread Statistical Counters} (\path{count_end.c})
 to implement
 a statistical counter that not only scales, but that also incurs little
 or no performance penalty to incrementers compared to simple non-atomic
 increment.

 Lines~1-4 define needed variables: \co{counter} is the per-thread counter
 variable, the \co{counterp[]} array allows threads to access each others'
 counters, \co{finalcount} accumulates the total as individual threads exit,
 and \co{final_mutex} coordinates between threads accumulating the total
 value of the counter and exiting threads.

 \QuickQuiz{}
 	Why do we need an explicit array to find the other threads'
 	counters?
 	Why doesn't gcc provide a \co{per_thread()} interface, similar
 	to the Linux kernel's \co{per_cpu()} primitive, to allow
 	threads to more easily access each others' per-thread variables?
 \QuickQuizAnswer{
 	Why indeed?

 	To be fair, gcc faces some challenges that the Linux kernel
 	gets to ignore.
 	When a user-level thread exits, its per-thread variables all
 	disappear, which complicates the problem of per-thread-variable
 	access, particularly before the advent of user-level RCU
 	(see Section~\ref{sec:defer:Read-Copy Update (RCU)}).
 	In contrast, in the Linux kernel, when a CPU goes offline,
 	that CPU's per-CPU variables remain mapped and accessible.

 	Similarly, when a new user-level thread is created, its
 	per-thread variables suddenly come into existence.
 	In contrast, in the Linux kernel, all per-CPU variables are
 	mapped and initialized at boot time, regardless of whether
 	the corresponding CPU exists yet, or indeed, whether the
 	corresponding CPU will ever exist.

 	A key limitation that the Linux kernel imposes is a compile-time
 	maximum bound on the number of CPUs, namely, \co{CONFIG_NR_CPUS},
 	along with a typically tighter boot-time bound of \co{nr_cpu_ids}.
 	In contrast, in user space, there is no hard-coded upper limit
 	on the number of threads.

 	Of course, both environments must handle dynamically loaded
 	code (dynamic libraries in user space, kernel modules in the
 	Linux kernel), which increases the complexity of per-thread
 	variables.

 	These complications make it significantly harder for user-space
 	environments to provide access to other threads' per-thread
 	variables.
 	Nevertheless, such access is highly useful, and it is hoped
 	that it will someday appear.
 } \QuickQuizEnd

 The \co{inc_count()} function used by updaters is quite simple, as can
 be seen on lines~6-9.

 The \co{read_count()} function used by readers is a bit more complex.
 Line~16 acquires a lock to exclude exiting threads, and line~21 releases
 it.
 Line~17 initializes the sum to the count accumulated by those threads that
 have already exited, and lines~18-20 sum the counts being accumulated
 by threads currently running.
 Finally, line~22 returns the sum.

 \QuickQuiz{}
 	Doesn't the check for \co{NULL} on line~19 of
 	Figure~\ref{fig:count:Per-Thread Statistical Counters}
 	add extra branch mispredictions?
 	Why not have a variable set permanently to zero, and point
 	unused counter-pointers to that variable rather than setting
 	them to \co{NULL}?
 \QuickQuizAnswer{
 	This is a reasonable strategy.
 	Checking for the performance difference is left as an exercise
 	for the reader.
 	However, please keep in mind that the fastpath is not
 	\co{read_count()}, but rather \co{inc_count()}.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Why on earth do we need something as heavyweight as a \emph{lock}
 	guarding the summation in the function \co{read_count()} in
 	Figure~\ref{fig:count:Per-Thread Statistical Counters}?
 \QuickQuizAnswer{
 	Remember, when a thread exits, its per-thread variables disappear.
 	Therefore, if we attempt to access a given thread's per-thread
 	variables after that thread exits, we will get a segmentation
 	fault.
 	The lock coordinates summation and thread exit, preventing this
 	scenario.

 	Of course, we could instead read-acquire a reader-writer lock,
 	but Chapter~\ref{chp:Deferred Processing} will introduce even
 	lighter-weight mechanisms for implementing the required coordination.

 	Another approach would be to use an array instead of a per-thread
 	variable, which, as Alexey Roytman notes, would eliminate
 	the tests against \co{NULL}.
 	However, array accesses are often slower than accesses to
 	per-thread variables, and use of an array would imply a
 	fixed upper bound on the number of threads.
 	Also, note that neither tests nor locks are needed on the
 	\co{inc_count()} fastpath.
 } \QuickQuizEnd

 Lines~25-32 show the \co{count_register_thread()} function, which
 must be called by each thread before its first use of this counter.
 This function simply sets up this thread's element of the \co{counterp[]}
 array to point to its per-thread \co{counter} variable.

 \QuickQuiz{}
 	Why on earth do we need to acquire the lock in
 	\co{count_register_thread()} in
 	Figure~\ref{fig:count:Per-Thread Statistical Counters}?
 	It is a single properly aligned machine-word store to a location
 	that no other thread is modifying, so it should be atomic anyway,
 	right?
 \QuickQuizAnswer{
 	This lock could in fact be omitted, but better safe than
 	sorry, especially given that this function is executed only at
 	thread startup, and is therefore not on any critical path.
 	Now, if we were testing on machines with thousands of CPUs,
 	we might need to omit the lock, but on machines with ``only''
 	a hundred or so CPUs, there is no need to get fancy.
 } \QuickQuizEnd

 Lines~34-42 show the \co{count_unregister_thread()} function, which
 must be called prior to exit by each thread that previously called
 \co{count_register_thread()}.
 Line~38 acquires the lock, and line~41 releases it, thus excluding any
 calls to \co{read_count()} as well as other calls to
 \co{count_unregister_thread()}.
 Line~39 adds this thread's \co{counter} to the global \co{finalcount},
 and then line~40 \co{NULL}s out its \co{counterp[]} array entry.
 A subsequent call to \co{read_count()} will see the exiting thread's
 count in the global \co{finalcount}, and will skip the exiting thread
 when sequencing through the \co{counterp[]} array, thus obtaining
 the correct total.

 This approach gives updaters almost exactly the same performance as
 a non-atomic add, and also scales linearly.
 On the other hand, concurrent reads contend for a single global lock,
 and therefore perform poorly and scale abysmally.
 However, this is not a problem for statistical counters, where incrementing
 happens often and readout happens almost never.
 Of course, this approach is considerably more complex than the
 array-based scheme, due to the fact that a given thread's per-thread
 variables vanish when that thread exits.

 \QuickQuiz{}
 	Fine, but the Linux kernel doesn't have to acquire a lock
 	when reading out the aggregate value of per-CPU counters.
 	So why should user-space code need to do this???
 \QuickQuizAnswer{
 	Remember, the Linux kernel's per-CPU variables are always
 	accessible, even if the corresponding CPU is offline---even
 	if the corresponding CPU never existed and never will exist.

 \begin{figure}[tb]
 { \scriptsize
 \begin{verbbox}
   1 long __thread counter = 0;
   2 long *counterp[NR_THREADS] = { NULL };
   3 int finalthreadcount = 0;
   4 DEFINE_SPINLOCK(final_mutex);
   5
   6 void inc_count(void)
   7 {
   8   counter++;
   9 }
  10
  11 long read_count(void)
  12 {
  13   int t;
  14   long sum = 0;
  15
  16   for_each_thread(t)
  17     if (counterp[t] != NULL)
  18       sum += *counterp[t];
  19   return sum;
  20 }
  21
  22 void count_init(void)
  23 {
  24 }
  25
  26 void count_register_thread(void)
  27 {
  28   counterp[smp_thread_id()] = &counter;
  29 }
  30
  31 void count_unregister_thread(int nthreadsexpected)
  32 {
  33   spin_lock(&final_mutex);
  34   finalthreadcount++;
  35   spin_unlock(&final_mutex);
  36   while (finalthreadcount < nthreadsexpected)
  37     poll(NULL, 0, 1);
  38 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Per-Thread Statistical Counters With Lockless Summation}
 \label{fig:count:Per-Thread Statistical Counters With Lockless Summation}
 \end{figure}

 	One workaround is to ensure that each thread continues to exist
 	until all threads are finished, as shown in
 	Figure~\ref{fig:count:Per-Thread Statistical Counters With Lockless Summation}
 	(\path{count_tstat.c}).
 	Analysis of this code is left as an exercise to the reader,
 	however, please note that it does not fit well into the
 	\path{counttorture.h} counter-evaluation scheme.
 	(Why not?)
 	Chapter~\ref{chp:Deferred Processing} will introduce
 	synchronization mechanisms that handle this situation in a much
 	more graceful manner.
 } \QuickQuizEnd

 \subsection{Discussion}

 These three implementations show that it is possible to obtain uniprocessor
 performance for statistical counters, despite running on a parallel
 machine.

 \QuickQuiz{}
 	What fundamental difference is there between counting packets
 	and counting the total number of bytes in the packets, given
 	that the packets vary in size?
 \QuickQuizAnswer{
 	When counting packets, the counter is only incremented by the
 	value one.
 	On the other hand, when counting bytes, the counter might
 	be incremented by largish numbers.

 	Why does this matter?
 	Because in the increment-by-one case, the value returned will
 	be exact in the sense that the counter must necessarily have
 	taken on that value at some point in time, even if it is impossible
 	to say precisely when that point occurred.
 	In contrast, when counting bytes, two different threads might
 	return values that are inconsistent with any global ordering
 	of operations.

 	To see this, suppose that thread~0 adds the value three to its
 	counter, thread~1 adds the value five to its counter, and
 	threads~2 and 3 sum the counters.
 	If the system is ``weakly ordered'' or if the compiler
 	uses aggressive optimizations, thread~2 might find the
 	sum to be three and thread~3 might find the sum to be five.
 	The only possible global orders of the sequence of values
 	of the counter are 0,3,8 and 0,5,8, and neither order is
 	consistent with the results obtained.

 	If you missed this one, you are not alone.
 	Michael Scott used this question to stump Paul E.~McKenney
 	during Paul's Ph.D. defense.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Given that the reader must sum all the threads' counters,
 	this could take a long time given large numbers of threads.
 	Is there any way that the increment operation can remain
 	fast and scalable while allowing readers to also enjoy
 	reasonable performance and scalability?
 \QuickQuizAnswer{
 	One approach would be to maintain a global approximation
 	to the value.
 	Readers would increment their per-thread variable, but when it
 	reached some predefined limit, atomically add it to a global
 	variable, then zero their per-thread variable.
 	This would permit a tradeoff between average increment overhead
 	and accuracy of the value read out.

 	The reader is encouraged to think up and try out other approaches,
 	for example, using a combining tree.
 } \QuickQuizEnd

 Given what has been presented in this section, you should now be able
 to answer the Quick Quiz about statistical counters for networking
 near the beginning of this chapter.

 \section{Approximate Limit Counters}
 \label{sec:count:Approximate Limit Counters}

 Another special case of counting involves limit-checking.
 For example, as noted in the approximate structure-allocation limit
 problem in \QuickQuizRef{\QcountQapproxcnt},
 suppose that you need to maintain a count of the number of
 structures allocated in order to fail any allocations once the number
 of structures in use exceeds a limit, in this case, 10,000.
 Suppose further that these structures are short-lived, that this
 limit is rarely exceeded, and that this limit is approximate in
 that it is OK to exceed it sometimes by some bounded amount
 (see Section~\ref{sec:count:Exact Limit Counters}
 if you instead need the limit to be exact).

 \subsection{Design}

 One possible design for limit counters is to divide the limit of 10,000
 by the number of threads, and give each thread a fixed pool of structures.
 For example, given 100 threads, each thread would manage its own pool
 of 100 structures.
 This approach is simple, and in some cases works well, but it does not
 handle the common case where a given structure is allocated by one
 thread and freed by another~\cite{McKenney93}.
 On the one hand, if a given thread takes credit for any structures it
 frees, then the thread doing most of the allocating runs out
 of structures, while the threads doing most of the freeing have lots
 of credits that they cannot use.
 On the other hand, if freed structures are credited to the CPU that
 allocated them, it will be necessary for CPUs to manipulate each
 others' counters, which will require expensive atomic instructions
 or other means of communicating between threads.\footnote{
 	That said, if each structure will always be freed
 	by the same CPU (or thread) that allocated it, then
 	this simple partitioning approach works extremely well.}

 In short, for many important workloads, we cannot fully partition the counter.
 Given that partitioning the counters was what brought the excellent
 update-side performance for the three schemes discussed in
 Section~\ref{sec:count:Statistical Counters}, this might be grounds
 for some pessimism.
 However, the eventually consistent algorithm presented in
 Section~\ref{sec:count:Eventually Consistent Implementation}
 provides an interesting hint.
 Recall that this algorithm kept two sets of books, a
 per-thread \co{counter} variable for updaters and a \co{global_count}
 variable for readers, with an \co{eventual()} thread that periodically
 updated \co{global_count} to be eventually consistent with the values
 of the per-thread \co{counter}.
 The per-thread \co{counter} perfectly partitioned the counter value, while
 \co{global_count} kept the full value.

 For limit counters, we can use a variation on this theme, in that
 we \emph{partially partition} the counter.
 For example, each of four threads could have a per-thread
 \co{counter}, but each could also have a per-thread maximum value
 (call it \co{countermax}).

 But then what happens if a given thread needs to increment its
 \co{counter}, but \co{counter} is equal to its \co{countermax}?
 The trick here is to move half of that thread's \co{counter} value
 to a \co{globalcount}, then increment \co{counter}.
 For example, if a given thread's \co{counter} and \co{countermax}
 variables were both equal to 10, we do the following:

 \begin{enumerate}
 \item	Acquire a global lock.
 \item	Add five to \co{globalcount}.
 \item	To balance out the addition, subtract five from this
 	thread's \co{counter}.
 \item	Release the global lock.
 \item	Increment this thread's \co{counter}, resulting in a value of six.
 \end{enumerate}

 Although this procedure still requires a global lock, that lock need only be
 acquired once for every five increment operations, greatly reducing
 that lock's level of contention.
 We can reduce this contention as low as we wish by increasing the
 value of \co{countermax}.
 However, the corresponding penalty for increasing the value of
 \co{countermax} is reduced accuracy of \co{globalcount}.
 To see this, note that on a four-CPU system, if \co{countermax}
 is equal to ten, \co{globalcount} will be in error by at most
 40 counts.
 In contrast, if \co{countermax} is increased to 100, \co{globalcount}
 might be in error by as much as 400 counts.

 This raises the question of just how much we care about \co{globalcount}'s
 deviation from the aggregate value of the counter, where this aggregate value
 is the sum of \co{globalcount} and each thread's \co{counter} variable.
 The answer to this question depends on how far the aggregate value is
 from the counter's limit (call it \co{globalcountmax}).
 The larger the difference between these two values, the larger \co{countermax}
 can be without risk of exceeding the \co{globalcountmax} limit.
 This means that the
 value of a given thread's \co{countermax} variable can be set
 based on this difference.
 When far from the limit, the \co{countermax} per-thread variables are
 set to large values to optimize for performance and scalability, while
 when close to the limit, these same variables are set to small values
 to minimize the error in the checks against the \co{globalcountmax} limit.

 This design is an example of \emph{parallel fastpath}, which is an important
 design pattern in which the common case executes with no expensive
 instructions and no interactions between threads, but where occasional
 use is also made of a more conservatively designed
 (and higher overhead) global algorithm.
 This design pattern is covered in more detail in
 Section~\ref{sec:SMPdesign:Parallel Fastpath}.

 \subsection{Simple Limit Counter Implementation}
 \label{sec:count:Simple Limit Counter Implementation}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 unsigned long __thread counter = 0;
   2 unsigned long __thread countermax = 0;
   3 unsigned long globalcountmax = 10000;
   4 unsigned long globalcount = 0;
   5 unsigned long globalreserve = 0;
   6 unsigned long *counterp[NR_THREADS] = { NULL };
   7 DEFINE_SPINLOCK(gblcnt_mutex);
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Simple Limit Counter Variables}
 \label{fig:count:Simple Limit Counter Variables}
 \end{figure}

 \begin{figure}[tb]
 \centering
 \resizebox{3in}{!}{\includegraphics{count/count_lim}}
 \caption{Simple Limit Counter Variable Relationships}
 \label{fig:count:Simple Limit Counter Variable Relationships}
 \end{figure}

 Figure~\ref{fig:count:Simple Limit Counter Variables}
 shows both the per-thread and global variables used by this
 implementation.
 The per-thread \co{counter} and \co{countermax} variables are the
 corresponding thread's local counter and the upper bound on that
 counter, respectively.
 The \co{globalcountmax} variable on line~3 contains the upper
 bound for the aggregate counter, and the \co{globalcount} variable
 on line~4 is the global counter.
 The sum of \co{globalcount} and each thread's \co{counter} gives
 the aggregate value of the overall counter.
 The \co{globalreserve} variable on line~5 is the sum of all of the
 per-thread \co{countermax} variables.
 The relationship among these variables is shown by
 Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}:
 \begin{enumerate}
 \item	The sum of \co{globalcount} and \co{globalreserve} must
 	be less than or equal to \co{globalcountmax}.
 \item	The sum of all threads' \co{countermax} values must be
 	less than or equal to \co{globalreserve}.
 \item	Each thread's \co{counter} must be less than or equal to
 	that thread's \co{countermax}.
 \end{enumerate}

 Each element of the \co{counterp[]} array references the corresponding
 thread's \co{counter} variable, and, finally, the \co{gblcnt_mutex}
 spinlock guards all of the global variables, in other words, no thread
 is permitted to access or modify any of the global variables unless it
 has acquired \co{gblcnt_mutex}.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 int add_count(unsigned long delta)
   2 {
   3   if (countermax - counter >= delta) {
   4     counter += delta;
   5     return 1;
   6   }
   7   spin_lock(&gblcnt_mutex);
   8   globalize_count();
   9   if (globalcountmax -
  10       globalcount - globalreserve < delta) {
  11     spin_unlock(&gblcnt_mutex);
  12     return 0;
  13   }
  14   globalcount += delta;
  15   balance_count();
  16   spin_unlock(&gblcnt_mutex);
  17   return 1;
  18 }
  19
  20 int sub_count(unsigned long delta)
  21 {
  22   if (counter >= delta) {
  23     counter -= delta;
  24     return 1;
  25   }
  26   spin_lock(&gblcnt_mutex);
  27   globalize_count();
  28   if (globalcount < delta) {
  29     spin_unlock(&gblcnt_mutex);
  30     return 0;
  31   }
  32   globalcount -= delta;
  33   balance_count();
  34   spin_unlock(&gblcnt_mutex);
  35   return 1;
  36 }
  37
  38 unsigned long read_count(void)
  39 {
  40   int t;
  41   unsigned long sum;
  42
  43   spin_lock(&gblcnt_mutex);
  44   sum = globalcount;
  45   for_each_thread(t)
  46     if (counterp[t] != NULL)
  47       sum += *counterp[t];
  48   spin_unlock(&gblcnt_mutex);
  49   return sum;
  50 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Simple Limit Counter Add, Subtract, and Read}
 \label{fig:count:Simple Limit Counter Add, Subtract, and Read}
 \end{figure}

 Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}
 shows the \co{add_count()}, \co{sub_count()}, and \co{read_count()}
 functions (\path{count_lim.c}).


 \QuickQuiz{}
 	Why does
 	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}
 	provide \co{add_count()} and \co{sub_count()} instead of the
 	\co{inc_count()} and \co{dec_count()} interfaces show in
 	Section~\ref{sec:count:Statistical Counters}?
 \QuickQuizAnswer{
 	Because structures come in different sizes.
 	Of course, a limit counter corresponding to a specific size
 	of structure might still be able to use
 	\co{inc_count()} and \co{dec_count()}.
 } \QuickQuizEnd

 Lines~1-18 show \co{add_count()}, which adds the specified value \co{delta}
 to the counter.
 Line~3 checks to see if there is room for \co{delta} on this thread's
 \co{counter}, and, if so, line~4 adds it and line~6 returns success.
 This is the \co{add_counter()} fastpath, and it does no atomic operations,
 references only per-thread variables, and should not incur any cache misses.

 \QuickQuiz{}
 	What is with the strange form of the condition on line~3 of
 	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
 	Why not the following more intuitive form of the fastpath?

 	\vspace{5pt}
 	\begin{minipage}[t]{\columnwidth}
 	\small
 	\begin{verbatim}
   3 if (counter + delta <= countermax){
   4   counter += delta;
   5   return 1;
   6 }
 	\end{verbatim}
 	\end{minipage}
 	\vspace{5pt}
 \QuickQuizAnswer{
 	Two words.
 	``Integer overflow.''

 	Try the above formulation with \co{counter} equal to 10 and
 	\co{delta} equal to \co{ULONG_MAX}.
 	Then try it again with the code shown in
 	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}.

 	A good understanding of integer overflow will be required for
 	the rest of this example, so if you have never dealt with
 	integer overflow before, please try several examples to get
 	the hang of it.
 	Integer overflow can sometimes be more difficult to get right
 	than parallel algorithms!
 } \QuickQuizEnd

 If the test on line~3 fails, we must access global variables, and thus
 must acquire \co{gblcnt_mutex} on line~7, which we release on line~11
 in the failure case or on line~16 in the success case.
 Line~8 invokes \co{globalize_count()}, shown in
 Figure~\ref{fig:count:Simple Limit Counter Utility Functions},
 which clears the thread-local variables, adjusting the global variables
 as needed, thus simplifying global processing.
 (But don't take \emph{my} word for it, try coding it yourself!)
 Lines~9 and 10 check to see if addition of \co{delta} can be accommodated,
 with the meaning of the expression preceding the less-than sign shown in
 Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}
 as the difference in height of the two red (leftmost) bars.
 If the addition of \co{delta} cannot be accommodated, then
 line~11 (as noted earlier) releases \co{gblcnt_mutex} and line~12
 returns indicating failure.

 Otherwise, we take the slowpath.
 Line~14 adds \co{delta} to \co{globalcount}, and then
 line~15 invokes \co{balance_count()} (shown in
 Figure~\ref{fig:count:Simple Limit Counter Utility Functions})
 in order to update both the global and the per-thread variables.
 This call to \co{balance_count()}
 will usually set this thread's \co{countermax} to re-enable the fastpath.
 Line~16 then releases
 \co{gblcnt_mutex} (again, as noted earlier), and, finally,
 line~17 returns indicating success.

 \QuickQuiz{}
 	Why does \co{globalize_count()} zero the per-thread variables,
 	only to later call \co{balance_count()} to refill them in
 	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
 	Why not just leave the per-thread variables non-zero?
 \QuickQuizAnswer{
 	That is in fact what an earlier version of this code did.
 	But addition and subtraction are extremely cheap, and handling
 	all of the special cases that arise is quite complex.
 	Again, feel free to try it yourself, but beware of integer
 	overflow!
 } \QuickQuizEnd

 Lines~20-36 show \co{sub_count()}, which subtracts the specified
 \co{delta} from the counter.
 Line~22 checks to see if the per-thread counter can accommodate
 this subtraction, and, if so, line~23 does the subtraction and
 line~24 returns success.
 These lines form \co{sub_count()}'s fastpath, and, as with
 \co{add_count()}, this fastpath executes no costly operations.

 If the fastpath cannot accommodate subtraction of \co{delta},
 execution proceeds to the slowpath on lines~26-35.
 Because the slowpath must access global state, line~26
 acquires \co{gblcnt_mutex}, which is released either by line~29
 (in case of failure) or by line~34 (in case of success).
 Line~27 invokes \co{globalize_count()}, shown in
 Figure~\ref{fig:count:Simple Limit Counter Utility Functions},
 which again clears the thread-local variables, adjusting the global variables
 as needed.
 Line~28 checks to see if the counter can accommodate subtracting
 \co{delta}, and, if not, line~29 releases \co{gblcnt_mutex}
 (as noted earlier) and line~30 returns failure.

 \QuickQuiz{}
 	Given that \co{globalreserve} counted against us in \co{add_count()},
 	why doesn't it count for us in \co{sub_count()} in
 	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
 \QuickQuizAnswer{
 	The \co{globalreserve} variable tracks the sum of all threads'
 	\co{countermax} variables.
 	The sum of these threads' \co{counter} variables might be anywhere
 	from zero to \co{globalreserve}.
 	We must therefore take a conservative approach, assuming that
 	all threads' \co{counter} variables are full in \co{add_count()}
 	and that they are all empty in \co{sub_count()}.

 	But remember this question, as we will come back to it later.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Suppose that one thread invokes \co{add_count()} shown in
 	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read},
 	and then another thread invokes \co{sub_count()}.
 	Won't \co{sub_count()} return failure even though the value of
 	the counter is non-zero?
 \QuickQuizAnswer{
 	Indeed it will!
 	In many cases, this will be a problem, as discussed in
 	Section~\ref{sec:count:Simple Limit Counter Discussion}, and
 	in those cases the algorithms from
 	Section~\ref{sec:count:Exact Limit Counters}
 	will likely be preferable.
 } \QuickQuizEnd

 If, on the other hand, line~28 finds that the counter \emph{can}
 accommodate subtracting \co{delta}, we complete the slowpath.
 Line~32 does the subtraction and then
 line~33 invokes \co{balance_count()} (shown in
 Figure~\ref{fig:count:Simple Limit Counter Utility Functions})
 in order to update both global and per-thread variables
 (hopefully re-enabling the fastpath).
 Then line~34 releases \co{gblcnt_mutex}, and line~35 returns success.

 \QuickQuiz{}
 	Why have both \co{add_count()} and \co{sub_count()} in
 	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
 	Why not simply pass a negative number to \co{add_count()}?
 \QuickQuizAnswer{
 	Given that \co{add_count()} takes an \co{unsigned} \co{long}
 	as its argument, it is going to be a bit tough to pass it a
 	negative number.
 	And unless you have some anti-matter memory, there is little
 	point in allowing negative numbers when counting the number
 	of structures in use!
 } \QuickQuizEnd

 Lines~38-50 show \co{read_count()}, which returns the aggregate value
 of the counter.
 It acquires \co{gblcnt_mutex} on line~43 and releases it on line~48,
 excluding global operations from \co{add_count()} and \co{sub_count()},
 and, as we will see, also excluding thread creation and exit.
 Line~44 initializes local variable \co{sum} to the value of
 \co{globalcount}, and then the loop spanning lines~45-47 sums the
 per-thread \co{counter} variables.
 Line~49 then returns the sum.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 static void globalize_count(void)
   2 {
   3   globalcount += counter;
   4   counter = 0;
   5   globalreserve -= countermax;
   6   countermax = 0;
   7 }
   8
   9 static void balance_count(void)
  10 {
  11   countermax = globalcountmax -
  12                globalcount - globalreserve;
  13   countermax /= num_online_threads();
  14   globalreserve += countermax;
  15   counter = countermax / 2;
  16   if (counter > globalcount)
  17     counter = globalcount;
  18   globalcount -= counter;
  19 }
  20
  21 void count_register_thread(void)
  22 {
  23   int idx = smp_thread_id();
  24
  25   spin_lock(&gblcnt_mutex);
  26   counterp[idx] = &counter;
  27   spin_unlock(&gblcnt_mutex);
  28 }
  29
  30 void count_unregister_thread(int nthreadsexpected)
  31 {
  32   int idx = smp_thread_id();
  33
  34   spin_lock(&gblcnt_mutex);
  35   globalize_count();
  36   counterp[idx] = NULL;
  37   spin_unlock(&gblcnt_mutex);
  38 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Simple Limit Counter Utility Functions}
 \label{fig:count:Simple Limit Counter Utility Functions}
 \end{figure}

 Figure~\ref{fig:count:Simple Limit Counter Utility Functions}
 shows a number of utility functions used by the \co{add_count()},
 \co{sub_count()}, and \co{read_count()} primitives shown in
 Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}.

 Lines~1-7 show \co{globalize_count()}, which zeros the current thread's
 per-thread counters, adjusting the global variables appropriately.
 It is important to note that this function does not change the aggregate
 value of the counter, but instead changes how the counter's current value
 is represented.
 Line~3 adds the thread's \co{counter} variable to \co{globalcount},
 and line~4 zeroes \co{counter}.
 Similarly, line~5 subtracts the per-thread \co{countermax} from
 \co{globalreserve}, and line~6 zeroes \co{countermax}.
 It is helpful to refer to
 Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}
 when reading both this function and \co{balance_count()}, which is next.

 Lines~9-19 show \co{balance_count()}, which is roughly speaking
 the inverse of \co{globalize_count()}.
 This function's job is to set the current thread's
 \co{countermax} variable to the largest value that avoids the risk
 of the counter exceeding the \co{globalcountmax} limit.
 Changing the current thread's \co{countermax} variable of course
 requires corresponding adjustments to \co{counter}, \co{globalcount}
 and \co{globalreserve}, as can be seen by referring back to
 Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}.
 By doing this, \co{balance_count()} maximizes use of
 \co{add_count()}'s and \co{sub_count()}'s low-overhead fastpaths.
 As with \co{globalize_count()}, \co{balance_count()} is not permitted
 to change the aggregate value of the counter.

 Lines~11-13 compute this thread's share of that portion of
 \co{globalcountmax} that is not already covered by either
 \co{globalcount} or \co{globalreserve}, and assign the
 computed quantity to this thread's \co{countermax}.
 Line~14 makes the corresponding adjustment to \co{globalreserve}.
 Line~15 sets this thread's \co{counter} to the middle of the range
 from zero to \co{countermax}.
 Line~16 checks to see whether \co{globalcount} can in fact accommodate
 this value of \co{counter}, and, if not, line~17 decreases \co{counter}
 accordingly.
 Finally, in either case, line~18 makes the corresponding adjustment to
 \co{globalcount}.

 \QuickQuiz{}
 	Why set \co{counter} to \co{countermax / 2} in line~15 of
 	Figure~\ref{fig:count:Simple Limit Counter Utility Functions}?
 	Wouldn't it be simpler to just take \co{countermax} counts?
 \QuickQuizAnswer{
 	First, it really is reserving \co{countermax} counts
 	(see line~14), however,
 	it adjusts so that only half of these are actually in use
 	by the thread at the moment.
 	This allows the thread to carry out at least \co{countermax / 2}
 	increments or decrements before having to refer back to
 	\co{globalcount} again.

 	Note that the accounting in \co{globalcount} remains accurate,
 	thanks to the adjustment in line~18.
 } \QuickQuizEnd

 \begin{figure*}[tb]
 \centering
 \resizebox{5in}{!}{\includegraphics{count/globbal}}
 \caption{Schematic of Globalization and Balancing}
 \label{fig:count:Schematic of Globalization and Balancing}
 \end{figure*}

 It is helpful to look at a schematic depicting how the relationship
 of the counters changes with the execution of first
 \co{globalize_count()} and then \co{balance_count}, as shown in
 Figure~\ref{fig:count:Schematic of Globalization and Balancing}.
 Time advances from left to right, with the leftmost configuration
 roughly that of
 Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}.
 The center configuration shows the relationship of these same counters
 after \co{globalize_count()} is executed by thread~0.
 As can be seen from the figure, thread~0's \co{counter} (``c~0'' in
 the figure) is added to \co{globalcount}, while the value of
 \co{globalreserve} is reduced by this same amount.
 Both thread~0's \co{counter} and its \co{countermax}
 (``cm~0'' in the figure) are reduced to zero.
 The other three threads' counters are unchanged.
 Note that this change did not affect the overall value of the counter,
 as indicated by the bottommost dotted line connecting the leftmost
 and center configurations.
 In other words, the sum of \co{globalcount} and the four threads'
 \co{counter} variables is the same in both configurations.
 Similarly, this change did not affect the sum of \co{globalcount} and
 \co{globalreserve}, as indicated by the upper dotted line.

 The rightmost configuration shows the relationship of these counters
 after \co{balance_count()} is executed, again by thread~0.
 One-quarter of the remaining count, denoted by the vertical line extending
 up from all three configurations, is added to thread~0's
 \co{countermax} and half of that to thread~0's \co{counter}.
 The amount added to thread~0's \co{counter} is also subtracted from
 \co{globalcount} in order to avoid changing the overall value of the
 counter (which is again the sum of \co{globalcount} and the three
 threads' \co{counter} variables), again as indicated by the lowermost
 of the two dotted lines connecting the center and rightmost configurations.
 The \co{globalreserve} variable is also adjusted so that this variable
 remains equal to the sum of the four threads' \co{countermax}
 variables.
 Because thread~0's \co{counter} is less than its \co{countermax},
 thread~0 can once again increment the counter locally.

 \QuickQuiz{}
 	In Figure~\ref{fig:count:Schematic of Globalization and Balancing},
 	even though a quarter of the remaining count up to the limit is
 	assigned to thread~0, only an eighth of the remaining count is
 	consumed, as indicated by the uppermost dotted line connecting
 	the center and the rightmost configurations.
 	Why is that?
 \QuickQuizAnswer{
 	The reason this happened is that thread~0's \co{counter} was
 	set to half of its \co{countermax}.
 	Thus, of the quarter assigned to thread~0, half of that quarter
 	(one eighth) came from \co{globalcount}, leaving the other half
 	(again, one eighth) to come from the remaining count.

 	There are two purposes for taking this approach:
 	(1)~To allow thread~0 to use the fastpath for decrements as
 	well as increments, and
 	(2)~To reduce the inaccuracies if all threads are monotonically
 	incrementing up towards the limit.
 	To see this last point, step through the algorithm and watch
 	what it does.
 } \QuickQuizEnd

 Lines~21-28 show \co{count_register_thread()}, which sets up state for
 newly created threads.
 This function simply installs
 a pointer to the newly created thread's \co{counter} variable into
 the corresponding entry of the \co{counterp[]} array under the protection
 of \co{gblcnt_mutex}.

 Finally, lines~30-38 show \co{count_unregister_thread()}, which tears down
 state for a soon-to-be-exiting thread.
 Line~34 acquires \co{gblcnt_mutex} and line~37 releases it.
 Line~35 invokes \co{globalize_count()} to clear out this thread's
 counter state, and line~36 clears this thread's entry in the
 \co{counterp[]} array.

 \subsection{Simple Limit Counter Discussion}
 \label{sec:count:Simple Limit Counter Discussion}

 This type of counter is quite fast when aggregate values are near zero,
 with some overhead due to the comparison and branch in both
 \co{add_count()}'s and \co{sub_count()}'s fastpaths.
 However, the use of a per-thread \co{countermax} reserve means that
 \co{add_count()} can fail even when
 the aggregate value of the counter is nowhere near \co{globalcountmax}.
 Similarly, \co{sub_count()} can fail
 even when the aggregate value of the counter is nowhere near zero.

 In many cases, this is unacceptable.
 Even if the \co{globalcountmax} is intended to be an approximate limit,
 there is usually a limit to exactly how much approximation can be tolerated.
 One way to limit the degree of approximation is to impose an upper limit
 on the value of the per-thread \co{countermax} instances.
 This task is undertaken in the next section.

 \subsection{Approximate Limit Counter Implementation}
 \label{sec:count:Approximate Limit Counter Implementation}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 unsigned long __thread counter = 0;
   2 unsigned long __thread countermax = 0;
   3 unsigned long globalcountmax = 10000;
   4 unsigned long globalcount = 0;
   5 unsigned long globalreserve = 0;
   6 unsigned long *counterp[NR_THREADS] = { NULL };
   7 DEFINE_SPINLOCK(gblcnt_mutex);
   8 #define MAX_COUNTERMAX 100
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Approximate Limit Counter Variables}
 \label{fig:count:Approximate Limit Counter Variables}
 \end{figure}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 static void balance_count(void)
   2 {
   3   countermax = globalcountmax -
   4                globalcount - globalreserve;
   5   countermax /= num_online_threads();
   6   if (countermax > MAX_COUNTERMAX)
   7     countermax = MAX_COUNTERMAX;
   8   globalreserve += countermax;
   9   counter = countermax / 2;
  10   if (counter > globalcount)
  11     counter = globalcount;
  12   globalcount -= counter;
  13 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Approximate Limit Counter Balancing}
 \label{fig:count:Approximate Limit Counter Balancing}
 \end{figure}

 Because this implementation (\path{count_lim_app.c}) is quite similar to
 that in the previous section
 (Figures~\ref{fig:count:Simple Limit Counter Variables},
 \ref{fig:count:Simple Limit Counter Add, Subtract, and Read}, and
 \ref{fig:count:Simple Limit Counter Utility Functions}),
 only the changes are shown here.
 Figure~\ref{fig:count:Approximate Limit Counter Variables}
 is identical to
 Figure~\ref{fig:count:Simple Limit Counter Variables},
 with the addition of \co{MAX_COUNTERMAX}, which sets the maximum
 permissible value of the per-thread \co{countermax} variable.

 Similarly,
 Figure~\ref{fig:count:Approximate Limit Counter Balancing}
 is identical to the \co{balance_count()} function in
 Figure~\ref{fig:count:Simple Limit Counter Utility Functions},
 with the addition of lines~6 and 7, which enforce the
 \co{MAX_COUNTERMAX} limit on the per-thread \co{countermax} variable.

 \subsection{Approximate Limit Counter Discussion}

 These changes greatly reduce the limit inaccuracy seen in the previous version,
 but present another problem: any given value of \co{MAX_COUNTERMAX}
 will cause a workload-dependent fraction of accesses to fall off the
 fastpath.
 As the number of threads increase, non-fastpath execution will become both
 a performance and a scalability problem.
 However, we will defer this problem and turn instead to counters
 with exact limits.

 \section{Exact Limit Counters}
 \label{sec:count:Exact Limit Counters}

 To solve the exact structure-allocation limit problem noted in
 \QuickQuizRef{\QcountQexactcnt},
 we need a limit counter that can tell exactly when its limits are
 exceeded.
 One way of implementing such a limit counter is to
 cause threads that have reserved counts to give them up.
 One way to do this is to use atomic instructions.
 Of course, atomic instructions will slow down the fastpath, but on the
 other hand, it would be silly not to at least give them a try.

 \subsection{Atomic Limit Counter Implementation}
 \label{sec:count:Atomic Limit Counter Implementation}

 Unfortunately,
 if one thread is to safely remove counts from another thread,
 both threads will need to atomically manipulate that thread's
 \co{counter} and \co{countermax} variables.
 The usual way to do this is to combine these two variables into a
 single variable,
 for example, given a 32-bit variable, using the high-order 16 bits to
 represent \co{counter} and the low-order 16 bits to represent
 \co{countermax}.

 \QuickQuiz{}
 	Why is it necessary to atomically manipulate the thread's
 	\co{counter} and \co{countermax} variables as a unit?
 	Wouldn't it be good enough to atomically manipulate them
 	individually?
 \QuickQuizAnswer{
 	This might well be possible, but great care is required.
 	Note that removing \co{counter} without first zeroing
 	\co{countermax} could result in the corresponding thread
 	increasing \co{counter} immediately after it was zeroed,
 	completely negating the effect of zeroing the counter.

 	The opposite ordering, namely zeroing \co{countermax} and then
 	removing \co{counter}, can also result in a non-zero
 	\co{counter}.
 	To see this, consider the following sequence of events:

 	\begin{enumerate}
 	\item	Thread~A fetches its \co{countermax}, and finds that
 		it is non-zero.
 	\item	Thread~B zeroes Thread~A's \co{countermax}.
 	\item	Thread~B removes Thread~A's \co{counter}.
 	\item	Thread~A, having found that its \co{countermax}
 		is non-zero, proceeds to add to its \co{counter},
 		resulting in a non-zero value for \co{counter}.
 	\end{enumerate}

 	Again, it might well be possible to atomically manipulate
 	\co{countermax} and \co{counter} as separate variables,
 	but it is clear that great care is required.
 	It is also quite likely that doing so will slow down the
 	fastpath.

 	Exploring these possibilities are left as exercises for
 	the reader.
 } \QuickQuizEnd

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 atomic_t __thread ctrandmax = ATOMIC_INIT(0);
   2 unsigned long globalcountmax = 10000;
   3 unsigned long globalcount = 0;
   4 unsigned long globalreserve = 0;
   5 atomic_t *counterp[NR_THREADS] = { NULL };
   6 DEFINE_SPINLOCK(gblcnt_mutex);
   7 #define CM_BITS (sizeof(atomic_t) * 4)
   8 #define MAX_COUNTERMAX ((1 << CM_BITS) - 1)
   9
  10 static void
  11 split_ctrandmax_int(int cami, int *c, int *cm)
  12 {
  13   *c = (cami >> CM_BITS) & MAX_COUNTERMAX;
  14   *cm = cami & MAX_COUNTERMAX;
  15 }
  16
  17 static void
  18 split_ctrandmax(atomic_t *cam, int *old,
  19                     int *c, int *cm)
  20 {
  21   unsigned int cami = atomic_read(cam);
  22
  23   *old = cami;
  24   split_ctrandmax_int(cami, c, cm);
  25 }
  26
  27 static int merge_ctrandmax(int c, int cm)
  28 {
  29   unsigned int cami;
  30
  31   cami = (c << CM_BITS) | cm;
  32   return ((int)cami);
  33 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Atomic Limit Counter Variables and Access Functions}
 \label{fig:count:Atomic Limit Counter Variables and Access Functions}
 \end{figure}

 The variables and access functions for a simple atomic limit counter
 are shown in
 Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}
 (\path{count_lim_atomic.c}).
 The \co{counter} and \co{countermax} variables in earlier algorithms
 are combined into the single variable \co{ctrandmax} shown on
 line~1, with \co{counter} in the upper half and \co{countermax} in
 the lower half.
 This variable is of type \co{atomic_t}, which has an underlying
 representation of \co{int}.

 Lines~2-6 show the definitions for \co{globalcountmax}, \co{globalcount},
 \co{globalreserve}, \co{counterp}, and \co{gblcnt_mutex}, all of which
 take on roles similar to their counterparts in
 Figure~\ref{fig:count:Approximate Limit Counter Variables}.
 Line~7 defines \co{CM_BITS}, which gives the number of bits in each half
 of \co{ctrandmax}, and line~8 defines \co{MAX_COUNTERMAX}, which
 gives the maximum value that may be held in either half of
 \co{ctrandmax}.

 \QuickQuiz{}
 	In what way does line~7 of
 	Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}
 	violate the C standard?
 \QuickQuizAnswer{
 	It assumes eight bits per byte.
 	This assumption does hold for all current commodity microprocessors
 	that can be easily assembled into shared-memory multiprocessors,
 	but certainly does not hold for all computer systems that have
 	ever run C code.
 	(What could you do instead in order to comply with the C
 	standard?  What drawbacks would it have?)
 } \QuickQuizEnd

 Lines~10-15 show the \co{split_ctrandmax_int()} function, which,
 when given the underlying \co{int} from the \co{atomic_t
 ctrandmax} variable, splits it into its \co{counter} (\co{c})
 and \co{countermax} (\co{cm}) components.
 Line~13 isolates the most-significant half of this \co{int},
 placing the result as specified by argument \co{c},
 and line~14 isolates the least-significant half of this \co{int},
 placing the result as specified by argument \co{cm}.

 Lines~17-25 show the \co{split_ctrandmax()} function, which
 picks up the underlying \co{int} from the specified variable
 on line~21, stores it as specified by the \co{old} argument on
 line~23, and then invokes \co{split_ctrandmax_int()} to split
 it on line~24.

 \QuickQuiz{}
 	Given that there is only one \co{ctrandmax} variable,
 	why bother passing in a pointer to it on line~18 of
 	Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}?
 \QuickQuizAnswer{
 	There is only one \co{ctrandmax} variable \emph{per thread}.
 	Later, we will see code that needs to pass other threads'
 	\co{ctrandmax} variables to \co{split_ctrandmax()}.
 } \QuickQuizEnd

 Lines~27-33 show the \co{merge_ctrandmax()} function, which
 can be thought of as the inverse of \co{split_ctrandmax()}.
 Line~31 merges the \co{counter} and \co{countermax}
 values passed in \co{c} and \co{cm}, respectively, and returns
 the result.

 \QuickQuiz{}
 	Why does \co{merge_ctrandmax()} in
 	Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}
 	return an \co{int} rather than storing directly into an
 	\co{atomic_t}?
 \QuickQuizAnswer{
 	Later, we will see that we need the \co{int} return to pass
 	to the \co{atomic_cmpxchg()} primitive.
 } \QuickQuizEnd

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 int add_count(unsigned long delta)
   2 {
   3   int c;
   4   int cm;
   5   int old;
   6   int new;
   7
   8   do {
   9     split_ctrandmax(&ctrandmax, &old, &c, &cm);
  10     if (delta > MAX_COUNTERMAX || c + delta > cm)
  11       goto slowpath;
  12     new = merge_ctrandmax(c + delta, cm);
  13   } while (atomic_cmpxchg(&ctrandmax,
  14                           old, new) != old);
  15   return 1;
  16 slowpath:
  17   spin_lock(&gblcnt_mutex);
  18   globalize_count();
  19   if (globalcountmax - globalcount -
  20       globalreserve < delta) {
  21     flush_local_count();
  22     if (globalcountmax - globalcount -
  23         globalreserve < delta) {
  24       spin_unlock(&gblcnt_mutex);
  25       return 0;
  26     }
  27   }
  28   globalcount += delta;
  29   balance_count();
  30   spin_unlock(&gblcnt_mutex);
  31   return 1;
  32 }
  33
  34 int sub_count(unsigned long delta)
  35 {
  36   int c;
  37   int cm;
  38   int old;
  39   int new;
  40
  41   do {
  42     split_ctrandmax(&ctrandmax, &old, &c, &cm);
  43     if (delta > c)
  44       goto slowpath;
  45     new = merge_ctrandmax(c - delta, cm);
  46   } while (atomic_cmpxchg(&ctrandmax,
  47                           old, new) != old);
  48   return 1;
  49 slowpath:
  50   spin_lock(&gblcnt_mutex);
  51   globalize_count();
  52   if (globalcount < delta) {
  53     flush_local_count();
  54     if (globalcount < delta) {
  55       spin_unlock(&gblcnt_mutex);
  56       return 0;
  57     }
  58   }
  59   globalcount -= delta;
  60   balance_count();
  61   spin_unlock(&gblcnt_mutex);
  62   return 1;
  63 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Atomic Limit Counter Add and Subtract}
 \label{fig:count:Atomic Limit Counter Add and Subtract}
 \end{figure}

 Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}
 shows the \co{add_count()} and \co{sub_count()} functions.

 Lines~1-32 show \co{add_count()}, whose fastpath spans lines~8-15,
 with the remainder of the function being the slowpath.
 Lines~8-14 of the fastpath form a compare-and-swap (CAS) loop, with
 the \co{atomic_cmpxchg()} primitives on lines~13-14 performing the
 actual CAS.
 Line~9 splits the current thread's \co{ctrandmax} variable into its
 \co{counter} (in \co{c}) and \co{countermax} (in \co{cm}) components,
 while placing the underlying \co{int} into \co{old}.
 Line~10 checks whether the amount \co{delta} can be accommodated
 locally (taking care to avoid integer overflow), and if not,
 line~11 transfers to the slowpath.
 Otherwise, line~12 combines an updated \co{counter} value with the
 original \co{countermax} value into \co{new}.
 The \co{atomic_cmpxchg()} primitive on lines~13-14 then atomically
 compares this thread's \co{ctrandmax} variable to \co{old},
 updating its value to \co{new} if the comparison succeeds.
 If the comparison succeeds, line~15 returns success, otherwise,
 execution continues in the loop at line~9.

 \QuickQuiz{}
 	Yecch!
 	Why the ugly \co{goto} on line~11 of
 	Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}?
 	Haven't you heard of the \co{break} statement???
 \QuickQuizAnswer{
 	Replacing the \co{goto} with a \co{break} would require keeping
 	a flag to determine whether or not line~15 should return, which
 	is not the sort of thing you want on a fastpath.
 	If you really hate the \co{goto} that much, your best bet would
 	be to pull the fastpath into a separate function that returned
 	success or failure, with ``failure'' indicating a need for the
 	slowpath.
 	This is left as an exercise for goto-hating readers.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Why would the \co{atomic_cmpxchg()} primitive at lines~13-14 of
 	Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}
 	ever fail?
 	After all, we picked up its old value on line~9 and have not
 	changed it!
 \QuickQuizAnswer{
 	Later, we will see how the \co{flush_local_count()} function in
 	Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 1}
 	might update this thread's \co{ctrandmax} variable concurrently
 	with the execution of the fastpath on lines~8-14 of
 	Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}.
 } \QuickQuizEnd

 Lines~16-31 of
 Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}
 show \co{add_count()}'s slowpath, which is protected by \co{gblcnt_mutex},
 which is acquired on line~17 and released on lines~24 and 30.
 Line~18 invokes \co{globalize_count()}, which moves this thread's
 state to the global counters.
 Lines~19-20 check whether the \co{delta} value can be accommodated by
 the current global state, and, if not, line~21 invokes
 \co{flush_local_count()} to flush all threads' local state to the
 global counters, and then lines~22-23 recheck whether \co{delta} can
 be accommodated.
 If, after all that, the addition of \co{delta} still cannot be accommodated,
 then line~24 releases \co{gblcnt_mutex} (as noted earlier), and
 then line~25 returns failure.

 Otherwise, line~28 adds \co{delta} to the global counter, line~29
 spreads counts to the local state if appropriate, line~30 releases
 \co{gblcnt_mutex} (again, as noted earlier), and finally, line~31
 returns success.

 Lines~34-63 of
 Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}
 show \co{sub_count()}, which is structured similarly to
 \co{add_count()}, having a fastpath on lines~41-48 and a slowpath on
 lines~49-62.
 A line-by-line analysis of this function is left as an exercise to
 the reader.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 unsigned long read_count(void)
   2 {
   3   int c;
   4   int cm;
   5   int old;
   6   int t;
   7   unsigned long sum;
   8
   9   spin_lock(&gblcnt_mutex);
  10   sum = globalcount;
  11   for_each_thread(t)
  12     if (counterp[t] != NULL) {
  13       split_ctrandmax(counterp[t], &old, &c, &cm);
  14       sum += c;
  15     }
  16   spin_unlock(&gblcnt_mutex);
  17   return sum;
  18 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Atomic Limit Counter Read}
 \label{fig:count:Atomic Limit Counter Read}
 \end{figure}

 Figure~\ref{fig:count:Atomic Limit Counter Read} shows \co{read_count()}.
 Line~9 acquires \co{gblcnt_mutex} and line~16 releases it.
 Line~10 initializes local variable \co{sum} to the value of
 \co{globalcount}, and the loop spanning lines~11-15 adds the
 per-thread counters to this sum, isolating each per-thread counter
 using \co{split_ctrandmax} on line~13.
 Finally, line~17 returns the sum.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 static void globalize_count(void)
   2 {
   3   int c;
   4   int cm;
   5   int old;
   6
   7   split_ctrandmax(&ctrandmax, &old, &c, &cm);
   8   globalcount += c;
   9   globalreserve -= cm;
  10   old = merge_ctrandmax(0, 0);
  11   atomic_set(&ctrandmax, old);
  12 }
  13
  14 static void flush_local_count(void)
  15 {
  16   int c;
  17   int cm;
  18   int old;
  19   int t;
  20   int zero;
  21
  22   if (globalreserve == 0)
  23     return;
  24   zero = merge_ctrandmax(0, 0);
  25   for_each_thread(t)
  26     if (counterp[t] != NULL) {
  27       old = atomic_xchg(counterp[t], zero);
  28       split_ctrandmax_int(old, &c, &cm);
  29       globalcount += c;
  30       globalreserve -= cm;
  31     }
  32 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Atomic Limit Counter Utility Functions 1}
 \label{fig:count:Atomic Limit Counter Utility Functions 1}
 \end{figure}

 \begin{figure}[tb]
 { \scriptsize
 \begin{verbbox}
   1 static void balance_count(void)
   2 {
   3   int c;
   4   int cm;
   5   int old;
   6   unsigned long limit;
   7
   8   limit = globalcountmax - globalcount -
   9           globalreserve;
  10   limit /= num_online_threads();
  11   if (limit > MAX_COUNTERMAX)
  12     cm = MAX_COUNTERMAX;
  13   else
  14     cm = limit;
  15   globalreserve += cm;
  16   c = cm / 2;
  17   if (c > globalcount)
  18     c = globalcount;
  19   globalcount -= c;
  20   old = merge_ctrandmax(c, cm);
  21   atomic_set(&ctrandmax, old);
  22 }
  23
  24 void count_register_thread(void)
  25 {
  26   int idx = smp_thread_id();
  27
  28   spin_lock(&gblcnt_mutex);
  29   counterp[idx] = &ctrandmax;
  30   spin_unlock(&gblcnt_mutex);
  31 }
  32
  33 void count_unregister_thread(int nthreadsexpected)
  34 {
  35   int idx = smp_thread_id();
  36
  37   spin_lock(&gblcnt_mutex);
  38   globalize_count();
  39   counterp[idx] = NULL;
  40   spin_unlock(&gblcnt_mutex);
  41 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Atomic Limit Counter Utility Functions 2}
 \label{fig:count:Atomic Limit Counter Utility Functions 2}
 \end{figure}

 Figures~\ref{fig:count:Atomic Limit Counter Utility Functions 1}
 and~\ref{fig:count:Atomic Limit Counter Utility Functions 2}
 shows the utility functions
 \co{globalize_count()},
 \co{flush_local_count()},
 \co{balance_count()},
 \co{count_register_thread()}, and
 \co{count_unregister_thread()}.
 The code for \co{globalize_count()} is shown on lines~1-12,
 of Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 1} and
 is similar to that of previous algorithms, with the addition of
 line~7, which is now required to split out \co{counter} and
 \co{countermax} from \co{ctrandmax}.

 The code for \co{flush_local_count()}, which moves all threads' local
 counter state to the global counter, is shown on lines~14-32.
 Line~22 checks to see if the value of \co{globalreserve} permits
 any per-thread counts, and, if not, line~23 returns.
 Otherwise, line~24 initializes local variable \co{zero} to a combined
 zeroed \co{counter} and \co{countermax}.
 The loop spanning lines~25-31 sequences through each thread.
 Line~26 checks to see if the current thread has counter state,
 and, if so, lines~27-30 move that state to the global counters.
 Line~27 atomically fetches the current thread's state
 while replacing it with zero.
 Line~28 splits this state into its \co{counter} (in local variable \co{c})
 and \co{countermax} (in local variable \co{cm}) components.
 Line~29 adds this thread's \co{counter} to \co{globalcount}, while
 line~30 subtracts this thread's \co{countermax} from \co{globalreserve}.

 \QuickQuiz{}
 	What stops a thread from simply refilling its
 	\co{ctrandmax} variable immediately after
 	\co{flush_local_count()} on line~14 of
 	Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 1}
 	empties it?
 \QuickQuizAnswer{
 	This other thread cannot refill its \co{ctrandmax}
 	until the caller of \co{flush_local_count()} releases the
 	\co{gblcnt_mutex}.
 	By that time, the caller of \co{flush_local_count()} will have
 	finished making use of the counts, so there will be no problem
 	with this other thread refilling---assuming that the value
 	of \co{globalcount} is large enough to permit a refill.
 } \QuickQuizEnd

 \QuickQuiz{}
 	What prevents concurrent execution of the fastpath of either
 	\co{add_count()} or \co{sub_count()} from interfering with
 	the \co{ctrandmax} variable while
 	\co{flush_local_count()} is accessing it on line~27 of
 	Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 1}
 	empties it?
 \QuickQuizAnswer{
 	Nothing.
 	Consider the following three cases:
 	\begin{enumerate}
 	\item	If \co{flush_local_count()}'s \co{atomic_xchg()} executes
 		before the \co{split_ctrandmax()} of either fastpath,
 		then the fastpath will see a zero \co{counter} and
 		\co{countermax}, and will thus transfer to the slowpath
 		(unless of course \co{delta} is zero).
 	\item	If \co{flush_local_count()}'s \co{atomic_xchg()} executes
 		after the \co{split_ctrandmax()} of either fastpath,
 		but before that fastpath's \co{atomic_cmpxchg()},
 		then the \co{atomic_cmpxchg()} will fail, causing the
 		fastpath to restart, which reduces to case~1 above.
 	\item	If \co{flush_local_count()}'s \co{atomic_xchg()} executes
 		after the \co{atomic_cmpxchg()} of either fastpath,
 		then the fastpath will (most likely) complete successfully
 		before \co{flush_local_count()} zeroes the thread's
 		\co{ctrandmax} variable.
 	\end{enumerate}
 	Either way, the race is resolved correctly.
 } \QuickQuizEnd

 Lines~1-22 on
 Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 2}
 show the code for \co{balance_count()}, which refills
 the calling thread's local \co{ctrandmax} variable.
 This function is quite similar to that of the preceding algorithms,
 with changes required to handle the merged \co{ctrandmax} variable.
 Detailed analysis of the code is left as an exercise for the reader,
 as it is with the \co{count_register_thread()} function starting on
 line~24 and the \co{count_unregister_thread()} function starting on
 line~33.

 \QuickQuiz{}
 	Given that the \co{atomic_set()} primitive does a simple
 	store to the specified \co{atomic_t}, how can line~21 of
 	\co{balance_count()} in
 	Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 2}
 	work correctly in face of concurrent \co{flush_local_count()}
 	updates to this variable?
 \QuickQuizAnswer{
 	The caller of both \co{balance_count()} and
 	\co{flush_local_count()} hold \co{gblcnt_mutex}, so
 	only one may be executing at a given time.
 } \QuickQuizEnd

 The next section qualitatively evaluates this design.

 \subsection{Atomic Limit Counter Discussion}

 This is the first implementation that actually allows the counter to
 be run all the way to either of its limits, but it does so at the
 expense of adding atomic operations to the fastpaths, which slow down
 the fastpaths significantly on some systems.
 Although some workloads might tolerate this slowdown, it is worthwhile
 looking for algorithms with better read-side performance.
 One such algorithm uses a signal handler to steal counts from other
 threads.
 Because signal handlers run in the context of the signaled thread,
 atomic operations are not necessary, as shown in the next section.

 \QuickQuiz{}
 	But signal handlers can be migrated to some other
 	CPU while running.
 	Doesn't this possibility require that atomic instructions
 	and memory barriers are required to reliably communicate
 	between a thread and a signal handler that interrupts that
 	thread?
 \QuickQuizAnswer{
 	No.
 	If the signal handler is migrated to another CPU, then the
 	interrupted thread is also migrated along with it.
 } \QuickQuizEnd

 \subsection{Signal-Theft Limit Counter Design}
 \label{sec:count:Signal-Theft Limit Counter Design}

 \begin{figure}[tb]
 \centering
 \resizebox{2in}{!}{\includegraphics{count/sig-theft}}
 \caption{Signal-Theft State Machine}
 \label{fig:count:Signal-Theft State Machine}
 \end{figure}

 Even though per-thread state will now be manipulated only by the
 corresponding thread, there will still need to be synchronization
 with the signal handlers.
 This synchronization is provided by the state machine shown in
 Figure~\ref{fig:count:Signal-Theft State Machine}.
 The state machine starts out in the IDLE state, and when \co{add_count()}
 or \co{sub_count()} find that the combination of the local thread's count
 and the global count cannot accommodate the request, the corresponding
 slowpath sets each thread's \co{theft} state to REQ (unless that thread
 has no count, in which case it transitions directly to READY).
 Only the slowpath, which holds the \co{gblcnt_mutex} lock, is permitted to
 transition from the IDLE state, as indicated by the green color.\footnote{
 	For those with black-and-white versions of this book,
 	IDLE and READY are green, REQ is red, and ACK is blue.}
 The slowpath then sends a signal to each thread, and the corresponding
 signal handler checks the corresponding thread's \co{theft} and
 \co{counting} variables.
 If the \co{theft} state is not REQ, then the signal handler is not
 permitted to change the state, and therefore simply returns.
 Otherwise, if the \co{counting} variable is set, indicating that
 the current thread's fastpath is in progress, the signal handler
 sets the \co{theft} state to ACK, otherwise to READY.

 If the \co{theft} state is ACK,
 only the fastpath is permitted to change
 the \co{theft} state, as indicated by the blue color.
 When the fastpath completes, it sets the \co{theft} state to READY.

 Once the slowpath sees a thread's \co{theft} state is READY, the
 slowpath is permitted to steal that thread's count.
 The slowpath then sets that thread's \co{theft} state to IDLE.

 \QuickQuiz{}
 	In Figure~\ref{fig:count:Signal-Theft State Machine}, why is
 	the REQ \co{theft} state colored red?
 \QuickQuizAnswer{
 	To indicate that only the fastpath is permitted to change the
 	\co{theft} state, and that if the thread remains in this
 	state for too long, the thread running the slowpath will
 	resend the POSIX signal.
 } \QuickQuizEnd

 \QuickQuiz{}
 	In Figure~\ref{fig:count:Signal-Theft State Machine}, what is
 	the point of having separate REQ and ACK \co{theft} states?
 	Why not simplify the state machine by collapsing
 	them into a single REQACK state?
 	Then whichever of the signal handler or the fastpath gets there
 	first could set the state to READY.
 \QuickQuizAnswer{
 	Reasons why collapsing the REQ and ACK states would be a very
 	bad idea include:
 	\begin{enumerate}
 	\item	The slowpath uses the REQ and ACK states to determine
 		whether the signal should be retransmitted.
 		If the states were collapsed, the slowpath would have
 		no choice but to send redundant signals, which would
 		have the unhelpful effect of needlessly slowing down
 		the fastpath.
 	\item	The following race would result:
 		\begin{enumerate}
 		\item	The slowpath sets a given thread's state to REQACK.
 		\item	That thread has just finished its fastpath, and
 			notes the REQACK state.
 		\item	The thread receives the signal, which also notes
 			the REQACK state, and, because there is no fastpath
 			in effect, sets the state to READY.
 		\item	The slowpath notes the READY state, steals the
 			count, and sets the state to IDLE, and completes.
 		\item	The fastpath sets the state to READY, disabling
 			further fastpath execution for this thread.
 		\end{enumerate}
 		The basic problem here is that the combined REQACK state
 		can be referenced by both the signal handler and the
 		fastpath.
 		The clear separation maintained by the four-state
 		setup ensures orderly state transitions.
 	\end{enumerate}
 	That said, you might well be able to make a three-state setup
 	work correctly.
 	If you do succeed, compare carefully to the four-state setup.
 	Is the three-state solution really preferable, and why or why not?
 } \QuickQuizEnd

 \subsection{Signal-Theft Limit Counter Implementation}
 \label{sec:count:Signal-Theft Limit Counter Implementation}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 #define THEFT_IDLE  0
   2 #define THEFT_REQ   1
   3 #define THEFT_ACK   2
   4 #define THEFT_READY 3
   5
   6 int __thread theft = THEFT_IDLE;
   7 int __thread counting = 0;
   8 unsigned long __thread counter = 0;
   9 unsigned long __thread countermax = 0;
  10 unsigned long globalcountmax = 10000;
  11 unsigned long globalcount = 0;
  12 unsigned long globalreserve = 0;
  13 unsigned long *counterp[NR_THREADS] = { NULL };
  14 unsigned long *countermaxp[NR_THREADS] = { NULL };
  15 int *theftp[NR_THREADS] = { NULL };
  16 DEFINE_SPINLOCK(gblcnt_mutex);
  17 #define MAX_COUNTERMAX 100
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Signal-Theft Limit Counter Data}
 \label{fig:count:Signal-Theft Limit Counter Data}
 \end{figure}

 Figure~\ref{fig:count:Signal-Theft Limit Counter Data}
 (\path{count_lim_sig.c})
 shows the data structures used by the signal-theft based counter
 implementation.
 Lines~1-7 define the states and values for the per-thread theft state machine
 described in the preceding section.
 Lines~8-17 are similar to earlier implementations, with the addition of
 lines~14 and 15 to allow remote access to a thread's \co{countermax}
 and \co{theft} variables, respectively.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 static void globalize_count(void)
   2 {
   3   globalcount += counter;
   4   counter = 0;
   5   globalreserve -= countermax;
   6   countermax = 0;
   7 }
   8
   9 static void flush_local_count_sig(int unused)
  10 {
  11   if (ACCESS_ONCE(theft) != THEFT_REQ)
  12     return;
  13   smp_mb();
  14   ACCESS_ONCE(theft) = THEFT_ACK;
  15   if (!counting) {
  16     ACCESS_ONCE(theft) = THEFT_READY;
  17   }
  18   smp_mb();
  19 }
  20
  21 static void flush_local_count(void)
  22 {
  23   int t;
  24   thread_id_t tid;
  25
  26   for_each_tid(t, tid)
  27     if (theftp[t] != NULL) {
  28       if (*countermaxp[t] == 0) {
  29         ACCESS_ONCE(*theftp[t]) = THEFT_READY;
  30         continue;
  31       }
  32       ACCESS_ONCE(*theftp[t]) = THEFT_REQ;
  33       pthread_kill(tid, SIGUSR1);
  34     }
  35   for_each_tid(t, tid) {
  36     if (theftp[t] == NULL)
  37       continue;
  38     while (ACCESS_ONCE(*theftp[t]) != THEFT_READY) {
  39       poll(NULL, 0, 1);
  40       if (ACCESS_ONCE(*theftp[t]) == THEFT_REQ)
  41         pthread_kill(tid, SIGUSR1);
  42     }
  43     globalcount += *counterp[t];
  44     *counterp[t] = 0;
  45     globalreserve -= *countermaxp[t];
  46     *countermaxp[t] = 0;
  47     ACCESS_ONCE(*theftp[t]) = THEFT_IDLE;
  48   }
  49 }
  50
  51 static void balance_count(void)
  52 {
  53   countermax = globalcountmax -
  54     globalcount - globalreserve;
  55   countermax /= num_online_threads();
  56   if (countermax > MAX_COUNTERMAX)
  57     countermax = MAX_COUNTERMAX;
  58   globalreserve += countermax;
  59   counter = countermax / 2;
  60   if (counter > globalcount)
  61     counter = globalcount;
  62   globalcount -= counter;
  63 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Signal-Theft Limit Counter Value-Migration Functions}
 \label{fig:count:Signal-Theft Limit Counter Value-Migration Functions}
 \end{figure}

 Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}
 shows the functions responsible for migrating counts between per-thread
 variables and the global variables.
 Lines~1-7 shows \co{globalize_count()}, which is identical to earlier
 implementations.
 Lines~9-19 shows \co{flush_local_count_sig()}, which is the signal
 handler used in the theft process.
 Lines~11 and 12 check to see if the \co{theft} state is REQ, and, if not
 returns without change.
 Line~13 executes a memory barrier to ensure that the sampling of the
 theft variable happens before any change to that variable.
 Line~14 sets the \co{theft} state to ACK, and, if line~15 sees that
 this thread's fastpaths are not running, line~16 sets the \co{theft}
 state to READY.

 \QuickQuiz{}
 	In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}
 	function \co{flush_local_count_sig()}, why are there
 	\co{ACCESS_ONCE()} wrappers around the uses of the
 	\co{theft} per-thread variable?
 \QuickQuizAnswer{
 	The first one (on line~11) can be argued to be unnecessary.
 	The last two (lines~14 and 16) are important.
 	If these are removed, the compiler would be within its rights
 	to rewrite lines~14-17 as follows:

 	\vspace{5pt}
 	\begin{minipage}[t]{\columnwidth}
 	\small
 	\begin{verbatim}
  14   theft = THEFT_READY;
  15   if (counting) {
  16     theft = THEFT_ACK;
  17   }
 	\end{verbatim}
 	\end{minipage}
 	\vspace{5pt}

 	This would be fatal, as the slowpath might see the transient
 	value of \co{THEFT_READY}, and start stealing before the
 	corresponding thread was ready.
 } \QuickQuizEnd

 Lines~21-49 shows \co{flush_local_count()}, which is called from the
 slowpath to flush all threads' local counts.
 The loop spanning lines~26-34 advances the \co{theft} state for each
 thread that has local count, and also sends that thread a signal.
 Line~27 skips any non-existent threads.
 Otherwise, line~28 checks to see if the current thread holds any local
 count, and, if not, line~29 sets the thread's \co{theft} state to READY
 and line~30 skips to the next thread.
 Otherwise, line~32 sets the thread's \co{theft} state to REQ and
 line~33 sends the thread a signal.

 \QuickQuiz{}
 	In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions},
 	why is it safe for line~28 to directly access the other thread's
 	\co{countermax} variable?
 \QuickQuizAnswer{
 	Because the other thread is not permitted to change the value
 	of its \co{countermax} variable unless it holds the
 	\co{gblcnt_mutex} lock.
 	But the caller has acquired this lock, so it is not possible
 	for the other thread to hold it, and therefore the other thread
 	is not permitted to change its \co{countermax} variable.
 	We can therefore safely access it---but not change it.
 } \QuickQuizEnd

 \QuickQuiz{}
 	In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions},
 	why doesn't line~33 check for the current thread sending itself
 	a signal?
 \QuickQuizAnswer{
 	There is no need for an additional check.
 	The caller of \co{flush_local_count()} has already invoked
 	\co{globalize_count()}, so the check on line~28 will have
 	succeeded, skipping the later \co{pthread_kill()}.
 } \QuickQuizEnd

 \QuickQuiz{}
 	The code in
 	Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions},
 	works with gcc and POSIX.
 	What would be required to make it also conform to the ISO C standard?
 \QuickQuizAnswer{
 	The \co{theft} variable must be of type \co{sig_atomic_t}
 	to guarantee that it can be safely shared between the signal
 	handler and the code interrupted by the signal.
 } \QuickQuizEnd

 The loop spanning lines~35-48 waits until each thread reaches READY state,
 then steals that thread's count.
 Lines~36-37 skip any non-existent threads, and the loop spanning
 lines~38-42 wait until the current thread's \co{theft} state becomes READY.
 Line~39 blocks for a millisecond to avoid priority-inversion problems,
 and if line~40 determines that the thread's signal has not yet arrived,
 line~41 resends the signal.
 Execution reaches line~43 when the thread's \co{theft} state becomes
 READY, so lines~43-46 do the thieving.
 Line~47 then sets the thread's \co{theft} state back to IDLE.

 \QuickQuiz{}
 	In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}, why does line~41 resend the signal?
 \QuickQuizAnswer{
 	Because many operating systems over several decades have
 	had the property of losing the occasional signal.
 	Whether this is a feature or a bug is debatable, but
 	irrelevant.
 	The obvious symptom from the user's viewpoint will not be
 	a kernel bug, but rather a user application hanging.

 	\emph{Your} user application hanging!
 } \QuickQuizEnd

 Lines~51-63 show \co{balance_count()}, which is similar to that of
 earlier examples.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 int add_count(unsigned long delta)
   2 {
   3   int fastpath = 0;
   4
   5   counting = 1;
   6   barrier();
   7   if (countermax - counter >= delta &&
   8       ACCESS_ONCE(theft) <= THEFT_REQ) {
   9     counter += delta;
  10     fastpath = 1;
  11   }
  12   barrier();
  13   counting = 0;
  14   barrier();
  15   if (ACCESS_ONCE(theft) == THEFT_ACK) {
  16     smp_mb();
  17     ACCESS_ONCE(theft) = THEFT_READY;
  18   }
  19   if (fastpath)
  20     return 1;
  21   spin_lock(&gblcnt_mutex);
  22   globalize_count();
  23   if (globalcountmax - globalcount -
  24       globalreserve < delta) {
  25     flush_local_count();
  26     if (globalcountmax - globalcount -
  27         globalreserve < delta) {
  28       spin_unlock(&gblcnt_mutex);
  29       return 0;
  30     }
  31   }
  32   globalcount += delta;
  33   balance_count();
  34   spin_unlock(&gblcnt_mutex);
  35   return 1;
  36 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Signal-Theft Limit Counter Add Function}
 \label{fig:count:Signal-Theft Limit Counter Add Function}
 \end{figure}

 \begin{figure}[tb]
 { \scriptsize
 \begin{verbbox}
  38 int sub_count(unsigned long delta)
  39 {
  40   int fastpath = 0;
  41
  42   counting = 1;
  43   barrier();
  44   if (counter >= delta &&
  45       ACCESS_ONCE(theft) <= THEFT_REQ) {
  46     counter -= delta;
  47     fastpath = 1;
  48   }
  49   barrier();
  50   counting = 0;
  51   barrier();
  52   if (ACCESS_ONCE(theft) == THEFT_ACK) {
  53     smp_mb();
  54     ACCESS_ONCE(theft) = THEFT_READY;
  55   }
  56   if (fastpath)
  57     return 1;
  58   spin_lock(&gblcnt_mutex);
  59   globalize_count();
  60   if (globalcount < delta) {
  61     flush_local_count();
  62     if (globalcount < delta) {
  63       spin_unlock(&gblcnt_mutex);
  64       return 0;
  65     }
  66   }
  67   globalcount -= delta;
  68   balance_count();
  69   spin_unlock(&gblcnt_mutex);
  70   return 1;
  71 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Signal-Theft Limit Counter Subtract Function}
 \label{fig:count:Signal-Theft Limit Counter Subtract Function}
 \end{figure}

 Figure~\ref{fig:count:Signal-Theft Limit Counter Add Function}
 shows the \co{add_count()} function.
 The fastpath spans lines~5-20, and the slowpath lines~21-35.
 Line~5 sets the per-thread \co{counting} variable to 1 so that
 any subsequent signal handlers interrupting this thread will
 set the \co{theft} state to ACK rather than READY, allowing this
 fastpath to complete properly.
 Line~6 prevents the compiler from reordering any of the fastpath body
 to precede the setting of \co{counting}.
 Lines~7 and 8 check to see if the per-thread data can accommodate
 the \co{add_count()} and if there is no ongoing theft in progress,
 and if so line~9 does the fastpath addition and line~10 notes that
 the fastpath was taken.

 In either case, line~12 prevents the compiler from reordering the
 fastpath body to follow line~13, which permits any subsequent signal
 handlers to undertake theft.
 Line~14 again disables compiler reordering, and then line~15
 checks to see if the signal handler deferred the \co{theft}
 state-change to READY, and, if so, line~16 executes a memory
 barrier to ensure that any CPU that sees line~17 setting state to
 READY also sees the effects of line~9.
 If the fastpath addition at line~9 was executed, then line~20 returns
 success.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 unsigned long read_count(void)
   2 {
   3   int t;
   4   unsigned long sum;
   5
   6   spin_lock(&gblcnt_mutex);
   7   sum = globalcount;
   8   for_each_thread(t)
   9     if (counterp[t] != NULL)
  10       sum += *counterp[t];
  11   spin_unlock(&gblcnt_mutex);
  12   return sum;
  13 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Signal-Theft Limit Counter Read Function}
 \label{fig:count:Signal-Theft Limit Counter Read Function}
 \end{figure}

 Otherwise, we fall through to the slowpath starting at line~21.
 The structure of the slowpath is similar to those of earlier examples,
 so its analysis is left as an exercise to the reader.
 Similarly, the structure of \co{sub_count()} on
 Figure~\ref{fig:count:Signal-Theft Limit Counter Subtract Function}
 is the same
 as that of \co{add_count()}, so the analysis of \co{sub_count()} is also
 left as an exercise for the reader, as is the analysis of
 \co{read_count()} in
 Figure~\ref{fig:count:Signal-Theft Limit Counter Read Function}.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
   1 void count_init(void)
   2 {
   3   struct sigaction sa;
   4
   5   sa.sa_handler = flush_local_count_sig;
   6   sigemptyset(&sa.sa_mask);
   7   sa.sa_flags = 0;
   8   if (sigaction(SIGUSR1, &sa, NULL) != 0) {
   9     perror("sigaction");
  10     exit(-1);
  11   }
  12 }
  13
  14 void count_register_thread(void)
  15 {
  16   int idx = smp_thread_id();
  17
  18   spin_lock(&gblcnt_mutex);
  19   counterp[idx] = &counter;
  20   countermaxp[idx] = &countermax;
  21   theftp[idx] = &theft;
  22   spin_unlock(&gblcnt_mutex);
  23 }
  24
  25 void count_unregister_thread(int nthreadsexpected)
  26 {
  27   int idx = smp_thread_id();
  28
  29   spin_lock(&gblcnt_mutex);
  30   globalize_count();
  31   counterp[idx] = NULL;
  32   countermaxp[idx] = NULL;
  33   theftp[idx] = NULL;
  34   spin_unlock(&gblcnt_mutex);
  35 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Signal-Theft Limit Counter Initialization Functions}
 \label{fig:count:Signal-Theft Limit Counter Initialization Functions}
 \end{figure}

 Lines~1-12 of
 Figure~\ref{fig:count:Signal-Theft Limit Counter Initialization Functions}
 show \co{count_init()}, which set up \co{flush_local_count_sig()}
 as the signal handler for \co{SIGUSR1},
 enabling the \co{pthread_kill()} calls in \co{flush_local_count()}
 to invoke \co{flush_local_count_sig()}.
 The code for thread registry and unregistry is similar to that of
 earlier examples, so its analysis is left as an exercise for the
 reader.

 \subsection{Signal-Theft Limit Counter Discussion}

 The signal-theft implementation runs more than twice as fast as the
 atomic implementation on my Intel Core Duo laptop.
 Is it always preferable?

 The signal-theft implementation would be vastly preferable on Pentium-4
 systems, given their slow atomic instructions, but the old 80386-based
 Sequent Symmetry systems would do much better with the shorter path
 length of the atomic implementation.
 However, this increased update-side performance comes at the
 prices of higher read-side overhead: Those POSIX signals are not free.
 If ultimate performance is of the essence, you will need to measure
 them both on the system that your application is to be deployed on.

 \QuickQuiz{}
 	Not only are POSIX signals slow, sending one to each thread
 	simply does not scale.
 	What would you do if you had (say) 10,000 threads and needed
 	the read side to be fast?
 \QuickQuizAnswer{
 	One approach is to use the techniques shown in
 	Section~\ref{sec:count:Eventually Consistent Implementation},
 	summarizing an approximation to the overall counter value in
 	a single variable.
 	Another approach would be to use multiple threads to carry
 	out the reads, with each such thread interacting with a
 	specific subset of the updating threads.
 } \QuickQuizEnd

 This is but one reason why high-quality APIs are so important:
 they permit implementations to be changed as required by ever-changing
 hardware performance characteristics.

 \QuickQuiz{}
 	What if you want an exact limit counter to be exact only for
 	its lower limit, but to allow the upper limit to be inexact?
 \QuickQuizAnswer{
 	One simple solution is to overstate the upper limit by the
 	desired amount.
 	The limiting case of such overstatement results in the
 	upper limit being set to the largest value that the counter is
 	capable of representing.
 } \QuickQuizEnd

 \section{Applying Specialized Parallel Counters}
 \label{sec:count:Applying Specialized Parallel Counters}

 Although the exact limit counter implementations in
 Section~\ref{sec:count:Exact Limit Counters}
 can be very useful, they are not much help if the counter's value
 remains near zero at all times, as it might when counting the number
 of outstanding accesses to an I/O device.
 The high overhead of such near-zero counting is especially painful
 given that we normally don't care how many references there are.
 As noted in the removable I/O device access-count problem posed by
 \QuickQuizRef{\QcountQIOcnt},
 the number of accesses is irrelevant except in those rare cases when
 someone is actually trying to remove the device.

 One simple solution to this problem is to add a large ``bias''
 (for example, one billion) to the
 counter in order to ensure that the value is far enough from zero that
 the counter can operate efficiently.
 When someone wants to remove the device, this bias is subtracted from
 the counter value.
 Counting the last few accesses will be quite inefficient,
 but the important point is that the many prior accesses will have been
 counted at full speed.

 \QuickQuiz{}
 	What else had you better have done when using a biased counter?
 \QuickQuizAnswer{
 	You had better have set the upper limit to be large enough
 	accommodate the bias, the expected maximum number of accesses,
 	and enough ``slop'' to allow the counter to work efficiently
 	even when the number of accesses is at its maximum.
 } \QuickQuizEnd

 Although a biased counter can be quite helpful and useful, it is only a
 partial solution to the removable I/O device access-count problem
 called out on
 page~\pageref{chp:Counting}.
 When attempting to remove a device, we must not only know the precise
 number of current I/O accesses, we also need to prevent any future
 accesses from starting.
 One way to accomplish this is to read-acquire a reader-writer lock
 when updating the counter, and to write-acquire that same reader-writer
 lock when checking the counter.
 Code for doing I/O might be as follows:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \small
 \begin{verbatim}
   1 read_lock(&mylock);
   2 if (removing) {
   3   read_unlock(&mylock);
   4   cancel_io();
   5 } else {
   6   add_count(1);
   7   read_unlock(&mylock);
   8   do_io();
   9   sub_count(1);
  10 }
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 Line~1 read-acquires the lock, and either line~3 or 7 releases it.
 Line~2 checks to see if the device is being removed, and, if so,
 line~3 releases the lock and line~4 cancels the I/O, or takes whatever
 action is appropriate given that the device is to be removed.
 Otherwise, line~6 increments the access count, line~7 releases the
 lock, line~8 performs the I/O, and line~9 decrements the access count.

 \QuickQuiz{}
 	This is ridiculous!
 	We are \emph{read}-acquiring a reader-writer lock to
 	\emph{update} the counter?
 	What are you playing at???
 \QuickQuizAnswer{
 	Strange, perhaps, but true!
 	Almost enough to make you think that the name
 	``reader-writer lock'' was poorly chosen, isn't it?
 } \QuickQuizEnd

 The code to remove the device might be as follows:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \small
 \begin{verbatim}
   1 write_lock(&mylock);
   2 removing = 1;
   3 sub_count(mybias);
   4 write_unlock(&mylock);
   5 while (read_count() != 0) {
   6   poll(NULL, 0, 1);
   7 }
   8 remove_device();
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 Line~1 write-acquires the lock and line~4 releases it.
 Line~2 notes that the device is being removed, and the loop spanning
 lines~5-7 wait for any I/O operations to complete.
 Finally, line~8 does any additional processing needed to prepare for
 device removal.

 \QuickQuiz{}
 	What other issues would need to be accounted for in a real system?
 \QuickQuizAnswer{
 	A huge number!

 	Here are a few to start with:

 	\begin{enumerate}
 	\item	There could be any number of devices, so that the
 		global variables are inappropriate, as are the
 		lack of arguments to functions like \co{do_io()}.
 	\item	Polling loops can be problematic in real systems.
 		In many cases, it is far better to have the last
 		completing I/O wake up the device-removal thread.
 	\item	The I/O might fail, and so \co{do_io()} will likely
 		need a return value.
 	\item	If the device fails, the last I/O might never complete.
 		In such cases, there might need to be some sort of
 		timeout to allow error recovery.
 	\item	Both \co{add_count()} and \co{sub_count()} can
 		fail, but their return values are not checked.
 	\item	Reader-writer locks do not scale well.
 		One way of avoiding the high read-acquisition costs
 		of reader-writer locks is presented in
 		Chapters~\ref{chp:Locking}
 		and~\ref{chp:Deferred Processing}.
 	\item	The polling loops result in poor energy efficiency.
 		An event-driven design is preferable.
 	\end{enumerate}
 } \QuickQuizEnd

 \section{Parallel Counting Discussion}
 \label{sec:count:Parallel Counting Discussion}

 This chapter has presented the reliability, performance, and
 scalability problems with traditional counting primitives.
 The C-language \co{++} operator is not guaranteed to function reliably in
 multithreaded code, and atomic operations to a single variable neither
 perform nor scale well.
 This chapter therefore presented a number of counting algorithms that
 perform and scale extremely well in certain special cases.

 It is well worth reviewing the lessons from these counting algorithms.
 To that end,
 Section~\ref{sec:count:Parallel Counting Performance}
 summarizes performance and scalability,
 Section~\ref{sec:count:Parallel Counting Specializations}
 discusses the need for specialization,
 and finally,
 Section~\ref{sec:count:Parallel Counting Lessons}
 enumerates lessons learned and calls attention to later chapters that
 will expand on these lessons.

 \subsection{Parallel Counting Performance}
 \label{sec:count:Parallel Counting Performance}

 \begin{table*}
 \small
 \centering
 \begin{tabular}{l|r|r|r|r}
 	& & & \multicolumn{2}{c}{Reads} \\
 	\cline{4-5}
 	Algorithm & Section & Updates & 1 Core & 32 Cores \\
 	\hline
 	\hline
 	\path{count_stat.c} & \ref{sec:count:Array-Based Implementation} &
 		11.5 ns & 408 ns & 409 ns \\
 	\path{count_stat_eventual.c} & \ref{sec:count:Eventually Consistent Implementation} &
 		11.6 ns & 1 ns & 1 ns \\
 	\path{count_end.c} & \ref{sec:count:Per-Thread-Variable-Based Implementation} &
 		6.3 ns & 389 ns & 51,200 ns \\
 	\path{count_end_rcu.c} & \ref{sec:together:RCU and Per-Thread-Variable-Based Statistical Counters} &
 		5.7 ns & 354 ns & 501 ns \\
 \end{tabular}
 \caption{Statistical Counter Performance on Power-6}
 \label{tab:count:Statistical Counter Performance on Power-6}
 \end{table*}

 Table~\ref{tab:count:Statistical Counter Performance on Power-6}
 shows the performance of the four parallel statistical counting
 algorithms.
 All four algorithms provide near-perfect linear scalability for updates.
 The per-thread-variable implementation (\path{count_end.c})
 is significantly faster on
 updates than the array-based implementation
 (\path{count_stat.c}), but is slower at reads on large numbers of core,
 and suffers severe lock contention when there are many parallel readers.
 This contention can be addressed using the deferred-processing
 techniques introduced in
 Chapter~\ref{chp:Deferred Processing},
 as shown on the \path{count_end_rcu.c} row of
 Table~\ref{tab:count:Statistical Counter Performance on Power-6}.
 Deferred processing also shines on the \path{count_stat_eventual.c} row,
 courtesy of eventual consistency.

 \QuickQuiz{}
 	On the \path{count_stat.c} row of
 	Table~\ref{tab:count:Statistical Counter Performance on Power-6},
 	we see that the read-side scales linearly with the number of
 	threads.
 	How is that possible given that the more threads there are,
 	the more per-thread counters must be summed up?
 \QuickQuizAnswer{
 	The read-side code must scan the entire fixed-size array, regardless
 	of the number of threads, so there is no difference in performance.
 	In contrast, in the last two algorithms, readers must do more
 	work when there are more threads.
 	In addition, the last two algorithms interpose an additional
 	level of indirection because they map from integer thread ID
 	to the corresponding \co{__thread} variable.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Even on the last row of
 	Table~\ref{tab:count:Statistical Counter Performance on Power-6},
 	the read-side performance of these statistical counter
 	implementations is pretty horrible.
 	So why bother with them?
 \QuickQuizAnswer{
 	``Use the right tool for the job.''

 	As can be seen from
 	Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem},
 	single-variable atomic increment need not apply for any job
 	involving heavy use of parallel updates.
 	In contrast, the algorithms shown in
 	Table~\ref{tab:count:Statistical Counter Performance on Power-6}
 	do an excellent job of handling update-heavy situations.
 	Of course, if you have a read-mostly situation, you should
 	use something else, for example, an eventually consistent design
 	featuring a single atomically incremented
 	variable that can be read out using a single load,
 	similar to the approach used in
 	Section~\ref{sec:count:Eventually Consistent Implementation}.
 } \QuickQuizEnd

 \begin{table*}
 \small
 \centering
 \begin{tabular}{l|r|c|r|r|r}
 	& & & & \multicolumn{2}{c}{Reads} \\
 	\cline{5-6}
 	Algorithm & Section & Exact? & Updates & 1 Core & 64 Cores \\
 	\hline
 	\hline
 	\path{count_lim.c} & \ref{sec:count:Simple Limit Counter Implementation} &
 		N & 3.6 ns & 375 ns & 50,700 ns \\
 	\path{count_lim_app.c} & \ref{sec:count:Approximate Limit Counter Implementation} &
 		N & 11.7 ns & 369 ns & 51,000 ns \\
 	\path{count_lim_atomic.c} & \ref{sec:count:Atomic Limit Counter Implementation} &
 		Y & 51.4 ns & 427 ns & 49,400 ns \\
 	\path{count_lim_sig.c} & \ref{sec:count:Signal-Theft Limit Counter Implementation} &
 		Y & 10.2 ns & 370 ns & 54,000 ns \\
 \end{tabular}
 \caption{Limit Counter Performance on Power-6}
 \label{tab:count:Limit Counter Performance on Power-6}
 \end{table*}

 Figure~\ref{tab:count:Limit Counter Performance on Power-6}
 shows the performance of the parallel limit-counting algorithms.
 Exact enforcement of the limits incurs a substantial performance
 penalty, although on this 4.7GHz Power-6 system that penalty can be reduced
 by substituting signals for atomic operations.
 All of these implementations suffer from read-side lock contention
 in the face of concurrent readers.

 \QuickQuiz{}
 	Given the performance data shown in
 	Table~\ref{tab:count:Limit Counter Performance on Power-6},
 	we should always prefer signals over atomic operations, right?
 \QuickQuizAnswer{
 	That depends on the workload.
 	Note that on a 64-core system, you need more than
 	one hundred non-atomic operations (with roughly
 	a 40-nanosecond performance gain) to make up for even one
 	signal (with almost a 5-\emph{microsecond} performance loss).
 	Although there are no shortage of workloads with far greater
 	read intensity, you will need to consider your particular
 	workload.

 	In addition, although memory barriers have historically been
 	expensive compared to ordinary instructions, you should
 	check this on the specific hardware you will be running.
 	The properties of computer hardware do change over time,
 	and algorithms must change accordingly.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Can advanced techniques be applied to address the lock
 	contention for readers seen in
 	Table~\ref{tab:count:Limit Counter Performance on Power-6}?
 \QuickQuizAnswer{
 	One approach is to give up some update-side performance, as is
 	done with scalable non-zero indicators
 	(SNZI)~\cite{FaithEllen:2007:SNZI}.
 	There are a number of other ways one might go about this, and these
 	are left as exercises for the reader.
 	Any number of approaches that apply hierarchy, which replace
 	frequent global-lock acquisitions with local lock acquisitions
 	corresponding to lower levels of the hierarchy, should work quite well.
 } \QuickQuizEnd

 In short, this chapter has demonstrated a number of counting algorithms
 that perform and scale extremely well in a number of special cases.
 But must our parallel counting be confined to special cases?
 Wouldn't it be better to have a general algorithm that operated
 efficiently in all cases?
 The next section looks at these questions.

 \subsection{Parallel Counting Specializations}
 \label{sec:count:Parallel Counting Specializations}

 The fact that these algorithms only work well in their respective special
 cases might be considered a major problem with parallel programming in
 general.
 After all, the C-language \co{++} operator works just fine in single-threaded
 code, and not just for special cases, but in general, right?

 This line of reasoning does contain a grain of truth, but is in essence
 misguided.
 The problem is not parallelism as such, but rather scalability.
 To understand this, first consider the C-language \co{++} operator.
 The fact is that it does \emph{not} work in general, only for a restricted
 range of numbers.
 If you need to deal with 1,000-digit decimal numbers, the C-language \co{++}
 operator will not work for you.

 \QuickQuiz{}
 	The \co{++} operator works just fine for 1,000-digit numbers!
 	Haven't you heard of operator overloading???
 \QuickQuizAnswer{
 	In the C++ language, you might well be able to use \co{++}
 	on a 1,000-digit number, assuming that you had access to a
 	class implementing such numbers.
 	But as of 2010, the C language does not permit operator overloading.
 } \QuickQuizEnd

 This problem is not specific to arithmetic.
 Suppose you need to store and query data.
 Should you use an ASCII file?
 XML?
 A relational database?
 A linked list?
 A dense array?
 A B-tree?
 A radix tree?
 Or one of the plethora of other data
 structures and environments that permit data to be stored and queried?
 It depends on what you need to do, how fast you need it done, and how
 large your data set is---even on sequential systems.

 Similarly, if you need to count, your solution will depend on how large
 of numbers you need to work with, how many CPUs need to be manipulating
 a given number concurrently, how the number is to be used, and what
 level of performance and scalability you will need.

 Nor is this problem specific to software.
 The design for a bridge meant to allow people to walk across a small brook
 might be a simple as a single wooden plank.
 But you would probably not use a plank to span the kilometers-wide mouth of
 the Columbia River, nor would such a design be advisable for bridges
 carrying concrete trucks.
 In short, just as bridge design must change with increasing span and load,
 so must software design change as the number of CPUs increases.
 That said, it would be good to automate this process, so that the
 software adapts to changes in hardware configuration and in workload.
 There has in fact been some research into this sort of
 automation~\cite{Appavoo03a,Soules03a}, and the Linux kernel does some
 boot-time reconfiguration, including limited binary rewriting.
 This sort of adaptation will become increasingly important as the
 number of CPUs on mainstream systems continues to increase.

 In short, as discussed in
 Chapter~\ref{chp:Hardware and its Habits},
 the laws of physics constrain parallel software just as surely as they
 constrain mechanical artifacts such as bridges.
 These constraints force specialization, though in the case of software
 it might be possible to automate the choice of specialization to
 fit the hardware and workload in question.

 Of course, even generalized counting is quite specialized.
 We need to do a great number of other things with computers.
 The next section relates what we have learned from counters to
 topics taken up later in this book.

 \subsection{Parallel Counting Lessons}
 \label{sec:count:Parallel Counting Lessons}

 The opening paragraph of this chapter promised that our study of counting
 would provide an excellent introduction to parallel programming.
 This section makes explicit connections between the lessons from
 this chapter and the material presented in a number of later chapters.

 The examples in this chapter have shown that an important scalability
 and performance tool is \emph{partitioning}.
 The counters might be fully partitioned, as in the statistical counters
 discussed in Section~\ref{sec:count:Statistical Counters},
 or partially partitioned as in the limit counters discussed in
 Sections~\ref{sec:count:Approximate Limit Counters} and
 \ref{sec:count:Exact Limit Counters}.
 Partitioning will be considered in far greater depth in
 Chapter~\ref{cha:Partitioning and Synchronization Design},
 and partial parallelization in particular in
 Section~\ref{sec:SMPdesign:Parallel Fastpath}, where it is called
 \emph{parallel fastpath}.

 \QuickQuiz{}
 	But if we are going to have to partition everything, why bother
 	with shared-memory multithreading?
 	Why not just partition the problem completely and run as
 	multiple processes, each in its own address space?
 \QuickQuizAnswer{
 	Indeed, multiple processes with separate address spaces can be
 	an excellent way to exploit parallelism, as the proponents of
 	the fork-join methodology and the Erlang language would be very
 	quick to tell you.
 	However, there are also some advantages to shared-memory parallelism:
 	\begin{enumerate}
 	\item	Only the most performance-critical portions of the
 		application must be partitioned, and such portions
 		are usually a small fraction of the application.
 	\item	Although cache misses are quite slow compared to
 		individual register-to-register instructions,
 		they are typically considerably faster than
 		inter-process-communication primitives, which in
 		turn are considerably faster than things like
 		TCP/IP networking.
 	\item	Shared-memory multiprocessors are readily available
 		and quite inexpensive, so, in stark contrast to the
 		1990s, there is little cost penalty for use of
 		shared-memory parallelism.
 	\end{enumerate}
 	As always, use the right tool for the job!
 } \QuickQuizEnd

 The partially partitioned counting algorithms used locking to
 guard the global data, and locking is the subject of
 Chapter~\ref{chp:Locking}.
 In contrast, the partitioned data tended to be fully under the control of
 the corresponding thread, so that no synchronization whatsoever was required.
 This \emph{data ownership} will be introduced in
 Section~\ref{sec:SMPdesign:Data Ownership}
 and discussed in more detail in
 Chapter~\ref{chp:Data Ownership}.

 Because integer addition and subtraction are extremely cheap operations
 compared to typical synchronization operations, achieving reasonable
 scalability requires synchronization operations be used sparingly.
 One way of achieving this is to batch the addition and subtraction
 operations, so that a great many of these cheap operations are handled
 by a single synchronization operation.
 Batching optimizations of one sort or another are used by each of
 the counting algorithms listed in
 Tables~\ref{tab:count:Statistical Counter Performance on Power-6}
 and~\ref{tab:count:Limit Counter Performance on Power-6}.

 Finally, the eventually consistent statistical counter discussed in
 Section~\ref{sec:count:Eventually Consistent Implementation}
 showed how deferring activity (in that case, updating the global
 counter) can provide substantial performance and scalability benefits.
 This approach allows common case code to use much cheaper synchronization
 operations than would otherwise be possible.
 Chapter~\ref{chp:Deferred Processing} will examine a number of additional
 ways that deferral can improve performance, scalability, and even
 real-time response.

 Summarizing the summary:

 \begin{enumerate}
 \item	Partitioning promotes performance and scalability.
 \item	Partial partitioning, that is, partitioning applied only to
 	common code paths, works almost as well.
 \item	Partial partitioning can be applied to code (as in
 	Section~\ref{sec:count:Statistical Counters}'s statistical
 	counters' partitioned updates and non-partitioned reads), but also
 	across time (as in
 	Section~\ref{sec:count:Approximate Limit Counters}'s and
 	Section~\ref{sec:count:Exact Limit Counters}'s
 	limit counters running fast when far from
 	the limit, but slowly when close to the limit).
 \item	Partitioning across time often batches updates locally
 	in order to reduce the number of expensive global operations,
 	thereby decreasing synchronization overhead, in turn
 	improving performance and scalability.
 	All the algorithms shown in
 	Tables~\ref{tab:count:Statistical Counter Performance on Power-6}
 	and~\ref{tab:count:Limit Counter Performance on Power-6}
 	make heavy use of batching.
 \item	Read-only code paths should remain read-only:  Spurious
 	synchronization writes to shared memory kill performance
 	and scalability, as seen in the \path{count_end.c} row of
 	Table~\ref{tab:count:Statistical Counter Performance on Power-6}.
 \item	Judicious use of delay promotes performance and scalability, as
 	seen in Section~\ref{sec:count:Eventually Consistent Implementation}.
 \item	Parallel performance and scalability is usually a balancing act:
 	Beyond a certain point, optimizing some code paths will degrade
 	others.
 	The \path{count_stat.c} and \path{count_end_rcu.c} rows of
 	Table~\ref{tab:count:Statistical Counter Performance on Power-6}
 	illustrate this point.
 \item	Different levels of performance and scalability will affect
 	algorithm and data-structure design, as do a large number of
 	other factors.
 	Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem}
 	illustrates this point:  Atomic increment might be completely
 	acceptable for a two-CPU system, but be completely inadequate for an
 	eight-CPU system.
 \end{enumerate}

 \begin{figure}[tb]
 \centering
 \resizebox{3in}{!}{\includegraphics{count/FourTaskOrderOpt}}
 \caption{Optimization and the Four Parallel-Programming Tasks}
 \label{fig:count:Optimization and the Four Parallel-Programming Tasks}
 \end{figure}

 Summarizing still further, we have the ``big three'' methods of
 increasing performance and scalability, namely
 (1)~\emph{partitioning} over CPUs or threads,
 (2)~\emph{batching} so that more work can be done by each expensive
 synchronization operations, and
 (3)~\emph{weakening} synchronization operations where feasible.
 As a rough rule of thumb, you should apply these methods in this order,
 as was noted earlier in the discussion of
 Figure~\ref{fig:intro:Ordering of Parallel-Programming Tasks}
 on
 page~\pageref{fig:intro:Ordering of Parallel-Programming Tasks}.
 The partitioning optimization applies to the
 ``Resource Partitioning and Replication'' bubble,
 the batching optimization to the ``Work Partitioning'' bubble,
 and the weakening optimization to the ``Parallel Access Control'' bubble,
 as shown in
 Figure~\ref{fig:count:Optimization and the Four Parallel-Programming Tasks}.
 Of course, if you are using special-purpose hardware such as
 digital signal processors (DSPs), field-programmable gate arrays (FPGAs),
 or general-purpose graphical processing units (GPGPUs), you may need
 to pay close attention to the ``Interacting With Hardware'' bubble
 throughout the design process.
 For example, the structure of a GPGPU's hardware threads and memory
 connectivity might richly reward very careful partitioning
 and batching design decisions.

 In short, as noted at the beginning of this chapter, the simplicity
 of counting have allowed us to explore many
 fundamental concurrency issues without the distraction of
 complex synchronization primitives or elaborate data structures.
 Such synchronization primitives and data structures are covered
 in later chapters.