blob: cc47554186a20870cec81d16c5828d567efaaa28 [file] [log] [blame]
% count/count.tex
\QuickQuizChapter{chp:Counting}{Counting}
%
\Epigraph{As easy as 1, 2, 3!}{\emph{Unknown}}
Counting is perhaps the simplest and most natural thing a computer can do.
However, counting efficiently and scalably on a large
shared-memory multiprocessor can be quite challenging.
Furthermore, the simplicity of the underlying concept of counting
allows us to explore the fundamental issues of concurrency without
the distractions
of elaborate data structures or complex synchronization primitives.
Counting therefore provides an excellent introduction to
parallel programming.
This chapter covers a number of special cases for which there are simple,
fast, and scalable counting algorithms.
But first, let us find out how much you already know about concurrent
counting.
\QuickQuiz{}
Why on earth should efficient and scalable counting be hard?
After all, computers have special hardware for the sole purpose
of doing counting,
addition, subtraction, and lots more besides, don't they???
\QuickQuizAnswer{
Because the straightforward counting algorithms, for example,
atomic operations on a shared counter, either are slow and scale
badly, or are inaccurate, as will be seen in
Section~\ref{sec:count:Why Isn't Concurrent Counting Trivial?}.
} \QuickQuizEnd
\QuickQuiz{}
{ \bfseries Network-packet counting problem. }
Suppose that you need to collect statistics on the number
of networking packets (or total number of bytes) transmitted
and/or received.
Packets might be transmitted or received by any CPU on
the system.
Suppose further that this large machine is capable of
handling a million packets per second, and that there
is a systems-monitoring package that reads out the count
every five seconds.
How would you implement this statistical counter?
\QuickQuizAnswer{
Hint: The act of updating the counter must be blazingly
fast, but because the counter is read out only about once
in five million updates, the act of reading out the counter can be
quite slow.
In addition, the value read out normally need not be all that
accurate---after all, since the counter is updated a thousand
times per millisecond, we should be able to work with a value
that is within a few thousand counts of the ``true value'',
whatever ``true value'' might mean in this context.
However, the value read out should maintain roughly the same
absolute error over time.
For example, a 1\% error might be just fine when the count
is on the order of a million or so, but might be absolutely
unacceptable once the count reaches a trillion.
See Section~\ref{sec:count:Statistical Counters}.
} \QuickQuizEnd
\QuickQuizLabel{\QcountQstatcnt}
\QuickQuiz{}
{ \bfseries Approximate structure-allocation limit problem. }
Suppose that you need to maintain a count of the number of
structures allocated in order to fail any allocations
once the number of structures in use exceeds a limit
(say, 10,000).
Suppose further that these structures are short-lived,
that the limit is rarely exceeded, and that a ``sloppy''
approximate limit is acceptable.
\QuickQuizAnswer{
Hint: The act of updating the counter must again be blazingly
fast, but the counter is read out each time that the
counter is increased.
However, the value read out need not be accurate
\emph{except} that it must distinguish approximately
between values below the limit and values greater than or
equal to the limit.
See Section~\ref{sec:count:Approximate Limit Counters}.
} \QuickQuizEnd
\QuickQuizLabel{\QcountQapproxcnt}
\QuickQuiz{}
{ \bfseries Exact structure-allocation limit problem. }
Suppose that you need to maintain a count of the number of
structures allocated in order to fail any allocations
once the number of structures in use exceeds an exact limit
(again, say 10,000).
Suppose further that these structures are short-lived,
and that the limit is rarely exceeded, that there is almost
always at least one structure in use, and suppose further
still that it is necessary to know exactly when this counter reaches
zero, for example, in order to free up some memory
that is not required unless there is at least one structure
in use.
\QuickQuizAnswer{
Hint: The act of updating the counter must once again be blazingly
fast, but the counter is read out each time that the
counter is increased.
However, the value read out need not be accurate
\emph{except} that it absolutely must distinguish perfectly
between values between the limit and zero on the one hand,
and values that either are less than or equal to zero or
are greater than or equal to the limit on the other hand.
See Section~\ref{sec:count:Exact Limit Counters}.
} \QuickQuizEnd
\QuickQuizLabel{\QcountQexactcnt}
\QuickQuiz{}
{ \bfseries Removable I/O device access-count problem. }
Suppose that you need to maintain a reference count on a
heavily used removable mass-storage device, so that you
can tell the user when it is safe to remove the device.
This device follows the usual removal procedure where
the user indicates a desire to remove the device, and
the system tells the user when it is safe to do so.
\QuickQuizAnswer{
Hint: Yet again, the act of updating the counter must be blazingly
fast and scalable in order to avoid slowing down I/O operations,
but because the counter is read out only when the
user wishes to remove the device, the counter read-out
operation can be extremely slow.
Furthermore, there is no need to be able to read out
the counter at all unless the user has already indicated
a desire to remove the device.
In addition, the value read out need not be accurate
\emph{except} that it absolutely must distinguish perfectly
between non-zero and zero values, and even then only when
the device is in the process of being removed.
However, once it has read out a zero value, it must act
to keep the value at zero until it has taken some action
to prevent subsequent threads from gaining access to the
device being removed.
See Section~\ref{sec:count:Applying Specialized Parallel Counters}.
} \QuickQuizEnd
\QuickQuizLabel{\QcountQIOcnt}
The remainder of this chapter will develop answers to these questions.
Section~\ref{sec:count:Why Isn't Concurrent Counting Trivial?}
asks why counting on multicore systems isn't trivial, and
Section~\ref{sec:count:Statistical Counters}
looks into ways of solving the network-packet counting problem.
Section~\ref{sec:count:Approximate Limit Counters}
investigates the approximate structure-allocation limit problem, while
Section~\ref{sec:count:Exact Limit Counters}
takes on the exact structure-allocation limit problem.
Section~\ref{sec:count:Applying Specialized Parallel Counters}
discusses how to use the various specialized parallel counters
introduced in the preceding sections.
Finally, Section~\ref{sec:count:Parallel Counting Discussion}
concludes the chapter with performance measurements.
Sections~\ref{sec:count:Why Isn't Concurrent Counting Trivial?}
and~\ref{sec:count:Statistical Counters}
contain introductory material, while the remaining sections
are more appropriate for advanced students.
\section{Why Isn't Concurrent Counting Trivial?}
\label{sec:count:Why Isn't Concurrent Counting Trivial?}
\begin{figure}[bp]
{ \scriptsize
\begin{verbbox}
1 long counter = 0;
2
3 void inc_count(void)
4 {
5 counter++;
6 }
7
8 long read_count(void)
9 {
10 return counter;
11 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Just Count!}
\label{fig:count:Just Count!}
\end{figure}
Let's start with something simple, for example, the straightforward
use of arithmetic shown in
Figure~\ref{fig:count:Just Count!} (\path{count_nonatomic.c}).
Here, we have a counter on line~1, we increment it on line~5, and we
read out its value on line~10.
What could be simpler?
This approach has the additional advantage of being blazingly fast if
you are doing lots of reading and almost no incrementing, and on small
systems, the performance is excellent.
There is just one large fly in the ointment: this approach can lose
counts.
On my dual-core laptop, a short run invoked \co{inc_count()}
100,014,000 times, but the final value of the counter was only
52,909,118.
Although approximate values do have their place in computing,
accuracies far greater than 50\% are almost always necessary.
\QuickQuiz{}
But doesn't the \co{++} operator produce an x86 add-to-memory
instruction?
And won't the CPU cache cause this to be atomic?
\QuickQuizAnswer{
Although the \co{++} operator \emph{could} be atomic, there
is no requirement that it be so.
And indeed, \co{gcc} often
chooses to load the value to a register, increment
the register, then store the value to memory, which is
decidedly non-atomic.
} \QuickQuizEnd
\QuickQuiz{}
The 8-figure accuracy on the number of failures indicates
that you really did test this.
Why would it be necessary to test such a trivial program,
especially when the bug is easily seen by inspection?
\QuickQuizAnswer{
Not only are there very few
trivial parallel programs, and most days I am
not so sure that there are many trivial sequential programs, either.
No matter how small or simple the program, if you haven't tested
it, it does not work.
And even if you have tested it, Murphy's Law says that there will
be at least a few bugs still lurking.
Furthermore, while proofs of correctness certainly do have their
place, they never will replace testing, including the
\path{counttorture.h} test setup used here.
After all, proofs are only as good as the assumptions that they
are based on.
Furthermore, proofs can have bugs just as easily as programs can!
} \QuickQuizEnd
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 atomic_t counter = ATOMIC_INIT(0);
2
3 void inc_count(void)
4 {
5 atomic_inc(&counter);
6 }
7
8 long read_count(void)
9 {
10 return atomic_read(&counter);
11 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Just Count Atomically!}
\label{fig:count:Just Count Atomically!}
\end{figure}
\begin{figure}[tb]
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/count/atomic}}
\caption{Atomic Increment Scalability on Nehalem}
\label{fig:count:Atomic Increment Scalability on Nehalem}
\end{figure}
The straightforward way to count accurately is to use atomic operations,
as shown in
Figure~\ref{fig:count:Just Count Atomically!} (\path{count_atomic.c}).
Line~1 defines an atomic variable, line~5 atomically increments it, and
line~10 reads it out.
Because this is atomic, it keeps perfect count.
However, it is slower: on a Intel Core Duo laptop, it is about
six times slower than non-atomic increment
when a single thread is incrementing, and more than \emph{ten times}
slower if two threads are incrementing.\footnote{
Interestingly enough, a pair of threads non-atomically incrementing
a counter will cause the counter to increase more quickly than
a pair of threads atomically incrementing the counter.
Of course, if your only goal is to make the counter increase
quickly, an easier approach is to simply assign a large value
to the counter.
Nevertheless, there is likely to be a role for algorithms that
use carefully relaxed notions of correctness in order to gain
greater performance and
scalability~\cite{Andrews91textbook,Arcangeli03,DavidUngar2011unsync}.}
This poor performance should not be a surprise, given the discussion in
Chapter~\ref{chp:Hardware and its Habits},
nor should it be a surprise that the performance of atomic increment
gets slower as the number of CPUs and threads increase, as shown in
Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem}.
In this figure, the horizontal dashed line resting on the x~axis
is the ideal performance that would be achieved
by a perfectly scalable algorithm: with such an algorithm, a given
increment would incur the same overhead that it would in a single-threaded
program.
Atomic increment of a single global variable is clearly
decidedly non-ideal, and gets worse as you add CPUs.
\QuickQuiz{}
Why doesn't the dashed line on the x~axis meet the
diagonal line at $x=1$?
\QuickQuizAnswer{
Because of the overhead of the atomic operation.
The dashed line on the x~axis represents the overhead of
a single \emph{non-atomic} increment.
After all, an \emph{ideal} algorithm would not only scale
linearly, it would also incur no performance penalty compared
to single-threaded code.
This level of idealism may seem severe, but if it is good
enough for Linus Torvalds, it is good enough for you.
} \QuickQuizEnd
\QuickQuiz{}
But atomic increment is still pretty fast.
And incrementing a single variable in a tight loop sounds
pretty unrealistic to me, after all, most of the program's
execution should be devoted to actually doing work, not accounting
for the work it has done!
Why should I care about making this go faster?
\QuickQuizAnswer{
In many cases, atomic increment will in fact be fast enough
for you.
In those cases, you should by all means use atomic increment.
That said, there are many real-world situations where
more elaborate counting algorithms are required.
The canonical example of such a situation is counting packets
and bytes in highly optimized networking stacks, where it is
all too easy to find much of the execution time going into
these sorts of accounting tasks, especially on large
multiprocessors.
In addition, as noted at the beginning of this chapter,
counting provides an excellent view of the
issues encountered in shared-memory parallel programs.
} \QuickQuizEnd
\begin{figure}[tb]
\centering
\resizebox{3in}{!}{\includegraphics{count/GlobalInc}}
\caption{Data Flow For Global Atomic Increment}
\label{fig:count:Data Flow For Global Atomic Increment}
\end{figure}
\begin{figure}[tb]
\centering
\resizebox{3in}{!}{\includegraphics{cartoons/r-2014-One-one-thousand}}
\caption{Waiting to Count}
\ContributedBy{Figure}{fig:count:Waiting to Count}{Melissa Broussard}
\end{figure}
For another perspective on global atomic increment, consider
Figure~\ref{fig:count:Data Flow For Global Atomic Increment}.
In order for each CPU to get a chance to increment a given
global variable, the cache line containing that variable must
circulate among all the CPUs, as shown by the red arrows.
Such circulation will take significant time, resulting in
the poor performance seen in
Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem},
which might be thought of as shown in
Figure~\ref{fig:count:Waiting to Count}.
The following sections discuss high-performance counting, which
avoids the delays inherent in such circulation.
\QuickQuiz{}
But why can't CPU designers simply ship the addition operation to the
data, avoiding the need to circulate the cache line containing
the global variable being incremented?
\QuickQuizAnswer{
It might well be possible to do this in some cases.
However, there are a few complications:
\begin{enumerate}
\item If the value of the variable is required, then the
thread will be forced to wait for the operation
to be shipped to the data, and then for the result
to be shipped back.
\item If the atomic increment must be ordered with respect
to prior and/or subsequent operations, then the thread
will be forced to wait for the operation to be shipped
to the data, and for an indication that the operation
completed to be shipped back.
\item Shipping operations among CPUs will likely require
more lines in the system interconnect, which will consume
more die area and more electrical power.
\end{enumerate}
But what if neither of the first two conditions holds?
Then you should think carefully about the algorithms discussed
in Section~\ref{sec:count:Statistical Counters}, which achieve
near-ideal performance on commodity hardware.
\begin{figure}[tb]
\centering
\resizebox{3in}{!}{\includegraphics{count/GlobalTreeInc}}
\caption{Data Flow For Global Combining-Tree Atomic Increment}
\label{fig:count:Data Flow For Global Combining-Tree Atomic Increment}
\end{figure}
If either or both of the first two conditions hold, there is
\emph{some} hope for improved hardware.
One could imagine the hardware implementing a combining tree,
so that the increment requests from multiple CPUs are combined
by the hardware into a single addition when the combined request
reaches the hardware.
The hardware could also apply an order to the requests, thus
returning to each CPU the return value corresponding to its
particular atomic increment.
This results in instruction latency that varies as $O(\log N)$,
where $N$ is the number of CPUs, as shown in
Figure~\ref{fig:count:Data Flow For Global Combining-Tree Atomic Increment}.
And CPUs with this sort of hardware optimization are starting to
appear as of 2011.
This is a great improvement over the $O(N)$ performance
of current hardware shown in
Figure~\ref{fig:count:Data Flow For Global Atomic Increment},
and it is possible that hardware latencies might decrease
further if innovations such as three-dimensional fabrication prove
practical.
Nevertheless, we will see that in some important special cases,
software can do \emph{much} better.
} \QuickQuizEnd
\section{Statistical Counters}
\label{sec:count:Statistical Counters}
This section covers the common special case of statistical counters, where
the count is updated extremely frequently and the value is read out
rarely, if ever.
These will be used to solve the network-packet counting problem
posed in \QuickQuizRef{\QcountQstatcnt}.
\subsection{Design}
Statistical counting is typically handled by providing a counter per
thread (or CPU, when running in the kernel), so that each thread
updates its own counter.
The aggregate value of the counters is read out by simply summing up
all of the threads' counters,
relying on the commutative and associative properties of addition.
This is an example of the Data Ownership pattern that will be introduced in
Section~\ref{sec:SMPdesign:Data Ownership}.
\QuickQuiz{}
But doesn't the fact that C's ``integers'' are limited in size
complicate things?
\QuickQuizAnswer{
No, because modulo addition is still commutative and associative.
At least as long as you use unsigned integers.
Recall that in the C standard, overflow of signed integers results
in undefined behavior, never mind the fact that machines that
do anything other than wrap on overflow are quite rare these days.
Unfortunately, compilers frequently carry out optimizations that
assume that signed integers will not overflow, so if your code
allows signed integers to overflow, you can run into trouble
even on twos-complement hardware.
That said, one potential source of additional complexity arises
when attempting to gather (say) a 64-bit sum from 32-bit
per-thread counters.
Dealing with this added complexity is left as
an exercise for the reader, for whom some of the techniques
introduced later in this chapter could be quite helpful.
} \QuickQuizEnd
\subsection{Array-Based Implementation}
\label{sec:count:Array-Based Implementation}
One way to provide per-thread variables is to allocate an array with
one element per
thread (presumably cache aligned and padded to avoid false sharing).
\QuickQuiz{}
An array???
But doesn't that limit the number of threads?
\QuickQuizAnswer{
It can, and in this toy implementation, it does.
But it is not that hard to come up with an alternative
implementation that permits an arbitrary number of threads,
for example, using the \co{gcc} \co{__thread} facility,
as shown in
Section~\ref{sec:count:Per-Thread-Variable-Based Implementation}.
} \QuickQuizEnd
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 DEFINE_PER_THREAD(long, counter);
2
3 void inc_count(void)
4 {
5 __get_thread_var(counter)++;
6 }
7
8 long read_count(void)
9 {
10 int t;
11 long sum = 0;
12
13 for_each_thread(t)
14 sum += per_thread(counter, t);
15 return sum;
16 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Array-Based Per-Thread Statistical Counters}
\label{fig:count:Array-Based Per-Thread Statistical Counters}
\end{figure}
Such an array can be wrapped into per-thread primitives, as shown in
Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
(\path{count_stat.c}).
Line~1 defines an array containing a set of per-thread counters of
type \co{long} named, creatively enough, \co{counter}.
Lines~3-6 show a function that increments the counters, using the
\co{__get_thread_var()} primitive to locate the currently running
thread's element of the \co{counter} array.
Because this element is modified only by the corresponding thread,
non-atomic increment suffices.
Lines~8-16 show a function that reads out the aggregate value of the counter,
using the \co{for_each_thread()} primitive to iterate over the list of
currently running threads, and using the \co{per_thread()} primitive
to fetch the specified thread's counter.
Because the hardware can fetch and store a properly aligned \co{long}
atomically, and because gcc is kind enough to make use of this capability,
normal loads suffice, and no special atomic instructions are required.
\QuickQuiz{}
What other choice does gcc have, anyway???
\QuickQuizAnswer{
According to the C standard, the effects of fetching a variable
that might be concurrently modified by some other thread are
undefined.
It turns out that the C standard really has no other choice,
given that C must support (for example) eight-bit architectures
which are incapable of atomically loading a \co{long}.
An upcoming version of the C standard aims to fill this gap,
but until then, we depend on the kindness of the gcc developers.
Alternatively, use of volatile accesses such as those provided
by \co{ACCESS_ONCE()}~\cite{JonCorbet2012ACCESS:ONCE}
can help constrain the compiler, at least
in cases where the hardware is capable of accessing the value
with a single memory-reference instruction.
} \QuickQuizEnd
\QuickQuiz{}
How does the per-thread \co{counter} variable in
Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
get initialized?
\QuickQuizAnswer{
The C standard specifies that the initial value of
global variables is zero, unless they are explicitly initialized.
So the initial value of all the instances of \co{counter}
will be zero.
Furthermore, in the common case where the user is interested only
in differences between consecutive reads
from statistical counters, the initial value is irrelevant.
} \QuickQuizEnd
\QuickQuiz{}
How is the code in
Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
supposed to permit more than one counter?
\QuickQuizAnswer{
Indeed, this toy example does not support more than one counter.
Modifying it so that it can provide multiple counters is left
as an exercise to the reader.
} \QuickQuizEnd
\begin{figure}[tb]
\centering
\resizebox{3in}{!}{\includegraphics{count/PerThreadInc}}
\caption{Data Flow For Per-Thread Increment}
\label{fig:count:Data Flow For Per-Thread Increment}
\end{figure}
This approach scales linearly with increasing number of updater threads
invoking \co{inc_count()}.
As is shown by the green arrows on each CPU in
Figure~\ref{fig:count:Data Flow For Per-Thread Increment},
the reason for this is that each CPU can make rapid progress incrementing
its thread's variable, without any expensive cross-system communication.
As such, this section solves the network-packet counting problem presented
at the beginning of this chapter.
\QuickQuiz{}
The read operation takes time to sum up the per-thread values,
and during that time, the counter could well be changing.
This means that the value returned by
\co{read_count()} in
Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
will not necessarily be exact.
Assume that the counter is being incremented at rate
$r$ counts per unit time, and that \co{read_count()}'s
execution consumes $\Delta$ units of time.
What is the expected error in the return value?
\QuickQuizAnswer{
Let's do worst-case analysis first, followed by a less
conservative analysis.
In the worst case, the read operation completes immediately,
but is then delayed for $\Delta$ time units before returning,
in which case the worst-case error is simply $r \Delta$.
This worst-case behavior is rather unlikely, so let us instead
consider the case where the reads from each of the $N$
counters is spaced equally over the time period $\Delta$.
There will be $N+1$ intervals of duration $\frac{\Delta}{N+1}$
between the $N$ reads.
The error due to the delay after the read from the last thread's
counter will be given by $\frac{r \Delta}{N \left( N + 1 \right)}$,
the second-to-last thread's counter by
$\frac{2 r \Delta}{N \left( N + 1 \right)}$,
the third-to-last by
$\frac{3 r \Delta}{N \left( N + 1 \right)}$,
and so on.
The total error is given by the sum of the errors due to the
reads from each thread's counter, which is:
\begin{equation}
\frac{r \Delta}{N \left( N + 1 \right)}
\sum_{i = 1}^N i
\end{equation}
Expressing the summation in closed form yields:
\begin{equation}
\frac{r \Delta}{N \left( N + 1 \right)}
\frac{N \left( N + 1 \right)}{2}
\end{equation}
Cancelling yields the intuitively expected result:
\begin{equation}
\frac{r \Delta}{2}
\label{eq:count:CounterErrorAverage}
\end{equation}
It is important to remember that error continues accumulating
as the caller executes code making use of the count returned
by the read operation.
For example, if the caller spends time $t$ executing some
computation based on the result of the returned count, the
worst-case error will have increased to $r \left(\Delta + t\right)$.
The expected error will have similarly increased to:
\begin{equation}
r \left( \frac{\Delta}{2} + t \right)
\end{equation}
Of course, it is sometimes unacceptable for the counter to
continue incrementing during the read operation.
Section~\ref{sec:count:Applying Specialized Parallel Counters}
discusses a way to handle this situation.
Thus far, we have been considering a counter that is only
increased, never decreased.
If the counter value is being changed by $r$ counts per unit
time, but in either direction, we should expect the error
to reduce.
However, the worst case is unchanged because although the
counter \emph{could} move in either direction, the worst
case is when the read operation completes immediately,
but then is delayed for $\Delta$ time units, during which
time all the changes in the counter's value move it in
the same direction, again giving us an absolute error
of $r \Delta$.
There are a number of ways to compute the average error,
based on a variety of assumptions about the patterns of
increments and decrements.
For simplicity, let's assume that the $f$ fraction of
the operations are decrements, and that the error of interest
is the deviation from the counter's long-term trend line.
Under this assumption, if $f$ is less than or equal to 0.5,
each decrement will be cancelled by an increment, so that
$2f$ of the operations will cancel each other, leaving
$1-2f$ of the operations being uncancelled increments.
On the other hand, if $f$ is greater than 0.5, $1-f$ of
the decrements are cancelled by increments, so that the
counter moves in the negative direction by $-1+2\left(1-f\right)$,
which simplifies to $1-2f$, so that the counter moves an average
of $1-2f$ per operation in either case.
Therefore, that the long-term
movement of the counter is given by $\left( 1-2f \right) r$.
Plugging this into
Equation~\ref{eq:count:CounterErrorAverage} yields:
\begin{equation}
\frac{\left( 1 - 2 f \right) r \Delta}{2}
\end{equation}
All that aside, in most uses of statistical counters, the
error in the value returned by \co{read_count()} is
irrelevant.
This irrelevance is due to the fact that the time required
for \co{read_count()} to execute is normally extremely
small compared to the time interval between successive
calls to \co{read_count()}.
} \QuickQuizEnd
However, this excellent update-side scalability comes at great read-side
expense for large numbers of threads.
The next section shows one way to reduce read-side expense while
still retaining the update-side scalability.
\subsection{Eventually Consistent Implementation}
\label{sec:count:Eventually Consistent Implementation}
One way to retain update-side scalability while greatly improving
read-side performance is to weaken consistency requirements.
The counting algorithm in the previous section is guaranteed to
return a value between the value that an ideal counter would have
taken on near the beginning of \co{read_count()}'s execution and
that near the end of \co{read_count()}'s execution.
\emph{Eventual consistency}~\cite{WernerVogels:2009:EventuallyConsistent}
provides a weaker
guarantee: in absence of calls to \co{inc_count()}, calls to
\co{read_count()} will eventually return an accurate count.
We exploit eventual consistency by maintaining a global counter.
However, updaters only manipulate their per-thread counters.
A separate thread is provided to transfer counts from the per-thread
counters to the global counter.
Readers simply access the value of the global counter.
If updaters are active, the value used by the readers will be out of
date, however, once updates cease, the global counter will eventually
converge on the true value---hence this approach qualifies as
eventually consistent.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 DEFINE_PER_THREAD(unsigned long, counter);
2 unsigned long global_count;
3 int stopflag;
4
5 void inc_count(void)
6 {
7 ACCESS_ONCE(__get_thread_var(counter))++;
8 }
9
10 unsigned long read_count(void)
11 {
12 return ACCESS_ONCE(global_count);
13 }
14
15 void *eventual(void *arg)
16 {
17 int t;
18 int sum;
19
20 while (stopflag < 3) {
21 sum = 0;
22 for_each_thread(t)
23 sum += ACCESS_ONCE(per_thread(counter, t));
24 ACCESS_ONCE(global_count) = sum;
25 poll(NULL, 0, 1);
26 if (stopflag) {
27 smp_mb();
28 stopflag++;
29 }
30 }
31 return NULL;
32 }
33
34 void count_init(void)
35 {
36 thread_id_t tid;
37
38 if (pthread_create(&tid, NULL, eventual, NULL)) {
39 perror("count_init:pthread_create");
40 exit(-1);
41 }
42 }
43
44 void count_cleanup(void)
45 {
46 stopflag = 1;
47 while (stopflag < 3)
48 poll(NULL, 0, 1);
49 smp_mb();
50 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Array-Based Per-Thread Eventually Consistent Counters}
\label{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
\end{figure}
The implementation is shown in
Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
(\path{count_stat_eventual.c}).
Lines~1-2 show the per-thread variable and the global variable that
track the counter's value, and line three shows \co{stopflag}
which is used to coordinate termination (for the case where we want
to terminate the program with an accurate counter value).
The \co{inc_count()} function shown on lines~5-8 is similar to its
counterpart in
Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}.
The \co{read_count()} function shown on lines~10-13 simply returns the
value of the \co{global_count} variable.
However, the \co{count_init()} function on lines~34-42
creates the \co{eventual()} thread shown on lines~15-32, which
cycles through all the threads,
summing the per-thread local \co{counter} and storing the
sum to the \co{global_count} variable.
The \co{eventual()} thread waits an arbitrarily chosen one millisecond
between passes.
The \co{count_cleanup()} function on lines~44-50 coordinates termination.
This approach gives extremely fast counter read-out while still
supporting linear counter-update performance.
However, this excellent read-side performance and update-side scalability
comes at the cost of the additional thread running \co{eventual()}.
\QuickQuiz{}
Why doesn't \co{inc_count()} in
Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
need to use atomic instructions?
After all, we now have multiple threads accessing the per-thread
counters!
\QuickQuizAnswer{
Because one of the two threads only reads, and because the
variable is aligned and machine-sized, non-atomic instructions
suffice.
That said, the \co{ACCESS_ONCE()} macro is used to prevent
compiler optimizations that might otherwise prevent the
counter updates from becoming visible to
\co{eventual()}~\cite{JonCorbet2012ACCESS:ONCE}.
An older version of this algorithm did in fact use atomic
instructions, kudos to Ersoy Bayramoglu for pointing out that
they are in fact unnecessary.
That said, atomic instructions would be needed in cases where
the per-thread \co{counter} variables were smaller than the
global \co{global_count}.
However, note that on a 32-bit system,
the per-thread \co{counter} variables
might need to be limited to 32 bits in order to sum them accurately,
but with a 64-bit \co{global_count} variable to avoid overflow.
In this case, it is necessary to zero the per-thread
\co{counter} variables periodically in order to avoid overflow.
It is extremely important to note that this zeroing cannot
be delayed too long or overflow of the smaller per-thread
variables will result.
This approach therefore imposes real-time requirements on the
underlying system, and in turn must be used with extreme care.
In contrast, if all variables are the same size, overflow
of any variable is harmless because the eventual sum
will be modulo the word size.
} \QuickQuizEnd
\QuickQuiz{}
Won't the single global thread in the function \co{eventual()} of
Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
be just as severe a bottleneck as a global lock would be?
\QuickQuizAnswer{
In this case, no.
What will happen instead is that as the number of threads increases,
the estimate of the counter
value returned by \co{read_count()} will become more inaccurate.
} \QuickQuizEnd
\QuickQuiz{}
Won't the estimate returned by \co{read_count()} in
Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
become increasingly
inaccurate as the number of threads rises?
\QuickQuizAnswer{
Yes.
If this proves problematic, one fix is to provide multiple
\co{eventual()} threads, each covering its own subset of
the other threads.
In more extreme cases, a tree-like hierarchy of
\co{eventual()} threads might be required.
} \QuickQuizEnd
\QuickQuiz{}
Given that in the eventually\-/consistent algorithm shown in
Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
both reads and updates have extremely low overhead
and are extremely scalable, why would anyone bother with the
implementation described in
Section~\ref{sec:count:Array-Based Implementation},
given its costly read-side code?
\QuickQuizAnswer{
The thread executing \co{eventual()} consumes CPU time.
As more of these eventually\-/consistent counters are added,
the resulting \co{eventual()} threads will eventually
consume all available CPUs.
This implementation therefore suffers a different sort of
scalability limitation, with the scalability limit being in
terms of the number of eventually consistent counters rather
than in terms of the number of threads or CPUs.
Of course, it is possible to make other tradeoffs.
For example, a single thread could be created to handle all
eventually\-/consistent counters, which would limit the
overhead to a single CPU, but would result in increasing
update-to-read latencies as the number of counters increased.
Alternatively, that single thread could track the update rates
of the counters, visiting the frequently\-/updated counters
more frequently.
In addition, the number of threads handling the counters could
be set to some fraction of the total number of CPUs, and
perhaps also adjusted at runtime.
Finally, each counter could specify its latency, and
deadline\-/scheduling techniques could be used to provide
the required latencies to each counter.
There are no doubt many other tradeoffs that could be made.
} \QuickQuizEnd
\subsection{Per-Thread-Variable-Based Implementation}
\label{sec:count:Per-Thread-Variable-Based Implementation}
\begin{figure}[tb]
{ \scriptsize
\begin{verbbox}
1 long __thread counter = 0;
2 long *counterp[NR_THREADS] = { NULL };
3 long finalcount = 0;
4 DEFINE_SPINLOCK(final_mutex);
5
6 void inc_count(void)
7 {
8 counter++;
9 }
10
11 long read_count(void)
12 {
13 int t;
14 long sum;
15
16 spin_lock(&final_mutex);
17 sum = finalcount;
18 for_each_thread(t)
19 if (counterp[t] != NULL)
20 sum += *counterp[t];
21 spin_unlock(&final_mutex);
22 return sum;
23 }
24
25 void count_register_thread(void)
26 {
27 int idx = smp_thread_id();
28
29 spin_lock(&final_mutex);
30 counterp[idx] = &counter;
31 spin_unlock(&final_mutex);
32 }
33
34 void count_unregister_thread(int nthreadsexpected)
35 {
36 int idx = smp_thread_id();
37
38 spin_lock(&final_mutex);
39 finalcount += counter;
40 counterp[idx] = NULL;
41 spin_unlock(&final_mutex);
42 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Per-Thread Statistical Counters}
\label{fig:count:Per-Thread Statistical Counters}
\end{figure}
Fortunately, gcc provides an \co{__thread} storage class that provides
per-thread storage.
This can be used as shown in
Figure~\ref{fig:count:Per-Thread Statistical Counters} (\path{count_end.c})
to implement
a statistical counter that not only scales, but that also incurs little
or no performance penalty to incrementers compared to simple non-atomic
increment.
Lines~1-4 define needed variables: \co{counter} is the per-thread counter
variable, the \co{counterp[]} array allows threads to access each others'
counters, \co{finalcount} accumulates the total as individual threads exit,
and \co{final_mutex} coordinates between threads accumulating the total
value of the counter and exiting threads.
\QuickQuiz{}
Why do we need an explicit array to find the other threads'
counters?
Why doesn't gcc provide a \co{per_thread()} interface, similar
to the Linux kernel's \co{per_cpu()} primitive, to allow
threads to more easily access each others' per-thread variables?
\QuickQuizAnswer{
Why indeed?
To be fair, gcc faces some challenges that the Linux kernel
gets to ignore.
When a user-level thread exits, its per-thread variables all
disappear, which complicates the problem of per-thread-variable
access, particularly before the advent of user-level RCU
(see Section~\ref{sec:defer:Read-Copy Update (RCU)}).
In contrast, in the Linux kernel, when a CPU goes offline,
that CPU's per-CPU variables remain mapped and accessible.
Similarly, when a new user-level thread is created, its
per-thread variables suddenly come into existence.
In contrast, in the Linux kernel, all per-CPU variables are
mapped and initialized at boot time, regardless of whether
the corresponding CPU exists yet, or indeed, whether the
corresponding CPU will ever exist.
A key limitation that the Linux kernel imposes is a compile-time
maximum bound on the number of CPUs, namely, \co{CONFIG_NR_CPUS},
along with a typically tighter boot-time bound of \co{nr_cpu_ids}.
In contrast, in user space, there is no hard-coded upper limit
on the number of threads.
Of course, both environments must handle dynamically loaded
code (dynamic libraries in user space, kernel modules in the
Linux kernel), which increases the complexity of per-thread
variables.
These complications make it significantly harder for user-space
environments to provide access to other threads' per-thread
variables.
Nevertheless, such access is highly useful, and it is hoped
that it will someday appear.
} \QuickQuizEnd
The \co{inc_count()} function used by updaters is quite simple, as can
be seen on lines~6-9.
The \co{read_count()} function used by readers is a bit more complex.
Line~16 acquires a lock to exclude exiting threads, and line~21 releases
it.
Line~17 initializes the sum to the count accumulated by those threads that
have already exited, and lines~18-20 sum the counts being accumulated
by threads currently running.
Finally, line~22 returns the sum.
\QuickQuiz{}
Doesn't the check for \co{NULL} on line~19 of
Figure~\ref{fig:count:Per-Thread Statistical Counters}
add extra branch mispredictions?
Why not have a variable set permanently to zero, and point
unused counter-pointers to that variable rather than setting
them to \co{NULL}?
\QuickQuizAnswer{
This is a reasonable strategy.
Checking for the performance difference is left as an exercise
for the reader.
However, please keep in mind that the fastpath is not
\co{read_count()}, but rather \co{inc_count()}.
} \QuickQuizEnd
\QuickQuiz{}
Why on earth do we need something as heavyweight as a \emph{lock}
guarding the summation in the function \co{read_count()} in
Figure~\ref{fig:count:Per-Thread Statistical Counters}?
\QuickQuizAnswer{
Remember, when a thread exits, its per-thread variables disappear.
Therefore, if we attempt to access a given thread's per-thread
variables after that thread exits, we will get a segmentation
fault.
The lock coordinates summation and thread exit, preventing this
scenario.
Of course, we could instead read-acquire a reader-writer lock,
but Chapter~\ref{chp:Deferred Processing} will introduce even
lighter-weight mechanisms for implementing the required coordination.
Another approach would be to use an array instead of a per-thread
variable, which, as Alexey Roytman notes, would eliminate
the tests against \co{NULL}.
However, array accesses are often slower than accesses to
per-thread variables, and use of an array would imply a
fixed upper bound on the number of threads.
Also, note that neither tests nor locks are needed on the
\co{inc_count()} fastpath.
} \QuickQuizEnd
Lines~25-32 show the \co{count_register_thread()} function, which
must be called by each thread before its first use of this counter.
This function simply sets up this thread's element of the \co{counterp[]}
array to point to its per-thread \co{counter} variable.
\QuickQuiz{}
Why on earth do we need to acquire the lock in
\co{count_register_thread()} in
Figure~\ref{fig:count:Per-Thread Statistical Counters}?
It is a single properly aligned machine-word store to a location
that no other thread is modifying, so it should be atomic anyway,
right?
\QuickQuizAnswer{
This lock could in fact be omitted, but better safe than
sorry, especially given that this function is executed only at
thread startup, and is therefore not on any critical path.
Now, if we were testing on machines with thousands of CPUs,
we might need to omit the lock, but on machines with ``only''
a hundred or so CPUs, there is no need to get fancy.
} \QuickQuizEnd
Lines~34-42 show the \co{count_unregister_thread()} function, which
must be called prior to exit by each thread that previously called
\co{count_register_thread()}.
Line~38 acquires the lock, and line~41 releases it, thus excluding any
calls to \co{read_count()} as well as other calls to
\co{count_unregister_thread()}.
Line~39 adds this thread's \co{counter} to the global \co{finalcount},
and then line~40 \co{NULL}s out its \co{counterp[]} array entry.
A subsequent call to \co{read_count()} will see the exiting thread's
count in the global \co{finalcount}, and will skip the exiting thread
when sequencing through the \co{counterp[]} array, thus obtaining
the correct total.
This approach gives updaters almost exactly the same performance as
a non-atomic add, and also scales linearly.
On the other hand, concurrent reads contend for a single global lock,
and therefore perform poorly and scale abysmally.
However, this is not a problem for statistical counters, where incrementing
happens often and readout happens almost never.
Of course, this approach is considerably more complex than the
array-based scheme, due to the fact that a given thread's per-thread
variables vanish when that thread exits.
\QuickQuiz{}
Fine, but the Linux kernel doesn't have to acquire a lock
when reading out the aggregate value of per-CPU counters.
So why should user-space code need to do this???
\QuickQuizAnswer{
Remember, the Linux kernel's per-CPU variables are always
accessible, even if the corresponding CPU is offline---even
if the corresponding CPU never existed and never will exist.
\begin{figure}[tb]
{ \scriptsize
\begin{verbbox}
1 long __thread counter = 0;
2 long *counterp[NR_THREADS] = { NULL };
3 int finalthreadcount = 0;
4 DEFINE_SPINLOCK(final_mutex);
5
6 void inc_count(void)
7 {
8 counter++;
9 }
10
11 long read_count(void)
12 {
13 int t;
14 long sum = 0;
15
16 for_each_thread(t)
17 if (counterp[t] != NULL)
18 sum += *counterp[t];
19 return sum;
20 }
21
22 void count_init(void)
23 {
24 }
25
26 void count_register_thread(void)
27 {
28 counterp[smp_thread_id()] = &counter;
29 }
30
31 void count_unregister_thread(int nthreadsexpected)
32 {
33 spin_lock(&final_mutex);
34 finalthreadcount++;
35 spin_unlock(&final_mutex);
36 while (finalthreadcount < nthreadsexpected)
37 poll(NULL, 0, 1);
38 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Per-Thread Statistical Counters With Lockless Summation}
\label{fig:count:Per-Thread Statistical Counters With Lockless Summation}
\end{figure}
One workaround is to ensure that each thread continues to exist
until all threads are finished, as shown in
Figure~\ref{fig:count:Per-Thread Statistical Counters With Lockless Summation}
(\path{count_tstat.c}).
Analysis of this code is left as an exercise to the reader,
however, please note that it does not fit well into the
\path{counttorture.h} counter-evaluation scheme.
(Why not?)
Chapter~\ref{chp:Deferred Processing} will introduce
synchronization mechanisms that handle this situation in a much
more graceful manner.
} \QuickQuizEnd
\subsection{Discussion}
These three implementations show that it is possible to obtain uniprocessor
performance for statistical counters, despite running on a parallel
machine.
\QuickQuiz{}
What fundamental difference is there between counting packets
and counting the total number of bytes in the packets, given
that the packets vary in size?
\QuickQuizAnswer{
When counting packets, the counter is only incremented by the
value one.
On the other hand, when counting bytes, the counter might
be incremented by largish numbers.
Why does this matter?
Because in the increment-by-one case, the value returned will
be exact in the sense that the counter must necessarily have
taken on that value at some point in time, even if it is impossible
to say precisely when that point occurred.
In contrast, when counting bytes, two different threads might
return values that are inconsistent with any global ordering
of operations.
To see this, suppose that thread~0 adds the value three to its
counter, thread~1 adds the value five to its counter, and
threads~2 and 3 sum the counters.
If the system is ``weakly ordered'' or if the compiler
uses aggressive optimizations, thread~2 might find the
sum to be three and thread~3 might find the sum to be five.
The only possible global orders of the sequence of values
of the counter are 0,3,8 and 0,5,8, and neither order is
consistent with the results obtained.
If you missed this one, you are not alone.
Michael Scott used this question to stump Paul E.~McKenney
during Paul's Ph.D. defense.
} \QuickQuizEnd
\QuickQuiz{}
Given that the reader must sum all the threads' counters,
this could take a long time given large numbers of threads.
Is there any way that the increment operation can remain
fast and scalable while allowing readers to also enjoy
reasonable performance and scalability?
\QuickQuizAnswer{
One approach would be to maintain a global approximation
to the value.
Readers would increment their per-thread variable, but when it
reached some predefined limit, atomically add it to a global
variable, then zero their per-thread variable.
This would permit a tradeoff between average increment overhead
and accuracy of the value read out.
The reader is encouraged to think up and try out other approaches,
for example, using a combining tree.
} \QuickQuizEnd
Given what has been presented in this section, you should now be able
to answer the Quick Quiz about statistical counters for networking
near the beginning of this chapter.
\section{Approximate Limit Counters}
\label{sec:count:Approximate Limit Counters}
Another special case of counting involves limit-checking.
For example, as noted in the approximate structure-allocation limit
problem in \QuickQuizRef{\QcountQapproxcnt},
suppose that you need to maintain a count of the number of
structures allocated in order to fail any allocations once the number
of structures in use exceeds a limit, in this case, 10,000.
Suppose further that these structures are short-lived, that this
limit is rarely exceeded, and that this limit is approximate in
that it is OK to exceed it sometimes by some bounded amount
(see Section~\ref{sec:count:Exact Limit Counters}
if you instead need the limit to be exact).
\subsection{Design}
One possible design for limit counters is to divide the limit of 10,000
by the number of threads, and give each thread a fixed pool of structures.
For example, given 100 threads, each thread would manage its own pool
of 100 structures.
This approach is simple, and in some cases works well, but it does not
handle the common case where a given structure is allocated by one
thread and freed by another~\cite{McKenney93}.
On the one hand, if a given thread takes credit for any structures it
frees, then the thread doing most of the allocating runs out
of structures, while the threads doing most of the freeing have lots
of credits that they cannot use.
On the other hand, if freed structures are credited to the CPU that
allocated them, it will be necessary for CPUs to manipulate each
others' counters, which will require expensive atomic instructions
or other means of communicating between threads.\footnote{
That said, if each structure will always be freed
by the same CPU (or thread) that allocated it, then
this simple partitioning approach works extremely well.}
In short, for many important workloads, we cannot fully partition the counter.
Given that partitioning the counters was what brought the excellent
update-side performance for the three schemes discussed in
Section~\ref{sec:count:Statistical Counters}, this might be grounds
for some pessimism.
However, the eventually consistent algorithm presented in
Section~\ref{sec:count:Eventually Consistent Implementation}
provides an interesting hint.
Recall that this algorithm kept two sets of books, a
per-thread \co{counter} variable for updaters and a \co{global_count}
variable for readers, with an \co{eventual()} thread that periodically
updated \co{global_count} to be eventually consistent with the values
of the per-thread \co{counter}.
The per-thread \co{counter} perfectly partitioned the counter value, while
\co{global_count} kept the full value.
For limit counters, we can use a variation on this theme, in that
we \emph{partially partition} the counter.
For example, each of four threads could have a per-thread
\co{counter}, but each could also have a per-thread maximum value
(call it \co{countermax}).
But then what happens if a given thread needs to increment its
\co{counter}, but \co{counter} is equal to its \co{countermax}?
The trick here is to move half of that thread's \co{counter} value
to a \co{globalcount}, then increment \co{counter}.
For example, if a given thread's \co{counter} and \co{countermax}
variables were both equal to 10, we do the following:
\begin{enumerate}
\item Acquire a global lock.
\item Add five to \co{globalcount}.
\item To balance out the addition, subtract five from this
thread's \co{counter}.
\item Release the global lock.
\item Increment this thread's \co{counter}, resulting in a value of six.
\end{enumerate}
Although this procedure still requires a global lock, that lock need only be
acquired once for every five increment operations, greatly reducing
that lock's level of contention.
We can reduce this contention as low as we wish by increasing the
value of \co{countermax}.
However, the corresponding penalty for increasing the value of
\co{countermax} is reduced accuracy of \co{globalcount}.
To see this, note that on a four-CPU system, if \co{countermax}
is equal to ten, \co{globalcount} will be in error by at most
40 counts.
In contrast, if \co{countermax} is increased to 100, \co{globalcount}
might be in error by as much as 400 counts.
This raises the question of just how much we care about \co{globalcount}'s
deviation from the aggregate value of the counter, where this aggregate value
is the sum of \co{globalcount} and each thread's \co{counter} variable.
The answer to this question depends on how far the aggregate value is
from the counter's limit (call it \co{globalcountmax}).
The larger the difference between these two values, the larger \co{countermax}
can be without risk of exceeding the \co{globalcountmax} limit.
This means that the
value of a given thread's \co{countermax} variable can be set
based on this difference.
When far from the limit, the \co{countermax} per-thread variables are
set to large values to optimize for performance and scalability, while
when close to the limit, these same variables are set to small values
to minimize the error in the checks against the \co{globalcountmax} limit.
This design is an example of \emph{parallel fastpath}, which is an important
design pattern in which the common case executes with no expensive
instructions and no interactions between threads, but where occasional
use is also made of a more conservatively designed
(and higher overhead) global algorithm.
This design pattern is covered in more detail in
Section~\ref{sec:SMPdesign:Parallel Fastpath}.
\subsection{Simple Limit Counter Implementation}
\label{sec:count:Simple Limit Counter Implementation}
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 unsigned long __thread counter = 0;
2 unsigned long __thread countermax = 0;
3 unsigned long globalcountmax = 10000;
4 unsigned long globalcount = 0;
5 unsigned long globalreserve = 0;
6 unsigned long *counterp[NR_THREADS] = { NULL };
7 DEFINE_SPINLOCK(gblcnt_mutex);
\end{verbbox}
}
\centering
\theverbbox
\caption{Simple Limit Counter Variables}
\label{fig:count:Simple Limit Counter Variables}
\end{figure}
\begin{figure}[tb]
\centering
\resizebox{3in}{!}{\includegraphics{count/count_lim}}
\caption{Simple Limit Counter Variable Relationships}
\label{fig:count:Simple Limit Counter Variable Relationships}
\end{figure}
Figure~\ref{fig:count:Simple Limit Counter Variables}
shows both the per-thread and global variables used by this
implementation.
The per-thread \co{counter} and \co{countermax} variables are the
corresponding thread's local counter and the upper bound on that
counter, respectively.
The \co{globalcountmax} variable on line~3 contains the upper
bound for the aggregate counter, and the \co{globalcount} variable
on line~4 is the global counter.
The sum of \co{globalcount} and each thread's \co{counter} gives
the aggregate value of the overall counter.
The \co{globalreserve} variable on line~5 is the sum of all of the
per-thread \co{countermax} variables.
The relationship among these variables is shown by
Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}:
\begin{enumerate}
\item The sum of \co{globalcount} and \co{globalreserve} must
be less than or equal to \co{globalcountmax}.
\item The sum of all threads' \co{countermax} values must be
less than or equal to \co{globalreserve}.
\item Each thread's \co{counter} must be less than or equal to
that thread's \co{countermax}.
\end{enumerate}
Each element of the \co{counterp[]} array references the corresponding
thread's \co{counter} variable, and, finally, the \co{gblcnt_mutex}
spinlock guards all of the global variables, in other words, no thread
is permitted to access or modify any of the global variables unless it
has acquired \co{gblcnt_mutex}.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 int add_count(unsigned long delta)
2 {
3 if (countermax - counter >= delta) {
4 counter += delta;
5 return 1;
6 }
7 spin_lock(&gblcnt_mutex);
8 globalize_count();
9 if (globalcountmax -
10 globalcount - globalreserve < delta) {
11 spin_unlock(&gblcnt_mutex);
12 return 0;
13 }
14 globalcount += delta;
15 balance_count();
16 spin_unlock(&gblcnt_mutex);
17 return 1;
18 }
19
20 int sub_count(unsigned long delta)
21 {
22 if (counter >= delta) {
23 counter -= delta;
24 return 1;
25 }
26 spin_lock(&gblcnt_mutex);
27 globalize_count();
28 if (globalcount < delta) {
29 spin_unlock(&gblcnt_mutex);
30 return 0;
31 }
32 globalcount -= delta;
33 balance_count();
34 spin_unlock(&gblcnt_mutex);
35 return 1;
36 }
37
38 unsigned long read_count(void)
39 {
40 int t;
41 unsigned long sum;
42
43 spin_lock(&gblcnt_mutex);
44 sum = globalcount;
45 for_each_thread(t)
46 if (counterp[t] != NULL)
47 sum += *counterp[t];
48 spin_unlock(&gblcnt_mutex);
49 return sum;
50 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Simple Limit Counter Add, Subtract, and Read}
\label{fig:count:Simple Limit Counter Add, Subtract, and Read}
\end{figure}
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}
shows the \co{add_count()}, \co{sub_count()}, and \co{read_count()}
functions (\path{count_lim.c}).
\QuickQuiz{}
Why does
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}
provide \co{add_count()} and \co{sub_count()} instead of the
\co{inc_count()} and \co{dec_count()} interfaces show in
Section~\ref{sec:count:Statistical Counters}?
\QuickQuizAnswer{
Because structures come in different sizes.
Of course, a limit counter corresponding to a specific size
of structure might still be able to use
\co{inc_count()} and \co{dec_count()}.
} \QuickQuizEnd
Lines~1-18 show \co{add_count()}, which adds the specified value \co{delta}
to the counter.
Line~3 checks to see if there is room for \co{delta} on this thread's
\co{counter}, and, if so, line~4 adds it and line~6 returns success.
This is the \co{add_counter()} fastpath, and it does no atomic operations,
references only per-thread variables, and should not incur any cache misses.
\QuickQuiz{}
What is with the strange form of the condition on line~3 of
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
Why not the following more intuitive form of the fastpath?
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
3 if (counter + delta <= countermax){
4 counter += delta;
5 return 1;
6 }
\end{verbatim}
\end{minipage}
\vspace{5pt}
\QuickQuizAnswer{
Two words.
``Integer overflow.''
Try the above formulation with \co{counter} equal to 10 and
\co{delta} equal to \co{ULONG_MAX}.
Then try it again with the code shown in
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}.
A good understanding of integer overflow will be required for
the rest of this example, so if you have never dealt with
integer overflow before, please try several examples to get
the hang of it.
Integer overflow can sometimes be more difficult to get right
than parallel algorithms!
} \QuickQuizEnd
If the test on line~3 fails, we must access global variables, and thus
must acquire \co{gblcnt_mutex} on line~7, which we release on line~11
in the failure case or on line~16 in the success case.
Line~8 invokes \co{globalize_count()}, shown in
Figure~\ref{fig:count:Simple Limit Counter Utility Functions},
which clears the thread-local variables, adjusting the global variables
as needed, thus simplifying global processing.
(But don't take \emph{my} word for it, try coding it yourself!)
Lines~9 and 10 check to see if addition of \co{delta} can be accommodated,
with the meaning of the expression preceding the less-than sign shown in
Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}
as the difference in height of the two red (leftmost) bars.
If the addition of \co{delta} cannot be accommodated, then
line~11 (as noted earlier) releases \co{gblcnt_mutex} and line~12
returns indicating failure.
Otherwise, we take the slowpath.
Line~14 adds \co{delta} to \co{globalcount}, and then
line~15 invokes \co{balance_count()} (shown in
Figure~\ref{fig:count:Simple Limit Counter Utility Functions})
in order to update both the global and the per-thread variables.
This call to \co{balance_count()}
will usually set this thread's \co{countermax} to re-enable the fastpath.
Line~16 then releases
\co{gblcnt_mutex} (again, as noted earlier), and, finally,
line~17 returns indicating success.
\QuickQuiz{}
Why does \co{globalize_count()} zero the per-thread variables,
only to later call \co{balance_count()} to refill them in
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
Why not just leave the per-thread variables non-zero?
\QuickQuizAnswer{
That is in fact what an earlier version of this code did.
But addition and subtraction are extremely cheap, and handling
all of the special cases that arise is quite complex.
Again, feel free to try it yourself, but beware of integer
overflow!
} \QuickQuizEnd
Lines~20-36 show \co{sub_count()}, which subtracts the specified
\co{delta} from the counter.
Line~22 checks to see if the per-thread counter can accommodate
this subtraction, and, if so, line~23 does the subtraction and
line~24 returns success.
These lines form \co{sub_count()}'s fastpath, and, as with
\co{add_count()}, this fastpath executes no costly operations.
If the fastpath cannot accommodate subtraction of \co{delta},
execution proceeds to the slowpath on lines~26-35.
Because the slowpath must access global state, line~26
acquires \co{gblcnt_mutex}, which is released either by line~29
(in case of failure) or by line~34 (in case of success).
Line~27 invokes \co{globalize_count()}, shown in
Figure~\ref{fig:count:Simple Limit Counter Utility Functions},
which again clears the thread-local variables, adjusting the global variables
as needed.
Line~28 checks to see if the counter can accommodate subtracting
\co{delta}, and, if not, line~29 releases \co{gblcnt_mutex}
(as noted earlier) and line~30 returns failure.
\QuickQuiz{}
Given that \co{globalreserve} counted against us in \co{add_count()},
why doesn't it count for us in \co{sub_count()} in
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
\QuickQuizAnswer{
The \co{globalreserve} variable tracks the sum of all threads'
\co{countermax} variables.
The sum of these threads' \co{counter} variables might be anywhere
from zero to \co{globalreserve}.
We must therefore take a conservative approach, assuming that
all threads' \co{counter} variables are full in \co{add_count()}
and that they are all empty in \co{sub_count()}.
But remember this question, as we will come back to it later.
} \QuickQuizEnd
\QuickQuiz{}
Suppose that one thread invokes \co{add_count()} shown in
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read},
and then another thread invokes \co{sub_count()}.
Won't \co{sub_count()} return failure even though the value of
the counter is non-zero?
\QuickQuizAnswer{
Indeed it will!
In many cases, this will be a problem, as discussed in
Section~\ref{sec:count:Simple Limit Counter Discussion}, and
in those cases the algorithms from
Section~\ref{sec:count:Exact Limit Counters}
will likely be preferable.
} \QuickQuizEnd
If, on the other hand, line~28 finds that the counter \emph{can}
accommodate subtracting \co{delta}, we complete the slowpath.
Line~32 does the subtraction and then
line~33 invokes \co{balance_count()} (shown in
Figure~\ref{fig:count:Simple Limit Counter Utility Functions})
in order to update both global and per-thread variables
(hopefully re-enabling the fastpath).
Then line~34 releases \co{gblcnt_mutex}, and line~35 returns success.
\QuickQuiz{}
Why have both \co{add_count()} and \co{sub_count()} in
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
Why not simply pass a negative number to \co{add_count()}?
\QuickQuizAnswer{
Given that \co{add_count()} takes an \co{unsigned} \co{long}
as its argument, it is going to be a bit tough to pass it a
negative number.
And unless you have some anti-matter memory, there is little
point in allowing negative numbers when counting the number
of structures in use!
} \QuickQuizEnd
Lines~38-50 show \co{read_count()}, which returns the aggregate value
of the counter.
It acquires \co{gblcnt_mutex} on line~43 and releases it on line~48,
excluding global operations from \co{add_count()} and \co{sub_count()},
and, as we will see, also excluding thread creation and exit.
Line~44 initializes local variable \co{sum} to the value of
\co{globalcount}, and then the loop spanning lines~45-47 sums the
per-thread \co{counter} variables.
Line~49 then returns the sum.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 static void globalize_count(void)
2 {
3 globalcount += counter;
4 counter = 0;
5 globalreserve -= countermax;
6 countermax = 0;
7 }
8
9 static void balance_count(void)
10 {
11 countermax = globalcountmax -
12 globalcount - globalreserve;
13 countermax /= num_online_threads();
14 globalreserve += countermax;
15 counter = countermax / 2;
16 if (counter > globalcount)
17 counter = globalcount;
18 globalcount -= counter;
19 }
20
21 void count_register_thread(void)
22 {
23 int idx = smp_thread_id();
24
25 spin_lock(&gblcnt_mutex);
26 counterp[idx] = &counter;
27 spin_unlock(&gblcnt_mutex);
28 }
29
30 void count_unregister_thread(int nthreadsexpected)
31 {
32 int idx = smp_thread_id();
33
34 spin_lock(&gblcnt_mutex);
35 globalize_count();
36 counterp[idx] = NULL;
37 spin_unlock(&gblcnt_mutex);
38 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Simple Limit Counter Utility Functions}
\label{fig:count:Simple Limit Counter Utility Functions}
\end{figure}
Figure~\ref{fig:count:Simple Limit Counter Utility Functions}
shows a number of utility functions used by the \co{add_count()},
\co{sub_count()}, and \co{read_count()} primitives shown in
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}.
Lines~1-7 show \co{globalize_count()}, which zeros the current thread's
per-thread counters, adjusting the global variables appropriately.
It is important to note that this function does not change the aggregate
value of the counter, but instead changes how the counter's current value
is represented.
Line~3 adds the thread's \co{counter} variable to \co{globalcount},
and line~4 zeroes \co{counter}.
Similarly, line~5 subtracts the per-thread \co{countermax} from
\co{globalreserve}, and line~6 zeroes \co{countermax}.
It is helpful to refer to
Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}
when reading both this function and \co{balance_count()}, which is next.
Lines~9-19 show \co{balance_count()}, which is roughly speaking
the inverse of \co{globalize_count()}.
This function's job is to set the current thread's
\co{countermax} variable to the largest value that avoids the risk
of the counter exceeding the \co{globalcountmax} limit.
Changing the current thread's \co{countermax} variable of course
requires corresponding adjustments to \co{counter}, \co{globalcount}
and \co{globalreserve}, as can be seen by referring back to
Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}.
By doing this, \co{balance_count()} maximizes use of
\co{add_count()}'s and \co{sub_count()}'s low-overhead fastpaths.
As with \co{globalize_count()}, \co{balance_count()} is not permitted
to change the aggregate value of the counter.
Lines~11-13 compute this thread's share of that portion of
\co{globalcountmax} that is not already covered by either
\co{globalcount} or \co{globalreserve}, and assign the
computed quantity to this thread's \co{countermax}.
Line~14 makes the corresponding adjustment to \co{globalreserve}.
Line~15 sets this thread's \co{counter} to the middle of the range
from zero to \co{countermax}.
Line~16 checks to see whether \co{globalcount} can in fact accommodate
this value of \co{counter}, and, if not, line~17 decreases \co{counter}
accordingly.
Finally, in either case, line~18 makes the corresponding adjustment to
\co{globalcount}.
\QuickQuiz{}
Why set \co{counter} to \co{countermax / 2} in line~15 of
Figure~\ref{fig:count:Simple Limit Counter Utility Functions}?
Wouldn't it be simpler to just take \co{countermax} counts?
\QuickQuizAnswer{
First, it really is reserving \co{countermax} counts
(see line~14), however,
it adjusts so that only half of these are actually in use
by the thread at the moment.
This allows the thread to carry out at least \co{countermax / 2}
increments or decrements before having to refer back to
\co{globalcount} again.
Note that the accounting in \co{globalcount} remains accurate,
thanks to the adjustment in line~18.
} \QuickQuizEnd
\begin{figure*}[tb]
\centering
\resizebox{5in}{!}{\includegraphics{count/globbal}}
\caption{Schematic of Globalization and Balancing}
\label{fig:count:Schematic of Globalization and Balancing}
\end{figure*}
It is helpful to look at a schematic depicting how the relationship
of the counters changes with the execution of first
\co{globalize_count()} and then \co{balance_count}, as shown in
Figure~\ref{fig:count:Schematic of Globalization and Balancing}.
Time advances from left to right, with the leftmost configuration
roughly that of
Figure~\ref{fig:count:Simple Limit Counter Variable Relationships}.
The center configuration shows the relationship of these same counters
after \co{globalize_count()} is executed by thread~0.
As can be seen from the figure, thread~0's \co{counter} (``c~0'' in
the figure) is added to \co{globalcount}, while the value of
\co{globalreserve} is reduced by this same amount.
Both thread~0's \co{counter} and its \co{countermax}
(``cm~0'' in the figure) are reduced to zero.
The other three threads' counters are unchanged.
Note that this change did not affect the overall value of the counter,
as indicated by the bottommost dotted line connecting the leftmost
and center configurations.
In other words, the sum of \co{globalcount} and the four threads'
\co{counter} variables is the same in both configurations.
Similarly, this change did not affect the sum of \co{globalcount} and
\co{globalreserve}, as indicated by the upper dotted line.
The rightmost configuration shows the relationship of these counters
after \co{balance_count()} is executed, again by thread~0.
One-quarter of the remaining count, denoted by the vertical line extending
up from all three configurations, is added to thread~0's
\co{countermax} and half of that to thread~0's \co{counter}.
The amount added to thread~0's \co{counter} is also subtracted from
\co{globalcount} in order to avoid changing the overall value of the
counter (which is again the sum of \co{globalcount} and the three
threads' \co{counter} variables), again as indicated by the lowermost
of the two dotted lines connecting the center and rightmost configurations.
The \co{globalreserve} variable is also adjusted so that this variable
remains equal to the sum of the four threads' \co{countermax}
variables.
Because thread~0's \co{counter} is less than its \co{countermax},
thread~0 can once again increment the counter locally.
\QuickQuiz{}
In Figure~\ref{fig:count:Schematic of Globalization and Balancing},
even though a quarter of the remaining count up to the limit is
assigned to thread~0, only an eighth of the remaining count is
consumed, as indicated by the uppermost dotted line connecting
the center and the rightmost configurations.
Why is that?
\QuickQuizAnswer{
The reason this happened is that thread~0's \co{counter} was
set to half of its \co{countermax}.
Thus, of the quarter assigned to thread~0, half of that quarter
(one eighth) came from \co{globalcount}, leaving the other half
(again, one eighth) to come from the remaining count.
There are two purposes for taking this approach:
(1)~To allow thread~0 to use the fastpath for decrements as
well as increments, and
(2)~To reduce the inaccuracies if all threads are monotonically
incrementing up towards the limit.
To see this last point, step through the algorithm and watch
what it does.
} \QuickQuizEnd
Lines~21-28 show \co{count_register_thread()}, which sets up state for
newly created threads.
This function simply installs
a pointer to the newly created thread's \co{counter} variable into
the corresponding entry of the \co{counterp[]} array under the protection
of \co{gblcnt_mutex}.
Finally, lines~30-38 show \co{count_unregister_thread()}, which tears down
state for a soon-to-be-exiting thread.
Line~34 acquires \co{gblcnt_mutex} and line~37 releases it.
Line~35 invokes \co{globalize_count()} to clear out this thread's
counter state, and line~36 clears this thread's entry in the
\co{counterp[]} array.
\subsection{Simple Limit Counter Discussion}
\label{sec:count:Simple Limit Counter Discussion}
This type of counter is quite fast when aggregate values are near zero,
with some overhead due to the comparison and branch in both
\co{add_count()}'s and \co{sub_count()}'s fastpaths.
However, the use of a per-thread \co{countermax} reserve means that
\co{add_count()} can fail even when
the aggregate value of the counter is nowhere near \co{globalcountmax}.
Similarly, \co{sub_count()} can fail
even when the aggregate value of the counter is nowhere near zero.
In many cases, this is unacceptable.
Even if the \co{globalcountmax} is intended to be an approximate limit,
there is usually a limit to exactly how much approximation can be tolerated.
One way to limit the degree of approximation is to impose an upper limit
on the value of the per-thread \co{countermax} instances.
This task is undertaken in the next section.
\subsection{Approximate Limit Counter Implementation}
\label{sec:count:Approximate Limit Counter Implementation}
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 unsigned long __thread counter = 0;
2 unsigned long __thread countermax = 0;
3 unsigned long globalcountmax = 10000;
4 unsigned long globalcount = 0;
5 unsigned long globalreserve = 0;
6 unsigned long *counterp[NR_THREADS] = { NULL };
7 DEFINE_SPINLOCK(gblcnt_mutex);
8 #define MAX_COUNTERMAX 100
\end{verbbox}
}
\centering
\theverbbox
\caption{Approximate Limit Counter Variables}
\label{fig:count:Approximate Limit Counter Variables}
\end{figure}
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 static void balance_count(void)
2 {
3 countermax = globalcountmax -
4 globalcount - globalreserve;
5 countermax /= num_online_threads();
6 if (countermax > MAX_COUNTERMAX)
7 countermax = MAX_COUNTERMAX;
8 globalreserve += countermax;
9 counter = countermax / 2;
10 if (counter > globalcount)
11 counter = globalcount;
12 globalcount -= counter;
13 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Approximate Limit Counter Balancing}
\label{fig:count:Approximate Limit Counter Balancing}
\end{figure}
Because this implementation (\path{count_lim_app.c}) is quite similar to
that in the previous section
(Figures~\ref{fig:count:Simple Limit Counter Variables},
\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}, and
\ref{fig:count:Simple Limit Counter Utility Functions}),
only the changes are shown here.
Figure~\ref{fig:count:Approximate Limit Counter Variables}
is identical to
Figure~\ref{fig:count:Simple Limit Counter Variables},
with the addition of \co{MAX_COUNTERMAX}, which sets the maximum
permissible value of the per-thread \co{countermax} variable.
Similarly,
Figure~\ref{fig:count:Approximate Limit Counter Balancing}
is identical to the \co{balance_count()} function in
Figure~\ref{fig:count:Simple Limit Counter Utility Functions},
with the addition of lines~6 and 7, which enforce the
\co{MAX_COUNTERMAX} limit on the per-thread \co{countermax} variable.
\subsection{Approximate Limit Counter Discussion}
These changes greatly reduce the limit inaccuracy seen in the previous version,
but present another problem: any given value of \co{MAX_COUNTERMAX}
will cause a workload-dependent fraction of accesses to fall off the
fastpath.
As the number of threads increase, non-fastpath execution will become both
a performance and a scalability problem.
However, we will defer this problem and turn instead to counters
with exact limits.
\section{Exact Limit Counters}
\label{sec:count:Exact Limit Counters}
To solve the exact structure-allocation limit problem noted in
\QuickQuizRef{\QcountQexactcnt},
we need a limit counter that can tell exactly when its limits are
exceeded.
One way of implementing such a limit counter is to
cause threads that have reserved counts to give them up.
One way to do this is to use atomic instructions.
Of course, atomic instructions will slow down the fastpath, but on the
other hand, it would be silly not to at least give them a try.
\subsection{Atomic Limit Counter Implementation}
\label{sec:count:Atomic Limit Counter Implementation}
Unfortunately,
if one thread is to safely remove counts from another thread,
both threads will need to atomically manipulate that thread's
\co{counter} and \co{countermax} variables.
The usual way to do this is to combine these two variables into a
single variable,
for example, given a 32-bit variable, using the high-order 16 bits to
represent \co{counter} and the low-order 16 bits to represent
\co{countermax}.
\QuickQuiz{}
Why is it necessary to atomically manipulate the thread's
\co{counter} and \co{countermax} variables as a unit?
Wouldn't it be good enough to atomically manipulate them
individually?
\QuickQuizAnswer{
This might well be possible, but great care is required.
Note that removing \co{counter} without first zeroing
\co{countermax} could result in the corresponding thread
increasing \co{counter} immediately after it was zeroed,
completely negating the effect of zeroing the counter.
The opposite ordering, namely zeroing \co{countermax} and then
removing \co{counter}, can also result in a non-zero
\co{counter}.
To see this, consider the following sequence of events:
\begin{enumerate}
\item Thread~A fetches its \co{countermax}, and finds that
it is non-zero.
\item Thread~B zeroes Thread~A's \co{countermax}.
\item Thread~B removes Thread~A's \co{counter}.
\item Thread~A, having found that its \co{countermax}
is non-zero, proceeds to add to its \co{counter},
resulting in a non-zero value for \co{counter}.
\end{enumerate}
Again, it might well be possible to atomically manipulate
\co{countermax} and \co{counter} as separate variables,
but it is clear that great care is required.
It is also quite likely that doing so will slow down the
fastpath.
Exploring these possibilities are left as exercises for
the reader.
} \QuickQuizEnd
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 atomic_t __thread ctrandmax = ATOMIC_INIT(0);
2 unsigned long globalcountmax = 10000;
3 unsigned long globalcount = 0;
4 unsigned long globalreserve = 0;
5 atomic_t *counterp[NR_THREADS] = { NULL };
6 DEFINE_SPINLOCK(gblcnt_mutex);
7 #define CM_BITS (sizeof(atomic_t) * 4)
8 #define MAX_COUNTERMAX ((1 << CM_BITS) - 1)
9
10 static void
11 split_ctrandmax_int(int cami, int *c, int *cm)
12 {
13 *c = (cami >> CM_BITS) & MAX_COUNTERMAX;
14 *cm = cami & MAX_COUNTERMAX;
15 }
16
17 static void
18 split_ctrandmax(atomic_t *cam, int *old,
19 int *c, int *cm)
20 {
21 unsigned int cami = atomic_read(cam);
22
23 *old = cami;
24 split_ctrandmax_int(cami, c, cm);
25 }
26
27 static int merge_ctrandmax(int c, int cm)
28 {
29 unsigned int cami;
30
31 cami = (c << CM_BITS) | cm;
32 return ((int)cami);
33 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Atomic Limit Counter Variables and Access Functions}
\label{fig:count:Atomic Limit Counter Variables and Access Functions}
\end{figure}
The variables and access functions for a simple atomic limit counter
are shown in
Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}
(\path{count_lim_atomic.c}).
The \co{counter} and \co{countermax} variables in earlier algorithms
are combined into the single variable \co{ctrandmax} shown on
line~1, with \co{counter} in the upper half and \co{countermax} in
the lower half.
This variable is of type \co{atomic_t}, which has an underlying
representation of \co{int}.
Lines~2-6 show the definitions for \co{globalcountmax}, \co{globalcount},
\co{globalreserve}, \co{counterp}, and \co{gblcnt_mutex}, all of which
take on roles similar to their counterparts in
Figure~\ref{fig:count:Approximate Limit Counter Variables}.
Line~7 defines \co{CM_BITS}, which gives the number of bits in each half
of \co{ctrandmax}, and line~8 defines \co{MAX_COUNTERMAX}, which
gives the maximum value that may be held in either half of
\co{ctrandmax}.
\QuickQuiz{}
In what way does line~7 of
Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}
violate the C standard?
\QuickQuizAnswer{
It assumes eight bits per byte.
This assumption does hold for all current commodity microprocessors
that can be easily assembled into shared-memory multiprocessors,
but certainly does not hold for all computer systems that have
ever run C code.
(What could you do instead in order to comply with the C
standard? What drawbacks would it have?)
} \QuickQuizEnd
Lines~10-15 show the \co{split_ctrandmax_int()} function, which,
when given the underlying \co{int} from the \co{atomic_t
ctrandmax} variable, splits it into its \co{counter} (\co{c})
and \co{countermax} (\co{cm}) components.
Line~13 isolates the most-significant half of this \co{int},
placing the result as specified by argument \co{c},
and line~14 isolates the least-significant half of this \co{int},
placing the result as specified by argument \co{cm}.
Lines~17-25 show the \co{split_ctrandmax()} function, which
picks up the underlying \co{int} from the specified variable
on line~21, stores it as specified by the \co{old} argument on
line~23, and then invokes \co{split_ctrandmax_int()} to split
it on line~24.
\QuickQuiz{}
Given that there is only one \co{ctrandmax} variable,
why bother passing in a pointer to it on line~18 of
Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}?
\QuickQuizAnswer{
There is only one \co{ctrandmax} variable \emph{per thread}.
Later, we will see code that needs to pass other threads'
\co{ctrandmax} variables to \co{split_ctrandmax()}.
} \QuickQuizEnd
Lines~27-33 show the \co{merge_ctrandmax()} function, which
can be thought of as the inverse of \co{split_ctrandmax()}.
Line~31 merges the \co{counter} and \co{countermax}
values passed in \co{c} and \co{cm}, respectively, and returns
the result.
\QuickQuiz{}
Why does \co{merge_ctrandmax()} in
Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}
return an \co{int} rather than storing directly into an
\co{atomic_t}?
\QuickQuizAnswer{
Later, we will see that we need the \co{int} return to pass
to the \co{atomic_cmpxchg()} primitive.
} \QuickQuizEnd
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 int add_count(unsigned long delta)
2 {
3 int c;
4 int cm;
5 int old;
6 int new;
7
8 do {
9 split_ctrandmax(&ctrandmax, &old, &c, &cm);
10 if (delta > MAX_COUNTERMAX || c + delta > cm)
11 goto slowpath;
12 new = merge_ctrandmax(c + delta, cm);
13 } while (atomic_cmpxchg(&ctrandmax,
14 old, new) != old);
15 return 1;
16 slowpath:
17 spin_lock(&gblcnt_mutex);
18 globalize_count();
19 if (globalcountmax - globalcount -
20 globalreserve < delta) {
21 flush_local_count();
22 if (globalcountmax - globalcount -
23 globalreserve < delta) {
24 spin_unlock(&gblcnt_mutex);
25 return 0;
26 }
27 }
28 globalcount += delta;
29 balance_count();
30 spin_unlock(&gblcnt_mutex);
31 return 1;
32 }
33
34 int sub_count(unsigned long delta)
35 {
36 int c;
37 int cm;
38 int old;
39 int new;
40
41 do {
42 split_ctrandmax(&ctrandmax, &old, &c, &cm);
43 if (delta > c)
44 goto slowpath;
45 new = merge_ctrandmax(c - delta, cm);
46 } while (atomic_cmpxchg(&ctrandmax,
47 old, new) != old);
48 return 1;
49 slowpath:
50 spin_lock(&gblcnt_mutex);
51 globalize_count();
52 if (globalcount < delta) {
53 flush_local_count();
54 if (globalcount < delta) {
55 spin_unlock(&gblcnt_mutex);
56 return 0;
57 }
58 }
59 globalcount -= delta;
60 balance_count();
61 spin_unlock(&gblcnt_mutex);
62 return 1;
63 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Atomic Limit Counter Add and Subtract}
\label{fig:count:Atomic Limit Counter Add and Subtract}
\end{figure}
Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}
shows the \co{add_count()} and \co{sub_count()} functions.
Lines~1-32 show \co{add_count()}, whose fastpath spans lines~8-15,
with the remainder of the function being the slowpath.
Lines~8-14 of the fastpath form a compare-and-swap (CAS) loop, with
the \co{atomic_cmpxchg()} primitives on lines~13-14 performing the
actual CAS.
Line~9 splits the current thread's \co{ctrandmax} variable into its
\co{counter} (in \co{c}) and \co{countermax} (in \co{cm}) components,
while placing the underlying \co{int} into \co{old}.
Line~10 checks whether the amount \co{delta} can be accommodated
locally (taking care to avoid integer overflow), and if not,
line~11 transfers to the slowpath.
Otherwise, line~12 combines an updated \co{counter} value with the
original \co{countermax} value into \co{new}.
The \co{atomic_cmpxchg()} primitive on lines~13-14 then atomically
compares this thread's \co{ctrandmax} variable to \co{old},
updating its value to \co{new} if the comparison succeeds.
If the comparison succeeds, line~15 returns success, otherwise,
execution continues in the loop at line~9.
\QuickQuiz{}
Yecch!
Why the ugly \co{goto} on line~11 of
Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}?
Haven't you heard of the \co{break} statement???
\QuickQuizAnswer{
Replacing the \co{goto} with a \co{break} would require keeping
a flag to determine whether or not line~15 should return, which
is not the sort of thing you want on a fastpath.
If you really hate the \co{goto} that much, your best bet would
be to pull the fastpath into a separate function that returned
success or failure, with ``failure'' indicating a need for the
slowpath.
This is left as an exercise for goto-hating readers.
} \QuickQuizEnd
\QuickQuiz{}
Why would the \co{atomic_cmpxchg()} primitive at lines~13-14 of
Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}
ever fail?
After all, we picked up its old value on line~9 and have not
changed it!
\QuickQuizAnswer{
Later, we will see how the \co{flush_local_count()} function in
Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 1}
might update this thread's \co{ctrandmax} variable concurrently
with the execution of the fastpath on lines~8-14 of
Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}.
} \QuickQuizEnd
Lines~16-31 of
Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}
show \co{add_count()}'s slowpath, which is protected by \co{gblcnt_mutex},
which is acquired on line~17 and released on lines~24 and 30.
Line~18 invokes \co{globalize_count()}, which moves this thread's
state to the global counters.
Lines~19-20 check whether the \co{delta} value can be accommodated by
the current global state, and, if not, line~21 invokes
\co{flush_local_count()} to flush all threads' local state to the
global counters, and then lines~22-23 recheck whether \co{delta} can
be accommodated.
If, after all that, the addition of \co{delta} still cannot be accommodated,
then line~24 releases \co{gblcnt_mutex} (as noted earlier), and
then line~25 returns failure.
Otherwise, line~28 adds \co{delta} to the global counter, line~29
spreads counts to the local state if appropriate, line~30 releases
\co{gblcnt_mutex} (again, as noted earlier), and finally, line~31
returns success.
Lines~34-63 of
Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}
show \co{sub_count()}, which is structured similarly to
\co{add_count()}, having a fastpath on lines~41-48 and a slowpath on
lines~49-62.
A line-by-line analysis of this function is left as an exercise to
the reader.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 unsigned long read_count(void)
2 {
3 int c;
4 int cm;
5 int old;
6 int t;
7 unsigned long sum;
8
9 spin_lock(&gblcnt_mutex);
10 sum = globalcount;
11 for_each_thread(t)
12 if (counterp[t] != NULL) {
13 split_ctrandmax(counterp[t], &old, &c, &cm);
14 sum += c;
15 }
16 spin_unlock(&gblcnt_mutex);
17 return sum;
18 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Atomic Limit Counter Read}
\label{fig:count:Atomic Limit Counter Read}
\end{figure}
Figure~\ref{fig:count:Atomic Limit Counter Read} shows \co{read_count()}.
Line~9 acquires \co{gblcnt_mutex} and line~16 releases it.
Line~10 initializes local variable \co{sum} to the value of
\co{globalcount}, and the loop spanning lines~11-15 adds the
per-thread counters to this sum, isolating each per-thread counter
using \co{split_ctrandmax} on line~13.
Finally, line~17 returns the sum.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 static void globalize_count(void)
2 {
3 int c;
4 int cm;
5 int old;
6
7 split_ctrandmax(&ctrandmax, &old, &c, &cm);
8 globalcount += c;
9 globalreserve -= cm;
10 old = merge_ctrandmax(0, 0);
11 atomic_set(&ctrandmax, old);
12 }
13
14 static void flush_local_count(void)
15 {
16 int c;
17 int cm;
18 int old;
19 int t;
20 int zero;
21
22 if (globalreserve == 0)
23 return;
24 zero = merge_ctrandmax(0, 0);
25 for_each_thread(t)
26 if (counterp[t] != NULL) {
27 old = atomic_xchg(counterp[t], zero);
28 split_ctrandmax_int(old, &c, &cm);
29 globalcount += c;
30 globalreserve -= cm;
31 }
32 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Atomic Limit Counter Utility Functions 1}
\label{fig:count:Atomic Limit Counter Utility Functions 1}
\end{figure}
\begin{figure}[tb]
{ \scriptsize
\begin{verbbox}
1 static void balance_count(void)
2 {
3 int c;
4 int cm;
5 int old;
6 unsigned long limit;
7
8 limit = globalcountmax - globalcount -
9 globalreserve;
10 limit /= num_online_threads();
11 if (limit > MAX_COUNTERMAX)
12 cm = MAX_COUNTERMAX;
13 else
14 cm = limit;
15 globalreserve += cm;
16 c = cm / 2;
17 if (c > globalcount)
18 c = globalcount;
19 globalcount -= c;
20 old = merge_ctrandmax(c, cm);
21 atomic_set(&ctrandmax, old);
22 }
23
24 void count_register_thread(void)
25 {
26 int idx = smp_thread_id();
27
28 spin_lock(&gblcnt_mutex);
29 counterp[idx] = &ctrandmax;
30 spin_unlock(&gblcnt_mutex);
31 }
32
33 void count_unregister_thread(int nthreadsexpected)
34 {
35 int idx = smp_thread_id();
36
37 spin_lock(&gblcnt_mutex);
38 globalize_count();
39 counterp[idx] = NULL;
40 spin_unlock(&gblcnt_mutex);
41 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Atomic Limit Counter Utility Functions 2}
\label{fig:count:Atomic Limit Counter Utility Functions 2}
\end{figure}
Figures~\ref{fig:count:Atomic Limit Counter Utility Functions 1}
and~\ref{fig:count:Atomic Limit Counter Utility Functions 2}
shows the utility functions
\co{globalize_count()},
\co{flush_local_count()},
\co{balance_count()},
\co{count_register_thread()}, and
\co{count_unregister_thread()}.
The code for \co{globalize_count()} is shown on lines~1-12,
of Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 1} and
is similar to that of previous algorithms, with the addition of
line~7, which is now required to split out \co{counter} and
\co{countermax} from \co{ctrandmax}.
The code for \co{flush_local_count()}, which moves all threads' local
counter state to the global counter, is shown on lines~14-32.
Line~22 checks to see if the value of \co{globalreserve} permits
any per-thread counts, and, if not, line~23 returns.
Otherwise, line~24 initializes local variable \co{zero} to a combined
zeroed \co{counter} and \co{countermax}.
The loop spanning lines~25-31 sequences through each thread.
Line~26 checks to see if the current thread has counter state,
and, if so, lines~27-30 move that state to the global counters.
Line~27 atomically fetches the current thread's state
while replacing it with zero.
Line~28 splits this state into its \co{counter} (in local variable \co{c})
and \co{countermax} (in local variable \co{cm}) components.
Line~29 adds this thread's \co{counter} to \co{globalcount}, while
line~30 subtracts this thread's \co{countermax} from \co{globalreserve}.
\QuickQuiz{}
What stops a thread from simply refilling its
\co{ctrandmax} variable immediately after
\co{flush_local_count()} on line~14 of
Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 1}
empties it?
\QuickQuizAnswer{
This other thread cannot refill its \co{ctrandmax}
until the caller of \co{flush_local_count()} releases the
\co{gblcnt_mutex}.
By that time, the caller of \co{flush_local_count()} will have
finished making use of the counts, so there will be no problem
with this other thread refilling---assuming that the value
of \co{globalcount} is large enough to permit a refill.
} \QuickQuizEnd
\QuickQuiz{}
What prevents concurrent execution of the fastpath of either
\co{add_count()} or \co{sub_count()} from interfering with
the \co{ctrandmax} variable while
\co{flush_local_count()} is accessing it on line~27 of
Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 1}
empties it?
\QuickQuizAnswer{
Nothing.
Consider the following three cases:
\begin{enumerate}
\item If \co{flush_local_count()}'s \co{atomic_xchg()} executes
before the \co{split_ctrandmax()} of either fastpath,
then the fastpath will see a zero \co{counter} and
\co{countermax}, and will thus transfer to the slowpath
(unless of course \co{delta} is zero).
\item If \co{flush_local_count()}'s \co{atomic_xchg()} executes
after the \co{split_ctrandmax()} of either fastpath,
but before that fastpath's \co{atomic_cmpxchg()},
then the \co{atomic_cmpxchg()} will fail, causing the
fastpath to restart, which reduces to case~1 above.
\item If \co{flush_local_count()}'s \co{atomic_xchg()} executes
after the \co{atomic_cmpxchg()} of either fastpath,
then the fastpath will (most likely) complete successfully
before \co{flush_local_count()} zeroes the thread's
\co{ctrandmax} variable.
\end{enumerate}
Either way, the race is resolved correctly.
} \QuickQuizEnd
Lines~1-22 on
Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 2}
show the code for \co{balance_count()}, which refills
the calling thread's local \co{ctrandmax} variable.
This function is quite similar to that of the preceding algorithms,
with changes required to handle the merged \co{ctrandmax} variable.
Detailed analysis of the code is left as an exercise for the reader,
as it is with the \co{count_register_thread()} function starting on
line~24 and the \co{count_unregister_thread()} function starting on
line~33.
\QuickQuiz{}
Given that the \co{atomic_set()} primitive does a simple
store to the specified \co{atomic_t}, how can line~21 of
\co{balance_count()} in
Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 2}
work correctly in face of concurrent \co{flush_local_count()}
updates to this variable?
\QuickQuizAnswer{
The caller of both \co{balance_count()} and
\co{flush_local_count()} hold \co{gblcnt_mutex}, so
only one may be executing at a given time.
} \QuickQuizEnd
The next section qualitatively evaluates this design.
\subsection{Atomic Limit Counter Discussion}
This is the first implementation that actually allows the counter to
be run all the way to either of its limits, but it does so at the
expense of adding atomic operations to the fastpaths, which slow down
the fastpaths significantly on some systems.
Although some workloads might tolerate this slowdown, it is worthwhile
looking for algorithms with better read-side performance.
One such algorithm uses a signal handler to steal counts from other
threads.
Because signal handlers run in the context of the signaled thread,
atomic operations are not necessary, as shown in the next section.
\QuickQuiz{}
But signal handlers can be migrated to some other
CPU while running.
Doesn't this possibility require that atomic instructions
and memory barriers are required to reliably communicate
between a thread and a signal handler that interrupts that
thread?
\QuickQuizAnswer{
No.
If the signal handler is migrated to another CPU, then the
interrupted thread is also migrated along with it.
} \QuickQuizEnd
\subsection{Signal-Theft Limit Counter Design}
\label{sec:count:Signal-Theft Limit Counter Design}
\begin{figure}[tb]
\centering
\resizebox{2in}{!}{\includegraphics{count/sig-theft}}
\caption{Signal-Theft State Machine}
\label{fig:count:Signal-Theft State Machine}
\end{figure}
Even though per-thread state will now be manipulated only by the
corresponding thread, there will still need to be synchronization
with the signal handlers.
This synchronization is provided by the state machine shown in
Figure~\ref{fig:count:Signal-Theft State Machine}.
The state machine starts out in the IDLE state, and when \co{add_count()}
or \co{sub_count()} find that the combination of the local thread's count
and the global count cannot accommodate the request, the corresponding
slowpath sets each thread's \co{theft} state to REQ (unless that thread
has no count, in which case it transitions directly to READY).
Only the slowpath, which holds the \co{gblcnt_mutex} lock, is permitted to
transition from the IDLE state, as indicated by the green color.\footnote{
For those with black-and-white versions of this book,
IDLE and READY are green, REQ is red, and ACK is blue.}
The slowpath then sends a signal to each thread, and the corresponding
signal handler checks the corresponding thread's \co{theft} and
\co{counting} variables.
If the \co{theft} state is not REQ, then the signal handler is not
permitted to change the state, and therefore simply returns.
Otherwise, if the \co{counting} variable is set, indicating that
the current thread's fastpath is in progress, the signal handler
sets the \co{theft} state to ACK, otherwise to READY.
If the \co{theft} state is ACK,
only the fastpath is permitted to change
the \co{theft} state, as indicated by the blue color.
When the fastpath completes, it sets the \co{theft} state to READY.
Once the slowpath sees a thread's \co{theft} state is READY, the
slowpath is permitted to steal that thread's count.
The slowpath then sets that thread's \co{theft} state to IDLE.
\QuickQuiz{}
In Figure~\ref{fig:count:Signal-Theft State Machine}, why is
the REQ \co{theft} state colored red?
\QuickQuizAnswer{
To indicate that only the fastpath is permitted to change the
\co{theft} state, and that if the thread remains in this
state for too long, the thread running the slowpath will
resend the POSIX signal.
} \QuickQuizEnd
\QuickQuiz{}
In Figure~\ref{fig:count:Signal-Theft State Machine}, what is
the point of having separate REQ and ACK \co{theft} states?
Why not simplify the state machine by collapsing
them into a single REQACK state?
Then whichever of the signal handler or the fastpath gets there
first could set the state to READY.
\QuickQuizAnswer{
Reasons why collapsing the REQ and ACK states would be a very
bad idea include:
\begin{enumerate}
\item The slowpath uses the REQ and ACK states to determine
whether the signal should be retransmitted.
If the states were collapsed, the slowpath would have
no choice but to send redundant signals, which would
have the unhelpful effect of needlessly slowing down
the fastpath.
\item The following race would result:
\begin{enumerate}
\item The slowpath sets a given thread's state to REQACK.
\item That thread has just finished its fastpath, and
notes the REQACK state.
\item The thread receives the signal, which also notes
the REQACK state, and, because there is no fastpath
in effect, sets the state to READY.
\item The slowpath notes the READY state, steals the
count, and sets the state to IDLE, and completes.
\item The fastpath sets the state to READY, disabling
further fastpath execution for this thread.
\end{enumerate}
The basic problem here is that the combined REQACK state
can be referenced by both the signal handler and the
fastpath.
The clear separation maintained by the four-state
setup ensures orderly state transitions.
\end{enumerate}
That said, you might well be able to make a three-state setup
work correctly.
If you do succeed, compare carefully to the four-state setup.
Is the three-state solution really preferable, and why or why not?
} \QuickQuizEnd
\subsection{Signal-Theft Limit Counter Implementation}
\label{sec:count:Signal-Theft Limit Counter Implementation}
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 #define THEFT_IDLE 0
2 #define THEFT_REQ 1
3 #define THEFT_ACK 2
4 #define THEFT_READY 3
5
6 int __thread theft = THEFT_IDLE;
7 int __thread counting = 0;
8 unsigned long __thread counter = 0;
9 unsigned long __thread countermax = 0;
10 unsigned long globalcountmax = 10000;
11 unsigned long globalcount = 0;
12 unsigned long globalreserve = 0;
13 unsigned long *counterp[NR_THREADS] = { NULL };
14 unsigned long *countermaxp[NR_THREADS] = { NULL };
15 int *theftp[NR_THREADS] = { NULL };
16 DEFINE_SPINLOCK(gblcnt_mutex);
17 #define MAX_COUNTERMAX 100
\end{verbbox}
}
\centering
\theverbbox
\caption{Signal-Theft Limit Counter Data}
\label{fig:count:Signal-Theft Limit Counter Data}
\end{figure}
Figure~\ref{fig:count:Signal-Theft Limit Counter Data}
(\path{count_lim_sig.c})
shows the data structures used by the signal-theft based counter
implementation.
Lines~1-7 define the states and values for the per-thread theft state machine
described in the preceding section.
Lines~8-17 are similar to earlier implementations, with the addition of
lines~14 and 15 to allow remote access to a thread's \co{countermax}
and \co{theft} variables, respectively.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 static void globalize_count(void)
2 {
3 globalcount += counter;
4 counter = 0;
5 globalreserve -= countermax;
6 countermax = 0;
7 }
8
9 static void flush_local_count_sig(int unused)
10 {
11 if (ACCESS_ONCE(theft) != THEFT_REQ)
12 return;
13 smp_mb();
14 ACCESS_ONCE(theft) = THEFT_ACK;
15 if (!counting) {
16 ACCESS_ONCE(theft) = THEFT_READY;
17 }
18 smp_mb();
19 }
20
21 static void flush_local_count(void)
22 {
23 int t;
24 thread_id_t tid;
25
26 for_each_tid(t, tid)
27 if (theftp[t] != NULL) {
28 if (*countermaxp[t] == 0) {
29 ACCESS_ONCE(*theftp[t]) = THEFT_READY;
30 continue;
31 }
32 ACCESS_ONCE(*theftp[t]) = THEFT_REQ;
33 pthread_kill(tid, SIGUSR1);
34 }
35 for_each_tid(t, tid) {
36 if (theftp[t] == NULL)
37 continue;
38 while (ACCESS_ONCE(*theftp[t]) != THEFT_READY) {
39 poll(NULL, 0, 1);
40 if (ACCESS_ONCE(*theftp[t]) == THEFT_REQ)
41 pthread_kill(tid, SIGUSR1);
42 }
43 globalcount += *counterp[t];
44 *counterp[t] = 0;
45 globalreserve -= *countermaxp[t];
46 *countermaxp[t] = 0;
47 ACCESS_ONCE(*theftp[t]) = THEFT_IDLE;
48 }
49 }
50
51 static void balance_count(void)
52 {
53 countermax = globalcountmax -
54 globalcount - globalreserve;
55 countermax /= num_online_threads();
56 if (countermax > MAX_COUNTERMAX)
57 countermax = MAX_COUNTERMAX;
58 globalreserve += countermax;
59 counter = countermax / 2;
60 if (counter > globalcount)
61 counter = globalcount;
62 globalcount -= counter;
63 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Signal-Theft Limit Counter Value-Migration Functions}
\label{fig:count:Signal-Theft Limit Counter Value-Migration Functions}
\end{figure}
Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}
shows the functions responsible for migrating counts between per-thread
variables and the global variables.
Lines~1-7 shows \co{globalize_count()}, which is identical to earlier
implementations.
Lines~9-19 shows \co{flush_local_count_sig()}, which is the signal
handler used in the theft process.
Lines~11 and 12 check to see if the \co{theft} state is REQ, and, if not
returns without change.
Line~13 executes a memory barrier to ensure that the sampling of the
theft variable happens before any change to that variable.
Line~14 sets the \co{theft} state to ACK, and, if line~15 sees that
this thread's fastpaths are not running, line~16 sets the \co{theft}
state to READY.
\QuickQuiz{}
In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}
function \co{flush_local_count_sig()}, why are there
\co{ACCESS_ONCE()} wrappers around the uses of the
\co{theft} per-thread variable?
\QuickQuizAnswer{
The first one (on line~11) can be argued to be unnecessary.
The last two (lines~14 and 16) are important.
If these are removed, the compiler would be within its rights
to rewrite lines~14-17 as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
14 theft = THEFT_READY;
15 if (counting) {
16 theft = THEFT_ACK;
17 }
\end{verbatim}
\end{minipage}
\vspace{5pt}
This would be fatal, as the slowpath might see the transient
value of \co{THEFT_READY}, and start stealing before the
corresponding thread was ready.
} \QuickQuizEnd
Lines~21-49 shows \co{flush_local_count()}, which is called from the
slowpath to flush all threads' local counts.
The loop spanning lines~26-34 advances the \co{theft} state for each
thread that has local count, and also sends that thread a signal.
Line~27 skips any non-existent threads.
Otherwise, line~28 checks to see if the current thread holds any local
count, and, if not, line~29 sets the thread's \co{theft} state to READY
and line~30 skips to the next thread.
Otherwise, line~32 sets the thread's \co{theft} state to REQ and
line~33 sends the thread a signal.
\QuickQuiz{}
In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions},
why is it safe for line~28 to directly access the other thread's
\co{countermax} variable?
\QuickQuizAnswer{
Because the other thread is not permitted to change the value
of its \co{countermax} variable unless it holds the
\co{gblcnt_mutex} lock.
But the caller has acquired this lock, so it is not possible
for the other thread to hold it, and therefore the other thread
is not permitted to change its \co{countermax} variable.
We can therefore safely access it---but not change it.
} \QuickQuizEnd
\QuickQuiz{}
In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions},
why doesn't line~33 check for the current thread sending itself
a signal?
\QuickQuizAnswer{
There is no need for an additional check.
The caller of \co{flush_local_count()} has already invoked
\co{globalize_count()}, so the check on line~28 will have
succeeded, skipping the later \co{pthread_kill()}.
} \QuickQuizEnd
\QuickQuiz{}
The code in
Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions},
works with gcc and POSIX.
What would be required to make it also conform to the ISO C standard?
\QuickQuizAnswer{
The \co{theft} variable must be of type \co{sig_atomic_t}
to guarantee that it can be safely shared between the signal
handler and the code interrupted by the signal.
} \QuickQuizEnd
The loop spanning lines~35-48 waits until each thread reaches READY state,
then steals that thread's count.
Lines~36-37 skip any non-existent threads, and the loop spanning
lines~38-42 wait until the current thread's \co{theft} state becomes READY.
Line~39 blocks for a millisecond to avoid priority-inversion problems,
and if line~40 determines that the thread's signal has not yet arrived,
line~41 resends the signal.
Execution reaches line~43 when the thread's \co{theft} state becomes
READY, so lines~43-46 do the thieving.
Line~47 then sets the thread's \co{theft} state back to IDLE.
\QuickQuiz{}
In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}, why does line~41 resend the signal?
\QuickQuizAnswer{
Because many operating systems over several decades have
had the property of losing the occasional signal.
Whether this is a feature or a bug is debatable, but
irrelevant.
The obvious symptom from the user's viewpoint will not be
a kernel bug, but rather a user application hanging.
\emph{Your} user application hanging!
} \QuickQuizEnd
Lines~51-63 show \co{balance_count()}, which is similar to that of
earlier examples.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 int add_count(unsigned long delta)
2 {
3 int fastpath = 0;
4
5 counting = 1;
6 barrier();
7 if (countermax - counter >= delta &&
8 ACCESS_ONCE(theft) <= THEFT_REQ) {
9 counter += delta;
10 fastpath = 1;
11 }
12 barrier();
13 counting = 0;
14 barrier();
15 if (ACCESS_ONCE(theft) == THEFT_ACK) {
16 smp_mb();
17 ACCESS_ONCE(theft) = THEFT_READY;
18 }
19 if (fastpath)
20 return 1;
21 spin_lock(&gblcnt_mutex);
22 globalize_count();
23 if (globalcountmax - globalcount -
24 globalreserve < delta) {
25 flush_local_count();
26 if (globalcountmax - globalcount -
27 globalreserve < delta) {
28 spin_unlock(&gblcnt_mutex);
29 return 0;
30 }
31 }
32 globalcount += delta;
33 balance_count();
34 spin_unlock(&gblcnt_mutex);
35 return 1;
36 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Signal-Theft Limit Counter Add Function}
\label{fig:count:Signal-Theft Limit Counter Add Function}
\end{figure}
\begin{figure}[tb]
{ \scriptsize
\begin{verbbox}
38 int sub_count(unsigned long delta)
39 {
40 int fastpath = 0;
41
42 counting = 1;
43 barrier();
44 if (counter >= delta &&
45 ACCESS_ONCE(theft) <= THEFT_REQ) {
46 counter -= delta;
47 fastpath = 1;
48 }
49 barrier();
50 counting = 0;
51 barrier();
52 if (ACCESS_ONCE(theft) == THEFT_ACK) {
53 smp_mb();
54 ACCESS_ONCE(theft) = THEFT_READY;
55 }
56 if (fastpath)
57 return 1;
58 spin_lock(&gblcnt_mutex);
59 globalize_count();
60 if (globalcount < delta) {
61 flush_local_count();
62 if (globalcount < delta) {
63 spin_unlock(&gblcnt_mutex);
64 return 0;
65 }
66 }
67 globalcount -= delta;
68 balance_count();
69 spin_unlock(&gblcnt_mutex);
70 return 1;
71 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Signal-Theft Limit Counter Subtract Function}
\label{fig:count:Signal-Theft Limit Counter Subtract Function}
\end{figure}
Figure~\ref{fig:count:Signal-Theft Limit Counter Add Function}
shows the \co{add_count()} function.
The fastpath spans lines~5-20, and the slowpath lines~21-35.
Line~5 sets the per-thread \co{counting} variable to 1 so that
any subsequent signal handlers interrupting this thread will
set the \co{theft} state to ACK rather than READY, allowing this
fastpath to complete properly.
Line~6 prevents the compiler from reordering any of the fastpath body
to precede the setting of \co{counting}.
Lines~7 and 8 check to see if the per-thread data can accommodate
the \co{add_count()} and if there is no ongoing theft in progress,
and if so line~9 does the fastpath addition and line~10 notes that
the fastpath was taken.
In either case, line~12 prevents the compiler from reordering the
fastpath body to follow line~13, which permits any subsequent signal
handlers to undertake theft.
Line~14 again disables compiler reordering, and then line~15
checks to see if the signal handler deferred the \co{theft}
state-change to READY, and, if so, line~16 executes a memory
barrier to ensure that any CPU that sees line~17 setting state to
READY also sees the effects of line~9.
If the fastpath addition at line~9 was executed, then line~20 returns
success.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 unsigned long read_count(void)
2 {
3 int t;
4 unsigned long sum;
5
6 spin_lock(&gblcnt_mutex);
7 sum = globalcount;
8 for_each_thread(t)
9 if (counterp[t] != NULL)
10 sum += *counterp[t];
11 spin_unlock(&gblcnt_mutex);
12 return sum;
13 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Signal-Theft Limit Counter Read Function}
\label{fig:count:Signal-Theft Limit Counter Read Function}
\end{figure}
Otherwise, we fall through to the slowpath starting at line~21.
The structure of the slowpath is similar to those of earlier examples,
so its analysis is left as an exercise to the reader.
Similarly, the structure of \co{sub_count()} on
Figure~\ref{fig:count:Signal-Theft Limit Counter Subtract Function}
is the same
as that of \co{add_count()}, so the analysis of \co{sub_count()} is also
left as an exercise for the reader, as is the analysis of
\co{read_count()} in
Figure~\ref{fig:count:Signal-Theft Limit Counter Read Function}.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 void count_init(void)
2 {
3 struct sigaction sa;
4
5 sa.sa_handler = flush_local_count_sig;
6 sigemptyset(&sa.sa_mask);
7 sa.sa_flags = 0;
8 if (sigaction(SIGUSR1, &sa, NULL) != 0) {
9 perror("sigaction");
10 exit(-1);
11 }
12 }
13
14 void count_register_thread(void)
15 {
16 int idx = smp_thread_id();
17
18 spin_lock(&gblcnt_mutex);
19 counterp[idx] = &counter;
20 countermaxp[idx] = &countermax;
21 theftp[idx] = &theft;
22 spin_unlock(&gblcnt_mutex);
23 }
24
25 void count_unregister_thread(int nthreadsexpected)
26 {
27 int idx = smp_thread_id();
28
29 spin_lock(&gblcnt_mutex);
30 globalize_count();
31 counterp[idx] = NULL;
32 countermaxp[idx] = NULL;
33 theftp[idx] = NULL;
34 spin_unlock(&gblcnt_mutex);
35 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Signal-Theft Limit Counter Initialization Functions}
\label{fig:count:Signal-Theft Limit Counter Initialization Functions}
\end{figure}
Lines~1-12 of
Figure~\ref{fig:count:Signal-Theft Limit Counter Initialization Functions}
show \co{count_init()}, which set up \co{flush_local_count_sig()}
as the signal handler for \co{SIGUSR1},
enabling the \co{pthread_kill()} calls in \co{flush_local_count()}
to invoke \co{flush_local_count_sig()}.
The code for thread registry and unregistry is similar to that of
earlier examples, so its analysis is left as an exercise for the
reader.
\subsection{Signal-Theft Limit Counter Discussion}
The signal-theft implementation runs more than twice as fast as the
atomic implementation on my Intel Core Duo laptop.
Is it always preferable?
The signal-theft implementation would be vastly preferable on Pentium-4
systems, given their slow atomic instructions, but the old 80386-based
Sequent Symmetry systems would do much better with the shorter path
length of the atomic implementation.
However, this increased update-side performance comes at the
prices of higher read-side overhead: Those POSIX signals are not free.
If ultimate performance is of the essence, you will need to measure
them both on the system that your application is to be deployed on.
\QuickQuiz{}
Not only are POSIX signals slow, sending one to each thread
simply does not scale.
What would you do if you had (say) 10,000 threads and needed
the read side to be fast?
\QuickQuizAnswer{
One approach is to use the techniques shown in
Section~\ref{sec:count:Eventually Consistent Implementation},
summarizing an approximation to the overall counter value in
a single variable.
Another approach would be to use multiple threads to carry
out the reads, with each such thread interacting with a
specific subset of the updating threads.
} \QuickQuizEnd
This is but one reason why high-quality APIs are so important:
they permit implementations to be changed as required by ever-changing
hardware performance characteristics.
\QuickQuiz{}
What if you want an exact limit counter to be exact only for
its lower limit, but to allow the upper limit to be inexact?
\QuickQuizAnswer{
One simple solution is to overstate the upper limit by the
desired amount.
The limiting case of such overstatement results in the
upper limit being set to the largest value that the counter is
capable of representing.
} \QuickQuizEnd
\section{Applying Specialized Parallel Counters}
\label{sec:count:Applying Specialized Parallel Counters}
Although the exact limit counter implementations in
Section~\ref{sec:count:Exact Limit Counters}
can be very useful, they are not much help if the counter's value
remains near zero at all times, as it might when counting the number
of outstanding accesses to an I/O device.
The high overhead of such near-zero counting is especially painful
given that we normally don't care how many references there are.
As noted in the removable I/O device access-count problem posed by
\QuickQuizRef{\QcountQIOcnt},
the number of accesses is irrelevant except in those rare cases when
someone is actually trying to remove the device.
One simple solution to this problem is to add a large ``bias''
(for example, one billion) to the
counter in order to ensure that the value is far enough from zero that
the counter can operate efficiently.
When someone wants to remove the device, this bias is subtracted from
the counter value.
Counting the last few accesses will be quite inefficient,
but the important point is that the many prior accesses will have been
counted at full speed.
\QuickQuiz{}
What else had you better have done when using a biased counter?
\QuickQuizAnswer{
You had better have set the upper limit to be large enough
accommodate the bias, the expected maximum number of accesses,
and enough ``slop'' to allow the counter to work efficiently
even when the number of accesses is at its maximum.
} \QuickQuizEnd
Although a biased counter can be quite helpful and useful, it is only a
partial solution to the removable I/O device access-count problem
called out on
page~\pageref{chp:Counting}.
When attempting to remove a device, we must not only know the precise
number of current I/O accesses, we also need to prevent any future
accesses from starting.
One way to accomplish this is to read-acquire a reader-writer lock
when updating the counter, and to write-acquire that same reader-writer
lock when checking the counter.
Code for doing I/O might be as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
1 read_lock(&mylock);
2 if (removing) {
3 read_unlock(&mylock);
4 cancel_io();
5 } else {
6 add_count(1);
7 read_unlock(&mylock);
8 do_io();
9 sub_count(1);
10 }
\end{verbatim}
\end{minipage}
\vspace{5pt}
Line~1 read-acquires the lock, and either line~3 or 7 releases it.
Line~2 checks to see if the device is being removed, and, if so,
line~3 releases the lock and line~4 cancels the I/O, or takes whatever
action is appropriate given that the device is to be removed.
Otherwise, line~6 increments the access count, line~7 releases the
lock, line~8 performs the I/O, and line~9 decrements the access count.
\QuickQuiz{}
This is ridiculous!
We are \emph{read}-acquiring a reader-writer lock to
\emph{update} the counter?
What are you playing at???
\QuickQuizAnswer{
Strange, perhaps, but true!
Almost enough to make you think that the name
``reader-writer lock'' was poorly chosen, isn't it?
} \QuickQuizEnd
The code to remove the device might be as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
1 write_lock(&mylock);
2 removing = 1;
3 sub_count(mybias);
4 write_unlock(&mylock);
5 while (read_count() != 0) {
6 poll(NULL, 0, 1);
7 }
8 remove_device();
\end{verbatim}
\end{minipage}
\vspace{5pt}
Line~1 write-acquires the lock and line~4 releases it.
Line~2 notes that the device is being removed, and the loop spanning
lines~5-7 wait for any I/O operations to complete.
Finally, line~8 does any additional processing needed to prepare for
device removal.
\QuickQuiz{}
What other issues would need to be accounted for in a real system?
\QuickQuizAnswer{
A huge number!
Here are a few to start with:
\begin{enumerate}
\item There could be any number of devices, so that the
global variables are inappropriate, as are the
lack of arguments to functions like \co{do_io()}.
\item Polling loops can be problematic in real systems.
In many cases, it is far better to have the last
completing I/O wake up the device-removal thread.
\item The I/O might fail, and so \co{do_io()} will likely
need a return value.
\item If the device fails, the last I/O might never complete.
In such cases, there might need to be some sort of
timeout to allow error recovery.
\item Both \co{add_count()} and \co{sub_count()} can
fail, but their return values are not checked.
\item Reader-writer locks do not scale well.
One way of avoiding the high read-acquisition costs
of reader-writer locks is presented in
Chapters~\ref{chp:Locking}
and~\ref{chp:Deferred Processing}.
\item The polling loops result in poor energy efficiency.
An event-driven design is preferable.
\end{enumerate}
} \QuickQuizEnd
\section{Parallel Counting Discussion}
\label{sec:count:Parallel Counting Discussion}
This chapter has presented the reliability, performance, and
scalability problems with traditional counting primitives.
The C-language \co{++} operator is not guaranteed to function reliably in
multithreaded code, and atomic operations to a single variable neither
perform nor scale well.
This chapter therefore presented a number of counting algorithms that
perform and scale extremely well in certain special cases.
It is well worth reviewing the lessons from these counting algorithms.
To that end,
Section~\ref{sec:count:Parallel Counting Performance}
summarizes performance and scalability,
Section~\ref{sec:count:Parallel Counting Specializations}
discusses the need for specialization,
and finally,
Section~\ref{sec:count:Parallel Counting Lessons}
enumerates lessons learned and calls attention to later chapters that
will expand on these lessons.
\subsection{Parallel Counting Performance}
\label{sec:count:Parallel Counting Performance}
\begin{table*}
\small
\centering
\begin{tabular}{l|r|r|r|r}
& & & \multicolumn{2}{c}{Reads} \\
\cline{4-5}
Algorithm & Section & Updates & 1 Core & 32 Cores \\
\hline
\hline
\path{count_stat.c} & \ref{sec:count:Array-Based Implementation} &
11.5 ns & 408 ns & 409 ns \\
\path{count_stat_eventual.c} & \ref{sec:count:Eventually Consistent Implementation} &
11.6 ns & 1 ns & 1 ns \\
\path{count_end.c} & \ref{sec:count:Per-Thread-Variable-Based Implementation} &
6.3 ns & 389 ns & 51,200 ns \\
\path{count_end_rcu.c} & \ref{sec:together:RCU and Per-Thread-Variable-Based Statistical Counters} &
5.7 ns & 354 ns & 501 ns \\
\end{tabular}
\caption{Statistical Counter Performance on Power-6}
\label{tab:count:Statistical Counter Performance on Power-6}
\end{table*}
Table~\ref{tab:count:Statistical Counter Performance on Power-6}
shows the performance of the four parallel statistical counting
algorithms.
All four algorithms provide near-perfect linear scalability for updates.
The per-thread-variable implementation (\path{count_end.c})
is significantly faster on
updates than the array-based implementation
(\path{count_stat.c}), but is slower at reads on large numbers of core,
and suffers severe lock contention when there are many parallel readers.
This contention can be addressed using the deferred-processing
techniques introduced in
Chapter~\ref{chp:Deferred Processing},
as shown on the \path{count_end_rcu.c} row of
Table~\ref{tab:count:Statistical Counter Performance on Power-6}.
Deferred processing also shines on the \path{count_stat_eventual.c} row,
courtesy of eventual consistency.
\QuickQuiz{}
On the \path{count_stat.c} row of
Table~\ref{tab:count:Statistical Counter Performance on Power-6},
we see that the read-side scales linearly with the number of
threads.
How is that possible given that the more threads there are,
the more per-thread counters must be summed up?
\QuickQuizAnswer{
The read-side code must scan the entire fixed-size array, regardless
of the number of threads, so there is no difference in performance.
In contrast, in the last two algorithms, readers must do more
work when there are more threads.
In addition, the last two algorithms interpose an additional
level of indirection because they map from integer thread ID
to the corresponding \co{__thread} variable.
} \QuickQuizEnd
\QuickQuiz{}
Even on the last row of
Table~\ref{tab:count:Statistical Counter Performance on Power-6},
the read-side performance of these statistical counter
implementations is pretty horrible.
So why bother with them?
\QuickQuizAnswer{
``Use the right tool for the job.''
As can be seen from
Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem},
single-variable atomic increment need not apply for any job
involving heavy use of parallel updates.
In contrast, the algorithms shown in
Table~\ref{tab:count:Statistical Counter Performance on Power-6}
do an excellent job of handling update-heavy situations.
Of course, if you have a read-mostly situation, you should
use something else, for example, an eventually consistent design
featuring a single atomically incremented
variable that can be read out using a single load,
similar to the approach used in
Section~\ref{sec:count:Eventually Consistent Implementation}.
} \QuickQuizEnd
\begin{table*}
\small
\centering
\begin{tabular}{l|r|c|r|r|r}
& & & & \multicolumn{2}{c}{Reads} \\
\cline{5-6}
Algorithm & Section & Exact? & Updates & 1 Core & 64 Cores \\
\hline
\hline
\path{count_lim.c} & \ref{sec:count:Simple Limit Counter Implementation} &
N & 3.6 ns & 375 ns & 50,700 ns \\
\path{count_lim_app.c} & \ref{sec:count:Approximate Limit Counter Implementation} &
N & 11.7 ns & 369 ns & 51,000 ns \\
\path{count_lim_atomic.c} & \ref{sec:count:Atomic Limit Counter Implementation} &
Y & 51.4 ns & 427 ns & 49,400 ns \\
\path{count_lim_sig.c} & \ref{sec:count:Signal-Theft Limit Counter Implementation} &
Y & 10.2 ns & 370 ns & 54,000 ns \\
\end{tabular}
\caption{Limit Counter Performance on Power-6}
\label{tab:count:Limit Counter Performance on Power-6}
\end{table*}
Figure~\ref{tab:count:Limit Counter Performance on Power-6}
shows the performance of the parallel limit-counting algorithms.
Exact enforcement of the limits incurs a substantial performance
penalty, although on this 4.7GHz Power-6 system that penalty can be reduced
by substituting signals for atomic operations.
All of these implementations suffer from read-side lock contention
in the face of concurrent readers.
\QuickQuiz{}
Given the performance data shown in
Table~\ref{tab:count:Limit Counter Performance on Power-6},
we should always prefer signals over atomic operations, right?
\QuickQuizAnswer{
That depends on the workload.
Note that on a 64-core system, you need more than
one hundred non-atomic operations (with roughly
a 40-nanosecond performance gain) to make up for even one
signal (with almost a 5-\emph{microsecond} performance loss).
Although there are no shortage of workloads with far greater
read intensity, you will need to consider your particular
workload.
In addition, although memory barriers have historically been
expensive compared to ordinary instructions, you should
check this on the specific hardware you will be running.
The properties of computer hardware do change over time,
and algorithms must change accordingly.
} \QuickQuizEnd
\QuickQuiz{}
Can advanced techniques be applied to address the lock
contention for readers seen in
Table~\ref{tab:count:Limit Counter Performance on Power-6}?
\QuickQuizAnswer{
One approach is to give up some update-side performance, as is
done with scalable non-zero indicators
(SNZI)~\cite{FaithEllen:2007:SNZI}.
There are a number of other ways one might go about this, and these
are left as exercises for the reader.
Any number of approaches that apply hierarchy, which replace
frequent global-lock acquisitions with local lock acquisitions
corresponding to lower levels of the hierarchy, should work quite well.
} \QuickQuizEnd
In short, this chapter has demonstrated a number of counting algorithms
that perform and scale extremely well in a number of special cases.
But must our parallel counting be confined to special cases?
Wouldn't it be better to have a general algorithm that operated
efficiently in all cases?
The next section looks at these questions.
\subsection{Parallel Counting Specializations}
\label{sec:count:Parallel Counting Specializations}
The fact that these algorithms only work well in their respective special
cases might be considered a major problem with parallel programming in
general.
After all, the C-language \co{++} operator works just fine in single-threaded
code, and not just for special cases, but in general, right?
This line of reasoning does contain a grain of truth, but is in essence
misguided.
The problem is not parallelism as such, but rather scalability.
To understand this, first consider the C-language \co{++} operator.
The fact is that it does \emph{not} work in general, only for a restricted
range of numbers.
If you need to deal with 1,000-digit decimal numbers, the C-language \co{++}
operator will not work for you.
\QuickQuiz{}
The \co{++} operator works just fine for 1,000-digit numbers!
Haven't you heard of operator overloading???
\QuickQuizAnswer{
In the C++ language, you might well be able to use \co{++}
on a 1,000-digit number, assuming that you had access to a
class implementing such numbers.
But as of 2010, the C language does not permit operator overloading.
} \QuickQuizEnd
This problem is not specific to arithmetic.
Suppose you need to store and query data.
Should you use an ASCII file?
XML?
A relational database?
A linked list?
A dense array?
A B-tree?
A radix tree?
Or one of the plethora of other data
structures and environments that permit data to be stored and queried?
It depends on what you need to do, how fast you need it done, and how
large your data set is---even on sequential systems.
Similarly, if you need to count, your solution will depend on how large
of numbers you need to work with, how many CPUs need to be manipulating
a given number concurrently, how the number is to be used, and what
level of performance and scalability you will need.
Nor is this problem specific to software.
The design for a bridge meant to allow people to walk across a small brook
might be a simple as a single wooden plank.
But you would probably not use a plank to span the kilometers-wide mouth of
the Columbia River, nor would such a design be advisable for bridges
carrying concrete trucks.
In short, just as bridge design must change with increasing span and load,
so must software design change as the number of CPUs increases.
That said, it would be good to automate this process, so that the
software adapts to changes in hardware configuration and in workload.
There has in fact been some research into this sort of
automation~\cite{Appavoo03a,Soules03a}, and the Linux kernel does some
boot-time reconfiguration, including limited binary rewriting.
This sort of adaptation will become increasingly important as the
number of CPUs on mainstream systems continues to increase.
In short, as discussed in
Chapter~\ref{chp:Hardware and its Habits},
the laws of physics constrain parallel software just as surely as they
constrain mechanical artifacts such as bridges.
These constraints force specialization, though in the case of software
it might be possible to automate the choice of specialization to
fit the hardware and workload in question.
Of course, even generalized counting is quite specialized.
We need to do a great number of other things with computers.
The next section relates what we have learned from counters to
topics taken up later in this book.
\subsection{Parallel Counting Lessons}
\label{sec:count:Parallel Counting Lessons}
The opening paragraph of this chapter promised that our study of counting
would provide an excellent introduction to parallel programming.
This section makes explicit connections between the lessons from
this chapter and the material presented in a number of later chapters.
The examples in this chapter have shown that an important scalability
and performance tool is \emph{partitioning}.
The counters might be fully partitioned, as in the statistical counters
discussed in Section~\ref{sec:count:Statistical Counters},
or partially partitioned as in the limit counters discussed in
Sections~\ref{sec:count:Approximate Limit Counters} and
\ref{sec:count:Exact Limit Counters}.
Partitioning will be considered in far greater depth in
Chapter~\ref{cha:Partitioning and Synchronization Design},
and partial parallelization in particular in
Section~\ref{sec:SMPdesign:Parallel Fastpath}, where it is called
\emph{parallel fastpath}.
\QuickQuiz{}
But if we are going to have to partition everything, why bother
with shared-memory multithreading?
Why not just partition the problem completely and run as
multiple processes, each in its own address space?
\QuickQuizAnswer{
Indeed, multiple processes with separate address spaces can be
an excellent way to exploit parallelism, as the proponents of
the fork-join methodology and the Erlang language would be very
quick to tell you.
However, there are also some advantages to shared-memory parallelism:
\begin{enumerate}
\item Only the most performance-critical portions of the
application must be partitioned, and such portions
are usually a small fraction of the application.
\item Although cache misses are quite slow compared to
individual register-to-register instructions,
they are typically considerably faster than
inter-process-communication primitives, which in
turn are considerably faster than things like
TCP/IP networking.
\item Shared-memory multiprocessors are readily available
and quite inexpensive, so, in stark contrast to the
1990s, there is little cost penalty for use of
shared-memory parallelism.
\end{enumerate}
As always, use the right tool for the job!
} \QuickQuizEnd
The partially partitioned counting algorithms used locking to
guard the global data, and locking is the subject of
Chapter~\ref{chp:Locking}.
In contrast, the partitioned data tended to be fully under the control of
the corresponding thread, so that no synchronization whatsoever was required.
This \emph{data ownership} will be introduced in
Section~\ref{sec:SMPdesign:Data Ownership}
and discussed in more detail in
Chapter~\ref{chp:Data Ownership}.
Because integer addition and subtraction are extremely cheap operations
compared to typical synchronization operations, achieving reasonable
scalability requires synchronization operations be used sparingly.
One way of achieving this is to batch the addition and subtraction
operations, so that a great many of these cheap operations are handled
by a single synchronization operation.
Batching optimizations of one sort or another are used by each of
the counting algorithms listed in
Tables~\ref{tab:count:Statistical Counter Performance on Power-6}
and~\ref{tab:count:Limit Counter Performance on Power-6}.
Finally, the eventually consistent statistical counter discussed in
Section~\ref{sec:count:Eventually Consistent Implementation}
showed how deferring activity (in that case, updating the global
counter) can provide substantial performance and scalability benefits.
This approach allows common case code to use much cheaper synchronization
operations than would otherwise be possible.
Chapter~\ref{chp:Deferred Processing} will examine a number of additional
ways that deferral can improve performance, scalability, and even
real-time response.
Summarizing the summary:
\begin{enumerate}
\item Partitioning promotes performance and scalability.
\item Partial partitioning, that is, partitioning applied only to
common code paths, works almost as well.
\item Partial partitioning can be applied to code (as in
Section~\ref{sec:count:Statistical Counters}'s statistical
counters' partitioned updates and non-partitioned reads), but also
across time (as in
Section~\ref{sec:count:Approximate Limit Counters}'s and
Section~\ref{sec:count:Exact Limit Counters}'s
limit counters running fast when far from
the limit, but slowly when close to the limit).
\item Partitioning across time often batches updates locally
in order to reduce the number of expensive global operations,
thereby decreasing synchronization overhead, in turn
improving performance and scalability.
All the algorithms shown in
Tables~\ref{tab:count:Statistical Counter Performance on Power-6}
and~\ref{tab:count:Limit Counter Performance on Power-6}
make heavy use of batching.
\item Read-only code paths should remain read-only: Spurious
synchronization writes to shared memory kill performance
and scalability, as seen in the \path{count_end.c} row of
Table~\ref{tab:count:Statistical Counter Performance on Power-6}.
\item Judicious use of delay promotes performance and scalability, as
seen in Section~\ref{sec:count:Eventually Consistent Implementation}.
\item Parallel performance and scalability is usually a balancing act:
Beyond a certain point, optimizing some code paths will degrade
others.
The \path{count_stat.c} and \path{count_end_rcu.c} rows of
Table~\ref{tab:count:Statistical Counter Performance on Power-6}
illustrate this point.
\item Different levels of performance and scalability will affect
algorithm and data-structure design, as do a large number of
other factors.
Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem}
illustrates this point: Atomic increment might be completely
acceptable for a two-CPU system, but be completely inadequate for an
eight-CPU system.
\end{enumerate}
\begin{figure}[tb]
\centering
\resizebox{3in}{!}{\includegraphics{count/FourTaskOrderOpt}}
\caption{Optimization and the Four Parallel-Programming Tasks}
\label{fig:count:Optimization and the Four Parallel-Programming Tasks}
\end{figure}
Summarizing still further, we have the ``big three'' methods of
increasing performance and scalability, namely
(1)~\emph{partitioning} over CPUs or threads,
(2)~\emph{batching} so that more work can be done by each expensive
synchronization operations, and
(3)~\emph{weakening} synchronization operations where feasible.
As a rough rule of thumb, you should apply these methods in this order,
as was noted earlier in the discussion of
Figure~\ref{fig:intro:Ordering of Parallel-Programming Tasks}
on
page~\pageref{fig:intro:Ordering of Parallel-Programming Tasks}.
The partitioning optimization applies to the
``Resource Partitioning and Replication'' bubble,
the batching optimization to the ``Work Partitioning'' bubble,
and the weakening optimization to the ``Parallel Access Control'' bubble,
as shown in
Figure~\ref{fig:count:Optimization and the Four Parallel-Programming Tasks}.
Of course, if you are using special-purpose hardware such as
digital signal processors (DSPs), field-programmable gate arrays (FPGAs),
or general-purpose graphical processing units (GPGPUs), you may need
to pay close attention to the ``Interacting With Hardware'' bubble
throughout the design process.
For example, the structure of a GPGPU's hardware threads and memory
connectivity might richly reward very careful partitioning
and batching design decisions.
In short, as noted at the beginning of this chapter, the simplicity
of counting have allowed us to explore many
fundamental concurrency issues without the distraction of
complex synchronization primitives or elaborate data structures.
Such synchronization primitives and data structures are covered
in later chapters.