blob: dd03c8c21149d033466b8dbbcc149cb159857197 [file] [log] [blame]
% together/applyrcu.tex
\section{RCU Rescues}
\label{sec:together:RCU Rescues}
This section shows how to apply RCU to some examples discussed earlier
in this book.
In some cases, RCU provides simpler code, in other cases better
performance and scalability, and in still other cases, both.
\subsection{RCU and Per-Thread-Variable-Based Statistical Counters}
\label{sec:together:RCU and Per-Thread-Variable-Based Statistical Counters}
Section~\ref{sec:count:Per-Thread-Variable-Based Implementation}
described an implementation of statistical counters that provided
excellent
performance, roughly that of simple increment (as in the C \co{++}
operator), and linear scalability---but only for incrementing
via \co{inc_count()}.
Unfortunately, threads needing to read out the value via \co{read_count()}
were required to acquire a global
lock, and thus incurred high overhead and suffered poor scalability.
The code for the lock-based implementation is shown in
Figure~\ref{fig:count:Per-Thread Statistical Counters} on
Page~\pageref{fig:count:Per-Thread Statistical Counters}.
\QuickQuiz{}
Why on earth did we need that global lock in the first place?
\QuickQuizAnswer{
A given thread's \co{__thread} variables vanish when that
thread exits.
It is therefore necessary to synchronize any operation that
accesses other threads' \co{__thread} variables with
thread exit.
Without such synchronization, accesses to \co{__thread} variable
of a just-exited thread will result in segmentation faults.
} \QuickQuizEnd
\subsubsection{Design}
The hope is to use RCU rather than \co{final_mutex} to protect the
thread traversal in \co{read_count()} in order to obtain excellent
performance and scalability from \co{read_count()}, rather than just
from \co{inc_count()}.
However, we do not want to give up any accuracy in the computed sum.
In particular, when a given thread exits, we absolutely cannot
lose the exiting thread's count, nor can we double-count it.
Such an error could result in inaccuracies equal to the full
precision of the result, in other words, such an error would
make the result completely useless.
And in fact, one of the purposes of \co{final_mutex} is to
ensure that threads do not come and go in the middle of \co{read_count()}
execution.
\QuickQuiz{}
Just what is the accuracy of \co{read_count()}, anyway?
\QuickQuizAnswer{
Refer to
Figure~\ref{fig:count:Per-Thread Statistical Counters} on
Page~\pageref{fig:count:Per-Thread Statistical Counters}.
Clearly, if there are no concurrent invocations of \co{inc_count()},
\co{read_count()} will return an exact result.
However, if there \emph{are} concurrent invocations of
\co{inc_count()}, then the sum is in fact changing as
\co{read_count()} performs its summation.
That said, because thread creation and exit are excluded by
\co{final_mutex}, the pointers in \co{counterp} remain constant.
Let's imagine a mythical machine that is able to take an
instantaneous snapshot of its memory.
Suppose that this machine takes such a snapshot at the
beginning of \co{read_count()}'s execution, and another
snapshot at the end of \co{read_count()}'s execution.
Then \co{read_count()} will access each thread's counter
at some time between these two snapshots, and will therefore
obtain a result that is bounded by those of the two snapshots,
inclusive.
The overall sum will therefore be bounded by the pair of sums that
would have been obtained from each of the two snapshots (again,
inclusive).
The expected error is therefore half of the difference between
the pair of sums that would have been obtained from each of the
two snapshots, that is to say, half of the execution time of
\co{read_count()} multiplied by the number of expected calls to
\co{inc_count()} per unit time.
Or, for those who prefer equations:
\begin{equation}
\epsilon = \frac{T_r R_i}{2}
\end{equation}
where $\epsilon$ is the expected error in \co{read_count()}'s
return value,
$T_r$ is the time that \co{read_count()} takes to execute,
and $R_i$ is the rate of \co{inc_count()} calls per unit time.
(And of course, $T_r$ and $R_i$ should use the same units of
time: microseconds and calls per microsecond, seconds and calls
per second, or whatever, as long as they are the same units.)
} \QuickQuizEnd
Therefore, if we are to dispense with \co{final_mutex}, we will need
to come up with some other method for ensuring consistency.
One approach is to place the total count for all previously exited
threads and the array of pointers to the per-thread counters into a single
structure.
Such a structure, once made available to \co{read_count()}, is
held constant, ensuring that \co{read_count()} sees consistent data.
\subsubsection{Implementation}
\begin{figure}[bp]
{ \scriptsize
\begin{verbbox}
1 struct countarray {
2 unsigned long total;
3 unsigned long *counterp[NR_THREADS];
4 };
5
6 long __thread counter = 0;
7 struct countarray *countarrayp = NULL;
8 DEFINE_SPINLOCK(final_mutex);
9
10 void inc_count(void)
11 {
12 counter++;
13 }
14
15 long read_count(void)
16 {
17 struct countarray *cap;
18 unsigned long sum;
19 int t;
20
21 rcu_read_lock();
22 cap = rcu_dereference(countarrayp);
23 sum = cap->total;
24 for_each_thread(t)
25 if (cap->counterp[t] != NULL)
26 sum += *cap->counterp[t];
27 rcu_read_unlock();
28 return sum;
29 }
30
31 void count_init(void)
32 {
33 countarrayp = malloc(sizeof(*countarrayp));
34 if (countarrayp == NULL) {
35 fprintf(stderr, "Out of memory\n");
36 exit(-1);
37 }
38 memset(countarrayp, '\0', sizeof(*countarrayp));
39 }
40
41 void count_register_thread(void)
42 {
43 int idx = smp_thread_id();
44
45 spin_lock(&final_mutex);
46 countarrayp->counterp[idx] = &counter;
47 spin_unlock(&final_mutex);
48 }
49
50 void count_unregister_thread(int nthreadsexpected)
51 {
52 struct countarray *cap;
53 struct countarray *capold;
54 int idx = smp_thread_id();
55
56 cap = malloc(sizeof(*countarrayp));
57 if (cap == NULL) {
58 fprintf(stderr, "Out of memory\n");
59 exit(-1);
60 }
61 spin_lock(&final_mutex);
62 *cap = *countarrayp;
63 cap->total += counter;
64 cap->counterp[idx] = NULL;
65 capold = countarrayp;
66 rcu_assign_pointer(countarrayp, cap);
67 spin_unlock(&final_mutex);
68 synchronize_rcu();
69 free(capold);
70 }
\end{verbbox}
}
\centering
\theverbbox
\caption{RCU and Per-Thread Statistical Counters}
\label{fig:together:RCU and Per-Thread Statistical Counters}
\end{figure}
Lines~1-4 of
Figure~\ref{fig:together:RCU and Per-Thread Statistical Counters}
show the \co{countarray} structure, which contains a
\co{->total} field for the count from previously exited threads,
and a \co{counterp[]} array of pointers to the per-thread
\co{counter} for each currently running thread.
This structure allows a given execution of \co{read_count()}
to see a total that is consistent with the indicated set of running
threads.
Lines~6-8 contain the definition of the per-thread \co{counter}
variable, the global pointer \co{countarrayp} referencing
the current \co{countarray} structure, and
the \co{final_mutex} spinlock.
Lines~10-13 show \co{inc_count()}, which is unchanged from
Figure~\ref{fig:count:Per-Thread Statistical Counters}.
Lines~15-29 show \co{read_count()}, which has changed significantly.
Lines~21 and~27 substitute \co{rcu_read_lock()} and
\co{rcu_read_unlock()} for acquisition and release of \co{final_mutex}.
Line~22 uses \co{rcu_dereference()} to snapshot the
current \co{countarray} structure into local variable \co{cap}.
Proper use of RCU will guarantee that this \co{countarray} structure
will remain with us through at least the end of the current RCU
read-side critical section at line~27.
Line~23 initializes \co{sum} to \co{cap->total}, which is the
sum of the counts of threads that have previously exited.
Lines~24-26 add up the per-thread counters corresponding to currently
running threads, and, finally, line~28 returns the sum.
The initial value for \co{countarrayp} is
provided by \co{count_init()} on lines~31-39.
This function runs before the first thread is created, and its job
is to allocate
and zero the initial structure, and then assign it to \co{countarrayp}.
Lines~41-48 show the \co{count_register_thread()} function, which
is invoked by each newly created thread.
Line~43 picks up the current thread's index, line~45 acquires
\co{final_mutex}, line~46 installs a pointer to this thread's
\co{counter}, and line~47 releases \co{final_mutex}.
\QuickQuiz{}
Hey!!!
Line~46 of
Figure~\ref{fig:together:RCU and Per-Thread Statistical Counters}
modifies a value in a pre-existing \co{countarray} structure!
Didn't you say that this structure, once made available to
\co{read_count()}, remained constant???
\QuickQuizAnswer{
Indeed I did say that.
And it would be possible to make \co{count_register_thread()}
allocate a new structure, much as \co{count_unregister_thread()}
currently does.
But this is unnecessary.
Recall the derivation of the error bounds of \co{read_count()}
that was based on the snapshots of memory.
Because new threads start with initial \co{counter} values of
zero, the derivation holds even if we add a new thread partway
through \co{read_count()}'s execution.
So, interestingly enough, when adding a new thread, this
implementation gets the effect of allocating a new structure,
but without actually having to do the allocation.
} \QuickQuizEnd
Lines~50-70 shows \co{count_unregister_thread()}, which is invoked
by each thread just before it exits.
Lines~56-60 allocate a new \co{countarray} structure,
line~61 acquires \co{final_mutex} and line~67 releases it.
Line~62 copies the contents of the current \co{countarray} into
the newly allocated version, line~63 adds the exiting thread's \co{counter}
to new structure's \co{->total}, and line~64 \co{NULL}s the exiting thread's
\co{counterp[]} array element.
Line~65 then retains a pointer to the current (soon to be old)
\co{countarray} structure, and line~66 uses \co{rcu_assign_pointer()}
to install the new version of the \co{countarray} structure.
Line~68 waits for a grace period to elapse, so that any threads that
might be concurrently executing in \co{read_count()}, and thus might
have references to the old \co{countarray} structure, will be allowed
to exit their RCU read-side critical sections, thus dropping any such
references.
Line~69 can then safely free the old \co{countarray} structure.
\subsubsection{Discussion}
\QuickQuiz{}
Wow!
Figure~\ref{fig:together:RCU and Per-Thread Statistical Counters}
contains 69 lines of code, compared to only 42 in
Figure~\ref{fig:count:Per-Thread Statistical Counters}.
Is this extra complexity really worth it?
\QuickQuizAnswer{
This of course needs to be decided on a case-by-case basis.
If you need an implementation of \co{read_count()} that
scales linearly, then the lock-based implementation shown in
Figure~\ref{fig:count:Per-Thread Statistical Counters}
simply will not work for you.
On the other hand, if calls to \co{count_read()} are sufficiently
rare, then the lock-based version is simpler and might thus be
better, although much of the size difference is due
to the structure definition, memory allocation, and \co{NULL}
return checking.
Of course, a better question is ``Why doesn't the language
implement cross-thread access to \co{__thread} variables?''
After all, such an implementation would make both the locking
and the use of RCU unnecessary.
This would in turn enable an implementation that
was even simpler than the one shown in
Figure~\ref{fig:count:Per-Thread Statistical Counters}, but
with all the scalability and performance benefits of the
implementation shown in
Figure~\ref{fig:together:RCU and Per-Thread Statistical Counters}!
} \QuickQuizEnd
Use of RCU enables exiting threads to wait until other threads are
guaranteed to be done using the exiting threads' \co{__thread} variables.
This allows the \co{read_count()} function to dispense with locking,
thereby providing
excellent performance and scalability for both the \co{inc_count()}
and \co{read_count()} functions.
However, this performance and scalability come at the cost of some increase
in code complexity.
It is hoped that compiler and library writers employ user-level
RCU~\cite{MathieuDesnoyers2009URCU} to provide safe cross-thread
access to \co{__thread} variables, greatly reducing the
complexity seen by users of \co{__thread} variables.
\subsection{RCU and Counters for Removable I/O Devices}
\label{sec:together:RCU and Counters for Removable I/O Devices}
Section~\ref{sec:count:Applying Specialized Parallel Counters}
showed a fanciful pair of code fragments for dealing with counting
I/O accesses to removable devices.
These code fragments suffered from high overhead on the fastpath
(starting an I/O) due to the need to acquire a reader-writer
lock.
This section shows how RCU may be used to avoid this overhead.
The code for performing an I/O is quite similar to the original, with
an RCU read-side critical section being substituted for the reader-writer
lock read-side critical section in the original:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
1 rcu_read_lock();
2 if (removing) {
3 rcu_read_unlock();
4 cancel_io();
5 } else {
6 add_count(1);
7 rcu_read_unlock();
8 do_io();
9 sub_count(1);
10 }
\end{verbatim}
\end{minipage}
\vspace{5pt}
The RCU read-side primitives have minimal overhead, thus speeding up
the fastpath, as desired.
The updated code fragment removing a device is as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
1 spin_lock(&mylock);
2 removing = 1;
3 sub_count(mybias);
4 spin_unlock(&mylock);
5 synchronize_rcu();
6 while (read_count() != 0) {
7 poll(NULL, 0, 1);
8 }
9 remove_device();
\end{verbatim}
\end{minipage}
\vspace{5pt}
Here we replace the reader-writer lock with an exclusive spinlock and
add a \co{synchronize_rcu()} to wait for all of the RCU read-side
critical sections to complete.
Because of the \co{synchronize_rcu()},
once we reach line~6, we know that all remaining I/Os have been accounted
for.
Of course, the overhead of \co{synchronize_rcu()} can be large,
but given that device removal is quite rare, this is usually a good
tradeoff.
\subsection{Array and Length}
\label{sec:together:Array and Length}
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 struct foo {
2 int length;
3 char *a;
4 };
\end{verbbox}
}
\centering
\theverbbox
\caption{RCU-Protected Variable-Length Array}
\label{fig:together:RCU-Protected Variable-Length Array}
\end{figure}
Suppose we have an RCU-protected variable-length array, as shown in
Figure~\ref{fig:together:RCU-Protected Variable-Length Array}.
The length of the array \co{->a[]} can change dynamically, and at any
given time, its length is given by the field \co{->length}.
Of course, this introduces the following race condition:
\begin{enumerate}
\item The array is initially 16 characters long, and thus \co{->length}
is equal to 16.
\item CPU~0 loads the value of \co{->length}, obtaining the value 16.
\item CPU~1 shrinks the array to be of length 8, and assigns a pointer
to a new 8-character block of memory into \co{->a[]}.
\item CPU~0 picks up the new pointer from \co{->a[]}, and stores a
new value into element 12.
Because the array has only 8 characters, this results in
a SEGV or (worse yet) memory corruption.
\end{enumerate}
How can we prevent this?
One approach is to make careful use of memory barriers, which are
covered in Section~\ref{sec:advsync:Memory Barriers}.
This works, but incurs read-side overhead and, perhaps worse, requires
use of explicit memory barriers.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 struct foo_a {
2 int length;
3 char a[0];
4 };
5
6 struct foo {
7 struct foo_a *fa;
8 };
\end{verbbox}
}
\centering
\theverbbox
\caption{Improved RCU-Protected Variable-Length Array}
\label{fig:together:Improved RCU-Protected Variable-Length Array}
\end{figure}
A better approach is to put the value and the array into the same structure,
as shown in
Figure~\ref{fig:together:Improved RCU-Protected Variable-Length Array}.
Allocating a new array (\co{foo_a} structure) then automatically provides
a new place for the array length.
This means that if any CPU picks up a reference to \co{->fa}, it is
guaranteed that the \co{->length} will match the \co{->a[]}
length~\cite{Arcangeli03}.
\begin{enumerate}
\item The array is initially 16 characters long, and thus \co{->length}
is equal to 16.
\item CPU~0 loads the value of \co{->fa}, obtaining a pointer to
the structure containing the value 16 and the 16-byte array.
\item CPU~0 loads the value of \co{->fa->length}, obtaining the value 16.
\item CPU~1 shrinks the array to be of length 8, and assigns a pointer
to a new \co{foo_a} structure containing an 8-character block
of memory into \co{->fa}.
\item CPU~0 picks up the new pointer from \co{->a[]}, and stores a
new value into element 12.
But because CPU~0 is still referencing the old \co{foo_a}
structure that contains the 16-byte array, all is well.
\end{enumerate}
Of course, in both cases, CPU~1 must wait for a grace period before
freeing the old array.
A more general version of this approach is presented in the next section.
\subsection{Correlated Fields}
\label{sec:together:Correlated Fields}
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 struct animal {
2 char name[40];
3 double age;
4 double meas_1;
5 double meas_2;
6 double meas_3;
7 char photo[0]; /* large bitmap. */
8 };
\end{verbbox}
}
\centering
\theverbbox
\caption{Uncorrelated Measurement Fields}
\label{fig:together:Uncorrelated Measurement Fields}
\end{figure}
Suppose that each of Sch\"odinger's animals is represented by the
data element shown in
Figure~\ref{fig:together:Uncorrelated Measurement Fields}.
The \co{meas_1}, \co{meas_2}, and \co{meas_3} fields are a set
of correlated measurements that are updated periodically.
It is critically important that readers see these three values from
a single measurement update: If a reader sees an old value of
\co{meas_1} but new values of \co{meas_2} and \co{meas_3}, that
reader will become fatally confused.
How can we guarantee that readers will see coordinated sets of these
three values?
One approach would be to allocate a new \co{animal} structure,
copy the old structure into the new structure, update the new
structure's \co{meas_1}, \co{meas_2}, and \co{meas_3} fields,
and then replace the old structure with a new one by updating
the pointer.
This does guarantee that all readers see coordinated sets of
measurement values, but it requires copying a large structure due
to the \co{->photo[]} field.
This copying might incur unacceptably large overhead.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 struct measurement {
2 double meas_1;
3 double meas_2;
4 double meas_3;
5 };
6
7 struct animal {
8 char name[40];
9 double age;
10 struct measurement *mp;
11 char photo[0]; /* large bitmap. */
12 };
\end{verbbox}
}
\centering
\theverbbox
\caption{Correlated Measurement Fields}
\label{fig:together:Correlated Measurement Fields}
\end{figure}
Another approach is to insert a level of indirection, as shown in
Figure~\ref{fig:together:Correlated Measurement Fields}.
When a new measurement is taken, a new \co{measurement} structure
is allocated, filled in with the measurements, and the \co{animal}
structure's \co{->mp} field is updated to point to this new
\co{measurement} structure using \co{rcu_assign_pointer()}.
After a grace period elapses, the old \co{measurement} structure
can be freed.
\QuickQuiz{}
But cant't the approach shown in
Figure~\ref{fig:together:Correlated Measurement Fields}
result in extra cache misses, in turn resulting in additional
read-side overhead?
\QuickQuizAnswer{
Indeed it can.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 struct measurement {
2 double meas_1;
3 double meas_2;
4 double meas_3;
5 };
6
7 struct animal {
8 char name[40];
9 double age;
10 struct measurement *mp;
11 struct measurement meas;
12 char photo[0]; /* large bitmap. */
13 };
\end{verbbox}
}
\centering
\theverbbox
\caption{Localized Correlated Measurement Fields}
\label{fig:together:Localized Correlated Measurement Fields}
\end{figure}
One way to avoid this cache-miss overhead is shown in
Figure~\ref{fig:together:Localized Correlated Measurement Fields}:
Simply embed an instance of a \co{measurement} structure
named \co{meas}
into the \co{animal} structure, and point the \co{->mp}
field at this \co{->meas} field.
Measurement updates can then be carried out as follows:
\begin{enumerate}
\item Allocate a new \co{measurement} structure and place
the new measurements into it.
\item Use \co{rcu_assign_pointer()} to point \co{->mp} to
this new structure.
\item Wait for a grace period to elapse, for example using
either \co{synchronize_rcu()} or \co{call_rcu()}.
\item Copy the measurements from the new \co{measurement}
structure into the embedded \co{->meas} field.
\item Use \co{rcu_assign_pointer()} to point \co{->mp}
back to the old embedded \co{->meas} field.
\item After another grace period elapses, free up the
new \co{measurement} structure.
\end{enumerate}
This approach uses a heavier weight update procedure to eliminate
the extra cache miss in the common case.
The extra cache miss will be incurred only while an update is
actually in progress.
} \QuickQuizEnd
This approach enables readers to see correlated values for selected
fields with minimal read-side overhead.
% Birthstone/tombstone for moving records when readers cannot be permitted
% to see extraneous records.
% Flag for deletion (if not already covered in the defer chapter).