together/applyrcu.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % together/applyrcu.tex

 \section{RCU Rescues}
 \label{sec:together:RCU Rescues}

 This section shows how to apply RCU to some examples discussed earlier
 in this book.
 In some cases, RCU provides simpler code, in other cases better
 performance and scalability, and in still other cases, both.

 \subsection{RCU and Per-Thread-Variable-Based Statistical Counters}
 \label{sec:together:RCU and Per-Thread-Variable-Based Statistical Counters}

 Section~\ref{sec:count:Per-Thread-Variable-Based Implementation}
 described an implementation of statistical counters that provided
 excellent
 performance, roughly that of simple increment (as in the C \co{++}
 operator), and linear scalability---but only for incrementing
 via \co{inc_count()}.
 Unfortunately, threads needing to read out the value via \co{read_count()}
 were required to acquire a global
 lock, and thus incurred high overhead and suffered poor scalability.
 The code for the lock-based implementation is shown in
 Figure~\ref{fig:count:Per-Thread Statistical Counters} on
 Page~\pageref{fig:count:Per-Thread Statistical Counters}.

 \QuickQuiz{}
 	Why on earth did we need that global lock in the first place?
 \QuickQuizAnswer{
 	A given thread's \co{__thread} variables vanish when that
 	thread exits.
 	It is therefore necessary to synchronize any operation that
 	accesses other threads' \co{__thread} variables with
 	thread exit.
 	Without such synchronization, accesses to \co{__thread} variable
 	of a just-exited thread will result in segmentation faults.
 } \QuickQuizEnd

 \subsubsection{Design}

 The hope is to use RCU rather than \co{final_mutex} to protect the
 thread traversal in \co{read_count()} in order to obtain excellent
 performance and scalability from \co{read_count()}, rather than just
 from \co{inc_count()}.
 However, we do not want to give up any accuracy in the computed sum.
 In particular, when a given thread exits, we absolutely cannot
 lose the exiting thread's count, nor can we double-count it.
 Such an error could result in inaccuracies equal to the full
 precision of the result, in other words, such an error would
 make the result completely useless.
 And in fact, one of the purposes of \co{final_mutex} is to
 ensure that threads do not come and go in the middle of \co{read_count()}
 execution.

 \QuickQuiz{}
 	Just what is the accuracy of \co{read_count()}, anyway?
 \QuickQuizAnswer{
 	Refer to
 	Figure~\ref{fig:count:Per-Thread Statistical Counters} on
 	Page~\pageref{fig:count:Per-Thread Statistical Counters}.
 	Clearly, if there are no concurrent invocations of \co{inc_count()},
 	\co{read_count()} will return an exact result.
 	However, if there \emph{are} concurrent invocations of
 	\co{inc_count()}, then the sum is in fact changing as
 	\co{read_count()} performs its summation.
 	That said, because thread creation and exit are excluded by
 	\co{final_mutex}, the pointers in \co{counterp} remain constant.

 	Let's imagine a mythical machine that is able to take an
 	instantaneous snapshot of its memory.
 	Suppose that this machine takes such a snapshot at the
 	beginning of \co{read_count()}'s execution, and another
 	snapshot at the end of \co{read_count()}'s execution.
 	Then \co{read_count()} will access each thread's counter
 	at some time between these two snapshots, and will therefore
 	obtain a result that is bounded by those of the two snapshots,
 	inclusive.
 	The overall sum will therefore be bounded by the pair of sums that
 	would have been obtained from each of the two snapshots (again,
 	inclusive).

 	The expected error is therefore half of the difference between
 	the pair of sums that would have been obtained from each of the
 	two snapshots, that is to say, half of the execution time of
 	\co{read_count()} multiplied by the number of expected calls to
 	\co{inc_count()} per unit time.

 	Or, for those who prefer equations:
 	\begin{equation}
 	\epsilon = \frac{T_r R_i}{2}
 	\end{equation}
 	where $\epsilon$ is the expected error in \co{read_count()}'s
 	return value,
 	$T_r$ is the time that \co{read_count()} takes to execute,
 	and $R_i$ is the rate of \co{inc_count()} calls per unit time.
 	(And of course, $T_r$ and $R_i$ should use the same units of
 	time: microseconds and calls per microsecond, seconds and calls
 	per second, or whatever, as long as they are the same units.)
 } \QuickQuizEnd

 Therefore, if we are to dispense with \co{final_mutex}, we will need
 to come up with some other method for ensuring consistency.
 One approach is to place the total count for all previously exited
 threads and the array of pointers to the per-thread counters into a single
 structure.
 Such a structure, once made available to \co{read_count()}, is
 held constant, ensuring that \co{read_count()} sees consistent data.

 \subsubsection{Implementation}

 \begin{figure}[bp]
 { \scriptsize
 \begin{verbbox}
   1 struct countarray {
   2   unsigned long total;
   3   unsigned long *counterp[NR_THREADS];
   4 };
   5
   6 long __thread counter = 0;
   7 struct countarray *countarrayp = NULL;
   8 DEFINE_SPINLOCK(final_mutex);
   9
  10 void inc_count(void)
  11 {
  12   counter++;
  13 }
  14
  15 long read_count(void)
  16 {
  17   struct countarray *cap;
  18   unsigned long sum;
  19   int t;
  20
  21   rcu_read_lock();
  22   cap = rcu_dereference(countarrayp);
  23   sum = cap->total;
  24   for_each_thread(t)
  25     if (cap->counterp[t] != NULL)
  26       sum += *cap->counterp[t];
  27   rcu_read_unlock();
  28   return sum;
  29 }
  30
  31 void count_init(void)
  32 {
  33   countarrayp = malloc(sizeof(*countarrayp));
  34   if (countarrayp == NULL) {
  35     fprintf(stderr, "Out of memory\n");
  36     exit(-1);
  37   }
  38   memset(countarrayp, '\0', sizeof(*countarrayp));
  39 }
  40
  41 void count_register_thread(void)
  42 {
  43   int idx = smp_thread_id();
  44
  45   spin_lock(&final_mutex);
  46   countarrayp->counterp[idx] = &counter;
  47   spin_unlock(&final_mutex);
  48 }
  49
  50 void count_unregister_thread(int nthreadsexpected)
  51 {
  52   struct countarray *cap;
  53   struct countarray *capold;
  54   int idx = smp_thread_id();
  55
  56   cap = malloc(sizeof(*countarrayp));
  57   if (cap == NULL) {
  58     fprintf(stderr, "Out of memory\n");
  59     exit(-1);
  60   }
  61   spin_lock(&final_mutex);
  62   *cap = *countarrayp;
  63   cap->total += counter;
  64   cap->counterp[idx] = NULL;
  65   capold = countarrayp;
  66   rcu_assign_pointer(countarrayp, cap);
  67   spin_unlock(&final_mutex);
  68   synchronize_rcu();
  69   free(capold);
  70 }
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{RCU and Per-Thread Statistical Counters}
 \label{fig:together:RCU and Per-Thread Statistical Counters}
 \end{figure}

 Lines~1-4 of
 Figure~\ref{fig:together:RCU and Per-Thread Statistical Counters}
 show the \co{countarray} structure, which contains a
 \co{->total} field for the count from previously exited threads,
 and a \co{counterp[]} array of pointers to the per-thread
 \co{counter} for each currently running thread.
 This structure allows a given execution of \co{read_count()}
 to see a total that is consistent with the indicated set of running
 threads.

 Lines~6-8 contain the definition of the per-thread \co{counter}
 variable, the global pointer \co{countarrayp} referencing
 the current \co{countarray} structure, and
 the \co{final_mutex} spinlock.

 Lines~10-13 show \co{inc_count()}, which is unchanged from
 Figure~\ref{fig:count:Per-Thread Statistical Counters}.

 Lines~15-29 show \co{read_count()}, which has changed significantly.
 Lines~21 and~27 substitute \co{rcu_read_lock()} and
 \co{rcu_read_unlock()} for acquisition and release of \co{final_mutex}.
 Line~22 uses \co{rcu_dereference()} to snapshot the
 current \co{countarray} structure into local variable \co{cap}.
 Proper use of RCU will guarantee that this \co{countarray} structure
 will remain with us through at least the end of the current RCU
 read-side critical section at line~27.
 Line~23 initializes \co{sum} to \co{cap->total}, which is the
 sum of the counts of threads that have previously exited.
 Lines~24-26 add up the per-thread counters corresponding to currently
 running threads, and, finally, line~28 returns the sum.

 The initial value for \co{countarrayp} is
 provided by \co{count_init()} on lines~31-39.
 This function runs before the first thread is created, and its job
 is to allocate
 and zero the initial structure, and then assign it to \co{countarrayp}.

 Lines~41-48 show the \co{count_register_thread()} function, which
 is invoked by each newly created thread.
 Line~43 picks up the current thread's index, line~45 acquires
 \co{final_mutex}, line~46 installs a pointer to this thread's
 \co{counter}, and line~47 releases \co{final_mutex}.

 \QuickQuiz{}
 	Hey!!!
 	Line~46 of
 	Figure~\ref{fig:together:RCU and Per-Thread Statistical Counters}
 	modifies a value in a pre-existing \co{countarray} structure!
 	Didn't you say that this structure, once made available to
 	\co{read_count()}, remained constant???
 \QuickQuizAnswer{
 	Indeed I did say that.
 	And it would be possible to make \co{count_register_thread()}
 	allocate a new structure, much as \co{count_unregister_thread()}
 	currently does.

 	But this is unnecessary.
 	Recall the derivation of the error bounds of \co{read_count()}
 	that was based on the snapshots of memory.
 	Because new threads start with initial \co{counter} values of
 	zero, the derivation holds even if we add a new thread partway
 	through \co{read_count()}'s execution.
 	So, interestingly enough, when adding a new thread, this
 	implementation gets the effect of allocating a new structure,
 	but without actually having to do the allocation.
 } \QuickQuizEnd

 Lines~50-70 shows \co{count_unregister_thread()}, which is invoked
 by each thread just before it exits.
 Lines~56-60 allocate a new \co{countarray} structure,
 line~61 acquires \co{final_mutex} and line~67 releases it.
 Line~62 copies the contents of the current \co{countarray} into
 the newly allocated version, line~63 adds the exiting thread's \co{counter}
 to new structure's \co{->total}, and line~64 \co{NULL}s the exiting thread's
 \co{counterp[]} array element.
 Line~65 then retains a pointer to the current (soon to be old)
 \co{countarray} structure, and line~66 uses \co{rcu_assign_pointer()}
 to install the new version of the \co{countarray} structure.
 Line~68 waits for a grace period to elapse, so that any threads that
 might be concurrently executing in \co{read_count()}, and thus might
 have references to the old \co{countarray} structure, will be allowed
 to exit their RCU read-side critical sections, thus dropping any such
 references.
 Line~69 can then safely free the old \co{countarray} structure.

 \subsubsection{Discussion}

 \QuickQuiz{}
 	Wow!
 	Figure~\ref{fig:together:RCU and Per-Thread Statistical Counters}
 	contains 69 lines of code, compared to only 42 in
 	Figure~\ref{fig:count:Per-Thread Statistical Counters}.
 	Is this extra complexity really worth it?
 \QuickQuizAnswer{
 	This of course needs to be decided on a case-by-case basis.
 	If you need an implementation of \co{read_count()} that
 	scales linearly, then the lock-based implementation shown in
 	Figure~\ref{fig:count:Per-Thread Statistical Counters}
 	simply will not work for you.
 	On the other hand, if calls to \co{count_read()} are sufficiently
 	rare, then the lock-based version is simpler and might thus be
 	better, although much of the size difference is due
 	to the structure definition, memory allocation, and \co{NULL}
 	return checking.

 	Of course, a better question is ``Why doesn't the language
 	implement cross-thread access to \co{__thread} variables?''
 	After all, such an implementation would make both the locking
 	and the use of RCU unnecessary.
 	This would in turn enable an implementation that
 	was even simpler than the one shown in
 	Figure~\ref{fig:count:Per-Thread Statistical Counters}, but
 	with all the scalability and performance benefits of the
 	implementation shown in
 	Figure~\ref{fig:together:RCU and Per-Thread Statistical Counters}!
 } \QuickQuizEnd

 Use of RCU enables exiting threads to wait until other threads are
 guaranteed to be done using the exiting threads' \co{__thread} variables.
 This allows the \co{read_count()} function to dispense with locking,
 thereby providing
 excellent performance and scalability for both the \co{inc_count()}
 and \co{read_count()} functions.
 However, this performance and scalability come at the cost of some increase
 in code complexity.
 It is hoped that compiler and library writers employ user-level
 RCU~\cite{MathieuDesnoyers2009URCU} to provide safe cross-thread
 access to \co{__thread} variables, greatly reducing the
 complexity seen by users of \co{__thread} variables.

 \subsection{RCU and Counters for Removable I/O Devices}
 \label{sec:together:RCU and Counters for Removable I/O Devices}

 Section~\ref{sec:count:Applying Specialized Parallel Counters}
 showed a fanciful pair of code fragments for dealing with counting
 I/O accesses to removable devices.
 These code fragments suffered from high overhead on the fastpath
 (starting an I/O) due to the need to acquire a reader-writer
 lock.

 This section shows how RCU may be used to avoid this overhead.

 The code for performing an I/O is quite similar to the original, with
 an RCU read-side critical section being substituted for the reader-writer
 lock read-side critical section in the original:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \small
 \begin{verbatim}
   1 rcu_read_lock();
   2 if (removing) {
   3   rcu_read_unlock();
   4   cancel_io();
   5 } else {
   6   add_count(1);
   7   rcu_read_unlock();
   8   do_io();
   9   sub_count(1);
  10 }
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 The RCU read-side primitives have minimal overhead, thus speeding up
 the fastpath, as desired.

 The updated code fragment removing a device is as follows:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \small
 \begin{verbatim}
   1 spin_lock(&mylock);
   2 removing = 1;
   3 sub_count(mybias);
   4 spin_unlock(&mylock);
   5 synchronize_rcu();
   6 while (read_count() != 0) {
   7   poll(NULL, 0, 1);
   8 }
   9 remove_device();
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 Here we replace the reader-writer lock with an exclusive spinlock and
 add a \co{synchronize_rcu()} to wait for all of the RCU read-side
 critical sections to complete.
 Because of the \co{synchronize_rcu()},
 once we reach line~6, we know that all remaining I/Os have been accounted
 for.

 Of course, the overhead of \co{synchronize_rcu()} can be large,
 but given that device removal is quite rare, this is usually a good
 tradeoff.

 \subsection{Array and Length}
 \label{sec:together:Array and Length}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
  1 struct foo {
  2   int length;
  3   char *a;
  4 };
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{RCU-Protected Variable-Length Array}
 \label{fig:together:RCU-Protected Variable-Length Array}
 \end{figure}

 Suppose we have an RCU-protected variable-length array, as shown in
 Figure~\ref{fig:together:RCU-Protected Variable-Length Array}.
 The length of the array \co{->a[]} can change dynamically, and at any
 given time, its length is given by the field \co{->length}.
 Of course, this introduces the following race condition:

 \begin{enumerate}
 \item	The array is initially 16 characters long, and thus \co{->length}
 	is equal to 16.
 \item	CPU~0 loads the value of \co{->length}, obtaining the value 16.
 \item	CPU~1 shrinks the array to be of length 8, and assigns a pointer
 	to a new 8-character block of memory into \co{->a[]}.
 \item	CPU~0 picks up the new pointer from \co{->a[]}, and stores a
 	new value into element 12.
 	Because the array has only 8 characters, this results in
 	a SEGV or (worse yet) memory corruption.
 \end{enumerate}

 How can we prevent this?

 One approach is to make careful use of memory barriers, which are
 covered in Section~\ref{sec:advsync:Memory Barriers}.
 This works, but incurs read-side overhead and, perhaps worse, requires
 use of explicit memory barriers.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
  1 struct foo_a {
  2   int length;
  3   char a[0];
  4 };
  5
  6 struct foo {
  7   struct foo_a *fa;
  8 };
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Improved RCU-Protected Variable-Length Array}
 \label{fig:together:Improved RCU-Protected Variable-Length Array}
 \end{figure}

 A better approach is to put the value and the array into the same structure,
 as shown in
 Figure~\ref{fig:together:Improved RCU-Protected Variable-Length Array}.
 Allocating a new array (\co{foo_a} structure) then automatically provides
 a new place for the array length.
 This means that if any CPU picks up a reference to \co{->fa}, it is
 guaranteed that the \co{->length} will match the \co{->a[]}
 length~\cite{Arcangeli03}.

 \begin{enumerate}
 \item	The array is initially 16 characters long, and thus \co{->length}
 	is equal to 16.
 \item	CPU~0 loads the value of \co{->fa}, obtaining a pointer to
 	the structure containing the value 16 and the 16-byte array.
 \item	CPU~0 loads the value of \co{->fa->length}, obtaining the value 16.
 \item	CPU~1 shrinks the array to be of length 8, and assigns a pointer
 	to a new \co{foo_a} structure containing an 8-character block
 	of memory into \co{->fa}.
 \item	CPU~0 picks up the new pointer from \co{->a[]}, and stores a
 	new value into element 12.
 	But because CPU~0 is still referencing the old \co{foo_a}
 	structure that contains the 16-byte array, all is well.
 \end{enumerate}

 Of course, in both cases, CPU~1 must wait for a grace period before
 freeing the old array.

 A more general version of this approach is presented in the next section.

 \subsection{Correlated Fields}
 \label{sec:together:Correlated Fields}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
  1 struct animal {
  2   char name[40];
  3   double age;
  4   double meas_1;
  5   double meas_2;
  6   double meas_3;
  7   char photo[0]; /* large bitmap. */
  8 };
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Uncorrelated Measurement Fields}
 \label{fig:together:Uncorrelated Measurement Fields}
 \end{figure}

 Suppose that each of Sch\"odinger's animals is represented by the
 data element shown in
 Figure~\ref{fig:together:Uncorrelated Measurement Fields}.
 The \co{meas_1}, \co{meas_2}, and \co{meas_3} fields are a set
 of correlated measurements that are updated periodically.
 It is critically important that readers see these three values from
 a single measurement update: If a reader sees an old value of
 \co{meas_1} but new values of \co{meas_2} and \co{meas_3}, that
 reader will become fatally confused.
 How can we guarantee that readers will see coordinated sets of these
 three values?

 One approach would be to allocate a new \co{animal} structure,
 copy the old structure into the new structure, update the new
 structure's \co{meas_1}, \co{meas_2}, and \co{meas_3} fields,
 and then replace the old structure with a new one by updating
 the pointer.
 This does guarantee that all readers see coordinated sets of
 measurement values, but it requires copying a large structure due
 to the \co{->photo[]} field.
 This copying might incur unacceptably large overhead.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
  1 struct measurement {
  2   double meas_1;
  3   double meas_2;
  4   double meas_3;
  5 };
  6
  7 struct animal {
  8   char name[40];
  9   double age;
 10   struct measurement *mp;
 11   char photo[0]; /* large bitmap. */
 12 };
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Correlated Measurement Fields}
 \label{fig:together:Correlated Measurement Fields}
 \end{figure}

 Another approach is to insert a level of indirection, as shown in
 Figure~\ref{fig:together:Correlated Measurement Fields}.
 When a new measurement is taken, a new \co{measurement} structure
 is allocated, filled in with the measurements, and the \co{animal}
 structure's \co{->mp} field is updated to point to this new
 \co{measurement} structure using \co{rcu_assign_pointer()}.
 After a grace period elapses, the old \co{measurement} structure
 can be freed.

 \QuickQuiz{}
 	But cant't the approach shown in
 	Figure~\ref{fig:together:Correlated Measurement Fields}
 	result in extra cache misses, in turn resulting in additional
 	read-side overhead?
 \QuickQuizAnswer{
 	Indeed it can.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbbox}
  1 struct measurement {
  2   double meas_1;
  3   double meas_2;
  4   double meas_3;
  5 };
  6
  7 struct animal {
  8   char name[40];
  9   double age;
 10   struct measurement *mp;
 11   struct measurement meas;
 12   char photo[0]; /* large bitmap. */
 13 };
 \end{verbbox}
 }
 \centering
 \theverbbox
 \caption{Localized Correlated Measurement Fields}
 \label{fig:together:Localized Correlated Measurement Fields}
 \end{figure}

 	One way to avoid this cache-miss overhead is shown in
 	Figure~\ref{fig:together:Localized Correlated Measurement Fields}:
 	Simply embed an instance of a \co{measurement} structure
 	named \co{meas}
 	into the \co{animal} structure, and point the \co{->mp}
 	field at this \co{->meas} field.

 	Measurement updates can then be carried out as follows:

 	\begin{enumerate}
 	\item	Allocate a new \co{measurement} structure and place
 		the new measurements into it.
 	\item	Use \co{rcu_assign_pointer()} to point \co{->mp} to
 		this new structure.
 	\item	Wait for a grace period to elapse, for example using
 		either \co{synchronize_rcu()} or \co{call_rcu()}.
 	\item	Copy the measurements from the new \co{measurement}
 		structure into the embedded \co{->meas} field.
 	\item	Use \co{rcu_assign_pointer()} to point \co{->mp}
 		back to the old embedded \co{->meas} field.
 	\item	After another grace period elapses, free up the
 		new \co{measurement} structure.
 	\end{enumerate}

 	This approach uses a heavier weight update procedure to eliminate
 	the extra cache miss in the common case.
 	The extra cache miss will be incurred only while an update is
 	actually in progress.
 } \QuickQuizEnd

 This approach enables readers to see correlated values for selected
 fields with minimal read-side overhead.

 % Birthstone/tombstone for moving records when readers cannot be permitted
 % to see extraneous records.

 % Flag for deletion (if not already covered in the defer chapter).