| % together/applyrcu.tex |
| |
| \section{RCU Rescues} |
| \label{sec:together:RCU Rescues} |
| |
| This section shows how to apply RCU to some examples discussed earlier |
| in this book. |
| In some cases, RCU provides simpler code, in other cases better |
| performance and scalability, and in still other cases, both. |
| |
| \subsection{RCU and Per-Thread-Variable-Based Statistical Counters} |
| \label{sec:together:RCU and Per-Thread-Variable-Based Statistical Counters} |
| |
| Section~\ref{sec:count:Per-Thread-Variable-Based Implementation} |
| described an implementation of statistical counters that provided |
| excellent |
| performance, roughly that of simple increment (as in the C \co{++} |
| operator), and linear scalability---but only for incrementing |
| via \co{inc_count()}. |
| Unfortunately, threads needing to read out the value via \co{read_count()} |
| were required to acquire a global |
| lock, and thus incurred high overhead and suffered poor scalability. |
| The code for the lock-based implementation is shown in |
| Figure~\ref{fig:count:Per-Thread Statistical Counters} on |
| Page~\pageref{fig:count:Per-Thread Statistical Counters}. |
| |
| \QuickQuiz{} |
| Why on earth did we need that global lock in the first place? |
| \QuickQuizAnswer{ |
| A given thread's \co{__thread} variables vanish when that |
| thread exits. |
| It is therefore necessary to synchronize any operation that |
| accesses other threads' \co{__thread} variables with |
| thread exit. |
| Without such synchronization, accesses to \co{__thread} variable |
| of a just-exited thread will result in segmentation faults. |
| } \QuickQuizEnd |
| |
| \subsubsection{Design} |
| |
| The hope is to use RCU rather than \co{final_mutex} to protect the |
| thread traversal in \co{read_count()} in order to obtain excellent |
| performance and scalability from \co{read_count()}, rather than just |
| from \co{inc_count()}. |
| However, we do not want to give up any accuracy in the computed sum. |
| In particular, when a given thread exits, we absolutely cannot |
| lose the exiting thread's count, nor can we double-count it. |
| Such an error could result in inaccuracies equal to the full |
| precision of the result, in other words, such an error would |
| make the result completely useless. |
| And in fact, one of the purposes of \co{final_mutex} is to |
| ensure that threads do not come and go in the middle of \co{read_count()} |
| execution. |
| |
| \QuickQuiz{} |
| Just what is the accuracy of \co{read_count()}, anyway? |
| \QuickQuizAnswer{ |
| Refer to |
| Figure~\ref{fig:count:Per-Thread Statistical Counters} on |
| Page~\pageref{fig:count:Per-Thread Statistical Counters}. |
| Clearly, if there are no concurrent invocations of \co{inc_count()}, |
| \co{read_count()} will return an exact result. |
| However, if there \emph{are} concurrent invocations of |
| \co{inc_count()}, then the sum is in fact changing as |
| \co{read_count()} performs its summation. |
| That said, because thread creation and exit are excluded by |
| \co{final_mutex}, the pointers in \co{counterp} remain constant. |
| |
| Let's imagine a mythical machine that is able to take an |
| instantaneous snapshot of its memory. |
| Suppose that this machine takes such a snapshot at the |
| beginning of \co{read_count()}'s execution, and another |
| snapshot at the end of \co{read_count()}'s execution. |
| Then \co{read_count()} will access each thread's counter |
| at some time between these two snapshots, and will therefore |
| obtain a result that is bounded by those of the two snapshots, |
| inclusive. |
| The overall sum will therefore be bounded by the pair of sums that |
| would have been obtained from each of the two snapshots (again, |
| inclusive). |
| |
| The expected error is therefore half of the difference between |
| the pair of sums that would have been obtained from each of the |
| two snapshots, that is to say, half of the execution time of |
| \co{read_count()} multiplied by the number of expected calls to |
| \co{inc_count()} per unit time. |
| |
| Or, for those who prefer equations: |
| \begin{equation} |
| \epsilon = \frac{T_r R_i}{2} |
| \end{equation} |
| where $\epsilon$ is the expected error in \co{read_count()}'s |
| return value, |
| $T_r$ is the time that \co{read_count()} takes to execute, |
| and $R_i$ is the rate of \co{inc_count()} calls per unit time. |
| (And of course, $T_r$ and $R_i$ should use the same units of |
| time: microseconds and calls per microsecond, seconds and calls |
| per second, or whatever, as long as they are the same units.) |
| } \QuickQuizEnd |
| |
| Therefore, if we are to dispense with \co{final_mutex}, we will need |
| to come up with some other method for ensuring consistency. |
| One approach is to place the total count for all previously exited |
| threads and the array of pointers to the per-thread counters into a single |
| structure. |
| Such a structure, once made available to \co{read_count()}, is |
| held constant, ensuring that \co{read_count()} sees consistent data. |
| |
| \subsubsection{Implementation} |
| |
| \begin{figure}[bp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 struct countarray { |
| 2 unsigned long total; |
| 3 unsigned long *counterp[NR_THREADS]; |
| 4 }; |
| 5 |
| 6 long __thread counter = 0; |
| 7 struct countarray *countarrayp = NULL; |
| 8 DEFINE_SPINLOCK(final_mutex); |
| 9 |
| 10 void inc_count(void) |
| 11 { |
| 12 counter++; |
| 13 } |
| 14 |
| 15 long read_count(void) |
| 16 { |
| 17 struct countarray *cap; |
| 18 unsigned long sum; |
| 19 int t; |
| 20 |
| 21 rcu_read_lock(); |
| 22 cap = rcu_dereference(countarrayp); |
| 23 sum = cap->total; |
| 24 for_each_thread(t) |
| 25 if (cap->counterp[t] != NULL) |
| 26 sum += *cap->counterp[t]; |
| 27 rcu_read_unlock(); |
| 28 return sum; |
| 29 } |
| 30 |
| 31 void count_init(void) |
| 32 { |
| 33 countarrayp = malloc(sizeof(*countarrayp)); |
| 34 if (countarrayp == NULL) { |
| 35 fprintf(stderr, "Out of memory\n"); |
| 36 exit(-1); |
| 37 } |
| 38 memset(countarrayp, '\0', sizeof(*countarrayp)); |
| 39 } |
| 40 |
| 41 void count_register_thread(void) |
| 42 { |
| 43 int idx = smp_thread_id(); |
| 44 |
| 45 spin_lock(&final_mutex); |
| 46 countarrayp->counterp[idx] = &counter; |
| 47 spin_unlock(&final_mutex); |
| 48 } |
| 49 |
| 50 void count_unregister_thread(int nthreadsexpected) |
| 51 { |
| 52 struct countarray *cap; |
| 53 struct countarray *capold; |
| 54 int idx = smp_thread_id(); |
| 55 |
| 56 cap = malloc(sizeof(*countarrayp)); |
| 57 if (cap == NULL) { |
| 58 fprintf(stderr, "Out of memory\n"); |
| 59 exit(-1); |
| 60 } |
| 61 spin_lock(&final_mutex); |
| 62 *cap = *countarrayp; |
| 63 cap->total += counter; |
| 64 cap->counterp[idx] = NULL; |
| 65 capold = countarrayp; |
| 66 rcu_assign_pointer(countarrayp, cap); |
| 67 spin_unlock(&final_mutex); |
| 68 synchronize_rcu(); |
| 69 free(capold); |
| 70 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{RCU and Per-Thread Statistical Counters} |
| \label{fig:together:RCU and Per-Thread Statistical Counters} |
| \end{figure} |
| |
| Lines~1-4 of |
| Figure~\ref{fig:together:RCU and Per-Thread Statistical Counters} |
| show the \co{countarray} structure, which contains a |
| \co{->total} field for the count from previously exited threads, |
| and a \co{counterp[]} array of pointers to the per-thread |
| \co{counter} for each currently running thread. |
| This structure allows a given execution of \co{read_count()} |
| to see a total that is consistent with the indicated set of running |
| threads. |
| |
| Lines~6-8 contain the definition of the per-thread \co{counter} |
| variable, the global pointer \co{countarrayp} referencing |
| the current \co{countarray} structure, and |
| the \co{final_mutex} spinlock. |
| |
| Lines~10-13 show \co{inc_count()}, which is unchanged from |
| Figure~\ref{fig:count:Per-Thread Statistical Counters}. |
| |
| Lines~15-29 show \co{read_count()}, which has changed significantly. |
| Lines~21 and~27 substitute \co{rcu_read_lock()} and |
| \co{rcu_read_unlock()} for acquisition and release of \co{final_mutex}. |
| Line~22 uses \co{rcu_dereference()} to snapshot the |
| current \co{countarray} structure into local variable \co{cap}. |
| Proper use of RCU will guarantee that this \co{countarray} structure |
| will remain with us through at least the end of the current RCU |
| read-side critical section at line~27. |
| Line~23 initializes \co{sum} to \co{cap->total}, which is the |
| sum of the counts of threads that have previously exited. |
| Lines~24-26 add up the per-thread counters corresponding to currently |
| running threads, and, finally, line~28 returns the sum. |
| |
| The initial value for \co{countarrayp} is |
| provided by \co{count_init()} on lines~31-39. |
| This function runs before the first thread is created, and its job |
| is to allocate |
| and zero the initial structure, and then assign it to \co{countarrayp}. |
| |
| Lines~41-48 show the \co{count_register_thread()} function, which |
| is invoked by each newly created thread. |
| Line~43 picks up the current thread's index, line~45 acquires |
| \co{final_mutex}, line~46 installs a pointer to this thread's |
| \co{counter}, and line~47 releases \co{final_mutex}. |
| |
| \QuickQuiz{} |
| Hey!!! |
| Line~46 of |
| Figure~\ref{fig:together:RCU and Per-Thread Statistical Counters} |
| modifies a value in a pre-existing \co{countarray} structure! |
| Didn't you say that this structure, once made available to |
| \co{read_count()}, remained constant??? |
| \QuickQuizAnswer{ |
| Indeed I did say that. |
| And it would be possible to make \co{count_register_thread()} |
| allocate a new structure, much as \co{count_unregister_thread()} |
| currently does. |
| |
| But this is unnecessary. |
| Recall the derivation of the error bounds of \co{read_count()} |
| that was based on the snapshots of memory. |
| Because new threads start with initial \co{counter} values of |
| zero, the derivation holds even if we add a new thread partway |
| through \co{read_count()}'s execution. |
| So, interestingly enough, when adding a new thread, this |
| implementation gets the effect of allocating a new structure, |
| but without actually having to do the allocation. |
| } \QuickQuizEnd |
| |
| Lines~50-70 shows \co{count_unregister_thread()}, which is invoked |
| by each thread just before it exits. |
| Lines~56-60 allocate a new \co{countarray} structure, |
| line~61 acquires \co{final_mutex} and line~67 releases it. |
| Line~62 copies the contents of the current \co{countarray} into |
| the newly allocated version, line~63 adds the exiting thread's \co{counter} |
| to new structure's \co{->total}, and line~64 \co{NULL}s the exiting thread's |
| \co{counterp[]} array element. |
| Line~65 then retains a pointer to the current (soon to be old) |
| \co{countarray} structure, and line~66 uses \co{rcu_assign_pointer()} |
| to install the new version of the \co{countarray} structure. |
| Line~68 waits for a grace period to elapse, so that any threads that |
| might be concurrently executing in \co{read_count()}, and thus might |
| have references to the old \co{countarray} structure, will be allowed |
| to exit their RCU read-side critical sections, thus dropping any such |
| references. |
| Line~69 can then safely free the old \co{countarray} structure. |
| |
| \subsubsection{Discussion} |
| |
| \QuickQuiz{} |
| Wow! |
| Figure~\ref{fig:together:RCU and Per-Thread Statistical Counters} |
| contains 69 lines of code, compared to only 42 in |
| Figure~\ref{fig:count:Per-Thread Statistical Counters}. |
| Is this extra complexity really worth it? |
| \QuickQuizAnswer{ |
| This of course needs to be decided on a case-by-case basis. |
| If you need an implementation of \co{read_count()} that |
| scales linearly, then the lock-based implementation shown in |
| Figure~\ref{fig:count:Per-Thread Statistical Counters} |
| simply will not work for you. |
| On the other hand, if calls to \co{count_read()} are sufficiently |
| rare, then the lock-based version is simpler and might thus be |
| better, although much of the size difference is due |
| to the structure definition, memory allocation, and \co{NULL} |
| return checking. |
| |
| Of course, a better question is ``Why doesn't the language |
| implement cross-thread access to \co{__thread} variables?'' |
| After all, such an implementation would make both the locking |
| and the use of RCU unnecessary. |
| This would in turn enable an implementation that |
| was even simpler than the one shown in |
| Figure~\ref{fig:count:Per-Thread Statistical Counters}, but |
| with all the scalability and performance benefits of the |
| implementation shown in |
| Figure~\ref{fig:together:RCU and Per-Thread Statistical Counters}! |
| } \QuickQuizEnd |
| |
| Use of RCU enables exiting threads to wait until other threads are |
| guaranteed to be done using the exiting threads' \co{__thread} variables. |
| This allows the \co{read_count()} function to dispense with locking, |
| thereby providing |
| excellent performance and scalability for both the \co{inc_count()} |
| and \co{read_count()} functions. |
| However, this performance and scalability come at the cost of some increase |
| in code complexity. |
| It is hoped that compiler and library writers employ user-level |
| RCU~\cite{MathieuDesnoyers2009URCU} to provide safe cross-thread |
| access to \co{__thread} variables, greatly reducing the |
| complexity seen by users of \co{__thread} variables. |
| |
| \subsection{RCU and Counters for Removable I/O Devices} |
| \label{sec:together:RCU and Counters for Removable I/O Devices} |
| |
| Section~\ref{sec:count:Applying Specialized Parallel Counters} |
| showed a fanciful pair of code fragments for dealing with counting |
| I/O accesses to removable devices. |
| These code fragments suffered from high overhead on the fastpath |
| (starting an I/O) due to the need to acquire a reader-writer |
| lock. |
| |
| This section shows how RCU may be used to avoid this overhead. |
| |
| The code for performing an I/O is quite similar to the original, with |
| an RCU read-side critical section being substituted for the reader-writer |
| lock read-side critical section in the original: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \small |
| \begin{verbatim} |
| 1 rcu_read_lock(); |
| 2 if (removing) { |
| 3 rcu_read_unlock(); |
| 4 cancel_io(); |
| 5 } else { |
| 6 add_count(1); |
| 7 rcu_read_unlock(); |
| 8 do_io(); |
| 9 sub_count(1); |
| 10 } |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| The RCU read-side primitives have minimal overhead, thus speeding up |
| the fastpath, as desired. |
| |
| The updated code fragment removing a device is as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \small |
| \begin{verbatim} |
| 1 spin_lock(&mylock); |
| 2 removing = 1; |
| 3 sub_count(mybias); |
| 4 spin_unlock(&mylock); |
| 5 synchronize_rcu(); |
| 6 while (read_count() != 0) { |
| 7 poll(NULL, 0, 1); |
| 8 } |
| 9 remove_device(); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Here we replace the reader-writer lock with an exclusive spinlock and |
| add a \co{synchronize_rcu()} to wait for all of the RCU read-side |
| critical sections to complete. |
| Because of the \co{synchronize_rcu()}, |
| once we reach line~6, we know that all remaining I/Os have been accounted |
| for. |
| |
| Of course, the overhead of \co{synchronize_rcu()} can be large, |
| but given that device removal is quite rare, this is usually a good |
| tradeoff. |
| |
| \subsection{Array and Length} |
| \label{sec:together:Array and Length} |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 struct foo { |
| 2 int length; |
| 3 char *a; |
| 4 }; |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{RCU-Protected Variable-Length Array} |
| \label{fig:together:RCU-Protected Variable-Length Array} |
| \end{figure} |
| |
| Suppose we have an RCU-protected variable-length array, as shown in |
| Figure~\ref{fig:together:RCU-Protected Variable-Length Array}. |
| The length of the array \co{->a[]} can change dynamically, and at any |
| given time, its length is given by the field \co{->length}. |
| Of course, this introduces the following race condition: |
| |
| \begin{enumerate} |
| \item The array is initially 16 characters long, and thus \co{->length} |
| is equal to 16. |
| \item CPU~0 loads the value of \co{->length}, obtaining the value 16. |
| \item CPU~1 shrinks the array to be of length 8, and assigns a pointer |
| to a new 8-character block of memory into \co{->a[]}. |
| \item CPU~0 picks up the new pointer from \co{->a[]}, and stores a |
| new value into element 12. |
| Because the array has only 8 characters, this results in |
| a SEGV or (worse yet) memory corruption. |
| \end{enumerate} |
| |
| How can we prevent this? |
| |
| One approach is to make careful use of memory barriers, which are |
| covered in Section~\ref{sec:advsync:Memory Barriers}. |
| This works, but incurs read-side overhead and, perhaps worse, requires |
| use of explicit memory barriers. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 struct foo_a { |
| 2 int length; |
| 3 char a[0]; |
| 4 }; |
| 5 |
| 6 struct foo { |
| 7 struct foo_a *fa; |
| 8 }; |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Improved RCU-Protected Variable-Length Array} |
| \label{fig:together:Improved RCU-Protected Variable-Length Array} |
| \end{figure} |
| |
| A better approach is to put the value and the array into the same structure, |
| as shown in |
| Figure~\ref{fig:together:Improved RCU-Protected Variable-Length Array}. |
| Allocating a new array (\co{foo_a} structure) then automatically provides |
| a new place for the array length. |
| This means that if any CPU picks up a reference to \co{->fa}, it is |
| guaranteed that the \co{->length} will match the \co{->a[]} |
| length~\cite{Arcangeli03}. |
| |
| \begin{enumerate} |
| \item The array is initially 16 characters long, and thus \co{->length} |
| is equal to 16. |
| \item CPU~0 loads the value of \co{->fa}, obtaining a pointer to |
| the structure containing the value 16 and the 16-byte array. |
| \item CPU~0 loads the value of \co{->fa->length}, obtaining the value 16. |
| \item CPU~1 shrinks the array to be of length 8, and assigns a pointer |
| to a new \co{foo_a} structure containing an 8-character block |
| of memory into \co{->fa}. |
| \item CPU~0 picks up the new pointer from \co{->a[]}, and stores a |
| new value into element 12. |
| But because CPU~0 is still referencing the old \co{foo_a} |
| structure that contains the 16-byte array, all is well. |
| \end{enumerate} |
| |
| Of course, in both cases, CPU~1 must wait for a grace period before |
| freeing the old array. |
| |
| A more general version of this approach is presented in the next section. |
| |
| \subsection{Correlated Fields} |
| \label{sec:together:Correlated Fields} |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 struct animal { |
| 2 char name[40]; |
| 3 double age; |
| 4 double meas_1; |
| 5 double meas_2; |
| 6 double meas_3; |
| 7 char photo[0]; /* large bitmap. */ |
| 8 }; |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Uncorrelated Measurement Fields} |
| \label{fig:together:Uncorrelated Measurement Fields} |
| \end{figure} |
| |
| Suppose that each of Sch\"odinger's animals is represented by the |
| data element shown in |
| Figure~\ref{fig:together:Uncorrelated Measurement Fields}. |
| The \co{meas_1}, \co{meas_2}, and \co{meas_3} fields are a set |
| of correlated measurements that are updated periodically. |
| It is critically important that readers see these three values from |
| a single measurement update: If a reader sees an old value of |
| \co{meas_1} but new values of \co{meas_2} and \co{meas_3}, that |
| reader will become fatally confused. |
| How can we guarantee that readers will see coordinated sets of these |
| three values? |
| |
| One approach would be to allocate a new \co{animal} structure, |
| copy the old structure into the new structure, update the new |
| structure's \co{meas_1}, \co{meas_2}, and \co{meas_3} fields, |
| and then replace the old structure with a new one by updating |
| the pointer. |
| This does guarantee that all readers see coordinated sets of |
| measurement values, but it requires copying a large structure due |
| to the \co{->photo[]} field. |
| This copying might incur unacceptably large overhead. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 struct measurement { |
| 2 double meas_1; |
| 3 double meas_2; |
| 4 double meas_3; |
| 5 }; |
| 6 |
| 7 struct animal { |
| 8 char name[40]; |
| 9 double age; |
| 10 struct measurement *mp; |
| 11 char photo[0]; /* large bitmap. */ |
| 12 }; |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Correlated Measurement Fields} |
| \label{fig:together:Correlated Measurement Fields} |
| \end{figure} |
| |
| Another approach is to insert a level of indirection, as shown in |
| Figure~\ref{fig:together:Correlated Measurement Fields}. |
| When a new measurement is taken, a new \co{measurement} structure |
| is allocated, filled in with the measurements, and the \co{animal} |
| structure's \co{->mp} field is updated to point to this new |
| \co{measurement} structure using \co{rcu_assign_pointer()}. |
| After a grace period elapses, the old \co{measurement} structure |
| can be freed. |
| |
| \QuickQuiz{} |
| But cant't the approach shown in |
| Figure~\ref{fig:together:Correlated Measurement Fields} |
| result in extra cache misses, in turn resulting in additional |
| read-side overhead? |
| \QuickQuizAnswer{ |
| Indeed it can. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbbox} |
| 1 struct measurement { |
| 2 double meas_1; |
| 3 double meas_2; |
| 4 double meas_3; |
| 5 }; |
| 6 |
| 7 struct animal { |
| 8 char name[40]; |
| 9 double age; |
| 10 struct measurement *mp; |
| 11 struct measurement meas; |
| 12 char photo[0]; /* large bitmap. */ |
| 13 }; |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Localized Correlated Measurement Fields} |
| \label{fig:together:Localized Correlated Measurement Fields} |
| \end{figure} |
| |
| One way to avoid this cache-miss overhead is shown in |
| Figure~\ref{fig:together:Localized Correlated Measurement Fields}: |
| Simply embed an instance of a \co{measurement} structure |
| named \co{meas} |
| into the \co{animal} structure, and point the \co{->mp} |
| field at this \co{->meas} field. |
| |
| Measurement updates can then be carried out as follows: |
| |
| \begin{enumerate} |
| \item Allocate a new \co{measurement} structure and place |
| the new measurements into it. |
| \item Use \co{rcu_assign_pointer()} to point \co{->mp} to |
| this new structure. |
| \item Wait for a grace period to elapse, for example using |
| either \co{synchronize_rcu()} or \co{call_rcu()}. |
| \item Copy the measurements from the new \co{measurement} |
| structure into the embedded \co{->meas} field. |
| \item Use \co{rcu_assign_pointer()} to point \co{->mp} |
| back to the old embedded \co{->meas} field. |
| \item After another grace period elapses, free up the |
| new \co{measurement} structure. |
| \end{enumerate} |
| |
| This approach uses a heavier weight update procedure to eliminate |
| the extra cache miss in the common case. |
| The extra cache miss will be incurred only while an update is |
| actually in progress. |
| } \QuickQuizEnd |
| |
| This approach enables readers to see correlated values for selected |
| fields with minimal read-side overhead. |
| |
| % Birthstone/tombstone for moving records when readers cannot be permitted |
| % to see extraneous records. |
| |
| % Flag for deletion (if not already covered in the defer chapter). |