| % together/applyrcu.tex |
| % mainfile: ../perfbook.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \section{RCU Rescues} |
| \label{sec:together:RCU Rescues} |
| % |
| \epigraph{With great doubts comes great understanding, with little doubts |
| comes little understanding.} |
| {Chinese proverb} |
| |
| This section shows how to apply RCU to some examples discussed earlier |
| in this book. |
| In some cases, RCU provides simpler code, in other cases better |
| performance and scalability, and in still other cases, both. |
| |
| \subsection{RCU and Per-Thread-Variable-Based Statistical Counters} |
| \label{sec:together:RCU and Per-Thread-Variable-Based Statistical Counters} |
| |
| \Cref{sec:count:Per-Thread-Variable-Based Implementation} |
| described an implementation of statistical counters that provided |
| excellent |
| performance, roughly that of simple increment (as in the C \co{++} |
| operator), and linear scalability---but only for incrementing |
| via \co{inc_count()}. |
| Unfortunately, threads needing to read out the value via \co{read_count()} |
| were required to acquire a global |
| lock, and thus incurred high overhead and suffered poor scalability. |
| The code for the lock-based implementation is shown in |
| \cref{lst:count:Per-Thread Statistical Counters} on |
| \cpageref{lst:count:Per-Thread Statistical Counters}. |
| |
| \QuickQuiz{ |
| Why on earth did we need that global lock in the first place? |
| }\QuickQuizAnswer{ |
| A given thread's \co{__thread} variables vanish when that |
| thread exits. |
| It is therefore necessary to synchronize any operation that |
| accesses other threads' \co{__thread} variables with |
| thread exit. |
| Without such synchronization, accesses to \co{__thread} variable |
| of a just-exited thread will result in segmentation faults. |
| }\QuickQuizEnd |
| |
| \subsubsection{Design} |
| |
| The hope is to use RCU rather than \co{final_mutex} to protect the |
| thread traversal in \co{read_count()} in order to obtain excellent |
| performance and scalability from \co{read_count()}, rather than just |
| from \co{inc_count()}. |
| However, we do not want to give up any accuracy in the computed sum. |
| In particular, when a given thread exits, we absolutely cannot |
| lose the exiting thread's count, nor can we double-count it. |
| Such an error could result in inaccuracies equal to the full |
| precision of the result, in other words, such an error would |
| make the result completely useless. |
| And in fact, one of the purposes of \co{final_mutex} is to |
| ensure that threads do not come and go in the middle of \co{read_count()} |
| execution. |
| |
| Therefore, if we are to dispense with \co{final_mutex}, we will need |
| to come up with some other method for ensuring consistency. |
| One approach is to place the total count for all previously exited |
| threads and the array of pointers to the per-thread counters into a single |
| structure. |
| Such a structure, once made available to \co{read_count()}, is |
| held constant, ensuring that \co{read_count()} sees consistent data. |
| |
| \subsubsection{Implementation} |
| |
| \begin{fcvref}[ln:count:count_end_rcu:whole] |
| \Clnrefrange{struct:b}{struct:e} of |
| \cref{lst:together:RCU and Per-Thread Statistical Counters} |
| show the \co{countarray} structure, which contains a |
| \co{->total} field for the count from previously exited threads, |
| and a \co{counterp[]} array of pointers to the per-thread |
| \co{counter} for each currently running thread. |
| This structure allows a given execution of \co{read_count()} |
| to see a total that is consistent with the indicated set of running |
| threads. |
| |
| \begin{listing} |
| \ebresizeverb{.71}{\input{CodeSamples/count/count_end_rcu@whole.fcv}} |
| \caption{RCU and Per-Thread Statistical Counters} |
| \label{lst:together:RCU and Per-Thread Statistical Counters} |
| \end{listing} |
| |
| \Clnrefrange{perthread:b}{perthread:e} |
| contain the definition of the per-thread \co{counter} |
| variable, the global pointer \co{countarrayp} referencing |
| the current \co{countarray} structure, and |
| the \co{final_mutex} spinlock. |
| |
| \Clnrefrange{inc:b}{inc:e} show \co{inc_count()}, which is unchanged from |
| \cref{lst:count:Per-Thread Statistical Counters}. |
| \end{fcvref} |
| |
| \begin{fcvref}[ln:count:count_end_rcu:whole:read] |
| \Clnrefrange{b}{e} show \co{read_count()}, which has changed significantly. |
| \Clnref{rrl,rru} substitute \co{rcu_read_lock()} and |
| \co{rcu_read_unlock()} for acquisition and release of \co{final_mutex}. |
| \Clnref{deref} uses \co{rcu_dereference()} to snapshot the |
| current \co{countarray} structure into local variable \co{cap}. |
| Proper use of RCU will guarantee that this \co{countarray} structure |
| will remain with us through at least the end of the current RCU |
| read-side critical section at \clnref{rru}. |
| \Clnref{init} initializes \co{sum} to \co{cap->total}, which is the |
| sum of the counts of threads that have previously exited. |
| \Clnrefrange{add:b}{add:e} add up the per-thread counters corresponding |
| to currently |
| running threads, and, finally, \clnref{ret} returns the sum. |
| \end{fcvref} |
| |
| \begin{fcvref}[ln:count:count_end_rcu:whole:init] |
| The initial value for \co{countarrayp} is |
| provided by \co{count_init()} on \clnrefrange{b}{e}. |
| This function runs before the first thread is created, and its job |
| is to allocate |
| and zero the initial structure, and then assign it to \co{countarrayp}. |
| \end{fcvref} |
| |
| \begin{fcvref}[ln:count:count_end_rcu:whole:reg] |
| \Clnrefrange{b}{e} show the \co{count_register_thread()} function, which |
| is invoked by each newly created thread. |
| \Clnref{idx} picks up the current thread's index, \clnref{acq} acquires |
| \co{final_mutex}, \clnref{set} installs a pointer to this thread's |
| \co{counter}, and \clnref{rel} releases \co{final_mutex}. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| \begin{fcvref}[ln:count:count_end_rcu:whole:reg] |
| Hey!!! |
| \Clnref{set} of |
| \cref{lst:together:RCU and Per-Thread Statistical Counters} |
| modifies a value in a pre-existing \co{countarray} structure! |
| Didn't you say that this structure, once made available to |
| \co{read_count()}, remained constant??? |
| \end{fcvref} |
| }\QuickQuizAnswer{ |
| Indeed I did say that. |
| And it would be possible to make \co{count_register_thread()} |
| allocate a new structure, much as \co{count_unregister_thread()} |
| currently does. |
| |
| But this is unnecessary. |
| Recall the derivation of the error bounds of \co{read_count()} |
| that was based on the snapshots of memory. |
| Because new threads start with initial \co{counter} values of |
| zero, the derivation holds even if we add a new thread partway |
| through \co{read_count()}'s execution. |
| So, interestingly enough, when adding a new thread, this |
| implementation gets the effect of allocating a new structure, |
| but without actually having to do the allocation. |
| }\QuickQuizEnd |
| |
| \begin{fcvref}[ln:count:count_end_rcu:whole:unreg] |
| \Clnrefrange{b}{e} show \co{count_unregister_thread()}, which is invoked |
| by each thread just before it exits. |
| \Clnrefrange{alloc:b}{alloc:e} allocate a new \co{countarray} structure, |
| \clnref{acq} acquires \co{final_mutex} and \clnref{rel} releases it. |
| \Clnref{copy} copies the contents of the current \co{countarray} into |
| the newly allocated version, \clnref{add} adds the exiting thread's \co{counter} |
| to new structure's \co{->total}, and \clnref{null} \co{NULL}s the exiting thread's |
| \co{counterp[]} array element. |
| \Clnref{retain} then retains a pointer to the current (soon to be old) |
| \co{countarray} structure, and \clnref{assign} uses \co{rcu_assign_pointer()} |
| to install the new version of the \co{countarray} structure. |
| \Clnref{sync} waits for a grace period to elapse, so that any threads that |
| might be concurrently executing in \co{read_count()}, and thus might |
| have references to the old \co{countarray} structure, will be allowed |
| to exit their RCU read-side critical sections, thus dropping any such |
| references. |
| \Clnref{free} can then safely free the old \co{countarray} structure. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| Given the fixed-size \co{counterp} array, exactly how does this |
| code avoid a fixed upper bound on the number of threads??? |
| }\QuickQuizAnswer{ |
| You are quite right, that array does in fact reimpose the fixed |
| upper limit. |
| This limit may be avoided by tracking threads with a linked list, |
| as is done in userspace RCU~\cite{MathieuDesnoyers2012URCU}. |
| Doing something similar for this code is left as an exercise for |
| the reader. |
| }\QuickQuizEnd |
| |
| \subsubsection{Discussion} |
| |
| \EQuickQuiz{ |
| Wow! |
| \Cref{lst:together:RCU and Per-Thread Statistical Counters} |
| contains 70 lines of code, compared to only 42 in |
| \cref{lst:count:Per-Thread Statistical Counters}. |
| Is this extra complexity really worth it? |
| }\EQuickQuizAnswer{ |
| This of course needs to be decided on a case-by-case basis. |
| If you need an implementation of \co{read_count()} that |
| scales linearly, then the lock-based implementation shown in |
| \cref{lst:count:Per-Thread Statistical Counters} |
| simply will not work for you. |
| On the other hand, if calls to \co{read_count()} are sufficiently |
| rare, then the lock-based version is simpler and might thus be |
| better, although much of the size difference is due |
| to the structure definition, memory allocation, and \co{NULL} |
| return checking. |
| |
| Of course, a better question is ``Why doesn't the language |
| implement cross-thread access to \co{__thread} variables?'' |
| After all, such an implementation would make both the locking |
| and the use of RCU unnecessary. |
| This would in turn enable an implementation that |
| was even simpler than the one shown in |
| \cref{lst:count:Per-Thread Statistical Counters}, but |
| with all the scalability and performance benefits of the |
| implementation shown in |
| \cref{lst:together:RCU and Per-Thread Statistical Counters}! |
| }\EQuickQuizEnd |
| |
| Use of RCU enables exiting threads to wait until other threads are |
| guaranteed to be done using the exiting threads' \co{__thread} variables. |
| This allows the \co{read_count()} function to dispense with locking, |
| thereby providing |
| excellent performance and scalability for both the \co{inc_count()} |
| and \co{read_count()} functions. |
| However, this performance and scalability come at the cost of some increase |
| in code complexity. |
| It is hoped that compiler and library writers employ user-level |
| RCU~\cite{MathieuDesnoyers2009URCU} to provide safe cross-thread |
| access to \co{__thread} variables, greatly reducing the |
| complexity seen by users of \co{__thread} variables. |
| |
| \subsection{RCU and Counters for Removable I/O Devices} |
| \label{sec:together:RCU and Counters for Removable I/O Devices} |
| |
| \Cref{sec:count:Applying Exact Limit Counters} |
| showed a fanciful pair of code fragments for dealing with counting |
| I/O accesses to removable devices. |
| These code fragments suffered from high overhead on the fastpath |
| (starting an I/O) due to the need to acquire a reader-writer |
| lock. |
| |
| This section shows how RCU may be used to avoid this overhead. |
| |
| The code for performing an I/O is quite similar to the original, with |
| an RCU read-side critical section being substituted for the reader-writer |
| lock read-side critical section in the original: |
| |
| \begin{VerbatimN}[tabsize=8] |
| rcu_read_lock(); |
| if (removing) { |
| rcu_read_unlock(); |
| cancel_io(); |
| } else { |
| add_count(1); |
| rcu_read_unlock(); |
| do_io(); |
| sub_count(1); |
| } |
| \end{VerbatimN} |
| \vspace{5pt} |
| |
| The RCU read-side primitives have minimal overhead, thus speeding up |
| the fastpath, as desired. |
| |
| The updated code fragment removing a device is as follows: |
| |
| \begin{fcvlabel}[ln:together:applyrcu:Removing Device] |
| \begin{VerbatimN}[tabsize=8,commandchars=\\\[\]] |
| spin_lock(&mylock); |
| removing = 1; |
| sub_count(mybias); |
| spin_unlock(&mylock); |
| synchronize_rcu(); |
| while (read_count() != 0) { \lnlbl[nextofsync] |
| poll(NULL, 0, 1); |
| } |
| remove_device(); |
| \end{VerbatimN} |
| \end{fcvlabel} |
| |
| \begin{fcvref}[ln:together:applyrcu:Removing Device] |
| Here we replace the reader-writer lock with an exclusive spinlock and |
| add a \co{synchronize_rcu()} to wait for all of the RCU read-side |
| critical sections to complete. |
| Because of the \co{synchronize_rcu()}, |
| once we reach \clnref{nextofsync}, |
| we know that all remaining I/Os have been accounted for. |
| |
| Of course, the overhead of \co{synchronize_rcu()} can be large, |
| but given that device removal is quite rare, this is usually a good |
| tradeoff. |
| \end{fcvref} |
| |
| \FloatBarrier |
| \subsection{Array and Length} |
| \label{sec:together:Array and Length} |
| |
| Suppose we have an RCU-protected variable-length array, as shown in |
| \cref{lst:together:RCU-Protected Variable-Length Array}. |
| The length of the array \co{->a[]} can change dynamically, and at any |
| given time, its length is given by the field \co{->length}. |
| Of course, this introduces the following \IX{race condition}: |
| |
| \begin{listing} |
| \begin{VerbatimL}[tabsize=8] |
| struct foo { |
| int length; |
| char *a; |
| }; |
| \end{VerbatimL} |
| \caption{RCU-Protected Variable-Length Array} |
| \label{lst:together:RCU-Protected Variable-Length Array} |
| \end{listing} |
| |
| \begin{enumerate} |
| \item The array is initially 16 characters long, and thus \co{->length} |
| is equal to 16. |
| \item CPU~0 loads the value of \co{->length}, obtaining the value 16. |
| \item CPU~1 shrinks the array to be of length 8, and assigns a pointer |
| to a new 8-character block of memory into \co{->a[]}. |
| \item CPU~0 picks up the new pointer from \co{->a[]}, and stores a |
| new value into element 12. |
| Because the array has only 8 characters, this results in |
| a SEGV or (worse yet) memory corruption. |
| \end{enumerate} |
| |
| How can we prevent this? |
| |
| One approach is to make careful use of memory barriers, which are |
| covered in \cref{chp:Advanced Synchronization: Memory Ordering}. |
| This works, but incurs read-side overhead and, perhaps worse, requires |
| use of explicit memory barriers. |
| |
| \begin{listing} |
| \begin{VerbatimL}[tabsize=8] |
| struct foo_a { |
| int length; |
| char a[0]; |
| }; |
| |
| struct foo { |
| struct foo_a *fa; |
| }; |
| \end{VerbatimL} |
| \caption{Improved RCU-Protected Variable-Length Array} |
| \label{lst:together:Improved RCU-Protected Variable-Length Array} |
| \end{listing} |
| |
| A better approach is to put the value and the array into the same structure, |
| as shown in |
| \cref{lst:together:Improved RCU-Protected Variable-Length Array}~\cite{Arcangeli03}. |
| Allocating a new array (\co{foo_a} structure) then automatically provides |
| a new place for the array length. |
| This means that if any CPU picks up a reference to \co{->fa}, it is |
| guaranteed that the \co{->length} will match the \co{->a[]}. |
| |
| \begin{enumerate} |
| \item The array is initially 16 characters long, and thus \co{->length} |
| is equal to 16. |
| \item CPU~0 loads the value of \co{->fa}, obtaining a pointer to |
| the structure containing the value 16 and the 16-byte array. |
| \item CPU~0 loads the value of \co{->fa->length}, obtaining the value 16. |
| \item CPU~1 shrinks the array to be of length 8, and assigns a pointer |
| to a new \co{foo_a} structure containing an 8-character block |
| of memory into \co{->fa}. |
| \item CPU~0 picks up the new pointer from \co{->a[]}, and stores a |
| new value into element 12. |
| But because CPU~0 is still referencing the old \co{foo_a} |
| structure that contains the 16-byte array, all is well. |
| \end{enumerate} |
| |
| Of course, in both cases, CPU~1 must wait for a grace period before |
| freeing the old array. |
| |
| A more general version of this approach is presented in the next section. |
| |
| \subsection{Correlated Fields} |
| \label{sec:together:Correlated Fields} |
| \OriginallyPublished{sec:together:Correlated Fields}{Correlated Fields}{Oregon Graduate Institute}{PaulEdwardMcKenneyPhD} |
| |
| Suppose that each of Sch\"odinger's animals is represented by the |
| data element shown in |
| \cref{lst:together:Uncorrelated Measurement Fields}. |
| The \co{meas_1}, \co{meas_2}, and \co{meas_3} fields are a set |
| of correlated measurements that are updated periodically. |
| It is critically important that readers see these three values from |
| a single measurement update: |
| If a reader sees an old value of \co{meas_1} but new values of |
| \co{meas_2} and \co{meas_3}, that reader will become fatally confused. |
| How can we guarantee that readers will see coordinated sets of these |
| three values?\footnote{ |
| This situation is similar to that described in |
| \cref{sec:together:Correlated Data Elements}, |
| except that here readers need only see a consistent view of a |
| given single data element, not the consistent view of a |
| group of data elements that was required in that earlier |
| \lcnamecref{sec:together:Correlated Data Elements}.} |
| |
| \begin{listing} |
| \begin{VerbatimL}[tabsize=8] |
| struct animal { |
| char name[40]; |
| double age; |
| double meas_1; |
| double meas_2; |
| double meas_3; |
| char photo[0]; /* large bitmap. */ |
| }; |
| \end{VerbatimL} |
| \caption{Uncorrelated Measurement Fields} |
| \label{lst:together:Uncorrelated Measurement Fields} |
| \end{listing} |
| |
| One approach would be to allocate a new \co{animal} structure, |
| copy the old structure into the new structure, update the new |
| structure's \co{meas_1}, \co{meas_2}, and \co{meas_3} fields, |
| and then replace the old structure with a new one by updating |
| the pointer. |
| This does guarantee that all readers see coordinated sets of |
| measurement values, but it requires copying a large structure due |
| to the \co{->photo[]} field. |
| This copying might incur unacceptably large overhead. |
| |
| \begin{listing} |
| \begin{VerbatimL}[tabsize=8] |
| struct measurement { |
| double meas_1; |
| double meas_2; |
| double meas_3; |
| }; |
| |
| struct animal { |
| char name[40]; |
| double age; |
| struct measurement *mp; |
| char photo[0]; /* large bitmap. */ |
| }; |
| \end{VerbatimL} |
| \caption{Correlated Measurement Fields} |
| \label{lst:together:Correlated Measurement Fields} |
| \end{listing} |
| |
| Another approach is to impose a level of indirection, as shown in |
| \cref{lst:together:Correlated Measurement Fields}~\cite[Section 5.3.4]{PaulEdwardMcKenneyPhD}. |
| When a new measurement is taken, a new \co{measurement} structure |
| is allocated, filled in with the measurements, and the \co{animal} |
| structure's \co{->mp} field is updated to point to this new |
| \co{measurement} structure using \co{rcu_assign_pointer()}. |
| After a grace period elapses, the old \co{measurement} structure |
| can be freed. |
| |
| \QuickQuiz{ |
| But cant't the approach shown in |
| \cref{lst:together:Correlated Measurement Fields} |
| result in extra cache misses, in turn resulting in additional |
| read-side overhead? |
| }\QuickQuizAnswer{ |
| Indeed it can. |
| |
| \begin{listing} |
| \begin{VerbatimL}[tabsize=8] |
| struct measurement { |
| double meas_1; |
| double meas_2; |
| double meas_3; |
| }; |
| |
| struct animal { |
| char name[40]; |
| double age; |
| struct measurement *mp; |
| struct measurement meas; |
| char photo[0]; /* large bitmap. */ |
| }; |
| \end{VerbatimL} |
| \caption{Localized Correlated Measurement Fields} |
| \label{lst:together:Localized Correlated Measurement Fields} |
| \end{listing} |
| |
| One way to avoid this cache-miss overhead is shown in |
| \cref{lst:together:Localized Correlated Measurement Fields}: |
| Simply embed an instance of a \co{measurement} structure |
| named \co{meas} |
| into the \co{animal} structure, and point the \co{->mp} |
| field at this \co{->meas} field. |
| |
| Measurement updates can then be carried out as follows: |
| |
| \begin{enumerate} |
| \item Allocate a new \co{measurement} structure and place |
| the new measurements into it. |
| \item Use \co{rcu_assign_pointer()} to point \co{->mp} to |
| this new structure. |
| \item Wait for a grace period to elapse, for example using |
| either \co{synchronize_rcu()} or \co{call_rcu()}. |
| \item Copy the measurements from the new \co{measurement} |
| structure into the embedded \co{->meas} field. |
| \item Use \co{rcu_assign_pointer()} to point \co{->mp} |
| back to the old embedded \co{->meas} field. |
| \item After another grace period elapses, free up the |
| new \co{measurement} structure. |
| \end{enumerate} |
| |
| This approach uses a heavier weight update procedure to eliminate |
| the extra cache miss in the common case. |
| The extra cache miss will be incurred only while an update is |
| actually in progress. |
| }\QuickQuizEnd |
| |
| This approach enables readers to see correlated values for selected |
| fields, but while incurring minimal read-side overhead. |
| This per-data-element consistency suffices in the common case where |
| a reader looks only at a single data element. |
| |
| % Flag for deletion (if not already covered in the defer chapter). |
| % @@@ Issaquah Challenge. |
| % @@@ RLU & MV-RLU (Eventually the corresponding patents.) |
| |
| \subsection{Update-Friendly Traversal} |
| \label{sec:together:Update-Friendly Traversal} |
| |
| Suppose that a statistical scan of all elements in a hash table is |
| required. |
| For example, Schr\"odinger might wish to compute the average |
| length-to-weight ratio over all of his animals.\footnote{ |
| Why would such a quantity be useful? |
| Beats me! |
| But group statistics are often useful.} |
| Suppose further that Schr\"odinger is willing to ignore slight |
| errors due to animals being added to and removed from the hash |
| table while this statistical scan is being carried out. |
| What should Schr\"odinger do to control concurrency? |
| |
| One approach is to enclose the statistical scan in an RCU read-side |
| critical section. |
| This permits updates to proceed concurrently without unduly impeding |
| the scan. |
| In particular, the scan does not block the updates and vice versa, |
| which allows scan of hash tables containing very large numbers of |
| elements to be supported gracefully, even in the face of very high |
| update rates. |
| |
| \QuickQuiz{ |
| But how does this scan work while a resizable hash table |
| is being resized? |
| In that case, neither the old nor the new hash table is |
| guaranteed to contain all the elements in the hash table! |
| }\QuickQuizAnswer{ |
| True, resizable hash tables as described in |
| \cref{sec:datastruct:Non-Partitionable Data Structures} |
| cannot be fully scanned while being resized. |
| One simple way around this is to acquire the |
| \co{hashtab} structure's \co{->ht_lock} while scanning, |
| but this prevents more than one scan from proceeding |
| concurrently. |
| |
| Another approach is for updates to mutate the old hash |
| table as well as the new one while resizing is in |
| progress. |
| This would allow scans to find all elements in the old |
| hash table. |
| Implementing this is left as an exercise for the reader. |
| }\QuickQuizEnd |
| |
| \subsection{Scalable Reference Count Two} |
| \label{sec:together:Scalable Reference Count Two} |
| |
| Suppose a \IX{reference count} is becoming a performance or scalability |
| bottleneck. |
| What can you do? |
| |
| One approach is to use per-CPU counters for each reference count, |
| somewhat similar to the algorithms in \cref{chp:Counting}, in particular, |
| the exact limit counters described in |
| \cref{sec:count:Exact Limit Counters}. |
| The need to switch between per-CPU and global modes for these counters |
| results either in expensive increments and decrements on the one hand |
| (\cref{sec:count:Atomic Limit Counter Implementation}) |
| or in the use of POSIX signals on the other |
| (\cref{sec:count:Signal-Theft Limit Counter Design}). |
| |
| Another approach is to use RCU to mediate the switch between per-CPU |
| and global counting modes. |
| Each update is carried out within an RCU read-side critical section, |
| and each update checks a flag to determine whether to update the |
| per-CPU counters on the one hand or the global on the other. |
| To switch modes, update the flag, wait for a grace period, and then |
| move any remaining counts from the per-CPU counters to the global |
| counter or vice versa. |
| |
| The Linux kernel uses this RCU-mediated approach in its \co{percpu_ref} |
| style of reference counter. |
| Code using this reference counter must initialize the \co{percpu_ref} |
| structure using \co{percpu_ref_init()}, which takes as arguments |
| a pointer to the structure, a pointer to a function to invoke when |
| the reference count reaches zero, a set of mode flags, and a set |
| of \co{kmalloc()} \co{GFP_} flags. |
| After normal initialization, the structure has one reference and |
| is in per-CPU mode. |
| |
| The mode flags are usually zero, but can include the |
| \co{PERCPU_REF_INIT_ATOMIC} bit if the counter is to start in slow |
| non-per-CPU (that is, atomic) mode. |
| There is also a \co{PERCPU_REF_ALLOW_REINIT} bit that allows |
| a given \co{percpu_ref} counter to be reused via a call to |
| \co{percpu_ref_reinit()} without needing to be freed and reallocated. |
| Regardless of how the \co{percpu_ref} structure is initialized, |
| \co{percpu_ref_get()} may be used to acquire a reference and |
| \co{percpu_ref_put()} may be used to release a reference. |
| |
| When in per-CPU mode, the \co{percpu_ref} structure cannot determine whether |
| or not its value has reached zero. |
| When such a determination is necessary, \co{percpu_ref_kill()} may |
| be invoked. |
| This function switches the structure into atomic mode and removes the |
| initial reference installed by the call to \co{percpu_ref_init()}. |
| Of course, when in atomic mode, calls to \co{percpu_ref_get()} and |
| \co{percpu_ref_put()} are quite expensive, but \co{percpu_ref_put()} |
| can tell when the value reaches zero. |
| |
| Readers desiring more \co{percpu_ref} information are referred to |
| the Linux-kernel documentation and source code. |
| |
| \subsection{Retriggered Grace Periods} |
| \label{sec:together:Retriggered Grace Periods} |
| |
| There is no \co{call_rcu_cancel()}, so once an \co{rcu_head} structure |
| is passed to \co{call_rcu()}, there is no calling it back. |
| It must be left alone until the callback is invoked. |
| In the common case, this is as it should be because the \co{rcu_head} |
| structure is on a one-way journey to deallocation. |
| |
| However, there are use cases that combine RCU and explicit \co{open()} |
| and \co{close()} calls. |
| After a \co{close()} call, readers are not supposed to begin new accesses |
| to the data structure, but there might well be readers completing their |
| traversal. |
| This situation can be handled in the usual manner: |
| Wait for a grace period following the \co{close()} call before freeing |
| the data structures. |
| |
| But what if \co{open()} is called before the grace period ends? |
| |
| Again, there is no \co{call_rcu_cancel()}, so another approach is to |
| set a flag that is checked by the callback function, which can opt out |
| of actually freeing anything. |
| Problem solved! |
| |
| But what if \co{open()} and then another \co{close()} are both called |
| before the grace period ends? |
| |
| One approach is to have a second value for the flag that causes the |
| callback to requeue itself. |
| |
| But what if there is not only a \co{open()} and then another \co{close()}, |
| but also another \co{open()} before the grace period ends? |
| |
| In this case, the callback needs to set state to reflect that last |
| \co{open()} still being in effect. |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{together/retriggergp}} |
| \caption{Retrigger-Grace-Period State Machine} |
| \label{fig:count:Retrigger-Grace-Period State Machine} |
| \end{figure} |
| |
| Continuing this line of thought leads us to the state machine |
| shown in \cref{fig:count:Retrigger-Grace-Period State Machine}. |
| The initial state is CLOSED and the operational state is OPEN\@. |
| The diamond-shaped arrowheads denote \co{call_rcu()} invocation, while |
| the arrows labeled ``CB'' denote callback invocation. |
| |
| The normal path through this state machine traverses the states CLOSED, |
| OPEN, CLOSING (with an invocation of \co{call_rcu()}), and back to CLOSED |
| once the callback has been invoked. |
| If \co{open()} is invoked before the grace period completes, the |
| state machine traverses the cycle OPEN, CLOSING (with |
| an invocation of \co{call_rcu()}), REOPENING, and back to OPEN once the |
| callback has been invoked. |
| If \co{open()} and then \co{close()} are invoked before the grace period |
| completes, the state machine traverses the cycle OPEN, CLOSING (with |
| an invocation of \co{call_rcu()}), REOPENING, RECLOSING, and back to CLOSING once the |
| callback has been invoked. |
| |
| Given an indefinite alternating sequence of \co{close()} and \co{open()} |
| invocations, the state machine would traverse OPEN, and CLOSING (with |
| an invocation of \co{call_rcu()}), followed by alternating sojourns in |
| the REOPENING and RECLOSING states. |
| Once the grace period ends, the state machine would transition to |
| either of the CLOSING or the OPEN state, depending on which of the |
| RECLOSING or REOPENING states the callback was invoked in. |
| |
| \begin{listing} |
| \ebresizeverb{.88}{\input{CodeSamples/together/retrigger-gp@whole.fcv}} |
| \caption{Retriggering a Grace Period (Pseudocode)} |
| \label{lst:together:Retriggering a Grace Period} |
| \end{listing} |
| |
| Rough pseudocode of this state machine is shown in |
| \cref{lst:together:Retriggering a Grace Period}. |
| \begin{fcvref}[ln:together:retrigger-gp:whole] |
| The five states are shown on \clnrefrange{states:b}{states:e}, |
| the current state is held in \co{rtrg_status} on \clnref{status}, |
| which is protected by the lock on \clnref{lock}. |
| |
| The three CB transitions (emanating from states CLOSING, REOPENING, |
| and RECLOSING) are implemented by the \co{close_cb()} function shown |
| on \clnrefrange{close_cb:b}{close_cb:e}. |
| \Clnref{cleanup} invokes a user-supplied \co{close_cleanup()} to |
| take any final cleanup actions such as freeing memory when |
| transitioning to the CLOSED state. |
| \Clnref{call_rcu1} contains the \co{call_rcu()} invocation that |
| causes a later transition to the CLOSED state. |
| |
| The \co{open()} function on \clnrefrange{open:b}{open:e} implements |
| the transitions to the OPEN, CLOSING, and REOPENING states, with |
| \clnref{do_open} invoking a \co{do_open()} function to implement |
| any allocation and initialization of any needed data structures. |
| |
| The \co{close()} function on \clnrefrange{close:b}{close:e} |
| implements the transitions to the CLOSING and RECLOSING states, |
| with \clnref{do_close} invoking a \co{do_close()} function to take |
| any actions that might be required to finalize this transition, |
| for example, causing later read-only traversals to return errors. |
| \Clnref{call_rcu2} contains the \co{call_rcu()} invocation that |
| causes a later transition to the CLOSED state. |
| \end{fcvref} |
| |
| This state machine and pseudocode shows how to get the effect of a |
| \co{call_rcu_cancel()} in those rare situations needing such semantics. |
| |
| \subsection{Long-Duration Accesses Two} |
| \label{sec:together:Long-Duration Accesses Two} |
| |
| Suppose a reader-writer-locking reader is holding the lock for so |
| long that updates are excessively delayed. |
| Suppose further that this reader cannot reasonably be converted to |
| use reference counting |
| (otherwise, see \cref{sec:together:Long-Duration Accesses}). |
| |
| If that reader can be reasonably converted to use RCU, that might |
| solve the problem. |
| The reason is that RCU readers do not completely block updates, but |
| rather block only the cleanup portions of those updates (including |
| memory reclamation). |
| Therefore, if the system has ample memory, converting the reader-writer |
| lock to RCU may suffice. |
| |
| However, converting to RCU does not always suffice. |
| For example, the code might traverse an extremely large linked data |
| structure within a single RCU read-side critical section, which might |
| so greatly extend the RCU grace period that the system runs out of memory. |
| These situations can be handled in a couple of different ways: |
| \begin{enumerate*}[(1)] |
| \item Use SRCU instead of RCU and |
| \item Acquire a reference to exit the RCU reader. |
| \end{enumerate*} |
| |
| \subsubsection{Use SRCU} |
| \label{sec:together:Use SRCU} |
| |
| In the Linux kernel, RCU is global. |
| In other words, any long-running RCU reader anywhere in the kernel will |
| delay the current RCU grace period. |
| If the long-running RCU reader is traversing a small data structure, |
| that small amount of data is delaying freeing of all other data structures, |
| which can result in memory exhaustion. |
| |
| One way to avoid this problem is to use SRCU for that long-running RCU |
| reader's data structure, with its own \co{srcu_struct} structure. |
| The resulting long-running SRCU readers will then delay only that |
| \co{srcu_struct} structure's grace periods, and not those of RCU, |
| thus avoiding memory exhaustion. |
| For more details, see the SRCU API in \cref{sec:defer:RCU Linux-Kernel API}. |
| |
| Unfortunately, this approach does have some drawbacks. |
| For one thing, SRCU readers are not subject to priority boosting, which can |
| result in additional delays to low-priority SRCU readers on busy |
| systems. |
| Worse yet, defining a separate \co{srcu_struct} structure reduces the |
| number of RCU updaters, which in turn increases the grace-period |
| overhead per updater. |
| This means that giving each current Linux-kernel RCU use case its own |
| \co{srcu_struct} structure could multiply system-wide grace-period |
| overhead by the number of such structures. |
| |
| Therefore, it is often better to acquire some sort of non-RCU reference |
| on the needed data to permit a momentary exit from the RCU read-side |
| critical section, as described in the next section. |
| |
| \subsubsection{Acquire a Reference} |
| \label{sec:together:Acquire a Reference} |
| |
| If the RCU read-side critical section is too long, shorten it! |
| |
| In some cases, this can be done trivially. |
| For example, code that scans all of the hash chains of a statically |
| allocated array of hash buckets can just as easily scan each hash chain |
| within its own critical section. |
| |
| This works because hash chains are normally quite short, and by design. |
| When traversing long linked structures, it is necessary to have some |
| way of stopping in the middle and resuming later. |
| |
| For example, in Linux kernel v5.16, the \co{khugepaged_scan_file()} |
| function checks to see if some other task needs the current CPU |
| using \co{need_resched()}, and if so invokes \co{xas_pause()} to |
| adjust the traversal's iterator appropriately, and then invokes |
| \co{cond_resched_rcu()} to yield the CPU\@. |
| In turn, the \co{cond_resched_rcu()} function invokes \co{rcu_read_unlock()}, |
| \co{cond_resched()}, and finally \co{rcu_read_lock()} to drop out of |
| the RCU read-side critical section in order to yield the CPU. |
| |
| Of course, where feasible, another approach would be to switch to a |
| data structure such as a hash table that is more friendly to momentarily |
| dropping out of an RCU read-side critical section. |
| |
| \QuickQuiz{ |
| But how would this work with a resizable hash table, such |
| as the one described in |
| \cref{sec:datastruct:Non-Partitionable Data Structures}? |
| }\QuickQuizAnswer{ |
| In this case, more care is required because the hash table |
| might well be resized during the time that we momentarily |
| exited the RCU read-side critical section. |
| Worse yet, the resize operation can be expected to free the |
| old hash buckets, leaving us pointing to the freelist. |
| |
| But it is not sufficient to prevent the old hash buckets |
| from being freed. |
| It is also necessary to ensure that those buckets continue |
| to be updated. |
| |
| One way to handle this is to have a reference count on each |
| set of buckets, which is initially set to the value one. |
| A full-table scan would acquire a reference at the beginning of |
| the scan (but only if the reference is non-zero) and release it |
| at the end of the scan. |
| The resizing would populate the new buckets, release the |
| reference, wait for a grace period, and then wait for the |
| reference to go to zero. |
| Once the reference was zero, the resizing could let updaters |
| forget about the old hash buckets and then free it. |
| |
| Actual implementation is left to the interested reader, who will |
| gain much insight from this task. |
| }\QuickQuizEnd |
| |
| % @@@ RCU link counts |