defer/refcnt.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % defer/refcnt.tex

 \section{Reference Counting}
 \label{sec:defer:Reference Counting}

 Reference counting tracks the number of references
 to a given object in order to prevent that object from being prematurely
 freed.
 Although this is a conceptually simple technique, many devils hide in
 the details.
 After all, if the object was not subject to premature disposal,
 there would be no need for the reference counter in the first place.
 But if the object can be disposed of, what prevents disposal during
 the reference-acquisition process itself?

 There are a number of possible answers to this question, including:

 \begin{enumerate}
 \item	A lock residing outside of the object must be held while
 	manipulating the reference count.
 \item	The object is created with a non-zero reference count, and new
 	references may be acquired only when the current value of
 	the reference counter is non-zero.
 	If a thread does not have a reference to a given object,
 	it may obtain one with the help of another thread that
 	already has a reference.
 \item	An existence guarantee is provided for the object, preventing
 	it from being freed while some other
 	entity might be attempting to acquire a reference.
 	Existence guarantees are often provided by automatic
 	garbage collectors, and, as will be seen in
 	Section~\ref{sec:defer:Read-Copy Update (RCU)}, by RCU.
 \item	A type-safety guarantee is provided for the object.
 	An additional identity check must be performed once
 	the reference is acquired.
 	Type-safety guarantees can be provided by special-purpose
 	memory allocators, for example, by the
 	\co{SLAB_DESTROY_BY_RCU} feature within the Linux kernel,
 	as will be seen in Section~\ref{sec:defer:Read-Copy Update (RCU)}.
 \end{enumerate}

 Of course, any mechanism that provides existence guarantees
 by definition also provides type-safety guarantees.
 This section will therefore group the last two answers together under the
 rubric of RCU, leaving us with three general categories of
 reference-acquisition protection: Reference counting, sequence
 locking, and RCU.

 \QuickQuiz{}
 	Why not implement reference-acquisition using
 	a simple compare-and-swap operation that only
 	acquires a reference if the reference counter is
 	non-zero?
 \QuickQuizAnswer{
 	Although this can resolve the race between the release of
 	the last reference and acquisition of a new reference,
 	it does absolutely nothing to prevent the data structure
 	from being freed and reallocated, possibly as some completely
 	different type of structure.
 	It is quite likely that the ``simple compare-and-swap
 	operation'' would give undefined results if applied to the
 	differently typed structure.

 	In short, use of atomic operations such as compare-and-swap
 	absolutely requires either type-safety or existence guarantees.
 } \QuickQuizEnd

 \begin{table}
 \centering
 \begin{tabular}{l||c|c|c}
 	& \multicolumn{3}{c}{Release Synchronization} \\
 	\cline{2-4}
 	Acquisition     &         & Reference &     \\
 	Synchronization & Locking & Counting  & RCU \\
 	\hline
 	\hline
 	Locking		& -	  & CAM	      & CA  \\
 	\hline
 	Reference	& A	  & AM	      & A   \\
 	Counting	&  	  &   	      &     \\
 	\hline
 	RCU		& CA	  & MCA	      & CA  \\
 \end{tabular}
 \caption{Reference Counting and Synchronization Mechanisms}
 \label{tab:defer:Reference Counting and Synchronization Mechanisms}
 \end{table}

 Given that the key reference-counting issue
 is synchronization between acquisition
 of a reference and freeing of the object, we have nine possible
 combinations of mechanisms, as shown in
 Table~\ref{tab:defer:Reference Counting and Synchronization Mechanisms}.
 This table
 divides reference-counting mechanisms into the following broad categories:
 \begin{enumerate}
 \item	Simple counting with neither atomic operations, memory
 	barriers, nor alignment constraints \makebox{(``-'')}.
 \item	Atomic counting without memory barriers (``A'').
 \item	Atomic counting, with memory barriers required only on release
 	(``AM'').
 \item	Atomic counting with a check combined with the atomic acquisition
 	operation, and with memory barriers required only on release
 	(``CAM'').
 \item	Atomic counting with a check combined with the atomic acquisition
 	operation (``CA'').
 \item	Atomic counting with a check combined with the atomic acquisition
 	operation, and with memory barriers also required on acquisition
 	(``MCA'').
 \end{enumerate}
 However, because all Linux-kernel atomic operations that return a
 value are defined to contain memory barriers, all release operations
 contain memory barriers, and all checked acquisition operations also
 contain memory barriers.
 Therefore, cases ``CA'' and ``MCA'' are equivalent to ``CAM'', so that
 there are sections below for only the first four cases:
 \makebox{``-''}, ``A'', ``AM'', and ``CAM''.
 The Linux primitives that support reference counting are presented in
 Section~\ref{sec:defer:Linux Primitives Supporting Reference Counting}.
 Later sections cite optimizations that can improve performance
 if reference acquisition and release is very frequent, and the
 reference count need be checked for zero only very rarely.

 \subsection{Implementation of Reference-Counting Categories}
 \label{sec:defer:Implementation of Reference-Counting Categories}

 Simple counting protected by locking (\makebox{``-''}) is described in
 Section~\ref{sec:defer:Simple Counting},
 atomic counting with no memory barriers (``A'') is described in
 Section~\ref{sec:defer:Atomic Counting}
 atomic counting with acquisition memory barrier (``AM'') is described in
 Section~\ref{sec:defer:Atomic Counting With Release Memory Barrier},
 and
 atomic counting with check and release memory barrier (``CAM'') is described in
 Section~\ref{sec:defer:Atomic Counting With Check and Release Memory Barrier}.

 \subsubsection{Simple Counting}
 \label{sec:defer:Simple Counting}

 Simple counting, with neither atomic operations nor memory barriers,
 can be used when the reference-counter acquisition and release are
 both protected by the same lock.
 In this case, it should be clear that the reference count itself
 may be manipulated non-atomically, because the lock provides any
 necessary exclusion, memory barriers, atomic instructions, and disabling
 of compiler optimizations.
 This is the method of choice when the lock is required to protect
 other operations in addition to the reference count, but where
 a reference to the object must be held after the lock is released.
 Figure~\ref{fig:defer:Simple Reference-Count API} shows a simple
 API that might be used to implement simple non-atomic reference
 counting -- although simple reference counting is almost always
 open-coded instead.

 \begin{figure}[htbp]
 { \scriptsize
 \begin{verbatim}
   1 struct sref {
   2   int refcount;
   3 };
   4
   5 void sref_init(struct sref *sref)
   6 {
   7   sref->refcount = 1;
   8 }
   9
  10 void sref_get(struct sref *sref)
  11 {
  12   sref->refcount++;
  13 }
  14
  15 int sref_put(struct sref *sref,
  16              void (*release)(struct sref *sref))
  17 {
  18   WARN_ON(release == NULL);
  19   WARN_ON(release == (void (*)(struct sref *))kfree);
  20
  21   if (--sref->refcount == 0) {
  22     release(sref);
  23     return 1;
  24   }
  25   return 0;
  26 }
 \end{verbatim}
 }
 \caption{Simple Reference-Count API}
 \label{fig:defer:Simple Reference-Count API}
 \end{figure}

 \subsubsection{Atomic Counting}
 \label{sec:defer:Atomic Counting}

 Simple atomic counting may be used in cases where any CPU acquiring
 a reference must already hold a reference.
 This style is used when a single CPU creates an object for its
 own private use, but must allow other CPU, tasks, timer handlers,
 or I/O completion handlers that it later spawns to also access this object.
 Any CPU that hands the object off must first acquire a new reference
 on behalf of the recipient object.
 In the Linux kernel, the \co{kref} primitives are used to implement
 this style of reference counting, as shown in
 Figure~\ref{fig:defer:Linux Kernel kref API}.

 Atomic counting is required
 because locking is not used to protect all reference-count operations,
 which means that it is possible for two different CPUs to concurrently
 manipulate the reference count.
 If normal increment and decrement were used, a pair of CPUs might both
 fetch the reference count concurrently, perhaps both obtaining
 the value ``3''.
 If both of them increment their value, they will both obtain ``4'',
 and both will store this value back into the counter.
 Since the new value of the counter should instead be ``5'', one
 of the two increments has been lost.
 Therefore, atomic operations must be used both for counter increments
 and for counter decrements.

 If releases are guarded by locking or RCU,
 memory barriers are \emph{not} required, but for different reasons.
 In the case of locking, the locks provide any needed memory barriers
 (and disabling of compiler optimizations), and the locks also
 prevent a pair of releases from running concurrently.
 In the case of RCU, cleanup must be deferred until all currently
 executing RCU read-side critical sections have completed, and
 any needed memory barriers or disabling of compiler optimizations
 will be provided by the RCU infrastructure.
 Therefore, if two CPUs release the final two references concurrently,
 the actual cleanup will be deferred until both CPUs exit their
 RCU read-side critical sections.

 \QuickQuiz{}
 	Why isn't it necessary to guard against cases where one CPU
 	acquires a reference just after another CPU releases the last
 	reference?
 \QuickQuizAnswer{
 	Because a CPU must already hold a reference in order
 	to legally acquire another reference.
 	Therefore, if one CPU releases the last reference,
 	there cannot possibly be any CPU that is permitted
 	to acquire a new reference.
 	This same fact allows the non-atomic check in line~22
 	of Figure~\ref{fig:defer:Linux Kernel kref API}.
 } \QuickQuizEnd

 \begin{figure}[htbp]
 { \scriptsize
 \begin{verbatim}
   1 struct kref {
   2   atomic_t refcount;
   3 };
   4
   5 void kref_init(struct kref *kref)
   6 {
   7   atomic_set(&kref->refcount, 1);
   8 }
   9
  10 void kref_get(struct kref *kref)
  11 {
  12   WARN_ON(!atomic_read(&kref->refcount));
  13   atomic_inc(&kref->refcount);
  14 }
  15
  16 static inline int
  17 kref_sub(struct kref *kref, unsigned int count,
  18          void (*release)(struct kref *kref))
  19 {
  20   WARN_ON(release == NULL);
  21
  22   if (atomic_sub_and_test((int) count,
  23                           &kref->refcount)) {
  24     release(kref);
  25     return 1;
  26   }
  27   return 0;
  28 }
 \end{verbatim}
 }
 \caption{Linux Kernel kref API}
 \label{fig:defer:Linux Kernel kref API}
 \end{figure}

 The \co{kref} structure itself, consisting of a single atomic
 data item, is shown in lines~1-3 of
 Figure~\ref{fig:defer:Linux Kernel kref API}.
 The \co{kref_init()} function on lines~5-8 initializes the counter
 to the value ``1''.
 Note that the \co{atomic_set()} primitive is a simple
 assignment, the name stems from the data type of \co{atomic_t}
 rather than from the operation.
 The \co{kref_init()} function must be invoked during object creation,
 before the object has been made available to any other CPU.

 The \co{kref_get()} function on lines~10-14 unconditionally atomically
 increments the counter.
 The \co{atomic_inc()} primitive does not necessarily explicitly
 disable compiler
 optimizations on all platforms, but the fact that the \co{kref}
 primitives are in a separate module and that the Linux kernel build
 process does no cross-module optimizations has the same effect.

 The \co{kref_put()} function on lines~16-28 atomically decrements the
 counter, and if the result is zero, line~24 invokes the specified
 \co{release()} function and line~24 returns, informing the caller
 that \co{release()} was invoked.
 Otherwise, \co{kref_put()} returns zero, informing the caller that
 \co{release()} was not called.

 \QuickQuiz{}
 	Suppose that just after the \co{atomic_sub_and_test()}
 	on line~22 of
 	Figure~\ref{fig:defer:Linux Kernel kref API} is invoked,
 	that some other CPU invokes \co{kref_get()}.
 	Doesn't this result in that other CPU now having an illegal
 	reference to a released object?
 \QuickQuizAnswer{
 	This cannot happen if these functions are used correctly.
 	It is illegal to invoke \co{kref_get()} unless you already
 	hold a reference, in which case the \co{kref_sub()} could
 	not possibly have decremented the counter to zero.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Suppose that \co{kref_sub()} returns zero, indicating that
 	the \co{release()} function was not invoked.
 	Under what conditions can the caller rely on the continued
 	existence of the enclosing object?
 \QuickQuizAnswer{
 	The caller cannot rely on the continued existence of the
 	object unless it knows that at least one reference will
 	continue to exist.
 	Normally, the caller will have no way of knowing this, and
 	must therefore carefullly avoid referencing the object after
 	the call to \co{kref_sub()}.
 } \QuickQuizEnd

 \subsubsection{Atomic Counting With Release Memory Barrier}
 \label{sec:defer:Atomic Counting With Release Memory Barrier}

 This style of reference is used in the Linux kernel's networking
 layer to track the destination caches that are used in packet routing.
 The actual implementation is quite a bit more involved; this section
 focuses on the aspects of \co{struct dst_entry} reference-count
 handling that matches this use case,
 shown in Figure~\ref{fig:defer:Linux Kernel dst-clone API}.

 \begin{figure}[htbp]
 { \scriptsize
 \begin{verbatim}
   1 static inline
   2 struct dst_entry * dst_clone(struct dst_entry * dst)
   3 {
   4   if (dst)
   5     atomic_inc(&dst->__refcnt);
   6   return dst;
   7 }
   8
   9 static inline
  10 void dst_release(struct dst_entry * dst)
  11 {
  12   if (dst) {
  13     WARN_ON(atomic_read(&dst->__refcnt) < 1);
  14     smp_mb__before_atomic_dec();
  15     atomic_dec(&dst->__refcnt);
  16   }
  17 }
 \end{verbatim}
 }
 \caption{Linux Kernel dst\_clone API}
 \label{fig:defer:Linux Kernel dst-clone API}
 \end{figure}

 The \co{dst_clone()} primitive may be used if the caller
 already has a reference to the specified \co{dst_entry},
 in which case it obtains another reference that may be handed off
 to some other entity within the kernel.
 Because a reference is already held by the caller, \co{dst_clone()}
 need not execute any memory barriers.
 The act of handing the \co{dst_entry} to some other entity might
 or might not require a memory barrier, but if such a memory barrier
 is required, it will be embedded in the mechanism used to hand the
 \co{dst_entry} off.

 The \co{dst_release()} primitive may be invoked from any environment,
 and the caller might well reference elements of the \co{dst_entry}
 structure immediately prior to the call to \co{dst_release()}.
 The \co{dst_release()} primitive therefore contains a memory
 barrier on line~14 preventing both the compiler and the CPU
 from misordering accesses.

 Please note that the programmer making use of \co{dst_clone()} and
 \co{dst_release()} need not be aware of the memory barriers, only
 of the rules for using these two primitives.

 \subsubsection{Atomic Counting With Check and Release Memory Barrier}
 \label{sec:defer:Atomic Counting With Check and Release Memory Barrier}

 Consider a situation where the caller must be able to acquire a new
 reference to an object to which it does not already hold a reference.
 The fact that initial reference-count acquisition can now run concurrently
 with reference-count release adds further complications.
 Suppose that a reference-count release finds that the new
 value of the reference count is zero, signalling that it is
 now safe to clean up the reference-counted object.
 We clearly cannot allow a reference-count acquisition to
 start after such clean-up has commenced, so the acquisition
 must include a check for a zero reference count.
 This check must be part of the atomic increment operation,
 as shown below.

 \QuickQuiz{}
 	Why can't the check for a zero reference count be
 	made in a simple ``if'' statement with an atomic
 	increment in its ``then'' clause?
 \QuickQuizAnswer{
 	Suppose that the ``if'' condition completed, finding
 	the reference counter value equal to one.
 	Suppose that a release operation executes, decrementing
 	the reference counter to zero and therefore starting
 	cleanup operations.
 	But now the ``then'' clause can increment the counter
 	back to a value of one, allowing the object to be
 	used after it has been cleaned up.
 } \QuickQuizEnd

 The Linux kernel's \co{fget()} and \co{fput()} primitives
 use this style of reference counting.
 Simplified versions of these functions are shown in
 Figure~\ref{fig:defer:Linux Kernel fget/fput API}.

 \begin{figure}[htbp]
 { \scriptsize
 \begin{verbatim}
   1 struct file *fget(unsigned int fd)
   2 {
   3   struct file *file;
   4   struct files_struct *files = current->files;
   5
   6   rcu_read_lock();
   7   file = fcheck_files(files, fd);
   8   if (file) {
   9     if (!atomic_inc_not_zero(&file->f_count)) {
  10       rcu_read_unlock();
  11       return NULL;
  12     }
  13   }
  14   rcu_read_unlock();
  15   return file;
  16 }
  17
  18 struct file *
  19 fcheck_files(struct files_struct *files, unsigned int fd)
  20 {
  21   struct file * file = NULL;
  22   struct fdtable *fdt = rcu_dereference((files)->fdt);
  23
  24   if (fd < fdt->max_fds)
  25     file = rcu_dereference(fdt->fd[fd]);
  26   return file;
  27 }
  28
  29 void fput(struct file *file)
  30 {
  31   if (atomic_dec_and_test(&file->f_count))
  32     call_rcu(&file->f_u.fu_rcuhead, file_free_rcu);
  33 }
  34
  35 static void file_free_rcu(struct rcu_head *head)
  36 {
  37   struct file *f;
  38
  39   f = container_of(head, struct file, f_u.fu_rcuhead);
  40   kmem_cache_free(filp_cachep, f);
  41 }
 \end{verbatim}
 }
 \caption{Linux Kernel fget/fput API}
 \label{fig:defer:Linux Kernel fget/fput API}
 \end{figure}

 Line~4 of \co{fget()} fetches the pointer to the current
 process's file-descriptor table, which might well be shared
 with other processes.
 Line~6 invokes \co{rcu_read_lock()}, which
 enters an RCU read-side critical section.
 The callback function from any subsequent \co{call_rcu()} primitive
 will be deferred until a matching \co{rcu_read_unlock()} is reached
 (line~10 or 14 in this example).
 Line~7 looks up the file structure corresponding to the file
 descriptor specified by the \co{fd} argument, as will be
 described later.
 If there is an open file corresponding to the specified file descriptor,
 then line~9 attempts to atomically acquire a reference count.
 If it fails to do so, lines~10-11 exit the RCU read-side critical
 section and report failure.
 Otherwise, if the attempt is successful, lines~14-15 exit the read-side
 critical section and return a pointer to the file structure.

 The \co{fcheck_files()} primitive is a helper function for
 \co{fget()}.
 It uses the \co{rcu_dereference()} primitive to safely fetch an
 RCU-protected pointer for later dereferencing (this emits a
 memory barrier on CPUs such as DEC Alpha in which data dependencies
 do not enforce memory ordering).
 Line~22 uses \co{rcu_dereference()} to fetch a pointer to this
 task's current file-descriptor table,
 and line~24 checks to see if the specified file descriptor is in range.
 If so, line~25 fetches the pointer to the file structure, again using
 the \co{rcu_dereference()} primitive.
 Line~26 then returns a pointer to the file structure or \co{NULL}
 in case of failure.

 The \co{fput()} primitive releases a reference to a file structure.
 Line~31 atomically decrements the reference count, and, if the result
 was zero, line~32 invokes the \co{call_rcu()} primitives in order to
 free up the file structure (via the \co{file_free_rcu()} function
 specified in \co{call_rcu()}'s second argument),
 but only after all currently-executing
 RCU read-side critical sections complete.
 The time period required for all currently-executing RCU read-side
 critical sections to complete is termed a ``grace period''.
 Note that the \co{atomic_dec_and_test()} primitive contains
 a memory barrier.
 This memory barrier is not necessary in this example, since the structure
 cannot be destroyed until the RCU read-side critical section completes,
 but in Linux, all atomic operations that return a result must
 by definition contain memory barriers.

 Once the grace period completes, the \co{file_free_rcu()} function
 obtains a pointer to the file structure on line~39, and frees it
 on line~40.

 This approach is also used by Linux's virtual-memory system,
 see \co{get_page_unless_zero()} and \co{put_page_testzero()} for
 page structures as well as \co{try_to_unuse()} and \co{mmput()}
 for memory-map structures.

 \subsection{Hazard Pointers}
 \label{sec:defer:Hazard Pointers}

 All of the reference-counting mechanisms discussed in the previous
 section require some other mechanism to prevent the data element from
 being deleted while the reference count is being acquired.
 This other mechanism might be a pre-existing reference held on that
 data element, locking, RCU, or atomic operations, but all of them
 either degrade performance and scalability or restrict use cases.

 One way of avoiding these problems is to implement the reference counters
 inside out, that is, rather than incrementing an integer stored in the
 data element, instead store a pointer to that data element in
 per-CPU (or per-thread) lists.
 Each element of these lists is called a
 \emph{hazard pointer}~\cite{MagedMichael04a}.\footnote{
 	Also independently invented by others~\cite{HerlihyLM02}.}
 The value of a given data element's ``virtual reference counter'' can
 then be obtained by counting the number of hazard pointers referencing
 that element.
 Therefore, if that element has been rendered inaccessible to readers,
 and there are no longer any hazard pointers referencing it, that element
 may safely be freed.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbatim}
  1 int hp_store(void **p, void **hp)
  2 {
  3   void *tmp;
  4
  5   tmp = ACCESS_ONCE(*p);
  6   ACCESS_ONCE(*hp) = tmp;
  7   smp_mb();
  8   if (tmp != ACCESS_ONCE(*p) ||
  9       tmp == HAZPTR_POISON) {
 10     ACCESS_ONCE(*hp) = NULL;
 11     return 0;
 12   }
 13   return 1;
 14 }
 15
 16 void hp_erase(void **hp)
 17 {
 18   smp_mb();
 19   ACCESS_ONCE(*hp) = NULL;
 20   hp_free(hp);
 21 }
 \end{verbatim}
 }
 \caption{Hazard-Pointer Storage and Erasure}
 \label{fig:defer:Hazard-Pointer Storage and Erasure}
 \end{figure}

 Of course, this means that hazard-pointer acquisition must be carried
 out quite carefully in order to avoid destructive races with concurrent
 deletion.
 One implementation is shown in
 Figure~\ref{fig:defer:Hazard-Pointer Storage and Erasure},
 which shows \co{hp_store()} on lines~1-13 and \co{hp_erase()} on
 lines~15-20.
 The \co{smp_mb()} primitive will be described in detail in
 Section~\ref{sec:advsync:Memory Barriers}, but may be ignored for
 the purposes of this brief overview.

 The \co{hp_store()} function records a hazard pointer at \co{hp} for the data
 element whose pointer is referenced by \co{p}, while checking for
 concurrent modifications.
 If a concurrent modification occurred, \co{hp_store()} refuses to record
 a hazard pointer, and returns zero to indicate that the caller must
 restart its traversal from the beginning.
 Otherwise, \co{hp_store()} returns one to indicate that it successfully
 recorded a hazard pointer for the data element.

 \QuickQuiz{}
 	Why does \co{hp_store()} in
 	Figure~\ref{fig:defer:Hazard-Pointer Storage and Erasure}
 	take a double indirection to the data element?
 	Why not \co{void *} instead of \co{void **}?
 \QuickQuizAnswer{
 	Because \co{hp_record()} must check for concurrent modifications.
 	To do that job, it needs a pointer to a pointer to the element,
 	so that it can check for a modification to the pointer to the
 	element.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Why does \co{hp_store()}'s caller need to restart its
 	traversal from the beginning in case of failure?
 	Isn't that inefficient for large data structures?
 \QuickQuizAnswer{
 	It might be inefficient in some sense, but the fact is that
 	such restarting is absolutely required for correctness.
 	To see this, consider a hazard-pointer-protected linked list
 	containing elements~A, B, and~C that is subjecte to the
 	following sequence of events:

 	\begin{enumerate}
 	\item	Thread~0 stores a hazard pointer to element~B
 		(having presumably traversed to element~B from element~A).
 	\item	Thread~1 removes element~B from the list, which sets
 		the pointer from element~B to element~C to a special
 		\co{HAZPTR_POISON} value in order to mark the deletion.
 		Because Thread~0 has a hazard pointer to element~B,
 		it cannot yet be freed.
 	\item	Thread~1 removes element~C from the list.
 		Because there are no hazard pointers referencing element~C,
 		it is immediately freed.
 	\item	Thread~0 attempts to acquire a hazard pointer to
 		now-removed element~B's successor, but sees the
 		\co{HAZPTR_POISON} value, and thus returns zero,
 		forcing the caller to restart its traversal from the
 		beginning of the list.
 	\end{enumerate}

 	Which is a very good thing, because otherwise Thread~0 would
 	have attempted to access the now-freed element~C,
 	which might have resulted in arbitrarily horrible
 	memory corruption, especially if the memory for
 	element~C had since been re-allocated for some other
 	purpose.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Given that papers on hazard pointers use the bottom bits
 	of each pointer to mark deleted elements, what is up with
 	\co{HAZPTR_POISON}?
 \QuickQuizAnswer{
 	The published implementations of hazard pointers used
 	non-blocking synchornization techniques for insertion
 	and deletion.
 	These techniques require that readers traversing the
 	data structure ``help'' updaters complete their updates,
 	which in turn means that readers need to look at the successor
 	of a deleted element.

 	In contrast, we will be using locking to synchronize updates,
 	which does away with the need for readers to help updaters
 	complete their updates, which in turn allows us to leave
 	pointers' bottom bits alone.
 	This approach allows read-side code to be simpler and faster.
 } \QuickQuizEnd

 Because algorithms using hazard pointers might be restarted at any
 step of their traversal through the data structure, such algorithms
 must typically take care to avoid making any changes to the data
 structure until after they have acquired all relevant hazard pointers.

 \QuickQuiz{}
 	But don't these restrictions on hazard pointers also apply
 	to other forms of reference counting?
 \QuickQuizAnswer{
 	These restrictions apply only to reference-counting mechanisms whose
 	reference acquisition can fail.
 } \QuickQuizEnd

 In exchange for these restrictions, hazard pointers offer excellent
 performance and scalability for readers.
 Performance comparisons with other mechanisms may be found in
 Chapter~\ref{chp:Data Structures}
 and in other publications~\cite{ThomasEHart2007a,McKenney:2013:SDS:2483852.2483867,MagedMichael04a}.

 \subsection{Linux Primitives Supporting Reference Counting}
 \label{sec:defer:Linux Primitives Supporting Reference Counting}

 The Linux-kernel primitives used in the above examples are
 summarized in the following list.
 \IfInBook{}{The RCU primitives may be unfamiliar to some readers,
 	    so a brief overview with citations is included in
 	    Section~\ref{sec:defer:Background on RCU}.}

 \begin{itemize}
 \item	\co{atomic_t}
 	Type definition for 32-bit quantity to be manipulated atomically.
 \item	\co{void atomic_dec(atomic_t *var);}
 	Atomically decrements the referenced variable without necessarily
 	issuing a memory barrier or disabling compiler optimizations.
 \item	\co{int atomic_dec_and_test(atomic_t *var);}
 	Atomically decrements the referenced variable, returning
 	\co{true} (non-zero) if the result is zero.
 	Issues a memory barrier and disables compiler optimizations that
 	might otherwise move memory references across this primitive.
 \item	\co{void atomic_inc(atomic_t *var);}
 	Atomically increments the referenced variable without necessarily
 	issuing a memory barrier or disabling compiler optimizations.
 \item	\co{int atomic_inc_not_zero(atomic_t *var);}
 	Atomically increments the referenced variable, but only if the
 	value is non-zero, and returning \co{true} (non-zero) if the
 	increment occurred.
 	Issues a memory barrier and disables compiler optimizations that
 	might otherwise move memory references across this primitive.
 \item	\co{int atomic_read(atomic_t *var);}
 	Returns the integer value of the referenced variable.
 	This is not an atomic operation, and it does not issue any
 	memory-barrier instructions.
 	Instead of thinking of as ``an atomic read,'' think of it as
 	``a normal read from an atomic variable.''
 \item	\co{void atomic_set(atomic_t *var, int val);}
 	Sets the value of the referenced atomic variable to ``val''.
 	This is not an atomic operation, and it neither issues memory
 	barriers nor disables compiler optimizations.
 	Instead of thinking of as ``an atomic set,'' think of it as
 	``a normal set of an atomic variable.''
 \item	\co{void call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *head));}
 	Invokes \co{func(head)} some time after all currently executing RCU
 	read-side critical sections complete, however, the \co{call_rcu()}
 	primitive returns immediately.
 	Note that \co{head} is normally a field within an RCU-protected
 	data structure, and that \co{func} is normally a function that
 	frees up this data structure.
 	The time interval between the invocation of \co{call_rcu()} and
 	the invocation of \co{func} is termed a ``grace period''.
 	Any interval of time containing a grace period is itself a
 	grace period.
 \item	\co{type *container_of(p, type, f);}
 	Given a pointer \co{p} to a field \co{f} within a structure
 	of the specified type, return a pointer to the structure.
 \item	\co{void rcu_read_lock(void);}
 	Marks the beginning of an RCU read-side critical section.
 \item	\co{void rcu_read_unlock(void);}
 	Marks the end of an RCU read-side critical section.
 	RCU read-side critical sections may be nested.
 \item	\co{void smp_mb__before_atomic_dec(void);}
 	Issues a memory barrier and disables code-motion compiler
 	optimizations only if the platform's \co{atomic_dec()}
 	primitive does not already do so.
 \item	\co{struct rcu_head}
 	A data structure used by the RCU infrastructure to track
 	objects awaiting a grace period.
 	This is normally included as a field within an RCU-protected
 	data structure.
 \end{itemize}

 \QuickQuiz{}
 	An \co{atomic_read()} and an \co{atomic_set()} that are
 	non-atomic?
 	Is this some kind of bad joke???
 \QuickQuizAnswer{
 	It might well seem that way, but in situations where no other
 	CPU has access to the atomic variable in question, the overhead
 	of an actual atomic instruction would be wasteful.
 	Two examples where no other CPU has access are
 	during initialization and cleanup.
 } \QuickQuizEnd

 \subsection{Counter Optimizations}
 \label{sec:defer:Counter Optimizations}

 In some cases where increments and decrements are common, but checks
 for zero are rare, it makes sense to maintain per-CPU or per-task
 counters, as was discussed in Chapter~\ref{chp:Counting}.
 \IfInBook{See Appendix~\ref{app:rcuimpl:Sleepable RCU Implementation}
 	  for an example of this technique applied to RCU.}{
 	  See the paper on sleepable read-copy update
 	  (SRCU) for an example of this technique applied to
 	  RCU~\cite{PaulEMcKenney2006c}.}
 This approach eliminates the need for atomic instructions or memory
 barriers on the increment and decrement primitives, but still requires
 that code-motion compiler optimizations be disabled.
 In addition, the primitives such as \co{synchronize_srcu()}
 that check for the aggregate reference
 count reaching zero can be quite slow.
 This underscores the fact that these techniques are designed
 for situations where the references are frequently acquired and
 released, but where it is rarely necessary to check for a zero
 reference count.

 % @@@ Difficulty of managing reference counts: leaks, premature freeing.

 However, it is usually the case that use of reference counts requires
 writing (often atomically) to a data structure that is otherwise
 read only.
 In this case, reference counts are imposing expensive cache misses
 on readers.

 \QuickQuiz{}
 	But hazard pointers don't write to the data structure!
 \QuickQuizAnswer{
 	Indeed, they do not.
 	However, they do write to the hazard pointers themselves,
 	and, more important, require that possible failures be
 	handled for all \co{hp_store()} calls, each of which
 	might fail.
 	Therefore, although hazard pointers are extremely useful,
 	it is still worth looking for improved mechanisms.
 } \QuickQuizEnd

 It is therefore worthwhile to look into synchronization mechanisms
 that do not require readers to do writes at all.
 One such synchronization mechanism, sequence locks, is covered in
 the next section.