locking/locking.tex - pub/scm/linux/kernel/git/paulmck/perfbook.git - Git at Google

 % locking/locking.tex

 \QuickQuizChapter{chp:Locking}{Locking}

 In recent concurrency research, the role of villain is often played
 by locking.
 In many papers and presentations,
 locking stands accused of promoting deadlocks, convoying, starvation,
 unfairness, data races, and all manner of other concurrency sins.
 Interestingly enough, the role of workhorse in production-quality
 shared-memory parallel software is played by, you guessed it, locking.
 This chapter will look into this dichotomy between villain and
 hero, as fancifully depicted in
 Figures~\ref{fig:locking:Locking: Villain or Slob?}
 and~Figure~\ref{fig:locking:Locking: Workhorse or Hero?}.

 There are a number of reasons behind this Jekyll-and-Hyde dichotomy:

 \begin{enumerate}
 \item	Many of locking's sins have pragmatic design solutions that
 	work well in most cases, for example:
 	\begin{enumerate}
 	\item	Use of lock hierarchies to avoid deadlock.
 	\item	Deadlock-detection tools, for example, the Linux kernel's
 		lockdep facility~\cite{JonathanCorbet2006lockdep}.
 	\item	Locking-friendly data structures, such as
 		arrays, hash tables, and radix trees, which will
 		be covered in Chapter~\ref{chp:Data Structures}.
 	\end{enumerate}
 \item	Some of locking's sins are problems only at high levels of
 	contention, levels reached only by poorly designed programs.
 \item	Some of locking's sins are avoided by using other synchronization
 	mechanisms in concert with locking.
 	These other mechanisms include
 	statistical counters
 	(see Chapter~\ref{chp:Counting}),
 	reference counters
 	(see Section~\ref{sec:defer:Reference Counting}),
 	hazard pointers
 	(see Section~\ref{sec:defer:Hazard Pointers}),
 	sequence-locking readers
 	(see Section~\ref{sec:defer:Sequence Locks})
 	RCU
 	(see Section~\ref{sec:defer:Read-Copy Update (RCU)}),
 	and simple non-blocking data structures
 	(see Section~\ref{sec:advsync:Non-Blocking Synchronization}).
 \item	Until quite recently, almost all large shared-memory parallel
 	programs were developed in secret, so that it was difficult for
 	most researchers to learn of these pragmatic solutions.
 \item	Locking works extremely well for some software artifacts
 	and extremely poorly for others.
 	Developers who have worked on artifacts for which locking
 	works well can be expected to have a much more positive
 	opinion of locking than those who have worked on artifacts
 	for which locking works poorly, as will be discussed in
 	Section~\ref{sec:locking:Locking: Hero or Villain?}.
 \item	All good stories need a villain, and locking has a long and
 	honorable history serving as a research-paper whipping boy.
 \end{enumerate}

 \QuickQuiz{}
 	Just how can serving as a whipping boy be considered to be
 	in any way honorable???
 \QuickQuizAnswer{
 	The reason locking serves as a research-paper whipping boy is
 	because it is heavily used in practice.
 	In contrast, if no one used or cared about locking, most research
 	papers would not bother even mentioning it.
 } \QuickQuizEnd

 This chapter will give an overview of a number of ways to avoid locking's
 more serious sins.

 \begin{figure}[tb]
 \begin{center}
 \resizebox{2in}{!}{\includegraphics{cartoons/r-2014-Locking-the-Slob}}
 \end{center}
 \caption{Locking: Villain or Slob?}
 \ContributedBy{Figure}{fig:locking:Locking: Villain or Slob?}{Melissa Broussard}
 \end{figure}

 \begin{figure}[tb]
 \begin{center}
 \resizebox{2in}{!}{\includegraphics{cartoons/r-2014-Locking-the-Hero}}
 \end{center}
 \caption{Locking: Workhorse or Hero?}
 \ContributedBy{Figure}{fig:locking:Locking: Workhorse or Hero?}{Melissa Broussard}
 \end{figure}

 \section{Staying Alive}
 \label{sec:locking:Staying Alive}

 Given that locking stands accused of deadlock and starvation,
 one important concern for shared-memory parallel developers is
 simply staying alive.
 The following sections therefore cover deadlock, livelock, starvation,
 unfairness, and inefficiency.

 \subsection{Deadlock}
 \label{sec:locking:Deadlock}

 Deadlock occurs when each of a group of threads is holding at least one
 lock while at the same time waiting on a lock held by a member
 of that same group.

 Without some sort of external intervention, deadlock is forever.
 No thread can acquire the lock it is waiting on until that
 lock is released by the thread holding it, but the thread holding
 it cannot release it until the holding thread acquires the lock that
 it is waiting on.

 \begin{figure}[tb]
 \begin{center}
 \resizebox{1.5in}{!}{\includegraphics{locking/DeadlockCycle}}
 \end{center}
 \caption{Deadlock Cycle}
 \label{fig:locking:Deadlock Cycle}
 \end{figure}

 We can create a directed-graph representation of a deadlock scenario
 with nodes for threads and locks, as shown in
 Figure~\ref{fig:locking:Deadlock Cycle}.
 An arrow from a lock to a thread indicates that the thread holds
 the lock, for example, Thread~B holds Locks~2 and 4.
 An arrow from a thread to a lock indicates that the thread is waiting
 on the lock, for example, Thread~B is waiting on Lock~3.

 A deadlock scenario will always contain at least one deadlock cycle.
 In Figure~\ref{fig:locking:Deadlock Cycle}, this cycle is
 Thread~B, Lock~3, Thread~C, Lock~4, and back to Thread~B.

 \QuickQuiz{}
 	But the definition of deadlock only said that each thread
 	was holding at least one lock and waiting on another lock
 	that was held by some thread.
 	How do you know that there is a cycle?
 \QuickQuizAnswer{
 	Suppose that there is no cycle in the graph.
 	We would then have a directed acyclic graph (DAG), which would
 	have at least one leaf node.

 	If this leaf node was a lock, then we would have a thread
 	that was waiting on a lock that wasn't held by any thread,
 	which violates the definition.
 	(And in this case the thread would immediately acquire the
 	lock.)

 	On the other hand, if this leaf node was a thread, then
 	we would have a thread that was not waiting on any lock,
 	again violating the definition.
 	(And in this case, the thread would either be running or
 	be blocked on something that is not a lock.)

 	Therefore, given this definition of deadlock, there must
 	be a cycle in the corresponding graph.
 } \QuickQuizEnd

 Although there are some software environments such as database systems
 that can repair an existing deadlock, this approach requires either that
 one of the threads be killed or that a lock be forcibly stolen from one
 of the threads.
 This killing and forcible stealing can be appropriate for transactions,
 but is often problematic for kernel and application-level use of locking:
 dealing with the resulting partially updated structures can be extremely
 complex, hazardous, and error-prone.

 Kernels and applications therefore work to avoid deadlocks rather than to
 recover from them.
 There are a number of deadlock-avoidance strategies,
 including locking hierarchies
 (Section~\ref{sec:locking:Locking Hierarchies}),
 local locking hierarchies
 (Section~\ref{sec:locking:Local Locking Hierarchies}),
 layered locking hierarchies
 (Section~\ref{sec:locking:Layered Locking Hierarchies}),
 strategies for dealing with APIs containing pointers to locks
 (Section~\ref{sec:locking:Locking Hierarchies and Pointers to Locks}),
 conditional locking
 (Section~\ref{sec:locking:Conditional Locking}),
 acquiring all needed locks first
 (Section~\ref{sec:locking:Acquire Needed Locks First}),
 single-lock-at-a-time designs
 (Section~\ref{sec:locking:Single-Lock-at-a-Time Designs}),
 and strategies for signal/interrupt handlers
 (Section~\ref{sec:locking:Signal/Interrupt Handlers}).
 Although there is no deadlock-avoidance strategy that works perfectly
 for all situations, there is a good selection of deadlock-avoidance
 tools to choose from.

 \subsubsection{Locking Hierarchies}
 \label{sec:locking:Locking Hierarchies}

 Locking hierarchies order the locks and prohibit acquiring locks out
 of order.
 In Figure~\ref{fig:locking:Deadlock Cycle},
 we might order the locks numerically, so that a thread was
 forbidden from acquiring a given lock if it already held a lock
 with the same or a higher number.
 Thread~B has violated this hierarchy because it is attempting to
 acquire Lock~3 while holding Lock~4, which permitted the deadlock
 to occur.

 Again, to apply a locking hierarchy, order the locks and prohibit
 out-of-order lock acquisition.
 In large program, it is wise to use tools to enforce your locking
 hierarchy~\cite{JonathanCorbet2006lockdep}.

 \subsubsection{Local Locking Hierarchies}
 \label{sec:locking:Local Locking Hierarchies}

 However, the global nature of locking hierarchies make them difficult to
 apply to library functions.
 After all, the program using a given library function has not even been
 written yet, so how can the poor library-function implementor possibly
 hope to adhere to the yet-to-be-written program's locking hierarchy?

 One special case that is fortunately the common case is when
 the library function does not invoke any of the caller's code.
 In this case, the caller's locks will never be acquired while holding
 any of the library's locks, so that there cannot be a deadlock cycle
 containing locks from both the library and the caller.

 \QuickQuiz{}
 	Are there any exceptions to this rule, so that there really could be
 	a deadlock cycle containing locks from both the library and
 	the caller, even given that the library code never invokes
 	any of the caller's functions?
 \QuickQuizAnswer{
 	Indeed there are!
 	Here are a few of them:
 	\begin{enumerate}
 	\item	If one of the library function's arguments is a pointer
 		to a lock that this library function acquires, and if
 		the library function holds one if its locks while
 		acquiring the caller's lock, then we could have a
 		deadlock cycle involving both caller and library locks.
 	\item	If one of the library functions returns a pointer to
 		a lock that is acquired by the caller, and if the
 		caller acquires one if its locks while holding the
 		library's lock, we could again have a deadlock
 		cycle involving both caller and library locks.
 	\item	If one of the library functions acquires a lock and
 		then returns while still holding it, and if the caller
 		acquires one of its locks, we have yet another way
 		to create a deadlock cycle involving both caller
 		and library locks.
 	\item	If the caller has a signal handler that acquires
 		locks, then the deadlock cycle can involve both
 		caller and library locks.
 		In this case, however, the library's locks are
 		innocent bystanders in the deadlock cycle.
 		That said, please note that acquiring a lock from
 		within a signal handler is a no-no in most
 		environments---it is not just a bad idea, it
 		is unsupported.
 	\end{enumerate}
 } \QuickQuizEnd

 But suppose that a library function does invoke the caller's code.
 For example, the \co{qsort()} function invokes a caller-provided
 comparison function.
 A concurrent implementation of \co{qsort()} likely uses locking,
 which might result in deadlock in the perhaps-unlikely case where
 the comparison function is a
 complicated function involving also locking.
 How can the library function avoid deadlock?

 The golden rule in this case is ``release all locks before invoking
 unknown code.''
 To follow this rule, the \co{qsort()} function must release all
 locks before invoking the comparison function.

 \QuickQuiz{}
 	But if \co{qsort()} releases all its locks before invoking
 	the comparison function, how can it protect against races
 	with other \co{qsort()} threads?
 \QuickQuizAnswer{
 	By privatizing the data elements being compared
 	(as discussed in Chapter~\ref{chp:Data Ownership})
 	or through use of deferral mechanisms such as
 	reference counting (as discussed in
 	Chapter~\ref{chp:Deferred Processing}).
 } \QuickQuizEnd

 \begin{figure}[tb]
 \begin{center}
 \resizebox{3in}{!}{\includegraphics{locking/NonLocalLockHierarchy}}
 \end{center}
 \caption{Without Local Locking Hierarchy for qsort()}
 \label{fig:lock:Without Local Locking Hierarchy for qsort()}
 \end{figure}

 \begin{figure}[tb]
 \begin{center}
 \resizebox{3in}{!}{\includegraphics{locking/LocalLockHierarchy}}
 \end{center}
 \caption{Local Locking Hierarchy for qsort()}
 \label{fig:lock:Local Locking Hierarchy for qsort()}
 \end{figure}

 To see the benefits of local locking hierarchies, compare
 Figures~\ref{fig:lock:Without Local Locking Hierarchy for qsort()} and
 \ref{fig:lock:Local Locking Hierarchy for qsort()}.
 In both figures, application functions \co{foo()} and \co{bar()}
 invoke \co{qsort()} while holding locks~A and B, respectively.
 Because this is a parallel implementation of \co{qsort()}, it acquires
 lock~C.
 Function \co{foo()} passes function \co{cmp()} to \co{qsort()},
 and \co{cmp()} acquires lock~B.
 Function \co{bar()} passes a simple integer-comparison function (not
 shown) to \co{qsort()}, and this simple function does not acquire any
 locks.

 Now, if \co{qsort()} holds Lock~C while calling \co{cmp()} in violation
 of the golden release-all-locks rule above, as shown in
 Figure~\ref{fig:lock:Without Local Locking Hierarchy for qsort()},
 deadlock can occur.
 To see this, suppose that one thread invokes \co{foo()} while a second
 thread concurrently invokes \co{bar()}.
 The first thread will acquire lock~A and the second thread will acquire
 lock~B.
 If the first thread's call to \co{qsort()} acquires lock~C, then it
 will be unable to acquire lock~B when it calls \co{cmp()}.
 But the first thread holds lock~C, so the second thread's call to
 \co{qsort()} will be unable to acquire it, and thus unable to release
 lock~B, resulting in deadlock.

 In contrast, if \co{qsort()} releases lock~C before invoking the
 comparison function (which is unknown code from \co{qsort()}'s perspective,
 then deadlock is avoided as shown in
 Figure~\ref{fig:lock:Local Locking Hierarchy for qsort()}.

 If each module releases all locks before invoking unknown code, then
 deadlock is avoided if each module separately avoids deadlock.
 This rule therefore greatly simplifies deadlock analysis and greatly
 improves modularity.

 \subsubsection{Layered Locking Hierarchies}
 \label{sec:locking:Layered Locking Hierarchies}

 \begin{figure}[tb]
 \begin{center}
 \resizebox{3in}{!}{\includegraphics{locking/LayeredLockHierarchy}}
 \end{center}
 \caption{Layered Locking Hierarchy for qsort()}
 \label{fig:lock:Layered Locking Hierarchy for qsort()}
 \end{figure}

 Unfortunately, it might not be possible for \co{qsort()} to release
 all of its locks before invoking the comparison function.
 In this case, we cannot construct a local locking hierarchy by
 releasing all locks before invoking unknown code.
 However, we can instead construct a layered locking hierarchy, as shown in
 Figure~\ref{fig:lock:Layered Locking Hierarchy for qsort()}.
 Here, the \co{cmp()} function uses a new lock~D that is acquired after
 all of locks~A, B, and C, avoiding deadlock.
 We therefore have three layers to the global deadlock hierarchy, the
 first containing locks~A and B, the second containing lock~C, and
 the third containing lock~D.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbatim}
   1 struct locked_list {
   2   spinlock_t s;
   3   struct list_head h;
   4 };
   5
   6 struct list_head *list_start(struct locked_list *lp)
   7 {
   8   spin_lock(&lp->s);
   9   return list_next(lp, &lp->h);
  10 }
  11
  12 struct list_head *list_next(struct locked_list *lp,
  13                             struct list_head *np)
  14 {
  15   struct list_head *ret;
  16
  17   ret = np->next;
  18   if (ret == &lp->h) {
  19     spin_unlock(&lp->s);
  20     ret = NULL;
  21   }
  22   return ret;
  23 }
 \end{verbatim}
 }
 \caption{Concurrent List Iterator}
 \label{fig:locking:Concurrent List Iterator}
 \end{figure}

 Please note that it is not typically possible to mechanically
 change \co{cmp()} to use the new Lock~D.
 Quite the opposite: It is often necessary to make profound design-level
 modifications.
 Nevertheless, the effort required for such modifications is normally
 a small price to pay in order to avoid deadlock.

 For another example where releasing all locks before invoking unknown
 code is impractical, imagine an iterator over a linked list, as shown in
 Figure~\ref{fig:locking:Concurrent List Iterator} (\co{locked_list.c}).
 The \co{list_start()} function acquires a lock on the list and returns
 the first element (if there is one), and
 \co{list_next()} either returns a pointer to the next element in the list
 or releases the lock and returns \co{NULL} if the end of the list has
 been reached.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbatim}
   1 struct list_ints {
   2   struct list_head n;
   3   int a;
   4 };
   5
   6 void list_print(struct locked_list *lp)
   7 {
   8   struct list_head *np;
   9   struct list_ints *ip;
  10
  11   np = list_start(lp);
  12   while (np != NULL) {
  13     ip = list_entry(np, struct list_ints, n);
  14     printf("\t%d\n", ip->a);
  15     np = list_next(lp, np);
  16   }
  17 }
 \end{verbatim}
 }
 \caption{Concurrent List Iterator Usage}
 \label{fig:locking:Concurrent List Iterator Usage}
 \end{figure}

 Figure~\ref{fig:locking:Concurrent List Iterator Usage} shows how
 this list iterator may be used.
 Lines~1-4 define the \co{list_ints} element containing a single integer,
 and lines~6-17 show how to iterate over the list.
 Line~11 locks the list and fetches a pointer to the first element,
 line~13 provides a pointer to our enclosing \co{list_ints} structure,
 line~14 prints the corresponding integer, and
 line~15 moves to the next element.
 This is quite simple, and hides all of the locking.

 That is, the locking remains hidden as long as the code processing each
 list element does not itself acquire a lock that is held across some
 other call to \co{list_start()} or \co{list_next()}, which results in
 deadlock.
 We can avoid the deadlock by layering the locking hierarchy
 to take the list-iterator locking into account.

 This layered approach can be extended to an arbitrarily large number of layers,
 but each added layer increases the complexity of the locking design.
 Such increases in complexity are particularly inconvenient for some
 types of object-oriented designs, in which control passes back and forth
 among a large group of objects in an undisciplined manner.\footnote{
 	One name for this is ``object-oriented spaghetti code.''}
 This mismatch between the habits of object-oriented design and the
 need to avoid deadlock is an important reason why parallel programming
 is perceived by some to be so difficult.

 Some alternatives to highly layered locking hierarchies are covered in
 Chapter~\ref{chp:Deferred Processing}.

 \subsubsection{Locking Hierarchies and Pointers to Locks}
 \label{sec:locking:Locking Hierarchies and Pointers to Locks}

 Althought there are some exceptions, an external API containing a pointer
 to a lock is very often a misdesigned API.
 Handing an internal lock to some other software component is after all
 the antithesis of information hiding, which is in turn a key design
 principle.

 \QuickQuiz{}
 	Name one common exception where it is perfectly reasonable
 	to pass a pointer to a lock into a function.
 \QuickQuizAnswer{
 	Locking primitives, of course!
 } \QuickQuizEnd

 One exception is functions that hand off some entity,
 where the caller's lock must be held until the handoff is complete,
 but where the lock must be released before the function returns.
 One example of such a function is the POSIX \co{pthread_cond_wait()}
 function, where passing an pointer to a \co{pthread_mutex_t}
 prevents hangs due to lost wakeups.

 \QuickQuiz{}
 	Doesn't the fact that \co{pthread_cond_wait()} first releases the
 	mutex and then re-acquires it eliminate the possibility of deadlock?
 \QuickQuizAnswer{
 	Absolutely not!

 	Consider the a program that acquires \co{mutex_a}, and then
 	\co{mutex_b}, in that order, and then passes \co{mutex_a}
 	to \co{pthread_cond_wait}.
 	Now, \co{pthread_cond_wait} will release \co{mutex_a}, but
 	will re-acquire it before returning.
 	If some other thread acquires \co{mutex_a} in the meantime
 	and then blocks on \co{mutex_b}, the program will deadlock.
 } \QuickQuizEnd

 In short, if you find yourself exporting an API with a pointer to a
 lock as an argument or the return value, do youself a favor and carefully
 reconsider your API design.
 It might well be the right thing to do, but experience indicates that
 this is unlikely.

 \subsubsection{Conditional Locking}
 \label{sec:locking:Conditional Locking}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbatim}
   1 spin_lock(&lock2);
   2 layer_2_processing(pkt);
   3 nextlayer = layer_1(pkt);
   4 spin_lock(&nextlayer->lock1);
   5 layer_1_processing(pkt);
   6 spin_unlock(&lock2);
   7 spin_unlock(&nextlayer->lock1);
 \end{verbatim}
 }
 \caption{Protocol Layering and Deadlock}
 \label{fig:locking:Protocol Layering and Deadlock}
 \end{figure}

 But suppose that there is no reasonable locking hierarchy.
 This can happen in real life, for example, in layered network protocol stacks
 where packets flow in both directions.
 In the networking case, it might be necessary to hold the locks from
 both layers when passing a packet from one layer to another.
 Given that packets travel both up and down the protocol stack, this
 is an excellent recipe for deadlock, as illustrated in
 Figure~\ref{fig:locking:Protocol Layering and Deadlock}.
 Here, a packet moving down the stack towards the wire must acquire
 the next layer's lock out of order.
 Given that packets moving up the stack away from the wire are acquiring
 the locks in order, the lock acquisition in line~4 of the figure
 can result in deadlock.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbatim}
   1 retry:
   2   spin_lock(&lock2);
   3   layer_2_processing(pkt);
   4   nextlayer = layer_1(pkt);
   5   if (!spin_trylock(&nextlayer->lock1)) {
   6     spin_unlock(&lock2);
   7     spin_lock(&nextlayer->lock1);
   8     spin_lock(&lock2);
   9     if (layer_1(pkt) != nextlayer) {
  10       spin_unlock(&nextlayer->lock1);
  11       spin_unlock(&lock2);
  12       goto retry;
  13     }
  14   }
  15   layer_1_processing(pkt);
  16 spin_unlock(&lock2);
  17 spin_unlock(&nextlayer->lock1);
 \end{verbatim}
 }
 \caption{Avoiding Deadlock Via Conditional Locking}
 \label{fig:locking:Avoiding Deadlock Via Conditional Locking}
 \end{figure}

 One way to avoid deadlocks in this case is to impose a locking hierarchy,
 but when it is necessary to acquire a lock out of order, acquire it
 conditionally, as shown in
 Figure~\ref{fig:locking:Avoiding Deadlock Via Conditional Locking}.
 Instead of unconditionally acquiring the layer-1 lock, line~5
 conditionally acquires the lock using the \co{spin_trylock()} primitive.
 This primitive acquires the lock immediately if the lock is available
 (returning non-zero), and otherwise returns zero without acquiring the lock.

 If \co{spin_trylock()} was successful, line~15 does the needed
 layer-1 processing.
 Otherwise, line~6 releases the lock, and lines~7 and 8 acquire them in
 the correct order.
 Unfortunately, there might be multiple networking devices on
 the system (e.g., Ethernet and WiFi), so that the \co{layer_1()}
 function must make a routing decision.
 This decision might change at any time, especially if the system
 is mobile.\footnote{
 	And, in contrast to the 1900s, mobility is the common case.}
 Therefore, line~9 must recheck the decision, and if it has changed,
 must release the locks and start over.

 \QuickQuiz{}
 	Can the transformation from
 	Figure~\ref{fig:locking:Protocol Layering and Deadlock} to
 	Figure~\ref{fig:locking:Avoiding Deadlock Via Conditional Locking}
 	be applied universally?
 \QuickQuizAnswer{
 	Absolutely not!

 	This transformation assumes that the
 	\co{layer_2_processing()} function is idempotent, given that
 	it might be executed multiple times on the same packet when
 	the \co{layer_1()} routing decision changes.
 	Therefore, in real life, this transformation can become
 	arbitrarily complex.
 } \QuickQuizEnd

 \QuickQuiz{}
 	But the complexity in
 	Figure~\ref{fig:locking:Avoiding Deadlock Via Conditional Locking}
 	is well worthwhile given that it avoids deadlock, right?
 \QuickQuizAnswer{
 	Maybe.

 	If the routing decision in \co{layer_1()} changes often enough,
 	the code will always retry, never making forward progress.
 	This is termed ``livelock'' if no thread makes any forward progress or
 	``starvation''
 	if some threads make forward progress but other do not
 	(see Section~\ref{sec:locking:Livelock and Starvation}).
 } \QuickQuizEnd

 \subsubsection{Acquire Needed Locks First}
 \label{sec:locking:Acquire Needed Locks First}

 In an important special case of conditional locking all needed
 locks are acquired before any processing is carried out.
 In this case, processing need not be idempotent: if it turns out
 to be impossible to acquire a given lock without first releasing
 one that was already acquired, just release all the locks and
 try again.
 Only once all needed locks are held will any processing be carried out.

 However, this procedure can result in \emph{livelock}, which will
 be discussed in
 Section~\ref{sec:locking:Livelock and Starvation}.

 \QuickQuiz{}
 	When using the ``acquire needed locks first'' approach described in
 	Section~\ref{sec:locking:Acquire Needed Locks First},
 	how can livelock be avoided?
 \QuickQuizAnswer{
 	Provide an additional global lock.
 	If a given thread has repeatedly tried and failed to acquire the needed
 	locks, then have that thread unconditionally acquire the new
 	global lock, and then unconditionally acquire any needed locks.
 	(Suggested by Doug Lea.)
 } \QuickQuizEnd

 A related approach, two-phase locking~\cite{PhilipABernstein1987},
 has seen long production use in transactional database systems.
 In the first phase of a two-phase locking transaction, locks are
 acquired but not released.
 Once all needed locks have been acquired, the transaction enters the
 second phase, where locks are released, but not acquired.
 This locking approach allows databases to provide serializability
 guarantees for their transactions, in other words, to guarantee
 that all of values see and produced by the transactions are consistent
 with some global ordering of all the transactions.
 Many such systems rely on the ability to abort transactions, although
 this can be simplified by avoiding making any changes to shared data
 until all needed locks are acquired.
 Livelock and deadlock are issues in such systems, but practical
 solutions may be found in any of a number of database textbooks.

 \subsubsection{Single-Lock-at-a-Time Designs}
 \label{sec:locking:Single-Lock-at-a-Time Designs}

 In some cases, it is possible to avoid nesting locks, thus avoiding
 deadlock.
 For example, if a problem is perfectly partitionable, a single
 lock may be assigned to each partition.
 Then a thread working on a given partition need only acquire the one
 corresponding lock.
 Because no thread ever holds more than one lock at a time,
 deadlock is impossible.

 However, there must be some mechanism to ensure that the needed data
 structures remain in existence during the time that neither lock is
 held.
 One such mechanism is discussed in
 Section~\ref{sec:locking:Lock-Based Existence Guarantees}
 and several others are presented in
 Chapter~\ref{chp:Deferred Processing}.

 \subsubsection{Signal/Interrupt Handlers}
 \label{sec:locking:Signal/Interrupt Handlers}

 Deadlocks involving signal handlers are often quickly dismissed by
 noting that it is not legal to invoke \co{pthread_mutex_lock()} from
 within a signal handler~\cite{OpenGroup1997pthreads}.
 However, it is possible (though almost always unwise) to hand-craft
 locking primitives that can be invoked from signal handlers.
 Besides which, almost all operating-system kernels permit locks to
 be acquired from within interrupt handlers, which are the kernel analog
 to signal handlers.

 The trick is to block signals (or disable interrupts, as the case may be)
 when acquiring any lock that might be acquired within an interrupt
 handler.
 Furthermore, if holding such a lock, it is illegal to attempt to
 acquire any lock that is ever acquired
 outside of a signal handler without blocking signals.

 \QuickQuiz{}
 	Why is it illegal to acquire a Lock~A that is acquired outside
 	of a signal handler without blocking signals while holding
 	a Lock~B that is acquired within a signal handler?
 \QuickQuizAnswer{
 	Because this would lead to deadlock.
 	Given that Lock~A is held outside of a signal
 	handler without blocking signals, a signal might be handled while
 	holding this lock.
 	The corresponding signal handler might then acquire
 	Lock~B, so that Lock~B is acquired while holding Lock~A.
 	Therefore, if we also acquire Lock~A while holding Lock~B
 	as called out in the question, we will have a deadlock cycle.

 	Therefore, it is illegal to acquire a lock that is acquired outside
 	of a signal handler without blocking signals while holding
 	a another lock that is acquired within a signal handler.
 } \QuickQuizEnd

 If a lock is acquired by the handlers for several signals, then each
 and every one of these signals must be blocked whenever that lock is
 acquired, even when that
 lock is acquired within a signal handler.

 \QuickQuiz{}
 	How can you legally block signals within a signal handler?
 \QuickQuizAnswer{
 	One of the simplest and fastest ways to do so is to use
 	the \co{sa_mask} field of the \co{struct sigaction} that
 	you pass to \co{sigaction()} when setting up the signal.
 } \QuickQuizEnd

 Unfortunately, blocking and unblocking signals can be expensive in
 some operating systems, notably including Linux, so performance
 concerns often mean that locks acquired in signal handlers are only
 acquired in signal handlers, and that lockless synchronization
 mechanisms are used to communicate between application code and
 signal handlers.

 Or that signal handlers are avoided completely except for handling
 fatal errors.

 \QuickQuiz{}
 	If acquiring locks in signal handlers is such a bad idea, why
 	even discuss ways of making it safe?
 \QuickQuizAnswer{
 	Because these same rules apply to the interrupt handlers used in
 	operating-system kernels and in some embedded applications.

 	In many application environments, acquiring locks in signal
 	handlers is frowned upon~\cite{OpenGroup1997pthreads}.
 	However, that does not stop clever developers from (usually
 	unwisely) fashioning home-brew locks out of atomic operations.
 	And atomic operations are in many cases perfectly legal in
 	signal handlers.
 } \QuickQuizEnd

 \subsubsection{Discussion}
 \label{sec:locking:Locking Hierarchy Discussion}

 There are a large number of deadlock-avoidance strategies available to
 the shared-memory parallel programmer, but there are sequential
 programs for which none of them is a good fit.
 This is one of the reasons that expert programmers have more than
 one tool in their toolbox: locking is a powerful concurrency
 tool, but there are jobs better addressed with other tools.

 \QuickQuiz{}
 	Given an object-oriented application that passes control freely
 	among a group of objects such that there is no straightforward
 	locking hierarchy,\footnote{
 		Also known as ``object-oriented spaghetti code.''}
 	layered or otherwise, how can this
 	application be parallelized?
 \QuickQuizAnswer{
 	There are a number of approaches:
 	\begin{enumerate}
 	\item	In the case of parametric search via simulation,
 		where a large number of simulations will be run
 		in order to converge on (for example) a good design
 		for a mechanical or electrical device, leave the
 		simulation single-threaded, but run many instances
 		of the simulation in parallel.
 		This retains the object-oriented design, and
 		gains parallelism at a higher level, and likely
 		also avoids synchronization overhead.
 	\item	Partition the objects into groups such that there
 		is no need to operate on objects in
 		more than one group at a given time.
 		Then associate a lock with each group.
 		This is an example of a single-lock-at-a-time
 		design, which discussed in
 		Section~\ref{sec:locking:Single-Lock-at-a-Time Designs}.
 	\item	Partition the objects into groups such that threads
 		can all operate on objects in the groups in some
 		groupwise ordering.
 		Then associate a lock with each group, and impose a
 		locking hierarchy over the groups.
 	\item	Impose an arbitrarily selected hierarchy on the locks,
 		and then use conditional locking if it is necessary
 		to acquire a lock out of order, as was discussed in
 		Section~\ref{sec:locking:Conditional Locking}.
 	\item	Before carrying out a given group of operations, predict
 		which locks will be acquired, and attempt to acquire them
 		before actually carrying out any updates.
 		If the prediction turns out to be incorrect, drop
 		all the locks and retry with an updated prediction
 		that includes the benefit of experience.
 		This approach was discussed in
 		Section~\ref{sec:locking:Acquire Needed Locks First}.
 	\item	Use transactional memory.
 		This approach has a number of advantages and disadvantages
 		which will be discussed in
 		Section~\ref{sec:future:Transactional Memory}.
 	\item	Refactor the application to be more concurrency-friendly.
 		This would likely also have the side effect of making
 		the application run faster even when single-threaded, but might
 		also make it more difficult to modify the application.
 	\item	Use techniques from later chapters in addition to locking.
 	\end{enumerate}
 } \QuickQuizEnd

 Nevertheless, the strategies described in this section have proven
 quite useful in many settings.

 \subsection{Livelock and Starvation}
 \label{sec:locking:Livelock and Starvation}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbatim}
   1 void thread1(void)
   2 {
   3 retry:
   4   spin_lock(&lock1);
   5   do_one_thing();
   6   if (!spin_trylock(&lock2)) {
   7     spin_unlock(&lock1);
   8     goto retry;
   9   }
  10   do_another_thing();
  11   spin_unlock(&lock2);
  12   spin_unlock(&lock1);
  13 }
  14
  15 void thread2(void)
  16 {
  17 retry:
  18   spin_lock(&lock2);
  19   do_a_third_thing();
  20   if (!spin_trylock(&lock1)) {
  21     spin_unlock(&lock2);
  22     goto retry;
  23   }
  24   do_a_fourth_thing();
  25   spin_unlock(&lock1);
  26   spin_unlock(&lock2);
  27 }
 \end{verbatim}
 }
 \caption{Abusing Conditional Locking}
 \label{fig:locking:Abusing Conditional Locking}
 \end{figure}

 Although conditional locking can be an effective deadlock-avoidance
 mechanism, it can be abused.
 Consider for example the beautifully symmetric example shown in
 Figure~\ref{fig:locking:Abusing Conditional Locking}.
 This example's beauty hides an ugly livelock.
 To see this, consider the following sequence of events:

 \begin{enumerate}
 \item	Thread~1 acquires \co{lock1} on line~4, then invokes
 	\co{do_one_thing()}.
 \item	Thread~2 acquires \co{lock2} on line~18, then invokes
 	\co{do_a_third_thing()}.
 \item	Thread~1 attempts to acquire \co{lock2} on line~6, but fails because
 	Thread~2 holds it.
 \item	Thread~2 attempts to acquire \co{lock1} on line~20, but fails because
 	Thread~1 holds it.
 \item	Thread~1 releases \co{lock1} on line~7, then jumps to \co{retry}
 	at line~3.
 \item	Thread~2 releases \co{lock2} on line~21, and jumps to \co{retry}
 	at line~17.
 \item	The livelock dance repeats from the beginning.
 \end{enumerate}

 \QuickQuiz{}
 	How can the livelock shown in
 	Figure~\ref{fig:locking:Abusing Conditional Locking}
 	be avoided?
 \QuickQuizAnswer{
 	Figure~\ref{fig:locking:Avoiding Deadlock Via Conditional Locking}
 	provides some good hints.
 	In many cases, livelocks are a hint that you should revisit your
 	locking design.
 	Or visit it in the first place if your locking design
 	``just grew''.

 	That said, one good-and-sufficient approach due to Doug Lea
 	is to use conditional locking as described in
 	Section~\ref{sec:locking:Conditional Locking}, but combine this
 	with acquiring all needed locks first, before modifying shared
 	data, as described in
 	Section~\ref{sec:locking:Acquire Needed Locks First}.
 	If a given critical section retries too many times,
 	unconditionally acquire
 	a global lock, then unconditionally acquire all the needed locks.
 	This avoids both deadlock and livelock, and scales reasonably
 	assuming that the global lock need not be acquired too often.
 } \QuickQuizEnd

 Livelock can be thought of as an extreme form of starvation where
 a group of threads starve, rather than just one of them.\footnote{
 	Try not to get too hung up on the exact definitions of terms
 	like livelock, starvation, and unfairness.
 	Anything that causes a group of threads to fail to make adequate
 	forward progress is a problem that needs to be fixed, regardless
 	of what name you choose for it.}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbatim}
   1 void thread1(void)
   2 {
   3   unsigned int wait = 1;
   4 retry:
   5   spin_lock(&lock1);
   6   do_one_thing();
   7   if (!spin_trylock(&lock2)) {
   8     spin_unlock(&lock1);
   9     sleep(wait);
  10     wait = wait << 1;
  11     goto retry;
  12   }
  13   do_another_thing();
  14   spin_unlock(&lock2);
  15   spin_unlock(&lock1);
  16 }
  17
  18 void thread2(void)
  19 {
  20   unsigned int wait = 1;
  21 retry:
  22   spin_lock(&lock2);
  23   do_a_third_thing();
  24   if (!spin_trylock(&lock1)) {
  25     spin_unlock(&lock2);
  26     sleep(wait);
  27     wait = wait << 1;
  28     goto retry;
  29   }
  30   do_a_fourth_thing();
  31   spin_unlock(&lock1);
  32   spin_unlock(&lock2);
  33 }
 \end{verbatim}
 }
 \caption{Conditional Locking and Exponential Backoff}
 \label{fig:locking:Conditional Locking and Exponential Backoff}
 \end{figure}

 Livelock and starvation are serious issues in software transactional
 memory implementations, and so the concept of \emph{contention
 manager} has been introduced to encapsulate these issues.
 In the case of locking, simple exponential backoff can often address
 livelock and starvation.
 The idea is to introduce exponentially increasing delays before each
 retry, as shown in
 Figure~\ref{fig:locking:Conditional Locking and Exponential Backoff}.

 \QuickQuiz{}
 	What problems can you spot in the code in
 	Figure~\ref{fig:locking:Conditional Locking and Exponential Backoff}?
 \QuickQuizAnswer{
 	Here are a couple:
 	\begin{enumerate}
 	\item	A one-second wait is way too long for most uses.
 		Wait intervals should begin with roughly the time
 		required to execute the critical section, which will
 		normally be in the microsecond or millisecond range.
 	\item	The code does not check for overflow.
 		On the other hand, this bug is nullified
 		by the previous bug: 32 bits worth of seconds is
 		more than 50 years.
 	\end{enumerate}
 } \QuickQuizEnd

 However, for better results, the backoff should be bounded, and
 even better high-contention results have been obtained via queued
 locking~\cite{Anderson90}, which is discussed more in
 Section~\ref{sec:locking:Other Exclusive-Locking Implementations}.
 Of course, best of all is to use a good parallel design so that lock
 contention remains low.

 \subsection{Unfairness}
 \label{sec:locking:Unfairness}

 \begin{figure}[tb]
 \begin{center}
 \resizebox{3in}{!}{\includegraphics{cpu/SystemArch}}
 \end{center}
 \caption{System Architecture and Lock Unfairness}
 \label{fig:lock:System Architecture and Lock Unfairness}
 \end{figure}

 Unfairness can be thought of as a less-severe form of starvation,
 where a subset of threads contending for a given lock are granted
 the lion's share of the acquisitions.
 This can happen on machines with shared caches or NUMA characteristics,
 for example, as shown in
 Figure~\ref{fig:lock:System Architecture and Lock Unfairness}.
 If CPU~0 releases a lock that all the other CPUs are attempting
 to acquire, the interconnect shared between CPUs~0 and 1 means that
 CPU~1 will have an advantage over CPUs~2-7.
 Therefore CPU~1 will likely acquire the lock.
 If CPU~1 hold the lock long enough for CPU~0 to be requesting the
 lock by the time CPU~1 releases it and vice versa, the lock can
 shuttle between CPUs~0 and 1, bypassing CPUs~2-7.

 \QuickQuiz{}
 	Wouldn't it be better just to use a good parallel design
 	so that lock contention was low enough to avoid unfairness?
 \QuickQuizAnswer{
 	It would be better in some sense, but there are situations
 	where it can be appropriate to use
 	designs that sometimes result in high lock contentions.

 	For example, imagine a system that is subject to a rare error
 	condition.
 	It might well be best to have a simple error-handling design
 	that has poor performance and scalability for the duration of
 	the rare error condition, as opposed to a complex and
 	difficult-to-debug design that is helpful only when one of
 	those rare error conditions is in effect.

 	That said, it is usually worth putting some effort into
 	attempting to produce a design that both simple as well as
 	efficient during error conditions, for example by partitioning
 	the problem.
 } \QuickQuizEnd

 \subsection{Inefficiency}
 \label{sec:locking:Inefficiency}

 Locks are implemented using atomic instructions and memory barriers,
 and often involve cache misses.
 As we saw in Chapter~\ref{chp:Hardware and its Habits},
 these instructions are quite expensive, roughly two
 orders of magnitude greater overhead than simple instructions.
 This can be a serious problem for locking: If you protect a single
 instruction with a lock, you will increase the overhead by a factor
 of one hundred.
 Even assuming perfect scalability, \emph{one hundred} CPUs would
 be required to keep up with a single CPU executing the same code
 without locking.

 This situation underscores the synchronization-granularity
 tradeoff discussed in Section~\ref{sec:SMPdesign:Synchronization Granularity},
 especially Figure~\ref{fig:SMPdesign:Synchronization Efficiency}:
 Too coarse a granularity will limit scalability, while too fine a
 granularity will result in excessive synchronization overhead.

 That said, once a lock is held, the data protected by that lock can
 be accessed by the lock holder without interference.
 Acquiring a lock might be expensive, but once held, the CPU's caches
 are an effective performance booster, at least for large critical sections.

 \QuickQuiz{}
 	How might the lock holder be interfered with?
 \QuickQuizAnswer{
 	If the data protected by the lock is in the same cache line
 	as the lock itself, then attempts by other CPUs to acquire
 	the lock will result in expensive cache misses on the part
 	of the CPU holding the lock.
 	This is a special case of false sharing, which can also occur
 	if a pair of variables protected by different locks happen
 	to share a cache line.
 	In contrast, if the lock is in a different cache line than
 	the data that it protects, the CPU holding the lock will
 	usually suffer a cache miss only on first access to a given
 	variable.

 	Of course, the downside of placing the lock and data into separate
 	cache lines is that the code will incur two cache misses rather
 	than only one in the uncontended case.
 } \QuickQuizEnd

 \section{Types of Locks}
 \label{sec:locking:Types of Locks}

 There are a surprising number of types of locks, more than this
 short chapter can possibly do justice to.
 The following sections discuss
 exclusive locks (Section~\ref{sec:locking:Exclusive Locks}),
 reader-writer locks (Section~\ref{sec:locking:Reader-Writer Locks}),
 multi-role locks (Section~\ref{sec:locking:Beyond Reader-Writer Locks}),
 and scoped locking (Section~\ref{sec:locking:Scoped Locking}).

 \subsection{Exclusive Locks}
 \label{sec:locking:Exclusive Locks}

 Exclusive locks are what they say they are: only one thread may hold
 the lock at a time.
 The holder of such a lock thus has exclusive access to all data protected
 by that lock, hence the name.

 Of course, this all assumes that this lock is held across all accesses
 to data purportedly protected by the lock.
 Although there are some tools that can help, the ultimate responsibility
 for ensuring that the lock is acquired in all necessary code paths
 rests with the developer.

 \QuickQuiz{}
 	Does it ever make sense to have an exclusive lock acquisition
 	immediately followed by a release of that same lock, that is,
 	an empty critical section?
 \QuickQuizAnswer{
 	This usage is rare, but is occasionally used.
 	The point is that the semantics of exclusive locks have two
 	components: (1)~the familiar data-protection semantic and
 	(2)~a messaging semantic, where releasing a given lock notifies
 	a waiting acquisition of that same lock.
 	An empty critical section uses the messaging component without
 	the data-protection component.

 	The rest of this answer provides some example uses of empty
 	critical sections, however, these examples should be considered
 	``gray magic.''\footnote{
 		Thanks to Alexey Roytman for this decription.}
 	As such, empty critical sections are almost never used in practice.
 	Nevertheless, pressing on into this gray area...

 	One historical use of empty critical sections appeared in the
 	networking stack of the 2.4 Linux kernel.
 	This usage pattern can be thought of as a way of approximating
 	the effects of read-copy update (RCU), which is discussed in
 	Section~\ref{sec:defer:Read-Copy Update (RCU)}.

 	The empty-lock-critical-section idiom can also be used to
 	reduce lock contention in some situations.
 	For example, consider a multithreaded user-space application where
 	each thread processes unit of work maintained in a per-thread
 	list, where thread are prohibited from touching each others'
 	lists.
 	There could also be updates that require that all previously
 	scheduled units of work have completed before the update can
 	progress.
 	One way to handle this is to schedule a unit of work on each
 	thread, so that when all of these units of work complete, the
 	update may proceed.

 	In some applications, threads can come and go.
 	For example, each thread might correspond to one user of the
 	application, and thus be removed when that user logs out or
 	otherwise disconnects.
 	In many applications, threads cannot depart atomically: They must
 	instead explicitly unravel themselves from various portions of
 	the application using a specific sequence of actions.
 	One specific action will be refusing to accept further requests
 	from other threads, and another specific action will be disposing
 	of any remaining units of work on its list, for example, by
 	placing these units of work in a global work-item-disposal list
 	to be taken by one of the remaining threads.
 	(Why not just drain the thread's work-item list by executing
 	each item?
 	Because a given work item might generate more work items, so
 	that the list could not be drained in a timely fashion.)

 	If the application is to perform and scale well, a good locking
 	design is required.
 	One common solution is to have a global lock (call it \co{G})
 	protecting the entire
 	process of departing (and perhaps other things as well),
 	with finer-grained locks protecting the
 	individual unraveling operations.

 	Now, a departing thread must clearly refuse to accept further
 	requests before disposing of the work on its list, because
 	otherwise additional work might arrive after the disposal action,
 	which would render that disposal action ineffective.
 	So simplified pseudocode for a departing thread might be as follows:

 	\begin{enumerate}
 	\item	Acquire lock \co{G}.
 	\item	Acquire the lock guarding communications.
 	\item	Refuse further communications from other threads.
 	\item	Release the lock guarding communications.
 	\item	Acquire the lock guarding the global work-item-disposal list.
 	\item	Move all pending work items to the global
 		work-item-disposal list.
 	\item	Release the lock guarding the global work-item-disposal list.
 	\item	Release lock \co{G}.
 	\end{enumerate}

 	Of course, a thread that needs to wait for all pre-existing work
 	items will need to take departing threads into account.
 	To see this, suppose that this thread starts waiting for all
 	pre-existing work items just after a departing thread has refused
 	further communications from other threads.
 	How can this thread wait for the departing thread's work items
 	to complete, keeping in mind that threads are not allowed to
 	access each others' lists of work items?

 	One straightforward approach is for this thread to acquire \co{G}
 	and then the lock guarding the global work-item-disposal list, then
 	move the work items to its own list.
 	The thread then release both locks,
 	places a work item on the end of it own list,
 	and then wait for all of the work items that it placed on each thread's
 	list (including its own) to complete.

 	This approach does work well in many cases, but if special
 	processing is required for each work item as it is pulled in
 	from the global work-item-disposal list, the result could be
 	excessive contention on \co{G}.
 	One way to avoid that contention is to acquire \co{G} and then
 	immediately release it.
 	Then the process of waiting for all prior work items look
 	something like the following:

 	\begin{enumerate}
 	\item	Set a global counter to one and initialize a condition
 		variable to zero.
 	\item	Send a message to all threads to cause them to atomically
 		increment the global counter, and then to enqueue a
 		work item.
 		The work item will atomically decrement the global
 		counter, and if the result is zero, it will set a
 		condition variable to one.
 	\item	Acquire \co{G}, which will wait on any currently departing
 		thread to finish departing.
 		Because only one thread may depart at a time, all the
 		remaining threads will have already received the message
 		sent in the preceding step.
 	\item	Release \co{G}.
 	\item	Acquire the lock guarding the global work-item-disposal list.
 	\item	Move all work items from the global work-item-disposal list
 		to this thread's list, processing them as needed along the way.
 	\item	Release the lock guarding the global work-item-disposal list.
 	\item	Enqueue an additional work item onto this thread's list.
 		(As before, this work item will atomically decrement
 		the global counter, and if the result is zero, it will
 		set a condition variable to one.)
 	\item	Wait for the condition variable to take on the value one.
 	\end{enumerate}

 	Once this procedure completes, all pre-existing work items are
 	guaranteed to have completed.
 	The empty critical sections are using locking for messaging as
 	well as for protection of data.
 } \QuickQuizEnd

 \subsection{Reader-Writer Locks}
 \label{sec:locking:Reader-Writer Locks}

 Reader-writer locks~\cite{Courtois71}
 permit any number of readers to hold the lock
 concurrently on the one hand or a single writer to hold the lock
 on the other.
 In theory, then, reader-writer locks should allow excellent scalability
 for data that is read often and written rarely.
 In practice, the scalability will depend on the reader-writer lock
 implementation.

 The classic reader-writer lock implementation involves a set of
 counters and flags that are manipulated atomically.
 This type of implementation suffers from the same problem as does
 exclusive locking for short critical sections: The overhead of acquiring
 and releasing the lock
 is about two orders of magnitude greater than the overhead
 of a simple instruction.
 Of course, if the critical section is long enough, the overhead of
 acquiring and releasing the lock becomes negligible.
 However, because only
 one thread at a time can be manipulating the lock, the required
 critical-section size increases with the number of CPUs.

 It is possible to design a reader-writer lock that is much more
 favorable to readers through use of per-thread exclusive
 locks~\cite{WilsonCHsieh92a}.
 To read, a thread acquires only its own lock.
 To write, a thread acquires all locks.
 In the absence of writers, each reader incurs only atomic-instruction
 and memory-barrier overhead, with no cache misses, which is quite
 good for a locking primitive.
 Unfortunately, writers must incur cache misses as well as atomic-instruction
 and memory-barrier overhead---multiplied by the number of threads.

 In short, reader-writer locks can be quite useful in a number of
 situations, but each type of implementation does have its drawbacks.
 The canonical use case for reader-writer locking involves very long
 read-side critical sections, preferably measured in hundreds of microseconds
 or even milliseconds.

 \subsection{Beyond Reader-Writer Locks}
 \label{sec:locking:Beyond Reader-Writer Locks}

 \begin{table}
 \small
 \begin{center}
 \begin{tabular}{l||c|c|c|c|c|c}
 	~ ~ ~ ~ ~ ~ ~ ~ ~
 	& \begin{picture}(6,70)(0,0)
 		\rotatebox{90}{Null (Not Held)}
 	  \end{picture}
 	& \begin{picture}(6,70)(0,0)
 		\rotatebox{90}{Concurrent Read}
 	  \end{picture}
 	& \begin{picture}(6,70)(0,0)
 		\rotatebox{90}{Concurrent Write}
 	  \end{picture}
 	& \begin{picture}(6,70)(0,0)
 		\rotatebox{90}{Protected Read}
 	  \end{picture}
 	& \begin{picture}(6,70)(0,0)
 		\rotatebox{90}{Protected Write}
 	  \end{picture}
 	& \begin{picture}(6,70)(0,0)
 		\rotatebox{90}{Exclusive}
 	  \end{picture}
 	\\
 	\hline
 %				 NL  CR  CW  PR    PW  EX
 	\hline
 	Null (Not Held)		& ~ & ~ & ~   & ~ & ~ & ~ \\
 	\hline
 	Concurrent Read		& ~ & ~ & ~   & ~ & ~ & X \\
 	\hline
 	Concurrent Write	& ~ & ~ & ~   & X & X & X \\
 	\hline
 	Protected Read		& ~ & ~ & X   & ~ & X & X \\
 	\hline
 	Protected Write		& ~ & ~ & X   & X & X & X \\
 	\hline
 	Exclusive		& ~ & X & X   & X & X & X \\
 \end{tabular}
 \end{center}
 \caption{VAX/VMS Distributed Lock Manager Policy}
 \label{tab:locking:VAX/VMS Distributed Lock Manager Policy}
 \end{table}

 Reader-writer locks and exclusive locks differ in their admission
 policy: exclusive locks allow at most one holder, while reader-writer
 locks permit an arbitrary number of read-holders (but only one write-holder).
 There is a very large number of possible admission policies, one of
 which is that of the VAX/VMS distributed lock
 manager (DLM)~\cite{Snaman87}, which is shown in
 Table~\ref{tab:locking:VAX/VMS Distributed Lock Manager Policy}.
 Blank cells indicate compatible modes, while cells containing ``X''
 indicate incompatible modes.

 The VAX/VMS DLM uses six modes.
 For purposes of comparison, exclusive
 locks use two modes (not held and held), while reader-writer locks
 use three modes (not held, read held, and write held).

 The first mode is null, or not held.
 This mode is compatible with all other modes, which is to be expected:
 If a thread is not holding a lock, it should not prevent any
 other thread from acquiring that lock.

 The second mode is concurrent read, which is compatible with every other
 mode except for exclusive.
 The concurrent-read mode might be used to accumulate approximate
 statistics on a data structure, while permitting updates to proceed
 concurrently.

 The third mode is concurrent write, which is compatible with null,
 concurrent read, and concurrent write.
 The concurrent-write mode might be used to update approximate statistics,
 while still permitting reads and concurrent updates to proceed
 concurrently.

 The fourth mode is protected read, which is compatible with null,
 concurrent read, and protected read.
 The protected-read mode might be used to obtain a consistent snapshot
 of the data structure, while permitting reads but not updates to
 proceed concurrently.

 The fifth mode is protected write, which is compatible with null and
 concurrent read.
 The protected-write mode might be used to carry out updates to a data
 structure that could interfere with protected readers but which could
 be tolerated by concurrent readers.

 The sixth and final mode is exclusive, which is compatible only with null.
 The exclusive mode is used when it is necessary to exclude all other accesses.

 It is interesting to note that exclusive locks and reader-writer locks
 can be emulated by the VAX/VMS DLM.
 Exclusive locks would use only the null and exclusive modes, while
 reader-writer locks might use the null, protected-read, and
 protected-write modes.

 \QuickQuiz{}
 	Is there any other way for the VAX/VMS DLM to emulate
 	a reader-writer lock?
 \QuickQuizAnswer{
 	There are in fact several.
 	One way would be to use the null, protected-read, and exclusive
 	modes.
 	Another way would be to use the null, protected-read, and
 	concurrent-write modes.
 	A third way would be to use the null, concurrent-read, and
 	exclusive modes.
 } \QuickQuizEnd

 Although the VAX/VMS DLM policy has seen widespread production use
 for distributed databases, it does not appear to be used much in
 shared-memory applications.
 One possible reason for this is that the greater communication overheads
 of distributed databases can hide the greater overhead of the
 VAX/VMS DLM's more-complex admission policy.

 Nevertheless, the VAX/VMS DLM is an interesting illustration of just
 how flexible the concepts behind locking can be.
 It also serves as a very simple introduction to the locking schemes
 used by modern DBMSes, which can have more than thirty locking modes,
 compared to VAX/VMS's six.

 \subsection{Scoped Locking}
 \label{sec:locking:Scoped Locking}

 The locking primitives discussed thus far require explicit acquisition and
 release primitives, for example, \co{spin_lock()} and \co{spin_unlock()},
 respectively.
 Another approach is to use the object-oriented ``resource allocation
 is initialization'' (RAII) pattern~\cite{MargaretAEllis1990Cplusplus}.\footnote{
 	Though more clearly expressed at
 	\url{http://www.stroustrup.com/bs_faq2.html\#finally}.}
 This pattern is often applied to auto variables in languages like C++,
 where the corresponding \emph{constructor} is invoked upon entry to
 the object's scope, and the corresponding \emph{destructor} is invoked
 upon exit from that scope.
 This can be applied to locking by having the constructor acquire the
 lock and the destructor free it.

 This approach can be quite useful, in fact in 1990 I was convinced that it
 was the only type of locking that was needed.\footnote{
 	My later work with parallelism at Sequent Computer Systems very
 	quickly disabused me of this misguided notion.}
 One very nice property of RAII locking is that you don't need to carefully
 release the lock on each and every code path that exits that scope,
 a property that can eliminate a troublesome set of bugs.

 However, RAII locking also has a dark side.
 RAII makes it quite difficult to encapsulate lock acquisition and release,
 for example, in iterators.
 In many iterator implementations, you would like to acquire the lock in
 the interator's ``start'' function and release it in the iterator's ``stop''
 function.
 RAII locking instead requires that the lock acquisition and release take
 place in the same level of scoping, making such encapsulation difficult or
 even impossible.

 RAII locking also prohibits overlapping critical sections, due to the fact
 that scopes must nest.
 This prohibition makes it difficult or impossible to express a number of
 useful constructs, for example, locking trees
 that mediate between multiple concurrent attempts to assert an event.
 Of an arbitrarily large group of concurrent attempts, only one need succeed,
 and the best strategy for the remaining attempts is for them to fail as
 quickly and painlessly as possible.
 Otherwise, lock contention becomes pathological on large systems
 (where ``large'' is many hundreds of CPUs).

 \begin{figure}[tb]
 \begin{center}
 \resizebox{3in}{!}{\includegraphics{locking/rnplock}}
 \end{center}
 \caption{Locking Hierarchy}
 \label{fig:lock:Locking Hierarchy}
 \end{figure}

 Example data structures (taken from the Linux kernel's implementation
 of RCU) are shown in
 Figure~\ref{fig:lock:Locking Hierarchy}.
 Here, each CPU is assigned a leaf \co{rcu_node} structure, and each
 \co{rcu_node} structure has a pointer to its parent (named, oddly
 enough, \co{->parent}), up to the root \co{rcu_node} structure,
 which has a \co{NULL} \co{->parent} pointer.
 The number of child \co{rcu_node} structures per parent can vary,
 but is typically 32 or 64.
 Each \co{rcu_node} structure also contains a lock named \co{->fqslock}.

 The general approach is a \emph{tournament}, where
 a given CPU conditionally acquires its
 leaf \co{rcu_node} structure's \co{->fqslock}, and, if successful,
 attempt to acquire that of the parent, then release that of the child.
 In addition, at each level, the CPU checks a global \co{gp_flags}
 variable, and if this variable indicates that some other CPU has
 asserted the event, the first CPU drops out of the competition.
 This acquire-then-release sequence continues until either the
 \co{gp_flags} variable indicates that someone else won the tournament,
 one of the attempts to acquire an \co{->fqslock} fails, or
 the root \co{rcu_node} structure's \co{->fqslock} as been acquired.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbatim}
   1 void force_quiescent_state(struct rcu_node *rnp_leaf)
   2 {
   3   int ret;
   4   struct rcu_node *rnp = rnp_leaf;
   5   struct rcu_node *rnp_old = NULL;
   6
   7   for (; rnp != NULL; rnp = rnp->parent) {
   8     ret = (ACCESS_ONCE(gp_flags)) ||
   9           !raw_spin_trylock(&rnp->fqslock);
  10     if (rnp_old != NULL)
  11       raw_spin_unlock(&rnp_old->fqslock);
  12     if (ret)
  13       return;
  14     rnp_old = rnp;
  15   }
  16   if (ACCESS_ONCE(gp_flags)) {
  17     ACCESS_ONCE(gp_flags) = 1;
  18     do_force_quiescent_state();
  19     ACCESS_ONCE(gp_flags) = 0;
  20   }
  21   raw_spin_unlock(&rnp_old->fqslock);
  22 }
 \end{verbatim}
 }
 \caption{Conditional Locking to Reduce Contention}
 \label{fig:locking:Conditional Locking to Reduce Contention}
 \end{figure}

 Simplified code to implement this is shown in
 Figure~\ref{fig:locking:Conditional Locking to Reduce Contention}.
 The purpose of this function is to mediate between CPUs who have concurrently
 detected a need to invoke the \co{do_force_quiescent_state()} function.
 At any given time, it only makes sense for one instance of
 \co{do_force_quiescent_state()} to be active, so if there are multiple
 concurrent callers, we need at most one of them to actually invoke
 \co{do_force_quiescent_state()}, and we need the rest to (as quickly and
 painlessly as possible) give up and leave.

 To this end, each pass through the loop spanning lines~7-15 attempts
 to advance up one level in the \co{rcu_node} hierarcy.
 If the \co{gp_flags} variable is already set (line~8) or if the attempt
 to acquire the current \co{rcu_node} structure's \co{->fqslock} is
 unsuccessful (line~9), then local variable \co{ret} is set to 1.
 If line~10 sees that local variable \co{rnp_old} is non-\co{NULL},
 meaning that we hold \co{rnp_old}'s \co{->fqs_lock},
 line~11 releases this lock (but only after the attempt has been made
 to acquire the parent \co{rcu_node} structure's \co{->fqslock}).
 If line~12 sees that either line~8 or~9 saw a reason to give up,
 line~13 returns to the caller.
 Otherwise, we must have acquired the current \co{rcu_node} structure's
 \co{->fqslock}, so line~14 saves a pointer to this structure in local
 variable \co{rnp_old} in preparation for the next pass through the loop.

 If control reaches line~16, we won the tournament, and now holds the
 root \co{rcu_node} structure's \co{->fqslock}.
 If line~16 still sees that the global variable \co{gp_flags} is zero,
 line~17 sets \co{gp_flags} to one, line~18 invokes
 \co{do_force_quiescent_state()}, and line~19 resets \co{gp_flags}
 back to zero.
 Either way, line~21 releases the root \co{rcu_node} structure's
 \co{->fqslock}.

 \QuickQuiz{}
 	The code in
 	Figure~\ref{fig:locking:Conditional Locking to Reduce Contention}
 	is ridiculously complicated!
 	Why not conditionally acquire a single global lock?
 \QuickQuizAnswer{
 	Conditionally acquiring a single global lock does work very well,
 	but only for relatively small numbers of CPUs.
 	To see why it is problematic in systems with many hundreds of
 	CPUs, look at
 	Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem}
 	and extrapolate the delay from eight to 1,000 CPUs.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Wait a minute!
 	If we ``win'' the tournament on line~16 of
 	Figure~\ref{fig:locking:Conditional Locking to Reduce Contention},
 	we get to do all the work of \co{do_force_quiescent_state()}.
 	Exactly how is that a win, really?
 \QuickQuizAnswer{
 	How indeed?
 	This just shows that in concurrency, just as in life, one
 	should take care to learn exactly what winning entails before
 	playing the game.
 } \QuickQuizEnd

 This function illustrates the not-uncommon pattern of hierarchical
 locking.
 This pattern is quite difficult to implement using RAII locking,
 just like the interator encapsulation noted earlier,
 and so the lock/unlock primitives will be needed for the foreseeable
 future.

 \section{Locking Implementation Issues}
 \label{sec:locking:Locking Implementation Issues}

 Developers are almost always best-served by using whatever locking
 primitives are provided by the system, for example, the POSIX
 pthread mutex locks~\cite{OpenGroup1997pthreads,Butenhof1997pthreads}.
 Nevertheless, studying sample implementations can be helpful,
 as can considering the challenges posed by extreme workloads and
 environments.

 \subsection{Sample Exclusive-Locking Implementation Based on Atomic Exchange}
 \label{sec:locking:Sample Exclusive-Locking Implementation Based on Atomic Exchange}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbatim}
   1 typedef int xchglock_t;
   2 #define DEFINE_XCHG_LOCK(n) xchglock_t n = 0
   3
   4 void xchg_lock(xchglock_t *xp)
   5 {
   6   while (xchg(xp, 1) == 1) {
   7     while (*xp == 1)
   8       continue;
   9   }
  10 }
  11
  12 void xchg_unlock(xchglock_t *xp)
  13 {
  14   (void)xchg(xp, 0);
  15 }
 \end{verbatim}
 }
 \caption{Sample Lock Based on Atomic Exchange}
 \label{fig:locking:Sample Lock Based on Atomic Exchange}
 \end{figure}

 This section reviews the implementation shown in
 Figure~\ref{fig:locking:Sample Lock Based on Atomic Exchange}.
 The data structure for this lock is just an \co{int}, as shown on
 line~1, but could be any integral type.
 The initial value of this lock is zero, meaning ``unlocked'',
 as shown on line~2.

 \QuickQuiz{}
 	Why not rely on the C language's default initialization of
 	zero instead of using the explicit initializer shown on
 	line~2 of
 	Figure~\ref{fig:locking:Sample Lock Based on Atomic Exchange}?
 \QuickQuizAnswer{
 	Because this default initialization does not apply to locks
 	allocated as auto variables within the scope of a function.
 } \QuickQuizEnd

 Lock acquisition is carried out by the \co{xchg_lock()} function
 shown on lines~4-9.
 This function uses a nested loop, with the outer loop repeatedly
 atomically exchanging the value of the lock with the value one
 (meaning ``locked'').
 If the old value was already the value one (in other words, someone
 else already holds the lock), then the inner loop (lines~7-8)
 spins until the lock is available, at which point the outer loop
 makes another attempt to acquire the lock.

 \QuickQuiz{}
 	Why bother with the inner loop on lines~7-8 of
 	Figure~\ref{fig:locking:Sample Lock Based on Atomic Exchange}?
 	Why not simply repeatedly do the atomic exchange operation
 	on line~6?
 \QuickQuizAnswer{
 	Suppose that the lock is held and that several threads
 	are attempting to acquire the lock.
 	In this situation, if these threads all loop on the atomic
 	exchange operation, they will ping-pong the cache line
 	containing the lock among themselves, imposing load
 	on the interconnect.
 	In contrast, if these threads are spinning in the inner
 	loop on lines~7-8, they will each spin within their own
 	caches, putting negligible load on the interconnect.
 } \QuickQuizEnd

 Lock release is carried out by the \co{xchg_unlock()} function
 shown on lines~12-15.
 Line~14 atomically exchanges the value zero (``unlocked'') into
 the lock, thus marking it as having been released.

 \QuickQuiz{}
 	Why not simply store zero into the lock word on line 14 of
 	Figure~\ref{fig:locking:Sample Lock Based on Atomic Exchange}?
 \QuickQuizAnswer{
 	This can be a legitimate implementation, but only if
 	this store is preceded by a memory barrier and makes use
 	of \co{ACCESS_ONCE()}.
 	The memory barrier is not required when the \co{xchg()}
 	operation is used because this operation implies a
 	full memory barrier due to the fact that it returns
 	a value.
 } \QuickQuizEnd

 This lock is a simple example of a test-and-set lock~\cite{Segall84},
 but very similar
 mechanisms have been used extensively as pure spinlocks in production.

 \subsection{Other Exclusive-Locking Implementations}
 \label{sec:locking:Other Exclusive-Locking Implementations}

 There are a great many other possible implementations of locking
 based on atomic instructions, many of which are reviewed
 by Mellor-Crummey and Scott~\cite{MellorCrummey91a}.
 These implementations represent different points in a multi-dimensional
 design tradeoff~\cite{McKenney96a}.
 For example,
 the atomic-exchange-based test-and-set lock presented in the previous
 section works well when contention is low and has the advantage
 of small memory footprint.
 It avoids giving the lock to threads that cannot use it, but as
 a result can suffer from unfairness or even starvation at high
 contention levels.

 In contrast, ticket lock~\cite{MellorCrummey91a}, which is used in the Linux
 kernel, avoids unfairness at high contention levels, but as a
 consequence of its first-in-first-out discipline can grant the
 lock to a thread that is currently unable to use it, for example,
 due to being preempted, interrupted, or otherwise out of action.
 However, it is important to avoid getting too worried about the
 possibility of preemption and interruption, given that this preemption
 and interruption might just as well happen just after the lock was
 acquired.\footnote{
 	Besides, the best way of handling high lock contention is to avoid
 	it in the first place!
 	However, there are some situation where high lock contention
 	is the lesser of the available evils, and in any case, studying
 	schemes that deal with high levels of contention is good mental
 	exercise.}

 All locking implementations where waiters spin on a single memory
 location, including both test-and-set locks and ticket locks,
 suffer from performance problems at high contention levels.
 The problem is that the thread releasing the lock must update the
 value of the corresponding memory location.
 At low contention, this is not a problem: The corresponding cache line
 is very likely still local to and writeable by the thread holding
 the lock.
 In contrast, at high levels of contention, each thread attempting to
 acquire the lock will have a read-only copy of the cache line, and
 the lock holder will need to invalidate all such copies before it
 can carry out the update that releases the lock.
 In general, the more CPUs and threads there are, the greater the
 overhead incurred when releasing the lock under conditions of
 high contention.

 This negative scalability has motivated a number of different
 queued-lock
 implementations~\cite{Anderson90,Graunke90,MellorCrummey91a,Wisniewski94,Craig93,Magnusson94,Takada93}.
 Queued locks avoid high cache-invalidation overhead by assigning each
 thread a queue element.
 These queue elements are linked together into a queue that governs the
 order that the lock will be granted to the waiting threads.
 The key point is that each thread spins on its own queue element,
 so that the lock holder need only invalidate the first element from
 the next thread's CPU's cache.
 This arrangement greatly reduces the overhead of lock handoff at high
 levels of contention.

 More recent queued-lock implementations also take the system's architecture
 into account, preferentially granting locks locally, while also taking
 steps to avoid
 starvation~\cite{McKenney02e,radovic03hierarchical,radovic02efficient,BenJackson02,McKenney02d}.
 Many of these can be thought of as analogous to the elevator algorithms
 traditionally used in scheduling disk I/O.

 Unfortunately, the same scheduling logic that improves the efficiency
 of queued locks at high contention also increases their overhead at
 low contention.
 Beng-Hong Lim and Anant Agarwal therefore combined a simple test-and-set
 lock with a queued lock, using the test-and-set lock at low levels of
 contention and switching to the queued lock at high levels of
 contention~\cite{BengHongLim94}, thus getting low overhead at low levels
 of contention and getting fairness and high throughput at high levels
 of contention.
 Browning et al.~took a similar approach, but avoided the use of a separate
 flag, so that the test-and-set fast path uses the same sequence of
 instructions that would be used in a simple test-and-set
 lock~\cite{LukeBrowning2005SimpleLockNUMAAware}.
 This approach has been used in production.

 Another issue that arises at high levels of contention is when the
 lock holder is delayed, especially when the delay is due to
 preemption, which can result in \emph{priority inversion},
 where a low-priority thread holds a lock, but is preempted
 by a medium priority CPU-bound thread, which results in
 a high-priority process blocking while attempting to acquire the
 lock.
 The result is that the CPU-bound medium-priority process is preventing the
 high-priority process from running.
 One solution is \emph{priority inheritance}~\cite{Lampson1980Mesa},
 which has been widely used for real-time
 computing~\cite{Sha1990IEEETransComp,JonathanCorbet2006PriorityInheritance},
 despite some lingering controversy over this
 practice~\cite{Yodaiken2004FSM,DougLocke2002a}.

 Another way to avoid priority inversion is to prevent preemption
 while a lock is held.
 Because preventing preemption while locks are held also improves throughput,
 most proprietary UNIX kernels offer some form of scheduler-conscious
 synchronization mechanism~\cite{Kontothanassis97a},
 largely due to the efforts of a certain sizable database vendor.
 These mechanisms usually take the form of a hint that preemption
 would be inappropriate.
 These hints frequently take the form of a bit set in a particular
 machine register, which enables extremely low per-lock-acquisition overhead
 for these mechanisms.
 In contrast, Linux avoids these hints, instead getting
 similar results from a mechanism called
 \emph{futexes}~\cite{HubertusFrancke2002Futex,IngoMolnar2006RobustFutexes,StevenRostedt2006piFutexes,UlrichDrepper2011Futexes}.

 Interestingly enough, atomic instructions are not strictly needed to
 implement locks~\cite{Dijkstra65a,Lamport74a}.
 An excellent exposition of the issues surrounding locking implementations
 based on simple loads and stores may be found in Herlihy's and
 Shavit's textbook~\cite{HerlihyShavit2008Textbook}.
 The main point echoed here is that such implementations currently
 have little practical application, although a careful study of
 them can be both entertaining and enlightening.
 Nevertheless, with one exception described below, such study is left
 as an exercise for the reader.

 Gamsa et al.~\cite[Section 5.3]{Gamsa99} describe a token-based
 mechanism in which a token circulates among the CPUs.
 When the token reaches a given CPU, it has exclusive
 access to anything protected by that token.
 There are any number of schemes that may be used to implement
 the token-based mechanism, for example:

 \begin{enumerate}
 \item	Maintain a per-CPU flag, which is initially
 	zero for all but one CPU.
 	When a CPU's flag is non-zero, it holds the token.
 	When it finishes with the token, it zeroes its flag and
 	sets the flag of the next CPU to one (or to any other
 	non-zero value).
 \item	Maintain a per-CPU counter, which is initially set
 	to the corresponding CPU's number, which we assume
 	to range from zero to $N-1$, where $N$ is the number
 	of CPUs in the system.
 	When a CPU's counter is greater than that of the next
 	CPU (taking counter wrap into account), the first CPU holds the token.
 	When it is finished with the token, it sets the next
 	CPU's counter to a value one greater than its own counter.
 \end{enumerate}

 \QuickQuiz{}
 	How can you tell if one counter is greater than another,
 	while accounting for counter wrap?
 \QuickQuizAnswer{
 	In the C language, the following macro correctly handles this:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \small
 \begin{verbatim}
 #define ULONG_CMP_LT(a, b) \
         (ULONG_MAX / 2 < (a) - (b))
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 	Although it is tempting to simply subtract two signed integers,
 	this should be avoided because signed overflow is undefined
 	in the C language.
 	For example, if the compiler knows that one of the values is
 	positive and the other negative, it is within its rights to
 	simply assume that the positive number is greater than the
 	negative number, even though subtracting the negative number
 	from the positive number might well result in overflow and
 	thus a negative number.

 	How could the compiler know the signs of the two numbers?
 	It might be able to deduce it based on prior assignments
 	and comparisons.
 	In this case, if the per-CPU counters were signed, the compiler
 	could deduce that they were always increasing in value, and
 	then might assume that they would never go negative.
 	This assumption could well lead the compiler to generate
 	unfortunate code~\cite{PaulEMcKenney2012SignedOverflow,JohnRegehr2010UndefinedBehavior}.
 } \QuickQuizEnd

 \QuickQuiz{}
 	Which is better, the counter approach or the flag approach?
 \QuickQuizAnswer{
 	The flag approach will normally suffer fewer cache misses,
 	but a better answer is to try both and see which works best
 	for your particular workload.
 } \QuickQuizEnd

 This lock is unusual in that a given CPU cannot necessarily acquire it
 immediately, even if no other CPU is using it at the moment.
 Instead, the CPU must wait until the token comes around to it.
 This is useful in cases where CPUs need periodic access to the critical
 section, but can tolerate variances in token-circulation rate.
 Gamsa et al.~\cite{Gamsa99} used it to implement a variant of
 read-copy update (see Section~\ref{sec:defer:Read-Copy Update (RCU)}),
 but it could also be used to protect periodic per-CPU operations such
 as flushing per-CPU caches used by memory allocators~\cite{McKenney93},
 garbage-collecting per-CPU data structures, or flushing per-CPU
 data to shared storage (or to mass storage, for that matter).

 As increasing numbers of people gain familiarity with parallel hardware
 and parallelize increasing amounts of code, we can expect more
 special-purpose locking primitives to appear.
 Nevertheless, you should carefully consider this important safety tip:
 Use the standard synchronization primitives whenever humanly possible.
 The big advantage of the standard synchronization primitives over
 roll-your-own efforts is that the standard primitives are typically
 \emph{much} less bug-prone.\footnote{
 	And yes, I have done at least my share of roll-your-own
 	synchronization primitives.
 	However, you will notice that my hair is much greyer than
 	it was before I started doing that sort of work.
 	Coincidence?
 	Maybe.
 	But are you \emph{really} willing to risk your own hair turning
 	prematurely grey?}

 \input{locking/locking-existence}

 \section{Locking: Hero or Villain?}
 \label{sec:locking:Locking: Hero or Villain?}

 As is often the case in real life, locking can be either hero or villain,
 depending on how it is used and on the problem at hand.
 In my experience, those writing whole applications are happy with
 locking, those writing parallel libraries are less happy, and those
 parallelizing existing sequential libraries are extremely unhappy.
 The following sections discuss some reasons for these differences in
 viewpoints.

 \subsection{Locking For Applications: Hero!}
 \label{sec:locking:Locking For Applications: Hero!}

 When writing an entire application (or entire kernel), developers have
 full control of the design, including the synchronization design.
 Assuming that the design makes good use of partitioning, as discussed in
 Chapter~\ref{cha:Partitioning and Synchronization Design}, locking
 can be an extremely effective synchronization mechanism, as demonstrated
 by the heavy use of locking in production-quality parallel software.

 Nevertheless, although such software usually bases most of its
 synchronization design on locking, such software also almost always
 makes use of other synchronization mechanisms, including
 special counting algorithms (Chapter~\ref{chp:Counting}),
 data ownership (Chapter~\ref{chp:Data Ownership}),
 reference counting (Section~\ref{sec:defer:Reference Counting}),
 sequence locking (Section~\ref{sec:defer:Sequence Locks}), and
 read-copy update (Section~\ref{sec:defer:Read-Copy Update (RCU)}).
 In addition, practitioners use tools for deadlock
 detection~\cite{JonathanCorbet2006lockdep},
 lock acquisition/release balancing~\cite{JonathanCorbet2004sparse},
 cache-miss analysis~\cite{ValgrindHomePage},
 hardware-counter-based profiling~\cite{LinuxKernelPerfWiki,OProfileHomePage},
 and many more besides.

 Given careful design, use of a good combination of synchronization
 mechanisms, and good tooling, locking works quite well for applications
 and kernels.

 \subsection{Locking For Parallel Libraries: Just Another Tool}
 \label{sec:locking:Locking For Parallel Libraries: Just Another Tool}

 Unlike applications and kernels, the designer of a library cannot
 know the locking design of the code that the library will be interacting
 with.
 In fact, that code might not be written for years to come.
 Library designers therefore have less control and must exercise more
 care when laying out their synchronization design.

 Deadlock is of course of particular concern, and the techniques discussed
 in Section~\ref{sec:locking:Deadlock} need to be applied.
 One popular deadlock-avoidance strategy is therefore to ensure that
 the library's locks are independent subtrees of the enclosing program's
 locking hierarchy.
 However, this can be harder than it looks.

 One complication was discussed in
 Section~\ref{sec:locking:Local Locking Hierarchies}, namely
 when library functions call into application code, with \co{qsort()}'s
 comparison-function argument being a case in point.
 Another complication is the interaction with signal handlers.
 If an application signal handler is invoked from a signal received within
 the library function, deadlock can ensue just as surely as
 if the library function had called the signal handler directly.
 A final complication occurs for those library functions that can be used
 between a \co{fork()}/\co{exec()} pair, for example, due to use of
 the \co{system()} function.
 In this case, if your library function was holding a lock at the time of
 the \co{fork()}, then the child process will begin life with that lock held.
 Because the thread that will release the lock is running in the parent
 but not the child, if the child calls your library function, deadlock
 will ensue.

 The following strategies may be used to avoid deadlock problems in these cases:

 \begin{enumerate}
 \item	Don't use either callbacks or signals.
 \item	Don't acquire locks from within callbacks or signal handlers.
 \item	Let the caller control synchronization.
 \item	Parameterize the library API to delegate locking to caller.
 \item	Explicitly avoid callback deadlocks.
 \item	Explicitly avoid signal-handler deadlocks.
 \end{enumerate}

 Each of these strategies is discussed in on of the following sections.

 \subsubsection{Use Neither Callbacks Nor Signals}
 \label{sec:locking:Use Neither Callbacks Nor Signals}

 If a library function avoids callbacks and the application as a whole
 avoids signals, then any locks acquired by that library function will
 be leaves of the locking-hierarchy tree.
 This arrangement avoids deadlock, as discussed in
 Section~\ref{sec:locking:Locking Hierarchies}.
 Although this strategy works extremely well where it applies,
 there are some applications that must use signal handlers,
 and there are some library functions (such as the \co{qsort()} function
 discussed in
 Section~\ref{sec:locking:Local Locking Hierarchies})
 that require callbacks.

 The strategy described in the next section can often be used in these cases.

 \subsubsection{Avoid Locking in Callbacks and Signal Handlers}
 \label{sec:locking:Avoid Locking in Callbacks and Signal Handlers}

 If neither callbacks nor signal handlers acquire locks, then they
 cannot be involved in deadlock cycles, which allows straightforward
 locking hierarchies to once again consider library functions to
 be leaves on the locking-hierarchy tree.
 This strategy works very well for most uses of \co{qsort}, whose
 callbacks usually simply compare the two values passed in to them.
 This strategy also works wonderfully for many signal handlers,
 especially given that acquiring locks from within signal handlers
 is generally frowned upon~\cite{TheOpenGroup1997SUS},\footnote{
 	But the standard's words do not stop clever coders from creating
 	their own home-brew locking primitives from atomic operations.}
 but can fail if the application needs to manipulate complex data structures
 from a signal handler.

 Here are some ways to avoid acquiring locks in signal handlers even
 if complex data structures must be manipulated:

 \begin{enumerate}
 \item	Use simple data structures based on non-blocking synchronization,
 	as will be discussed in
 	Section~\ref{sec:advsync:Simple NBS}.
 \item	If the data structures are too complex for reasonable use of
 	non-blocking synchronization, create a queue that allows
 	non-blocking enqueue operations.
 	In the signal handler, instead of manipulating the complex
 	data structure, add an element to the queue describing the
 	required change.
 	A separate thread can then remove elements from the queue and
 	carry out the required changes using normal locking.
 	There are a number of readily available implementations of
 	concurrent
 	queues~\cite{ChristophMKirsch2012FIFOisntTR,MathieuDesnoyers2009URCU,MichaelScott96}.
 \end{enumerate}

 This strategy should be enforced with occasional manual or (preferably)
 automated inspections of callbacks and signal handlers.
 When carrying out these inspections, be wary of clever coders who
 might have (unwisely) created home-brew locks from atomic operations.

 \subsubsection{Caller Controls Synchronization}
 \label{sec:locking:Caller Controls Synchronization}

 Let the caller control synchronization.
 This works extremely well when the library functions are operating
 on independent caller-visible instances of a data structure, each
 of which may be synchronized separately.
 For example, if the library functions operate on a search tree,
 and if the application needs a large number of independent search
 trees, then the application can associate a lock with each tree.
 The application then acquires and releases locks as needed, so
 that the library need not be aware of parallelism at all.
 Instead, the application controls the parallelism, so that locking
 can work very well, as was discussed in
 Section~\ref{sec:locking:Locking For Applications: Hero!}.

 However, this strategy fails if the
 library implements a data structure that requires internal
 concurrency, for example, a hash table or a parallel sort.
 In this case, the library absolutely must control its own
 synchronization.

 \subsubsection{Parameterize Library Synchronization}
 \label{sec:locking:Parameterize Library Synchronization}

 The idea here is to add arguments to the library's API to specify
 which locks to acquire, how to acquire and release them, or both.
 This strategy allows the application to take on the global task of
 avoiding deadlock by specifying which locks to acquire (by passing in
 pointers to the locks in question) and how to
 acquire them (by passing in pointers to lock acquisition and release
 functions),
 but also allows a given library function to control its own concurrency
 by deciding where the locks should be acquired and released.

 In particular, this strategy allows the lock acquisition and release
 functions to block signals as needed without the library code needing to
 be concerned with of which signals need to be blocked by which locks.
 The separation of concerns used by this strategy can be quite effective,
 but in some cases the strategies laid out in the following sections
 can work better.

 That said, passing explicit pointers to locks to external APIs must
 be very carefully considered, as discussed in
 Section~\ref{sec:locking:Locking Hierarchies and Pointers to Locks}.
 Although this practice is sometimes the right thing to do, you should do
 yourself a favor by looking into alternative designs first.

 \subsubsection{Explicitly Avoid Callback Deadlocks}
 \label{sec:locking:Explicitly Avoid Callback Deadlocks}

 The basic rule behind this strategy was discussed in
 Section~\ref{sec:locking:Local Locking Hierarchies}: ``Release all
 locks before invoking unknown code.''
 This is usually the best approach because it allows the application to
 ignore the library's locking hierarchy: the library remains a leaf or
 isolated subtree of the application's overall locking hierarchy.

 In cases where it is not possible to release all locks before invoking
 unknown code, the layered locking hierarchies described in
 Section~\ref{sec:locking:Layered Locking Hierarchies} can work well.
 For example, if the unknown code is a signal handler, this implies that
 the library function block signals across all lock acquisitions, which
 can be complex and slow.
 Therefore, in cases where signal handlers (probably unwisely) acquire
 locks, the strategies in the next section may prove helpful.

 \subsubsection{Explicitly Avoid Signal-Handler Deadlocks}
 \label{sec:locking:Explicitly Avoid Signal-Handler Deadlocks}

 Signal-handler deadlocks can be explicitly avoided as follows:

 \begin{enumerate}
 \item	If the application invokes the library function from
 	within a signal handler, then that signal must be blocked
 	every time that the library function is invoked from outside
 	of a signal handler.
 \item	If the application invokes the library function
 	while holding a lock acquired within a given signal
 	handler, then that signal must be blocked every time that the
 	library function is called outside of a signal handler.
 \end{enumerate}

 These rules can be enforced by using tools similar to
 the Linux kernel's lockdep lock dependency
 checker~\cite{JonathanCorbet2006lockdep}.
 One of the great strengths of lockdep is that it is not fooled by
 human intuition~\cite{StevenRostedt2011locdepCryptic}.

 \subsubsection{Library Functions Used Between fork() and exec()}
 \label{sec:locking:Library Functions Used Between fork() and exec()}

 As noted earlier, if a thread executing a library function is holding
 a lock at the time that some other thread invokes \co{fork()}, the
 fact that the parent's memory is copied to create the child means that
 this lock will be born held in the child's context.
 The thread that will release this lock is running in the parent, but not
 in the child, which means that the child's copy of this lock will never
 be released.
 Therefore, any attempt on the part of the child to invoke that same
 library function will result in deadlock.

 One approach to this problem would be to have the library function
 check to see if the owner of the lock is still running, and if not,
 ``breaking'' the lock by re-initializing and then acquiring it.
 However, this approach has a couple of vulnerabilities:

 \begin{enumerate}
 \item	The data structures protected by that lock are likely to
 	be in some intermediate state, so that naively breaking the lock
 	might result in arbitrary memory corruption.
 \item	If the child creates additional threads, two threads might
 	break the lock concurrently, with the result that both
 	threads believe they own the lock.
 	This could again result in arbitrary memory corruption.
 \end{enumerate}

 The \co{atfork()} function is provided to help deal with these situations.
 The idea is to register a triplet of functions, one to be called by the
 parent before the \co{fork()}, one to be called by the parent after the
 \co{fork()}, and one to be called by the child after the \co{fork()}.
 Appropriate cleanups can then be carried out at these three points.

 Be warned, however, that coding of \co{atfork()} handlers is quite subtle
 in general.
 The cases where \co{atfork()} works best are cases where the data structure
 in question can simply be re-initialized by the child.

 \subsubsection{Parallel Libraries: Discussion}
 \label{sec:locking:Parallel Libraries: Discussion}

 Regardless of the strategy used, the description of the library's API
 must include a clear description of that strategy and how the caller
 should interact with that strategy.
 In short, constructing parallel libraries using locking is possible,
 but not as easy as constructing a parallel application.

 \subsection{Locking For Parallelizing Sequential Libraries: Villain!}
 \label{sec:locking:Locking For Parallelizing Sequential Libraries: Villain!}

 With the advent of readily available low-cost multicore systems,
 a common task is parallelizing an existing library that was designed
 with only single-threaded use in mind.
 This all-to-common disregard for parallelism can result in a library
 API that is severely flawed from a parallel-programming viewpoint.
 Candidate flaws include:

 \begin{enumerate}
 \item	Implicit prohibition of partitioning.
 \item	Callback functions requiring locking.
 \item	Object-oriented spaghetti code.
 \end{enumerate}

 These flaws and the consequences for locking are discussed in the following
 sections.

 \subsubsection{Partitioning Prohibited}
 \label{sec:locking:Partitioning Prohibited}

 Suppose that you were writing a single-threaded hash-table implementation.
 It is easy and fast to maintain an exact count of the total number of items
 in the hash table, and also easy and fast to return this exact count on each
 addition and deletion operation.
 So why not?

 One reason is that exact counters do not perform or scale well on
 multicore systems, as was
 seen in Chapter~\ref{chp:Counting}.
 As a result, the parallelized implementation of the hash table will not
 perform or scale well.

 So what can be done about this?
 One approach is to return an approximate count, using one of the algorithms
 from Chapter~\ref{chp:Counting}.
 Another approach is to drop the element count altogether.

 Either way, it will be necessary to inspect uses of the hash table to see
 why the addition and deletion operations need the exact count.
 Here are a few possibilities:

 \begin{enumerate}
 \item	Determining when to resize the hash table.
 	In this case, an approximate count should work quite well.
 	It might also be useful to trigger the resizing operation from
 	the length of the longest chain, which can be computed and
 	maintained in a nicely partitioned per-chain manner.
 \item	Producing an estimate of the time required to traverse the
 	entire hash table.
 	An approximate count works well in this case, also.
 \item	For diagnostic purposes, for example, to check for items being
 	lost when transferring them to and from the hash table.
 	This clearly requires an exact count.
 	However, given that this usage is diagnostic in nature, it might
 	suffice to maintain the lengths of the hash chains, then to
 	infrequently sum them
 	up while locking out addition and deletion operations.
 \end{enumerate}

 It turns out that there is now a strong theoretical basis for some of the
 constraints that performance and scalability place on a parallel library's
 APIs~\cite{HagitAttiya2011LawsOfOrder,Attiya:2011:LOE:1925844.1926442,PaulEMcKenney2011SNC}.
 Anyone designing a parallel library needs to pay close attention to
 those constraints.

 Although it is all too easy to blame locking for what are really problems
 due to a concurrency-unfriendly API, doing so is not helpful.
 On the other hand, one has little choice but to sympathize with the
 hapless developer who made this choice in (say) 1985.
 It would have been a rare and courageous developer to anticipate the
 need for parallelism at that time, and it would have required an
 even more rare combination of brilliance and luck to actually arrive
 at a good parallel-friendly API.

 Times change, and code must change with them.
 That said, there might be a huge number of users of a popular library,
 in which case an incompatible change to the API would be quite foolish.
 Adding a parallel-friendly API to complement the existing heavily used
 sequential-only API is probably the best course of action in this
 situation.

 Nevertheless, human nature being what it is, we can expect our hapless
 developer to be more likely to complain about locking than about his
 or her own poor (though understandable) API design choices.

 \subsubsection{Deadlock-Prone Callbacks}
 \label{sec:locking:Deadlock-Prone Callbacks}

 Sections~\ref{sec:locking:Local Locking Hierarchies},
 \ref{sec:locking:Layered Locking Hierarchies},
 and~\ref{sec:locking:Locking For Parallel Libraries: Just Another Tool}
 described how undisciplined use of callbacks can result in locking
 woes.
 These sections also described how to design your library function to
 avoid these problems, but it is unrealistic to expect a 1990s programmer
 with no experience in parallel programming to have followed such a design.
 Therefore, someone attempting to parallelize an existing callback-heavy
 single-threaded library will likely have many opportunities to curse
 locking's villainy.

 If there are a very large number of uses of a callback-heavy library,
 it may be wise to again add a parallel-friendly API to the library in
 order to allow existing users to convert their code incrementally.
 Alternatively, some advocate use of transactional memory in these cases.
 While the jury is still out on transactional memory,
 Section~\ref{sec:future:Transactional Memory} discusses its strengths and
 weaknesses.
 It is important to note that hardware transactional memory
 (discussed in
 Section~\ref{sec:future:Hardware Transactional Memory})
 cannot help here unless the hardware transactional memory implementation
 provides forward-progress guarantees, which few do.
 Other alternatives that appear to be quite practical (if less heavily
 hyped) include the methods discussed in
 Sections~\ref{sec:locking:Conditional Locking},
 and~\ref{sec:locking:Acquire Needed Locks First},
 as well as those that will be discussed in
 Chapters~\ref{chp:Data Ownership}
 and~\ref{chp:Deferred Processing}.

 \subsubsection{Object-Oriented Spaghetti Code}
 \label{sec:locking:Object-Oriented Spaghetti Code}

 Object-oriented programming went mainstream sometime in the 1980s or
 1990s, and as a result there is a huge amount of object-oriented code
 in production, much of it single-threaded.
 Although object orientation can be a valuable software technique,
 undisciplined use of objects can easily result in object-oriented
 spaghetti code.
 In object-oriented spaghetti code, control flits from object to object
 in an essentially random manner, making the code hard to understand
 and even harder, and perhaps impossible, to accommodate a locking hierarchy.

 Although many might argue that such code should be cleaned up in any
 case, such things are much easier to say than to do.
 If you are tasked with parallelizing such a beast, you can reduce the
 number of opportunities to curse locking by using the techniques
 described in
 Sections~\ref{sec:locking:Conditional Locking},
 and~\ref{sec:locking:Acquire Needed Locks First},
 as well as those that will be discussed in
 Chapters~\ref{chp:Data Ownership}
 and~\ref{chp:Deferred Processing}.
 This situation appears to be the use case that inspired transactional
 memory, so it might be worth a try as well.
 That said, the choice of synchronization mechanism should be made in
 light of the hardware habits discussed in
 Chapter~\ref{chp:Hardware and its Habits}.
 After all, if the overhead of the synchronization mechanism is orders of
 magnitude more than that of the operations being protected, the results
 are not going to be pretty.

 And that leads to a question well worth asking in these situations:
 Should the code remain sequential?
 For example, perhaps parallelism should be introduced at the process level
 rather than the thread level.
 In general, if a task is proving extremely hard, it is worth some time
 spent thinking about not only alternative ways to accomplish that
 particular task, but also alternative tasks that might better solve
 the problem at hand.

 \section{Summary}
 \label{sec:locking:Summary}

 Locking is perhaps the most widely used and most generally useful
 synchronization tool.
 However, it works best when designed into an application
 or library from the beginning.
 Given the large quantity of pre-existing single-threaded code that might
 need to one day run in parallel, locking should therefore not be the
 only tool in your parallel-programming toolbox.
 The next few chapters will discuss other tools, and how they can best
 be used in concert with locking and with each other.