| % locking/locking.tex |
| % mainfile: ../perfbook.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \QuickQuizChapter{chp:Locking}{Locking}{qqzlocking} |
| % |
| \Epigraph{Locking is the worst general-purpose synchronization mechanism |
| except for all those other mechanisms that |
| have been tried from time to time.}{\emph{With apologies |
| to the memory of Winston Churchill and to whoever he was |
| quoting}} |
| |
| In recent concurrency research, locking often plays the role of villain. |
| \IX{Locking} stands accused of inciting deadlocks, convoying, \IX{starvation}, |
| \IX{unfairness}, \IXpl{data race}, and all manner of other concurrency sins. |
| Interestingly enough, the role of workhorse in production-quality |
| shared-memory parallel software is also played by locking. |
| This chapter will look into this dichotomy between villain and |
| hero, as fancifully depicted in |
| \cref{fig:locking:Locking: Villain or Slob?,% |
| fig:locking:Locking: Workhorse or Hero?}. |
| |
| There are a number of reasons behind this Jekyll-and-Hyde dichotomy: |
| |
| \begin{enumerate} |
| \item Many of locking's sins have pragmatic design solutions that |
| work well in most cases, for example: |
| \begin{enumerate} |
| \item Use of lock hierarchies to avoid deadlock. |
| \item Deadlock-detection tools, for example, the Linux kernel's |
| lockdep facility~\cite{JonathanCorbet2006lockdep}. |
| \item Locking-friendly data structures, such as |
| arrays, hash tables, and radix trees, which will |
| be covered in \cref{chp:Data Structures}. |
| \end{enumerate} |
| \item Some of locking's sins are problems only at high levels of |
| contention, levels reached only by poorly designed programs. |
| \item Some of locking's sins are avoided by using other synchronization |
| mechanisms in concert with locking. |
| These other mechanisms include |
| statistical counters |
| (see \cref{chp:Counting}), |
| reference counters |
| (see \cref{sec:defer:Reference Counting}), |
| hazard pointers |
| (see \cref{sec:defer:Hazard Pointers}), |
| sequence-locking readers |
| (see \cref{sec:defer:Sequence Locks}), |
| RCU |
| (see \cref{sec:defer:Read-Copy Update (RCU)}), |
| and simple non-blocking data structures |
| (see \cref{sec:advsync:Non-Blocking Synchronization}). |
| \item Until quite recently, almost all large shared-memory parallel |
| programs were developed in secret, so that it was not easy |
| to learn of these pragmatic solutions. |
| \item Locking works extremely well for some software artifacts |
| and extremely poorly for others. |
| Developers who have worked on artifacts for which locking |
| works well can be expected to have a much more positive |
| opinion of locking than those who have worked on artifacts |
| for which locking works poorly, as will be discussed in |
| \cref{sec:locking:Locking: Hero or Villain?}. |
| \item All good stories need a villain, and locking has a long and |
| honorable history serving as a research-paper whipping boy. |
| \end{enumerate} |
| |
| \QuickQuiz{ |
| Just how can serving as a whipping boy be considered to be |
| in any way honorable??? |
| }\QuickQuizAnswer{ |
| The reason locking serves as a research-paper whipping boy is |
| because it is heavily used in practice. |
| In contrast, if no one used or cared about locking, most research |
| papers would not bother even mentioning it. |
| }\QuickQuizEnd |
| |
| This chapter will give an overview of a number of ways to avoid locking's |
| more serious sins. |
| |
| \begin{figure} |
| \centering |
| \resizebox{2in}{!}{\includegraphics{cartoons/r-2014-Locking-the-Slob}} |
| \caption{Locking: |
| Villain or Slob?} |
| \ContributedBy{Figure}{fig:locking:Locking: Villain or Slob?}{Melissa Broussard} |
| \end{figure} |
| |
| \begin{figure} |
| \centering |
| \resizebox{2in}{!}{\includegraphics{cartoons/r-2014-Locking-the-Hero}} |
| \caption{Locking: |
| Workhorse or Hero?} |
| \ContributedBy{Figure}{fig:locking:Locking: Workhorse or Hero?}{Melissa Broussard} |
| \end{figure} |
| |
| \section{Staying Alive} |
| \label{sec:locking:Staying Alive} |
| % |
| \epigraph{I work to stay alive.}{\emph{Bette Davis}} |
| |
| Given that locking stands accused of deadlock and starvation, |
| one important concern for shared-memory parallel developers is |
| simply staying alive. |
| The following sections therefore cover deadlock, \IX{livelock}, starvation, |
| unfairness, and inefficiency. |
| |
| \subsection{Deadlock} |
| \label{sec:locking:Deadlock} |
| |
| \IXB{Deadlock} occurs when each member of a group of threads is holding at |
| least one lock while at the same time waiting on a lock held by a member |
| of that same group. |
| This happens even in groups containing a single thread when that thread |
| attempts to acquire a non-recursive lock that it already holds. |
| Deadlock can therefore occur even given but one thread and one lock! |
| |
| Without some sort of external intervention, deadlock is forever. |
| No thread can acquire the lock it is waiting on until that |
| lock is released by the thread holding it, but the thread holding |
| it cannot release it until the holding thread acquires the lock that |
| it is in turn waiting on. |
| |
| \begin{figure} |
| \centering |
| \resizebox{1.5in}{!}{\includegraphics{locking/DeadlockCycle}} |
| \caption{Deadlock Cycle} |
| \label{fig:locking:Deadlock Cycle} |
| \end{figure} |
| |
| We can create a directed-graph representation of a deadlock scenario |
| with nodes for threads and locks, as shown in |
| \cref{fig:locking:Deadlock Cycle}. |
| An arrow from a lock to a thread indicates that the thread holds |
| the lock, for example, Thread~B holds Locks~2 and~4. |
| An arrow from a thread to a lock indicates that the thread is waiting |
| on the lock, for example, Thread~B is waiting on Lock~3. |
| |
| A deadlock scenario will always contain at least one deadlock cycle. |
| In \cref{fig:locking:Deadlock Cycle}, this cycle is |
| Thread~B, Lock~3, Thread~C, Lock~4, and back to Thread~B. |
| |
| \QuickQuiz{ |
| But the definition of lock-based deadlock only said that each |
| thread was holding at least one lock and waiting on another lock |
| that was held by some thread. |
| How do you know that there is a cycle? |
| }\QuickQuizAnswer{ |
| Suppose that there is no cycle in the graph. |
| We would then have a directed acyclic graph (DAG), which would |
| have at least one leaf node. |
| |
| If this leaf node was a lock, then we would have a thread |
| that was waiting on a lock that wasn't held by any thread, |
| counter to the definition. |
| In this case the thread would immediately acquire the lock. |
| |
| On the other hand, if this leaf node was a thread, then |
| we would have a thread that was not waiting on any lock, |
| again counter to the definition. |
| And in this case, the thread would either be running or |
| be blocked on something that is not a lock. |
| In the first case, in the absence of infinite-loop bugs, |
| the thread will eventually release the lock. |
| In the second case, in the absence of a failure-to-wake bug, |
| the thread will eventually wake up and release the lock.\footnote{ |
| Of course, one type of failure-to-wake bug is a |
| deadlock that involves not only locks, but also non-lock |
| resources. |
| But the question really did say ``lock-based deadlock''!} |
| |
| Therefore, given this definition of lock-based deadlock, there |
| must be a cycle in the corresponding graph. |
| }\QuickQuizEnd |
| |
| Although there are some software environments such as database systems |
| that can recover from an existing deadlock, this approach requires either |
| that one of the threads be killed or that a lock be forcibly stolen from |
| one of the threads. |
| This killing and forcible stealing works well for transactions, |
| but is often problematic for kernel and application-level use of locking: |
| Dealing with the resulting partially updated structures can be extremely |
| complex, hazardous, and error-prone. |
| |
| Therefore, kernels and applications should instead avoid deadlocks. |
| Deadlock-avoidance strategies include locking hierarchies |
| (\cref{sec:locking:Locking Hierarchies}), |
| local locking hierarchies |
| (\cref{sec:locking:Local Locking Hierarchies}), |
| layered locking hierarchies |
| (\cref{sec:locking:Layered Locking Hierarchies}), |
| temporal locking hierarchies |
| (\cref{sec:locking:Temporal Locking Hierarchies}), |
| strategies for dealing with APIs containing pointers to locks |
| (\cref{sec:locking:Locking Hierarchies and Pointers to Locks}), |
| conditional locking |
| (\cref{sec:locking:Conditional Locking}), |
| acquiring all needed locks first |
| (\cref{sec:locking:Acquire Needed Locks First}), |
| single-lock-at-a-time designs |
| (\cref{sec:locking:Single-Lock-at-a-Time Designs}), |
| and strategies for signal/interrupt handlers |
| (\cref{sec:locking:Signal/Interrupt Handlers}). |
| Although there is no deadlock-avoidance strategy that works perfectly |
| for all situations, there is a good selection of tools to choose from. |
| |
| \subsubsection{Locking Hierarchies} |
| \label{sec:locking:Locking Hierarchies} |
| |
| Locking hierarchies order the locks and prohibit acquiring locks out |
| of order. |
| In \cref{fig:locking:Deadlock Cycle}, |
| we might order the locks numerically, thus forbidding a thread |
| from acquiring a given lock if it already holds a lock |
| with the same or a higher number. |
| Thread~B has violated this hierarchy because it is attempting to |
| acquire Lock~3 while holding Lock~4. |
| This violation permitted the deadlock to occur. |
| |
| Again, to apply a locking hierarchy, order the locks and prohibit |
| out-of-order lock acquisition. |
| In large program, it is wise to use tools such as the Linux-kernel |
| \co{lockdep}~\cite{JonathanCorbet2006lockdep} |
| to enforce your locking hierarchy. |
| |
| \subsubsection{Local Locking Hierarchies} |
| \label{sec:locking:Local Locking Hierarchies} |
| |
| However, the global nature of locking hierarchies makes them difficult to |
| apply to library functions. |
| After all, when a program using a given library function has not yet |
| been written, how can the poor library-function implementor possibly |
| follow the yet-to-be-defined locking hierarchy? |
| |
| One special (but common) case is when the library function does not |
| invoke any of the caller's code. |
| In this case, the caller's locks will never be acquired while holding |
| any of the library's locks, so that there cannot be a deadlock cycle |
| containing locks from both the library and the caller. |
| |
| \QuickQuiz{ |
| Are there any exceptions to this rule, so that there really could be |
| a deadlock cycle containing locks from both the library and |
| the caller, even given that the library code never invokes |
| any of the caller's functions? |
| }\QuickQuizAnswer{ |
| Indeed there are! |
| Here are a few of them: |
| \begin{enumerate} |
| \item If one of the library function's arguments is a pointer |
| to a lock that this library function acquires, and if |
| the library function holds one of its locks while |
| acquiring the caller's lock, then we could have a |
| deadlock cycle involving both caller and library locks. |
| \item If one of the library functions returns a pointer to |
| a lock that is acquired by the caller, and if the |
| caller acquires one of its locks while holding the |
| library's lock, we could again have a deadlock |
| cycle involving both caller and library locks. |
| \item If one of the library functions acquires a lock and |
| then returns while still holding it, and if the caller |
| acquires one of its locks, we have yet another way |
| to create a deadlock cycle involving both caller |
| and library locks. |
| \item If the caller has a signal handler that acquires |
| locks, then the deadlock cycle can involve both |
| caller and library locks. |
| In this case, however, the library's locks are |
| innocent bystanders in the deadlock cycle. |
| That said, please note that acquiring a lock from |
| within a signal handler is a no-no in most |
| environments---it is not just a bad idea, it |
| is unsupported. |
| But if you absolutely must acquire a lock in a signal |
| handler, be sure to block that signal while holding that |
| same lock in thread context. |
| \end{enumerate} |
| }\QuickQuizEnd |
| |
| But suppose that a library function does invoke the caller's code. |
| For example, \co{qsort()} invokes a caller-provided comparison function. |
| Now, normally this comparison function will operate on unchanging local |
| data, so that it need not acquire locks, as shown in |
| \cref{fig:locking:No qsort() Compare-Function Locking}. |
| But maybe someone is crazy enough to sort a collection whose keys |
| are changing, thus requiring that the comparison function acquire locks, |
| which might result in deadlock, as shown in |
| \cref{fig:locking:Without qsort() Local Locking Hierarchy}. |
| How can the library function avoid this deadlock? |
| |
| The golden rule in this case is ``Release all locks before invoking |
| unknown code.'' |
| To follow this rule, the \co{qsort()} function must release all of its |
| locks before invoking the comparison function. |
| Thus \co{qsort()} will not be holding any of its locks while the comparison |
| function acquires any of the caller's locks, thus avoiding deadlock. |
| |
| \QuickQuiz{ |
| But if \co{qsort()} releases all its locks before invoking |
| the comparison function, how can it protect against races |
| with other \co{qsort()} threads? |
| }\QuickQuizAnswer{ |
| By privatizing the data elements being compared |
| (as discussed in \cref{chp:Data Ownership}) |
| or through use of deferral mechanisms such as |
| reference counting (as discussed in |
| \cref{chp:Deferred Processing}). |
| Or through use of layered locking hierarchies, as described |
| in \cref{sec:locking:Layered Locking Hierarchies}. |
| |
| On the other hand, changing a key in a list that is |
| currently being sorted is at best rather brave. |
| }\QuickQuizEnd |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{locking/NoLockHierarchy}} |
| \caption{No \tco{qsort()} Compare-Function Locking} |
| \label{fig:locking:No qsort() Compare-Function Locking} |
| \end{figure} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{locking/NonLocalLockHierarchy}} |
| \caption{Without \tco{qsort()} Local Locking Hierarchy} |
| \label{fig:locking:Without qsort() Local Locking Hierarchy} |
| \end{figure} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{locking/LocalLockHierarchy}} |
| \caption{Local Locking Hierarchy for \tco{qsort()}} |
| \label{fig:locking:Local Locking Hierarchy for qsort()} |
| \end{figure} |
| |
| To see the benefits of local locking hierarchies, compare |
| \cref{fig:locking:Without qsort() Local Locking Hierarchy,% |
| fig:locking:Local Locking Hierarchy for qsort()}. |
| In both figures, application functions \co{foo()} and \co{bar()} |
| invoke \co{qsort()} while holding Locks~A and~B, respectively. |
| Because this is a parallel implementation of \co{qsort()}, it acquires |
| Lock~C\@. |
| Function \co{foo()} passes function \co{cmp()} to \co{qsort()}, |
| and \co{cmp()} acquires Lock~B\@. |
| Function \co{bar()} passes a simple integer-comparison function (not |
| shown) to \co{qsort()}, and this simple function does not acquire any |
| locks. |
| |
| Now, if \co{qsort()} holds Lock~C while calling \co{cmp()} in violation |
| of the golden release-all-locks rule above, as shown in |
| \cref{fig:locking:Without qsort() Local Locking Hierarchy}, |
| deadlock can occur. |
| To see this, suppose that one thread invokes \co{foo()} while a second |
| thread concurrently invokes \co{bar()}. |
| The first thread will acquire Lock~A and the second thread will acquire |
| Lock~B\@. |
| If the first thread's call to \co{qsort()} acquires Lock~C, then it |
| will be unable to acquire Lock~B when it calls \co{cmp()}. |
| But the first thread holds Lock~C, so the second thread's call to |
| \co{qsort()} will be unable to acquire it, and thus unable to release |
| Lock~B, resulting in deadlock. |
| |
| In contrast, if \co{qsort()} releases Lock~C before invoking the |
| comparison function, which is unknown code from \co{qsort()}'s perspective, |
| then deadlock is avoided as shown in |
| \cref{fig:locking:Local Locking Hierarchy for qsort()}. |
| |
| If each module releases all locks before invoking unknown code, then |
| deadlock is avoided if each module separately avoids deadlock. |
| This rule therefore greatly simplifies deadlock analysis and greatly |
| improves modularity. |
| |
| \subsubsection{Layered Locking Hierarchies} |
| \label{sec:locking:Layered Locking Hierarchies} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{locking/LayeredLockHierarchy}} |
| \caption{Layered Locking Hierarchy for \tco{qsort()}} |
| \label{fig:locking:Layered Locking Hierarchy for qsort()} |
| \end{figure} |
| |
| Unfortunately, it might not be possible for \co{qsort()} to release |
| all of its locks before invoking the comparison function. |
| In this case, we cannot construct a local locking hierarchy by |
| releasing all locks before invoking unknown code. |
| However, we can instead construct a layered locking hierarchy, as shown in |
| \cref{fig:locking:Layered Locking Hierarchy for qsort()}. |
| Here, the \co{cmp()} function uses a new Lock~D that is acquired after |
| all of Locks~A, B, and~C, avoiding deadlock. |
| We therefore have three layers to the global deadlock hierarchy, the |
| first containing Locks~A and~B, the second containing Lock~C, and |
| the third containing Lock~D\@. |
| |
| \begin{listing} |
| \input{CodeSamples/locking/locked_list@start_next.fcv} |
| \caption{Concurrent List Iterator} |
| \label{lst:locking:Concurrent List Iterator} |
| \end{listing} |
| |
| Please note that it is not typically possible to mechanically |
| change \co{cmp()} to use the new Lock~D\@. |
| Quite the opposite: |
| It is often necessary to make profound design-level modifications. |
| Nevertheless, the effort required for such modifications is normally |
| a small price to pay in order to avoid deadlock. |
| More to the point, this potential deadlock should preferably be detected |
| at design time, before any code has been generated! |
| |
| For another example where releasing all locks before invoking unknown |
| code is impractical, imagine an iterator over a linked list, as shown in |
| \cref{lst:locking:Concurrent List Iterator} (\path{locked_list.c}). |
| The \co{list_start()} function acquires a lock on the list and returns |
| the first element (if there is one), and |
| \co{list_next()} either returns a pointer to the next element in the list |
| or releases the lock and returns \co{NULL} if the end of the list has |
| been reached. |
| |
| \begin{listing} |
| \input{CodeSamples/locking/locked_list@list_print.fcv} |
| \caption{Concurrent List Iterator Usage} |
| \label{lst:locking:Concurrent List Iterator Usage} |
| \end{listing} |
| |
| \begin{fcvref}[ln:locking:locked_list:list_print:ints] |
| \Cref{lst:locking:Concurrent List Iterator Usage} shows how |
| this list iterator may be used. |
| \Clnrefrange{b}{e} define the \co{list_ints} element |
| containing a single integer, |
| \end{fcvref} |
| \begin{fcvref}[ln:locking:locked_list:list_print:print] |
| and \clnrefrange{b}{e} show how to iterate over the list. |
| \Clnref{start} locks the list and fetches a pointer to the first element, |
| \clnref{entry} provides a pointer to our enclosing \co{list_ints} structure, |
| \clnref{print} prints the corresponding integer, and |
| \clnref{next} moves to the next element. |
| This is quite simple, and hides all of the locking. |
| \end{fcvref} |
| |
| That is, the locking remains hidden as long as the code processing each |
| list element does not itself acquire a lock that is held across some |
| other call to \co{list_start()} or \co{list_next()}, which results in |
| deadlock. |
| We can avoid the deadlock by layering the locking hierarchy |
| to take the list-iterator locking into account. |
| |
| This layered approach can be extended to an arbitrarily large number of layers, |
| but each added layer increases the complexity of the locking design. |
| Such increases in complexity are particularly inconvenient for some |
| types of object-oriented designs, in which control passes back and forth |
| among a large group of objects in an undisciplined manner.\footnote{ |
| One name for this is ``object-oriented spaghetti code.''} |
| This mismatch between the habits of object-oriented design and the |
| need to avoid deadlock is an important reason why parallel programming |
| is perceived by some to be so difficult. |
| |
| Some alternatives to highly layered locking hierarchies are covered in |
| \cref{chp:Deferred Processing}. |
| |
| \subsubsection{Temporal Locking Hierarchies} |
| \label{sec:locking:Temporal Locking Hierarchies} |
| |
| One way to avoid deadlock is to defer acquisition of one of the |
| conflicting locks. |
| This approach is used in Linux-kernel RCU, whose \apik{call_rcu()} |
| function is invoked by the Linux-kernel scheduler while holding |
| its locks. |
| This means that \co{call_rcu()} cannot always safely invoke the scheduler |
| to do a wakeup, for example, in order to wake up an RCU kthread in order |
| to start the new grace period that is required by the callback queued |
| by \co{call_rcu()}. |
| |
| \QuickQuiz{ |
| What do you mean ``cannot always safely invoke the scheduler''? |
| Either \co{call_rcu()} can or cannot safely invoke the scheduler, |
| right? |
| }\QuickQuizAnswer{ |
| It really does depend. |
| |
| The scheduler locks are always held with interrupts disabled. |
| Therefore, if \co{call_rcu()} is invoked with interrupts |
| enabled, no scheduler locks are held, and \co{call_rcu()} |
| can safely call into the scheduler. |
| Otherwise, if interrupts are disabled, one of the scheduler |
| locks \emph{might} be held, so \co{call_rcu()} must play it |
| safe and refrain from calling into the scheduler. |
| }\QuickQuizEnd |
| |
| However, grace periods last for many milliseconds, so waiting another |
| millisecond before starting a new grace period is not normally a problem. |
| Therefore, if \co{call_rcu()} detects a possible deadlock with the |
| scheduler, it arranges to start the new grace period later, either |
| within a timer handler or within the scheduler-clock interrupt handler, |
| depending on configuration. |
| Because no scheduler locks are held across either handler, deadlock |
| is successfully avoided. |
| |
| The overall approach is thus to adhere to a locking hierarchy by deferring |
| lock acquisition to an environment in which no locks are held. |
| |
| \subsubsection{Locking Hierarchies and Pointers to Locks} |
| \label{sec:locking:Locking Hierarchies and Pointers to Locks} |
| |
| Although there are some exceptions, an external API containing a pointer |
| to a lock is very often a misdesigned API\@. |
| Handing an internal lock to some other software component is after all |
| the antithesis of information hiding, which is in turn a key design |
| principle. |
| |
| \QuickQuiz{ |
| Name one common situation where a pointer to a lock is passed |
| into a function. |
| }\QuickQuizAnswer{ |
| Locking primitives, of course! |
| }\QuickQuizEnd |
| |
| One exception is functions that hand off some entity, |
| where the caller's lock must be held until the handoff is complete, |
| but where the lock must be released before the function returns. |
| One example of such a function is the POSIX \apipx{pthread_cond_wait()} |
| function, where passing an pointer to a \apipx{pthread_mutex_t} |
| prevents hangs due to lost wakeups. |
| |
| \QuickQuiz{ |
| Doesn't the fact that \co{pthread_cond_wait()} first releases the |
| mutex and then re-acquires it eliminate the possibility of deadlock? |
| }\QuickQuizAnswer{ |
| Absolutely not! |
| |
| Consider a program that acquires \co{mutex_a}, and then |
| \co{mutex_b}, in that order, and then passes \co{mutex_a} |
| to \co{pthread_cond_wait}. |
| Now, \co{pthread_cond_wait} will release \co{mutex_a}, but |
| will re-acquire it before returning. |
| If some other thread acquires \co{mutex_a} in the meantime |
| and then blocks on \co{mutex_b}, the program will deadlock. |
| }\QuickQuizEnd |
| |
| In short, if you find yourself exporting an API with a pointer to a |
| lock as an argument or the as the return value, do yourself a favor and |
| carefully reconsider your API design. |
| It might well be the right thing to do, but experience indicates that |
| this is unlikely. |
| |
| \subsubsection{Conditional Locking} |
| \label{sec:locking:Conditional Locking} |
| |
| But suppose that there is no reasonable locking hierarchy. |
| This can happen in real life, for example, in some types of layered |
| network protocol stacks where packets flow in both directions, for |
| example, in implementations of distributed lock managers. |
| In the networking case, it might be necessary to hold the locks from |
| both layers when passing a packet from one layer to another. |
| Given that packets travel both up and down the protocol stack, this |
| is an excellent recipe for deadlock, as illustrated in |
| \cref{lst:locking:Protocol Layering and Deadlock}. |
| \begin{fcvref}[ln:locking:Protocol Layering and Deadlock] |
| Here, a packet moving down the stack towards the wire must acquire |
| the next layer's lock out of order. |
| Given that packets moving up the stack away from the wire are acquiring |
| the locks in order, the lock acquisition in \clnref{acq} of the listing |
| can result in deadlock. |
| \end{fcvref} |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:locking:Protocol Layering and Deadlock] |
| \begin{VerbatimL}[commandchars=\\\{\}] |
| spin_lock(&lock2); |
| layer_2_processing(pkt); |
| nextlayer = layer_1(pkt); |
| spin_lock(&nextlayer->lock1); \lnlbl{acq} |
| layer_1_processing(pkt); |
| spin_unlock(&lock2); |
| spin_unlock(&nextlayer->lock1); |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Protocol Layering and Deadlock} |
| \label{lst:locking:Protocol Layering and Deadlock} |
| \end{listing} |
| |
| One way to avoid deadlocks in this case is to impose a locking hierarchy, |
| but when it is necessary to acquire a lock out of order, acquire it |
| conditionally, as shown in |
| \cref{lst:locking:Avoiding Deadlock Via Conditional Locking}. |
| \begin{fcvref}[ln:locking:Avoiding Deadlock Via Conditional Locking] |
| Instead of unconditionally acquiring the layer-1 lock, \clnref{trylock} |
| conditionally acquires the lock using the \apik{spin_trylock()} primitive. |
| This primitive acquires the lock immediately if the lock is available |
| (returning non-zero), and otherwise returns zero without acquiring the lock. |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:locking:Avoiding Deadlock Via Conditional Locking] |
| \begin{VerbatimL}[commandchars=\\\[\]] |
| retry: |
| spin_lock(&lock2); |
| layer_2_processing(pkt); |
| nextlayer = layer_1(pkt); |
| if (!spin_trylock(&nextlayer->lock1)) { \lnlbl[trylock] |
| spin_unlock(&lock2); \lnlbl[rel2] |
| spin_lock(&nextlayer->lock1); \lnlbl[acq1] |
| spin_lock(&lock2); \lnlbl[acq2] |
| if (layer_1(pkt) != nextlayer) {\lnlbl[recheck] |
| spin_unlock(&nextlayer->lock1); |
| spin_unlock(&lock2); |
| goto retry; |
| } |
| } |
| layer_1_processing(pkt); \lnlbl[l1_proc] |
| spin_unlock(&lock2); |
| spin_unlock(&nextlayer->lock1); |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Avoiding Deadlock Via Conditional Locking} |
| \label{lst:locking:Avoiding Deadlock Via Conditional Locking} |
| \end{listing} |
| |
| If \co{spin_trylock()} was successful, \clnref{l1_proc} does the needed |
| layer-1 processing. |
| Otherwise, \clnref{rel2} releases the lock, and |
| \clnref{acq1,acq2} acquire them in |
| the correct order. |
| Unfortunately, there might be multiple networking devices on |
| the system (e.g., Ethernet and WiFi), so that the \co{layer_1()} |
| function must make a routing decision. |
| This decision might change at any time, especially if the system |
| is mobile.\footnote{ |
| And, in contrast to the 1900s, mobility is the common case.} |
| Therefore, \clnref{recheck} must recheck the decision, and if it has changed, |
| must release the locks and start over. |
| \end{fcvref} |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| Can the transformation from |
| \cref{lst:locking:Protocol Layering and Deadlock} to |
| \cref{lst:locking:Avoiding Deadlock Via Conditional Locking} |
| be applied universally? |
| }\QuickQuizAnswerB{ |
| Absolutely not! |
| |
| This transformation assumes that the |
| \co{layer_2_processing()} function is idempotent, given that |
| it might be executed multiple times on the same packet when |
| the \co{layer_1()} routing decision changes. |
| Therefore, in real life, this transformation can become |
| arbitrarily complex. |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| But the complexity in |
| \cref{lst:locking:Avoiding Deadlock Via Conditional Locking} |
| is well worthwhile given that it avoids deadlock, right? |
| }\QuickQuizAnswerE{ |
| Maybe. |
| |
| If the routing decision in \co{layer_1()} changes often enough, |
| the code will always retry, never making forward progress. |
| This is termed ``\IX{livelock}'' if no thread makes any |
| forward progress or ``\IX{starvation}'' |
| if some threads make forward progress but others do not |
| (see \cref{sec:locking:Livelock and Starvation}). |
| }\QuickQuizEndE |
| } |
| |
| \subsubsection{Acquire Needed Locks First} |
| \label{sec:locking:Acquire Needed Locks First} |
| |
| In an important special case of conditional locking, all needed |
| locks are acquired before any processing is carried out, where |
| the needed locks might be identified by hashing the addresses |
| of the data structures involved. |
| In this case, processing need not be idempotent: |
| If it turns out to be impossible to acquire a given lock without |
| first releasing one that was already acquired, just release all |
| the locks and try again. |
| Only once all needed locks are held will any processing be carried out. |
| |
| However, this procedure can result in \emph{livelock}, which will |
| be discussed in |
| \cref{sec:locking:Livelock and Starvation}. |
| |
| \QuickQuiz{ |
| When using the ``acquire needed locks first'' approach described in |
| \cref{sec:locking:Acquire Needed Locks First}, |
| how can livelock be avoided? |
| }\QuickQuizAnswer{ |
| Provide an additional global lock. |
| If a given thread has repeatedly tried and failed to acquire the needed |
| locks, then have that thread unconditionally acquire the new |
| global lock, and then unconditionally acquire any needed locks. |
| (Suggested by Doug Lea.) |
| }\QuickQuizEnd |
| |
| A related approach, two-phase locking~\cite{PhilipABernstein1987}, |
| has seen long production use in transactional database systems. |
| In the first phase of a two-phase locking transaction, locks are |
| acquired but not released. |
| Once all needed locks have been acquired, the transaction enters the |
| second phase, where locks are released, but not acquired. |
| This locking approach allows databases to provide serializability |
| guarantees for their transactions, in other words, to guarantee |
| that all values seen and produced by the transactions are consistent |
| with some global ordering of all the transactions. |
| Many such systems rely on the ability to abort transactions, although |
| this can be simplified by avoiding making any changes to shared data |
| until all needed locks are acquired. |
| Livelock and deadlock are issues in such systems, but practical |
| solutions may be found in any of a number of database textbooks. |
| |
| \subsubsection{Single-Lock-at-a-Time Designs} |
| \label{sec:locking:Single-Lock-at-a-Time Designs} |
| |
| In some cases, it is possible to avoid nesting locks, thus avoiding |
| deadlock. |
| For example, if a problem is perfectly partitionable, a single |
| lock may be assigned to each partition. |
| Then a thread working on a given partition need only acquire the one |
| corresponding lock. |
| Because no thread ever holds more than one lock at a time, |
| deadlock is impossible. |
| |
| However, there must be some mechanism to ensure that the needed data |
| structures remain in existence during the time that neither lock is |
| held. |
| One such mechanism is discussed in |
| \cref{sec:locking:Lock-Based Existence Guarantees} |
| and several others are presented in |
| \cref{chp:Deferred Processing}. |
| |
| \subsubsection{Signal/Interrupt Handlers} |
| \label{sec:locking:Signal/Interrupt Handlers} |
| |
| Deadlocks involving signal handlers are often quickly dismissed by |
| noting that it is not legal to invoke \apipx{pthread_mutex_lock()} from |
| within a signal handler~\cite{OpenGroup1997pthreads}. |
| However, it is possible (though often unwise) to hand-craft |
| locking primitives that \emph{can} be invoked from signal handlers. |
| Besides which, almost all operating-system kernels permit locks to |
| be acquired from within interrupt handlers, which are analogous |
| to signal handlers. |
| |
| The trick is to block signals (or disable interrupts, as the case may be) |
| when acquiring any lock that might be acquired within a signal |
| (or an interrupt) handler. |
| Furthermore, if holding such a lock, it is illegal to attempt to |
| acquire any lock that is ever acquired |
| outside of a signal handler without blocking signals. |
| |
| \QuickQuiz{ |
| Suppose Lock~A is never acquired within a signal handler, |
| but Lock~B is acquired both from thread context and by signal |
| handlers. |
| Suppose further that Lock~A is sometimes acquired with signals |
| unblocked. |
| Why is it illegal to acquire Lock~A holding Lock~B? |
| }\QuickQuizAnswer{ |
| Because this would lead to deadlock. |
| Given that Lock~A is sometimes held outside of a signal |
| handler without blocking signals, a signal might be handled while |
| holding this lock. |
| The corresponding signal handler might then acquire |
| Lock~B, so that Lock~B is acquired while holding Lock~A\@. |
| Therefore, if we also acquire Lock~A while holding Lock~B, |
| we will have a deadlock cycle. |
| Note that this problem exists even if signals are blocked while |
| holding Lock~B. |
| |
| This is another reason to be very careful with locks that are |
| acquired within interrupt or signal handlers. |
| But the Linux kernel's lock dependency checker knows about this |
| situation and many others as well, so please do make full use |
| of it! |
| }\QuickQuizEnd |
| |
| If a lock is acquired by the handlers for several signals, then each |
| and every one of these signals must be blocked whenever that lock is |
| acquired, even when that |
| lock is acquired within a signal handler. |
| |
| \QuickQuiz{ |
| How can you legally block signals within a signal handler? |
| }\QuickQuizAnswer{ |
| One of the simplest and fastest ways to do so is to use |
| the \co{sa_mask} field of the \co{struct sigaction} that |
| you pass to \co{sigaction()} when setting up the signal. |
| }\QuickQuizEnd |
| |
| Unfortunately, blocking and unblocking signals can be expensive in |
| some operating systems, notably including Linux, so performance |
| concerns often mean that locks acquired in signal handlers are only |
| acquired in signal handlers, and that lockless synchronization |
| mechanisms are used to communicate between application code and |
| signal handlers. |
| |
| Or that signal handlers are avoided completely except for handling |
| fatal errors. |
| |
| \QuickQuiz{ |
| If acquiring locks in signal handlers is such a bad idea, why |
| even discuss ways of making it safe? |
| }\QuickQuizAnswer{ |
| Because these same rules apply to the interrupt handlers used in |
| operating-system kernels and in some embedded applications. |
| |
| In many application environments, acquiring locks in signal |
| handlers is frowned upon~\cite{OpenGroup1997pthreads}. |
| However, that does not stop clever developers from (perhaps |
| unwisely) fashioning home-brew locks out of atomic operations. |
| And atomic operations are in many cases perfectly legal in |
| signal handlers. |
| }\QuickQuizEnd |
| |
| \subsubsection{Discussion} |
| \label{sec:locking:Locking Hierarchy Discussion} |
| |
| There are a large number of deadlock-avoidance strategies available to |
| the shared-memory parallel programmer, but there are sequential |
| programs for which none of them is a good fit. |
| This is one of the reasons that expert programmers have more than |
| one tool in their toolbox: |
| Locking is a powerful concurrency tool, but there are jobs better |
| addressed with other tools. |
| |
| \QuickQuiz{ |
| Given an object-oriented application that passes control freely |
| among a group of objects such that there is no straightforward |
| locking hierarchy,\footnote{ |
| Also known as ``object-oriented spaghetti code.''} |
| layered or otherwise, how can this |
| application be parallelized? |
| }\QuickQuizAnswer{ |
| There are a number of approaches: |
| \begin{enumerate} |
| \item In the case of parametric search via simulation, |
| where a large number of simulations will be run |
| in order to converge on (for example) a good design |
| for a mechanical or electrical device, leave the |
| simulation single-threaded, but run many instances |
| of the simulation in parallel. |
| This retains the object-oriented design, and gains |
| parallelism at a higher level, and likely also avoids |
| both deadlocks and synchronization overhead. |
| \item Partition the objects into groups such that there |
| is no need to operate on objects in |
| more than one group at a given time. |
| Then associate a lock with each group. |
| This is an example of a single-lock-at-a-time |
| design, which discussed in |
| \cref{sec:locking:Single-Lock-at-a-Time Designs}. |
| \item Partition the objects into groups such that threads |
| can all operate on objects in the groups in some |
| groupwise ordering. |
| Then associate a lock with each group, and impose a |
| locking hierarchy over the groups. |
| \item Impose an arbitrarily selected hierarchy on the locks, |
| and then use conditional locking if it is necessary |
| to acquire a lock out of order, as was discussed in |
| \cref{sec:locking:Conditional Locking}. |
| \item Before carrying out a given group of operations, predict |
| which locks will be acquired, and attempt to acquire them |
| before actually carrying out any updates. |
| If the prediction turns out to be incorrect, drop |
| all the locks and retry with an updated prediction |
| that includes the benefit of experience. |
| This approach was discussed in |
| \cref{sec:locking:Acquire Needed Locks First}. |
| \item Use transactional memory. |
| This approach has a number of advantages and disadvantages |
| which will be discussed in |
| \crefrange{sec:future:Transactional Memory}{sec:future:Hardware Transactional Memory}. |
| \item Refactor the application to be more concurrency-friendly. |
| This would likely also have the side effect of making |
| the application run faster even when single-threaded, but might |
| also make it more difficult to modify the application. |
| \item Use techniques from later chapters in addition to locking. |
| \end{enumerate} |
| }\QuickQuizEnd |
| |
| Nevertheless, the strategies described in this section have proven |
| quite useful in many settings. |
| |
| \subsection{Livelock and Starvation}\ucindex{Livelock|BF}\ucindex{Starvation|BF} |
| \label{sec:locking:Livelock and Starvation} |
| |
| Although conditional locking can be an effective deadlock-avoidance |
| mechanism, it can be abused. |
| Consider for example the beautifully symmetric example shown in |
| \cref{lst:locking:Abusing Conditional Locking}. |
| This example's beauty hides an ugly livelock. |
| To see this, consider the following sequence of events: |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:locking:Abusing Conditional Locking] |
| \begin{VerbatimL}[commandchars=\\\[\]] |
| void thread1(void) |
| { |
| retry: \lnlbl[thr1:retry] |
| spin_lock(&lock1); \lnlbl[thr1:acq1] |
| do_one_thing(); |
| if (!spin_trylock(&lock2)) { \lnlbl[thr1:try2] |
| spin_unlock(&lock1); \lnlbl[thr1:rel1] |
| goto retry; |
| } |
| do_another_thing(); |
| spin_unlock(&lock2); |
| spin_unlock(&lock1); |
| } |
| |
| void thread2(void) |
| { |
| retry: \lnlbl[thr2:retry] |
| spin_lock(&lock2); \lnlbl[thr2:acq2] |
| do_a_third_thing(); |
| if (!spin_trylock(&lock1)) { \lnlbl[thr2:try1] |
| spin_unlock(&lock2); \lnlbl[thr2:rel2] |
| goto retry; |
| } |
| do_a_fourth_thing(); |
| spin_unlock(&lock1); |
| spin_unlock(&lock2); |
| } |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Abusing Conditional Locking} |
| \label{lst:locking:Abusing Conditional Locking} |
| \end{listing} |
| |
| \begin{fcvref}[ln:locking:Abusing Conditional Locking] |
| \begin{enumerate} |
| \item Thread~1 acquires \co{lock1} on \clnref{thr1:acq1}, then invokes |
| \co{do_one_thing()}. |
| \item Thread~2 acquires \co{lock2} on \clnref{thr2:acq2}, then invokes |
| \co{do_a_third_thing()}. |
| \item Thread~1 attempts to acquire \co{lock2} on \clnref{thr1:try2}, |
| but fails because Thread~2 holds it. |
| \item Thread~2 attempts to acquire \co{lock1} on \clnref{thr2:try1}, |
| but fails because Thread~1 holds it. |
| \item Thread~1 releases \co{lock1} on \clnref{thr1:rel1}, |
| then jumps to \co{retry} at \clnref{thr1:retry}. |
| \item Thread~2 releases \co{lock2} on \clnref{thr2:rel2}, |
| and jumps to \co{retry} at \clnref{thr2:retry}. |
| \item The livelock dance repeats from the beginning. |
| \end{enumerate} |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| How can the livelock shown in |
| \cref{lst:locking:Abusing Conditional Locking} |
| be avoided? |
| }\QuickQuizAnswer{ |
| \Cref{lst:locking:Avoiding Deadlock Via Conditional Locking} |
| provides some good hints. |
| In many cases, livelocks are a hint that you should revisit your |
| locking design. |
| Or visit it in the first place if your locking design |
| ``just grew''. |
| |
| That said, one good-and-sufficient approach due to Doug Lea |
| is to use conditional locking as described in |
| \cref{sec:locking:Conditional Locking}, but combine this |
| with acquiring all needed locks first, before modifying shared |
| data, as described in |
| \cref{sec:locking:Acquire Needed Locks First}. |
| If a given critical section retries too many times, |
| unconditionally acquire |
| a global lock, then unconditionally acquire all the needed locks. |
| This avoids both deadlock and livelock, and scales reasonably |
| assuming that the global lock need not be acquired too often. |
| }\QuickQuizEnd |
| |
| Livelock can be thought of as an extreme form of starvation where |
| a group of threads starves, rather than just one of them.\footnote{ |
| Try not to get too hung up on the exact definitions of terms |
| like livelock, starvation, and unfairness. |
| Anything that causes a group of threads to fail to make adequate |
| forward progress is a bug that needs to be fixed, and debating |
| names doesn't fix bugs.} |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:locking:Conditional Locking and Exponential Backoff] |
| \begin{VerbatimL}[commandchars=\\\[\]] |
| void thread1(void) |
| { |
| unsigned int wait = 1; |
| retry: |
| spin_lock(&lock1); |
| do_one_thing(); |
| if (!spin_trylock(&lock2)) { |
| spin_unlock(&lock1); |
| sleep(wait); |
| wait = wait << 1; |
| goto retry; |
| } |
| do_another_thing(); |
| spin_unlock(&lock2); |
| spin_unlock(&lock1); |
| } |
| |
| void thread2(void) |
| { |
| unsigned int wait = 1; |
| retry: |
| spin_lock(&lock2); |
| do_a_third_thing(); |
| if (!spin_trylock(&lock1)) { |
| spin_unlock(&lock2); |
| sleep(wait); |
| wait = wait << 1; |
| goto retry; |
| } |
| do_a_fourth_thing(); |
| spin_unlock(&lock1); |
| spin_unlock(&lock2); |
| } |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Conditional Locking and Exponential Backoff} |
| \label{lst:locking:Conditional Locking and Exponential Backoff} |
| \end{listing} |
| |
| Livelock and starvation are serious issues in software transactional |
| memory implementations, and so the concept of \emph{contention |
| manager} has been introduced to encapsulate these issues. |
| In the case of locking, simple exponential backoff can often address |
| livelock and starvation. |
| The idea is to introduce exponentially increasing delays before each |
| retry, as shown in |
| \cref{lst:locking:Conditional Locking and Exponential Backoff}. |
| |
| \QuickQuiz{ |
| What problems can you spot in the code in |
| \cref{lst:locking:Conditional Locking and Exponential Backoff}? |
| }\QuickQuizAnswer{ |
| Here are a couple: |
| \begin{enumerate} |
| \item A one-second wait is way too long for most uses. |
| Wait intervals should begin with roughly the time |
| required to execute the critical section, which will |
| normally be in the microsecond or millisecond range. |
| \item The code does not check for overflow. |
| On the other hand, this bug is nullified |
| by the previous bug: |
| 32 bits worth of seconds is more than 50 years. |
| \end{enumerate} |
| }\QuickQuizEnd |
| |
| For better results, backoffs should be bounded, and |
| even better high-contention results are obtained via queued |
| locking~\cite{Anderson90}, which is discussed more in |
| \cref{sec:locking:Other Exclusive-Locking Implementations}. |
| Of course, best of all is to use a good parallel design that avoids |
| these problems by maintaining low \IX{lock contention}. |
| |
| \subsection{Unfairness} |
| \label{sec:locking:Unfairness} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{cpu/SystemArch}} |
| \caption{System Architecture and Lock Unfairness} |
| \label{fig:locking:System Architecture and Lock Unfairness} |
| \end{figure} |
| |
| \IXB{Unfairness} can be thought of as a less-severe form of starvation, |
| where a subset of threads contending for a given lock are granted |
| the lion's share of the acquisitions. |
| This can happen on machines with shared caches or \IXacr{numa} characteristics, |
| for example, as shown in |
| \cref{fig:locking:System Architecture and Lock Unfairness}. |
| If CPU~0 releases a lock that all the other CPUs are attempting |
| to acquire, the interconnect shared between CPUs~0 and~1 means that |
| CPU~1 will have an advantage over CPUs~2--7. |
| Therefore CPU~1 will likely acquire the lock. |
| If CPU~1 holds the lock long enough for CPU~0 to be requesting the |
| lock by the time CPU~1 releases it and vice versa, the lock can |
| shuttle between CPUs~0 and~1, bypassing CPUs~2--7. |
| |
| \QuickQuiz{ |
| Wouldn't it be better just to use a good parallel design |
| so that lock contention was low enough to avoid unfairness? |
| }\QuickQuizAnswer{ |
| It would be better in some sense, but there are situations |
| where it can be appropriate to use |
| designs that sometimes result in high lock contentions. |
| |
| For example, imagine a system that is subject to a rare error |
| condition. |
| It might well be best to have a simple error-handling design |
| that has poor performance and scalability for the duration of |
| the rare error condition, as opposed to a complex and |
| difficult-to-debug design that is helpful only when one of |
| those rare error conditions is in effect. |
| |
| That said, it is usually worth putting some effort into |
| attempting to produce a design that both simple as well as |
| efficient during error conditions, for example by partitioning |
| the problem. |
| }\QuickQuizEnd |
| |
| \subsection{Inefficiency} |
| \label{sec:locking:Inefficiency} |
| |
| Locks are implemented using atomic instructions and memory barriers, |
| and often involve cache misses. |
| As we saw in \cref{chp:Hardware and its Habits}, |
| these instructions are quite expensive, roughly two |
| orders of magnitude greater overhead than simple instructions. |
| This can be a serious problem for locking: |
| If you protect a single instruction with a lock, you will increase |
| the overhead by a factor of one hundred. |
| Even assuming perfect scalability, \emph{one hundred} CPUs would |
| be required to keep up with a single CPU executing the same code |
| without locking. |
| |
| This situation underscores the synchronization\-/granularity |
| tradeoff discussed in \cref{sec:SMPdesign:Synchronization Granularity}, |
| especially \cref{fig:SMPdesign:Synchronization Efficiency}: |
| Too coarse a granularity will limit scalability, while too fine a |
| granularity will result in excessive synchronization overhead. |
| |
| Acquiring a lock might be expensive, but once held, the CPU's caches |
| are an effective performance booster, at least for large \IXpl{critical section}. |
| In addition, once a lock is held, the data protected by that lock can |
| be accessed by the lock holder without interference from other threads. |
| |
| \QuickQuiz{ |
| How might the lock holder be interfered with? |
| }\QuickQuizAnswer{ |
| If the data protected by the lock is in the same cache line |
| as the lock itself, then attempts by other CPUs to acquire |
| the lock will result in expensive cache misses on the part |
| of the CPU holding the lock. |
| This is a special case of \IX{false sharing}, which can also occur |
| if a pair of variables protected by different locks happen |
| to share a cache line. |
| In contrast, if the lock is in a different cache line than |
| the data that it protects, the CPU holding the lock will |
| usually suffer a cache miss only on first access to a given |
| variable. |
| |
| Of course, the downside of placing the lock and data into separate |
| cache lines is that the code will incur two cache misses rather |
| than only one in the uncontended case. |
| As always, choose wisely! |
| }\QuickQuizEnd |
| |
| \section{Types of Locks} |
| \label{sec:locking:Types of Locks} |
| % |
| \epigraph{Only locks in life are what you think you know, but don't. |
| Accept your ignorance and try something new.} |
| {\emph{Dennis Vickers}} |
| |
| There are a surprising number of types of locks, more than this |
| short chapter can possibly do justice to. |
| The following sections discuss |
| exclusive locks (\cref{sec:locking:Exclusive Locks}), |
| reader-writer locks (\cref{sec:locking:Reader-Writer Locks}), |
| multi-role locks (\cref{sec:locking:Beyond Reader-Writer Locks}), |
| and scoped locking (\cref{sec:locking:Scoped Locking}). |
| |
| \subsection{Exclusive Locks} |
| \label{sec:locking:Exclusive Locks} |
| |
| \IXhpl{Exclusive}{lock} are what they say they are: |
| Only one thread may hold the lock at a time. |
| The holder of such a lock thus has exclusive access to all data protected |
| by that lock, hence the name. |
| |
| Of course, this all assumes that this lock is held across all accesses |
| to data purportedly protected by the lock. |
| Although there are some tools that can help (see for example |
| \cref{sec:formal:Axiomatic Approaches and Locking}), |
| the ultimate responsibility for ensuring that the lock is always acquired |
| when needed rests with the developer. |
| |
| \QuickQuiz{ |
| Does it ever make sense to have an exclusive lock acquisition |
| immediately followed by a release of that same lock, that is, |
| an empty critical section? |
| }\QuickQuizAnswer{ |
| Empty lock-based critical sections are rarely used, but they |
| do have their uses. |
| The point is that the semantics of exclusive locks have two |
| components: |
| \begin{enumerate*}[(1)] |
| \item The familiar data-protection semantic and |
| \item A messaging semantic, where releasing a given lock notifies |
| a waiting acquisition of that same lock. |
| \end{enumerate*} |
| An empty critical section uses the messaging component without |
| the data-protection component. |
| |
| The rest of this answer provides some example uses of empty |
| critical sections, however, these examples should be considered |
| ``gray magic.''\footnote{ |
| Thanks to Alexey Roytman for this description.} |
| As such, empty critical sections are almost never used in practice. |
| Nevertheless, pressing on into this gray area \ldots |
| |
| One historical use of empty critical sections appeared in the |
| networking stack of the 2.4 Linux kernel through use of a |
| read-side-scalable reader-writer lock called \co{brlock} |
| for ``big reader lock''. |
| This use case is a way of approximating the semantics of read-copy |
| update (RCU), which is discussed in |
| \cref{sec:defer:Read-Copy Update (RCU)}. |
| And in fact this Linux-kernel use case has been replaced |
| with RCU\@. |
| |
| The empty-lock-critical-section idiom can also be used to |
| reduce lock contention in some situations. |
| For example, consider a multithreaded user-space application where |
| each thread processes units of work maintained in a per-thread |
| list, where threads are prohibited from touching each others' |
| lists~\cite{PaulEMcKenney2012EmptyLocks}. |
| There could also be updates that require that all previously |
| scheduled units of work have completed before the update can |
| progress. |
| One way to handle this is to schedule a unit of work on each |
| thread, so that when all of these units of work complete, the |
| update may proceed. |
| |
| In some applications, threads can come and go. |
| For example, each thread might correspond to one user of the |
| application, and thus be removed when that user logs out or |
| otherwise disconnects. |
| In many applications, threads cannot depart atomically: |
| They must instead explicitly unravel themselves from various |
| portions of the application using a specific sequence of actions. |
| One specific action will be refusing to accept further requests |
| from other threads, and another specific action will be disposing |
| of any remaining units of work on its list, for example, by |
| placing these units of work in a global work-item-disposal list |
| to be taken by one of the remaining threads. |
| (Why not just drain the thread's work-item list by executing |
| each item? |
| Because a given work item might generate more work items, so |
| that the list could not be drained in a timely fashion.) |
| |
| If the application is to perform and scale well, a good locking |
| design is required. |
| One common solution is to have a global lock (call it \co{G}) |
| protecting the entire |
| process of departing (and perhaps other things as well), |
| with finer-grained locks protecting the |
| individual unraveling operations. |
| |
| Now, a departing thread must clearly refuse to accept further |
| requests before disposing of the work on its list, because |
| otherwise additional work might arrive after the disposal action, |
| which would render that disposal action ineffective. |
| So simplified pseudocode for a departing thread might be as follows: |
| |
| \begin{enumerate} |
| \item Acquire lock \co{G}. |
| \item Acquire the lock guarding communications. |
| \item Refuse further communications from other threads. |
| \item Release the lock guarding communications. |
| \item Acquire the lock guarding the global work-item-disposal list. |
| \item Move all pending work items to the global |
| work-item-disposal list. |
| \item Release the lock guarding the global work-item-disposal list. |
| \item Release lock \co{G}. |
| \end{enumerate} |
| |
| Of course, a thread that needs to wait for all pre-existing work |
| items will need to take departing threads into account. |
| To see this, suppose that this thread starts waiting for all |
| pre-existing work items just after a departing thread has refused |
| further communications from other threads. |
| How can this thread wait for the departing thread's work items |
| to complete, keeping in mind that threads are not allowed to |
| access each others' lists of work items? |
| |
| One straightforward approach is for this thread to acquire \co{G} |
| and then the lock guarding the global work-item-disposal list, then |
| move the work items to its own list. |
| The thread then release both locks, |
| places a work item on the end of its own list, |
| and then wait for all of the work items that it placed on each thread's |
| list (including its own) to complete. |
| |
| This approach does work well in many cases, but if special |
| processing is required for each work item as it is pulled in |
| from the global work-item-disposal list, the result could be |
| excessive contention on \co{G}. |
| One way to avoid that contention is to acquire \co{G} and then |
| immediately release it. |
| Then the process of waiting for all prior work items look |
| something like the following: |
| |
| \begin{enumerate} |
| \item Set a global counter to one and initialize a condition |
| variable to zero. |
| \item Send a message to all threads to cause them to atomically |
| increment the global counter, and then to enqueue a |
| work item. |
| The work item will atomically decrement the global |
| counter, and if the result is zero, it will set a |
| condition variable to one. |
| \item Acquire \co{G}, which will wait on any currently departing |
| thread to finish departing. |
| Because only one thread may depart at a time, all the |
| remaining threads will have already received the message |
| sent in the preceding step. |
| \item Release \co{G}. |
| \item Acquire the lock guarding the global work-item-disposal list. |
| \item Move all work items from the global work-item-disposal list |
| to this thread's list, processing them as needed along the way. |
| \item Release the lock guarding the global work-item-disposal list. |
| \item Enqueue an additional work item onto this thread's list. |
| (As before, this work item will atomically decrement |
| the global counter, and if the result is zero, it will |
| set a condition variable to one.) |
| \item Wait for the condition variable to take on the value one. |
| \end{enumerate} |
| |
| Once this procedure completes, all pre-existing work items are |
| guaranteed to have completed. |
| The empty critical sections are using locking for messaging as |
| well as for protection of data. |
| }\QuickQuizEnd |
| |
| \QuickQuizLabel{\QlockingQemptycriticalsection} |
| |
| It is important to note that unconditionally acquiring an exclusive lock |
| has two effects: |
| \begin{enumerate*}[(1)] |
| \item Waiting for all prior holders of that lock to release it and |
| \item Blocking any other acquisition attempts until the lock is released. |
| \end{enumerate*} |
| As a result, at lock acquisition time, any concurrent acquisitions of |
| that lock must be partitioned into prior holders and subsequent |
| holders. |
| Different types of exclusive locks use different partitioning |
| strategies~\cite{BjoernBrandenburgPhD,Guerraoui:2019:LPA:3319851.3301501}, |
| for example: |
| |
| \begin{enumerate} |
| \item Strict FIFO, with acquisitions starting earlier acquiring |
| the lock earlier. |
| \item Approximate FIFO, with acquisitions starting sufficiently |
| earlier acquiring the lock earlier. |
| \item FIFO within priority level, with higher-priority threads |
| acquiring the lock earlier than any lower-priority threads |
| attempting to acquire the lock at about the same time, but so |
| that some FIFO ordering applies for threads of the same priority. |
| \item Random, so that the new lock holder is chosen randomly from |
| all threads attempting acquisition, regardless of timing. |
| \item |
| Unfair, so that a given acquisition might never acquire the lock |
| (see \cref{sec:locking:Unfairness}). |
| \end{enumerate} |
| |
| Unfortunately, locking implementations with stronger guarantees |
| typically incur higher overhead, motivating the wide variety of locking |
| implementations in production use. |
| For example, real-time systems often require some degree of FIFO |
| ordering within priority level, and much else besides |
| (see \cref{sec:advsync:Event-Driven Real-Time Support}), |
| while non-realtime systems subject to high contention might require |
| only enough ordering to avoid starvation, and finally, non-realtime |
| systems designed to avoid contention might not need fairness at all. |
| |
| \subsection{Reader-Writer Locks} |
| \label{sec:locking:Reader-Writer Locks} |
| |
| \IXhpl{Reader-writer}{lock}~\cite{Courtois71} |
| permit any number of readers to hold the lock |
| concurrently on the one hand or a single writer to hold the lock |
| on the other. |
| In theory, then, reader-writer locks should allow excellent scalability |
| for data that is read often and written rarely. |
| In practice, the scalability will depend on the reader-writer lock |
| implementation. |
| |
| The classic reader-writer lock implementation involves a set of |
| counters and flags that are manipulated atomically. |
| This type of implementation suffers from the same problem as does |
| exclusive locking for short critical sections: |
| The overhead of acquiring and releasing the lock |
| is about two orders of magnitude greater than the overhead |
| of a simple instruction. |
| Of course, if the critical section is long enough, the overhead of |
| acquiring and releasing the lock becomes negligible. |
| However, because only |
| one thread at a time can be manipulating the lock, the required |
| critical-section size increases with the number of CPUs. |
| |
| It is possible to design a reader-writer lock that is much more |
| favorable to readers through use of per-thread exclusive |
| locks~\cite{WilsonCHsieh92a}. |
| To read, a thread acquires only its own lock. |
| To write, a thread acquires all locks. |
| In the absence of writers, each reader incurs only atomic-instruction |
| and memory-barrier overhead, with no cache misses, which is quite |
| good for a locking primitive. |
| Unfortunately, writers must incur cache misses as well as atomic-instruction |
| and memory-barrier overhead---multiplied by the number of threads. |
| |
| In short, reader-writer locks can be quite useful in a number of |
| situations, but each type of implementation does have its drawbacks. |
| The canonical use case for reader-writer locking involves very long |
| \IXhpl{read-side}{critical section}, preferably measured in hundreds of microseconds |
| or even milliseconds. |
| |
| As with exclusive locks, a reader-writer lock acquisition cannot complete |
| until all prior conflicting holders of that lock have released it. |
| If a lock is read-held, then read acquisitions can complete immediately, |
| but write acquisitions must wait until there are no longer any readers |
| holding the lock. |
| If a lock is write-held, then all acquisitions must wait until the writer |
| releases the lock. |
| Again as with exclusive locks, different reader-writer lock implementations |
| provide different degrees of FIFO ordering to readers on the one hand |
| and to writers on the other. |
| |
| But suppose a large number of readers hold the lock and a writer is waiting |
| to acquire the lock. |
| Should readers be allowed to continue to acquire the lock, possibly |
| starving the writer? |
| Similarly, suppose that a writer holds the lock and that a large number |
| of both readers and writers are waiting to acquire the lock. |
| When the current writer release the lock, should it be given to a reader |
| or to another writer? |
| If it is given to a reader, how many readers should be allowed to acquire |
| the lock before the next writer is permitted to do so? |
| |
| There are many possible answers to these questions, with different |
| levels of complexity, overhead, and fairness. |
| Different implementations might have different costs, for example, |
| some types of reader-writer locks incur extremely large latencies |
| when switching from read-holder to write-holder mode. |
| Here are a few possible approaches: |
| |
| \begin{enumerate} |
| \item |
| Reader-preference implementations unconditionally favor readers |
| over writers, possibly allowing write acquisitions to be |
| indefinitely blocked. |
| \item Batch-fair implementations ensure that when both readers and writers |
| are acquiring the lock, both have reasonable access via batching. |
| For example, the lock might admit five readers per CPU, then two |
| writers, then five more readers per CPU, and so on. |
| \item Writer-preference implementations unconditionally favor |
| writers over readers, possibly allowing read acquisitions to be |
| indefinitely blocked. |
| \end{enumerate} |
| |
| Of course, these distinctions matter only under conditions of high |
| lock contention. |
| |
| Please keep the waiting/blocking dual nature of locks firmly in mind. |
| This will be revisited in \cref{chp:Deferred Processing}'s discussion |
| of scalable high-performance special-purpose alternatives to locking. |
| |
| \subsection{Beyond Reader-Writer Locks} |
| \label{sec:locking:Beyond Reader-Writer Locks} |
| |
| \begin{table} |
| \renewcommand*{\arraystretch}{1.2} |
| \newcommand{\x}{\textcolor{gray!20}{\rule{7pt}{7pt}}} |
| \newcommand{\rothead}[1]{\begin{picture}(6,65)(0,0)\rotatebox{90}{#1}\end{picture}} |
| \small |
| \centering |
| \begin{tabular}{lcccccc} |
| \toprule |
| & \rothead{Null (Not Held)} |
| & \rothead{Concurrent Read} |
| & \rothead{Concurrent Write} |
| & \rothead{Protected Read} |
| & \rothead{Protected Write} |
| & \rothead{Exclusive} |
| \\ |
| % NL CR CW PR PW EX |
| \cmidrule(r){1-1} \cmidrule{2-7} |
| Null (Not Held) & \x & \x & \x & \x & \x & \x \\ |
| Concurrent Read & \x & \x & \x & \x & \x & X \\ |
| Concurrent Write & \x & \x & \x & X & X & X \\ |
| Protected Read & \x & \x & X & \x & X & X \\ |
| Protected Write & \x & \x & X & X & X & X \\ |
| Exclusive & \x & X & X & X & X & X \\ |
| \bottomrule |
| \end{tabular} |
| \caption{VAX/VMS Distributed Lock Manager Policy} |
| \label{tab:locking:VAX/VMS Distributed Lock Manager Policy} |
| \end{table} |
| |
| Reader-writer locks and exclusive locks differ in their admission policy: |
| Exclusive locks allow at most one holder, while reader-writer locks |
| permit an arbitrary number of read-holders (but only one write-holder). |
| There is a very large number of possible admission policies, one of |
| which is that of the VAX/VMS distributed lock |
| manager (DLM)~\cite{Snaman87}, which is shown in |
| \cref{tab:locking:VAX/VMS Distributed Lock Manager Policy}. |
| Blank cells indicate compatible modes, while cells containing ``X'' |
| indicate incompatible modes. |
| |
| The VAX/VMS DLM uses six modes. |
| For purposes of comparison, exclusive |
| locks use two modes (not held and held), while reader-writer locks |
| use three modes (not held, read held, and write held). |
| |
| The first mode is null, or not held. |
| This mode is compatible with all other modes, which is to be expected: |
| If a thread is not holding a lock, it should not prevent any |
| other thread from acquiring that lock. |
| |
| The second mode is concurrent read, which is compatible with every other |
| mode except for exclusive. |
| The concurrent-read mode might be used to accumulate approximate |
| statistics on a data structure, while permitting updates to proceed |
| concurrently. |
| |
| The third mode is concurrent write, which is compatible with null, |
| concurrent read, and concurrent write. |
| The concurrent-write mode might be used to update approximate statistics, |
| while still permitting reads and concurrent updates to proceed |
| concurrently. |
| |
| The fourth mode is protected read, which is compatible with null, |
| concurrent read, and protected read. |
| The protected-read mode might be used to obtain a consistent snapshot |
| of the data structure, while permitting reads but not updates to |
| proceed concurrently. |
| |
| The fifth mode is protected write, which is compatible with null and |
| concurrent read. |
| The protected-write mode might be used to carry out updates to a data |
| structure that could interfere with protected readers but which could |
| be tolerated by concurrent readers. |
| |
| The sixth and final mode is exclusive, which is compatible only with null. |
| The exclusive mode is used when it is necessary to exclude all other accesses. |
| |
| It is interesting to note that exclusive locks and reader-writer locks |
| can be emulated by the VAX/VMS DLM\@. |
| Exclusive locks would use only the null and exclusive modes, while |
| reader-writer locks might use the null, protected-read, and |
| protected-write modes. |
| |
| \QuickQuiz{ |
| Is there any other way for the VAX/VMS DLM to emulate |
| a reader-writer lock? |
| }\QuickQuizAnswer{ |
| There are in fact several. |
| One way would be to use the null, protected-read, and exclusive |
| modes. |
| Another way would be to use the null, protected-read, and |
| concurrent-write modes. |
| A third way would be to use the null, concurrent-read, and |
| exclusive modes. |
| }\QuickQuizEnd |
| |
| Although the VAX/VMS DLM policy has seen widespread production use |
| for distributed databases, it does not appear to be used much in |
| shared-memory applications. |
| One possible reason for this is that the greater communication overheads |
| of distributed databases can hide the greater overhead of the |
| VAX/VMS DLM's more-complex admission policy. |
| |
| Nevertheless, the VAX/VMS DLM is an interesting illustration of just |
| how flexible the concepts behind locking can be. |
| It also serves as a very simple introduction to the locking schemes |
| used by modern DBMSes, which can have more than thirty locking modes, |
| compared to VAX/VMS's six. |
| |
| \subsection{Scoped Locking} |
| \label{sec:locking:Scoped Locking} |
| |
| The locking primitives discussed thus far require explicit acquisition and |
| release primitives, for example, \co{spin_lock()} and \co{spin_unlock()}, |
| respectively. |
| Another approach is to use the object-oriented \IXacrmfst{raii} |
| pattern~\cite{MargaretAEllis1990Cplusplus}.\footnote{ |
| Though more clearly expressed at |
| \url{https://www.stroustrup.com/bs_faq2.html\#finally}.} |
| This pattern is often applied to auto variables in languages like C++, |
| where the corresponding \emph{constructor} is invoked upon entry to |
| the object's scope, and the corresponding \emph{destructor} is invoked |
| upon exit from that scope. |
| This can be applied to locking by having the constructor acquire the |
| lock and the destructor free it. |
| |
| This approach can be quite useful, in fact in 1990 I was convinced that it |
| was the only type of locking that was needed.\footnote{ |
| My later work with parallelism at Sequent Computer Systems very |
| quickly disabused me of this misguided notion.} |
| One very nice property of RAII locking is that you don't need to carefully |
| release the lock on each and every code path that exits that scope, |
| a property that can eliminate a troublesome set of bugs. |
| |
| However, RAII locking also has a dark side. |
| RAII makes it quite difficult to encapsulate lock acquisition and release, |
| for example, in iterators. |
| In many iterator implementations, you would like to acquire the lock in |
| the iterator's ``start'' function and release it in the iterator's ``stop'' |
| function. |
| RAII locking instead requires that the lock acquisition and release take |
| place in the same level of scoping, making such encapsulation difficult or |
| even impossible. |
| |
| Strict RAII locking also prohibits overlapping critical sections, due |
| to the fact that scopes must nest. |
| This prohibition makes it difficult or impossible to express a number of |
| useful constructs, for example, locking trees |
| that mediate between multiple concurrent attempts to assert an event. |
| Of an arbitrarily large group of concurrent attempts, only one need succeed, |
| and the best strategy for the remaining attempts is for them to fail as |
| quickly and painlessly as possible. |
| Otherwise, lock contention becomes pathological on large systems |
| (where ``large'' is many hundreds of CPUs). |
| Therefore, C++17~\cite{RichardSmith2019N4800} has escapes from strict RAII |
| in its \co{unique_lock} class, which allows the scope of the critical |
| section to be controlled to roughly the same extent as can be achieved |
| with explicit lock acquisition and release primitives. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{locking/rnplock}} |
| \caption{Locking Hierarchy} |
| \label{fig:locking:Locking Hierarchy} |
| \end{figure} |
| |
| Example strict-RAII-unfriendly data structures from Linux-kernel RCU |
| are shown in |
| \cref{fig:locking:Locking Hierarchy}. |
| Here, each CPU is assigned a leaf \co{rcu_node} structure, and each |
| \co{rcu_node} structure has a pointer to its parent (named, oddly |
| enough, \co{->parent}), up to the root \co{rcu_node} structure, |
| which has a \co{NULL} \co{->parent} pointer. |
| The number of child \co{rcu_node} structures per parent can vary, |
| but is typically 32 or 64. |
| Each \co{rcu_node} structure also contains a lock named \co{->fqslock}. |
| |
| The general approach is a \emph{tournament}, where |
| a given CPU conditionally acquires its |
| leaf \co{rcu_node} structure's \co{->fqslock}, and, if successful, |
| attempt to acquire that of the parent, then release that of the child. |
| In addition, at each level, the CPU checks a global \co{gp_flags} |
| variable, and if this variable indicates that some other CPU has |
| asserted the event, the first CPU drops out of the competition. |
| This acquire-then-release sequence continues until either the |
| \co{gp_flags} variable indicates that someone else won the tournament, |
| one of the attempts to acquire an \co{->fqslock} fails, or |
| the root \co{rcu_node} structure's \co{->fqslock} has been acquired. |
| If the root \co{rcu_node} structure's \co{->fqslock} is acquired, |
| a function named \co{do_force_quiescent_state()} is invoked. |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:locking:Conditional Locking to Reduce Contention] |
| \begin{VerbatimL}[commandchars=\\\[\]] |
| void force_quiescent_state(struct rcu_node *rnp_leaf) |
| { |
| int ret; |
| struct rcu_node *rnp = rnp_leaf; |
| struct rcu_node *rnp_old = NULL; |
| |
| for (; rnp != NULL; rnp = rnp->parent) { \lnlbl[loop:b] |
| ret = (READ_ONCE(gp_flags)) || \lnlbl[flag_set] |
| !raw_spin_trylock(&rnp->fqslock);\lnlbl[trylock] |
| if (rnp_old != NULL) \lnlbl[non_NULL] |
| raw_spin_unlock(&rnp_old->fqslock);\lnlbl[rel1] |
| if (ret) \lnlbl[giveup] |
| return; \lnlbl[return] |
| rnp_old = rnp; \lnlbl[save] |
| } \lnlbl[loop:e] |
| if (!READ_ONCE(gp_flags)) { \lnlbl[flag_not_set] |
| WRITE_ONCE(gp_flags, 1); \lnlbl[set_flag] |
| do_force_quiescent_state(); \lnlbl[invoke] |
| WRITE_ONCE(gp_flags, 0); \lnlbl[clr_flag] |
| } |
| raw_spin_unlock(&rnp_old->fqslock); \lnlbl[rel2] |
| } |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Conditional Locking to Reduce Contention} |
| \label{lst:locking:Conditional Locking to Reduce Contention} |
| \end{listing} |
| |
| Simplified code to implement this is shown in |
| \cref{lst:locking:Conditional Locking to Reduce Contention}. |
| The purpose of this function is to mediate between CPUs who have concurrently |
| detected a need to invoke the \co{do_force_quiescent_state()} function. |
| At any given time, it only makes sense for one instance of |
| \co{do_force_quiescent_state()} to be active, so if there are multiple |
| concurrent callers, we need at most one of them to actually invoke |
| \co{do_force_quiescent_state()}, and we need the rest to (as quickly and |
| painlessly as possible) give up and leave. |
| |
| \begin{fcvref}[ln:locking:Conditional Locking to Reduce Contention] |
| To this end, each pass through the loop spanning \clnrefrange{loop:b}{loop:e} attempts |
| to advance up one level in the \co{rcu_node} hierarchy. |
| If the \co{gp_flags} variable is already set (\clnref{flag_set}) or if the attempt |
| to acquire the current \co{rcu_node} structure's \co{->fqslock} is |
| unsuccessful (\clnref{trylock}), then local variable \co{ret} is set to 1. |
| If \clnref{non_NULL} sees that local variable \co{rnp_old} is non-\co{NULL}, |
| meaning that we hold \co{rnp_old}'s \co{->fqs_lock}, |
| \clnref{rel1} releases this lock (but only after the attempt has been made |
| to acquire the parent \co{rcu_node} structure's \co{->fqslock}). |
| If \clnref{giveup} sees that either \clnref{flag_set} or~\lnref{trylock} |
| saw a reason to give up, |
| \clnref{return} returns to the caller. |
| Otherwise, we must have acquired the current \co{rcu_node} structure's |
| \co{->fqslock}, so \clnref{save} saves a pointer to this structure in local |
| variable \co{rnp_old} in preparation for the next pass through the loop. |
| |
| If control reaches \clnref{flag_not_set}, we won the tournament, and now holds the |
| root \co{rcu_node} structure's \co{->fqslock}. |
| If \clnref{flag_not_set} still sees that the global variable \co{gp_flags} is zero, |
| \clnref{set_flag} sets \co{gp_flags} to one, \clnref{invoke} invokes |
| \co{do_force_quiescent_state()}, |
| and \clnref{clr_flag} resets \co{gp_flags} back to zero. |
| Either way, \clnref{rel2} releases the root \co{rcu_node} structure's |
| \co{->fqslock}. |
| \end{fcvref} |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| The code in |
| \cref{lst:locking:Conditional Locking to Reduce Contention} |
| is ridiculously complicated! |
| Why not conditionally acquire a single global lock? |
| }\QuickQuizAnswerB{ |
| Conditionally acquiring a single global lock does work very well, |
| but only for relatively small numbers of CPUs. |
| To see why it is problematic in systems with many hundreds of |
| CPUs, look at |
| \cref{fig:count:Atomic Increment Scalability on x86}. |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| \begin{fcvref}[ln:locking:Conditional Locking to Reduce Contention] |
| Wait a minute! |
| If we ``win'' the tournament on \clnref{flag_not_set} of |
| \cref{lst:locking:Conditional Locking to Reduce Contention}, |
| we get to do all the work of \co{do_force_quiescent_state()}. |
| Exactly how is that a win, really? |
| \end{fcvref} |
| }\QuickQuizAnswerE{ |
| How indeed? |
| This just shows that in concurrency, just as in life, one |
| should take care to learn exactly what winning entails before |
| playing the game. |
| }\QuickQuizEndE |
| } |
| |
| This function illustrates the not-uncommon pattern of hierarchical |
| locking. |
| This pattern is difficult to implement using strict RAII locking,\footnote{ |
| Which is why many RAII locking implementations provide a way |
| to leak the lock out of the scope that it was acquired and into |
| the scope in which it is to be released. |
| However, some object must mediate the scope leaking, which can |
| add complexity compared to non-RAII explicit locking primitives.} |
| just like the iterator encapsulation noted earlier, and so explicit |
| lock/unlock primitives (or C++17-style \co{unique_lock} escapes) will |
| be required for the foreseeable future. |
| |
| \section{Locking Implementation Issues} |
| \label{sec:locking:Locking Implementation Issues} |
| % |
| \epigraph{When you translate a dream into reality, it's never a full |
| implementation. |
| It is easier to dream than to do.} |
| {\emph{Shai Agassi}} |
| |
| Developers are almost always best-served by using whatever locking |
| primitives are provided by the system, for example, the POSIX |
| pthread mutex locks~\cite{OpenGroup1997pthreads,Butenhof1997pthreads}. |
| Nevertheless, studying sample implementations can be helpful, |
| as can considering the challenges posed by extreme workloads and |
| environments. |
| |
| \subsection{Sample Exclusive-Locking Implementation Based on Atomic Exchange} |
| \label{sec:locking:Sample Exclusive-Locking Implementation Based on Atomic Exchange} |
| |
| \begin{fcvref}[ln:locking:xchglock:lock_unlock] |
| This section reviews the implementation shown in |
| \cref{lst:locking:Sample Lock Based on Atomic Exchange}. |
| The data structure for this lock is just an \co{int}, as shown on |
| \clnref{typedef}, but could be any integral type. |
| The initial value of this lock is zero, meaning ``unlocked'', |
| as shown on \clnref{initval}. |
| \end{fcvref} |
| |
| \begin{listing} |
| \input{CodeSamples/locking/xchglock@lock_unlock.fcv} |
| \caption{Sample Lock Based on Atomic Exchange} |
| \label{lst:locking:Sample Lock Based on Atomic Exchange} |
| \end{listing} |
| |
| \QuickQuiz{ |
| \begin{fcvref}[ln:locking:xchglock:lock_unlock] |
| Why not rely on the C language's default initialization of |
| zero instead of using the explicit initializer shown on |
| \clnref{initval} of |
| \cref{lst:locking:Sample Lock Based on Atomic Exchange}? |
| \end{fcvref} |
| }\QuickQuizAnswer{ |
| Because this default initialization does not apply to locks |
| allocated as auto variables within the scope of a function. |
| }\QuickQuizEnd |
| |
| \begin{fcvref}[ln:locking:xchglock:lock_unlock:lock] |
| Lock acquisition is carried out by the \co{xchg_lock()} function |
| shown on \clnrefrange{b}{e}. |
| This function uses a nested loop, with the outer loop repeatedly |
| atomically exchanging the value of the lock with the value one |
| (meaning ``locked''). |
| If the old value was already the value one (in other words, someone |
| else already holds the lock), then the inner loop (\clnrefrange{inner:b}{inner:e}) |
| spins until the lock is available, at which point the outer loop |
| makes another attempt to acquire the lock. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| \begin{fcvref}[ln:locking:xchglock:lock_unlock:lock] |
| Why bother with the inner loop on \clnrefrange{inner:b}{inner:e} of |
| \cref{lst:locking:Sample Lock Based on Atomic Exchange}? |
| Why not simply repeatedly do the atomic exchange operation |
| on \clnref{atmxchg}? |
| \end{fcvref} |
| }\QuickQuizAnswer{ |
| \begin{fcvref}[ln:locking:xchglock:lock_unlock:lock] |
| Suppose that the lock is held and that several threads |
| are attempting to acquire the lock. |
| In this situation, if these threads all loop on the atomic |
| exchange operation, they will ping-pong the cache line |
| containing the lock among themselves, imposing load |
| on the interconnect. |
| In contrast, if these threads are spinning in the inner |
| loop on \clnrefrange{inner:b}{inner:e}, |
| they will each spin within their own |
| caches, placing negligible load on the interconnect. |
| \end{fcvref} |
| }\QuickQuizEnd |
| |
| \begin{fcvref}[ln:locking:xchglock:lock_unlock:unlock] |
| Lock release is carried out by the \co{xchg_unlock()} function |
| shown on \clnrefrange{b}{e}. |
| \Clnref{atmxchg} atomically exchanges the value zero (``unlocked'') into |
| the lock, thus marking it as having been released. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| \begin{fcvref}[ln:locking:xchglock:lock_unlock:unlock] |
| Why not simply store zero into the lock word on \clnref{atmxchg} of |
| \cref{lst:locking:Sample Lock Based on Atomic Exchange}? |
| \end{fcvref} |
| }\QuickQuizAnswer{ |
| This can be a legitimate implementation, but only if |
| this store is preceded by a memory barrier and makes use |
| of \co{WRITE_ONCE()}. |
| The memory barrier is not required when the \co{xchg()} |
| operation is used because this operation implies a |
| full memory barrier due to the fact that it returns |
| a value. |
| }\QuickQuizEnd |
| |
| This lock is a simple example of a test-and-set lock~\cite{Segall84}, |
| but very similar |
| mechanisms have been used extensively as pure spinlocks in production. |
| |
| \subsection{Other Exclusive-Locking Implementations} |
| \label{sec:locking:Other Exclusive-Locking Implementations} |
| |
| There are a great many other possible implementations of locking based |
| on atomic instructions, many of which are reviewed in the classic paper |
| by Mellor-Crummey and Scott~\cite{MellorCrummey91a}. |
| These implementations represent different points in a multi-dimensional |
| design tradeoff~\cite{Guerraoui:2019:LPA:3319851.3301501,HugoGuirouxPhD,McKenney96a}. |
| For example, |
| the atomic-exchange-based test-and-set lock presented in the previous |
| section works well when contention is low and has the advantage |
| of small memory footprint. |
| It avoids giving the lock to threads that cannot use it, but as |
| a result can suffer from \IX{unfairness} or even \IX{starvation} at high |
| contention levels. |
| |
| In contrast, ticket lock~\cite{MellorCrummey91a}, which was once used |
| in the Linux kernel, avoids unfairness at high contention levels. |
| However, as a consequence of its strict FIFO discipline, it can grant |
| the lock to a thread that is currently unable to use it, perhaps due |
| to that thread being preempted or interrupted. |
| On the other hand, it is important to avoid getting too worried about the |
| possibility of preemption and interruption. |
| After all, in many cases, this preemption and interruption could just |
| as well happen just after the lock was |
| acquired.\footnote{ |
| Besides, the best way of handling high lock contention is to avoid |
| it in the first place! |
| There are nevertheless some situations where high lock contention |
| is the lesser of the available evils, and in any case, studying |
| schemes that deal with high levels of contention is a good mental |
| exercise.} |
| |
| All locking implementations where waiters spin on a single memory |
| location, including both test-and-set locks and ticket locks, |
| suffer from performance problems at high contention levels. |
| The problem is that the thread releasing the lock must update the |
| value of the corresponding memory location. |
| At low contention, this is not a problem: |
| The corresponding cache line is very likely still local to and |
| writeable by the thread holding the lock. |
| In contrast, at high levels of contention, each thread attempting to |
| acquire the lock will have a read-only copy of the \IX{cache line}, and |
| the lock holder will need to invalidate all such copies before it |
| can carry out the update that releases the lock. |
| In general, the more CPUs and threads there are, the greater the |
| overhead incurred when releasing the lock under conditions of |
| high contention. |
| |
| This negative scalability has motivated a number of different |
| queued-lock |
| implementations~\cite{Anderson90,Graunke90,MellorCrummey91a,Wisniewski94,Craig93,Magnusson94,Takada93}, |
| some of which are used in recent versions of the Linux |
| kernel~\cite{JonathanCorbet2014qspinlocks}. |
| Queued locks avoid high cache-invalidation overhead by assigning each |
| thread a queue element. |
| These queue elements are linked together into a queue that governs the |
| order that the lock will be granted to the waiting threads. |
| The key point is that each thread spins on its own queue element, |
| so that the lock holder need only invalidate the first element from |
| the next thread's CPU's cache. |
| This arrangement greatly reduces the overhead of lock handoff at high |
| levels of contention. |
| |
| More recent queued-lock implementations also take the system's architecture |
| into account, preferentially granting locks locally, while also taking |
| steps to avoid |
| starvation~\cite{McKenney02e,radovic03hierarchical,radovic02efficient,BenJackson02,McKenney02d}. |
| Many of these can be thought of as analogous to the elevator algorithms |
| traditionally used in scheduling disk I/O. |
| |
| Unfortunately, the same scheduling logic that improves the \IX{efficiency} |
| of queued locks at high contention also increases their overhead at |
| low contention. |
| Beng-Hong Lim and Anant Agarwal therefore combined a simple test-and-set |
| lock with a queued lock, using the test-and-set lock at low levels of |
| contention and switching to the queued lock at high levels of |
| contention~\cite{BengHongLim94}, thus getting low overhead at low levels |
| of contention and getting fairness and high throughput at high levels |
| of contention. |
| Browning et al.\ took a similar approach, but avoided the use of a separate |
| flag, so that the test-and-set fast path uses the same sequence of |
| instructions that would be used in a simple test-and-set |
| lock~\cite{LukeBrowning2005SimpleLockNUMAAware}. |
| This approach has been used in production. |
| |
| Another issue that arises at high levels of contention is when the |
| lock holder is delayed, especially when the delay is due to |
| preemption, which can result in \emph{priority inversion}, |
| where a low-priority thread holds a lock, but is preempted |
| by a medium priority CPU-bound thread, which results in |
| a high-priority process blocking while attempting to acquire the |
| lock. |
| The result is that the CPU-bound medium-priority process is preventing the |
| high-priority process from running. |
| One solution is \emph{priority inheritance}~\cite{Lampson1980Mesa}, |
| which has been widely used for real-time |
| computing~\cite{LuiSha1990PriorityInheritance,JonathanCorbet2006PriorityInheritance}, |
| despite some lingering controversy over this |
| practice~\cite{Yodaiken2004FSM,DougLocke2002a}. |
| |
| Another way to avoid priority inversion is to prevent preemption |
| while a lock is held. |
| Because preventing preemption while locks are held also improves throughput, |
| most proprietary UNIX kernels offer some form of scheduler-conscious |
| synchronization mechanism~\cite{Kontothanassis97a}, |
| largely due to the efforts of a certain sizable database vendor. |
| These mechanisms usually take the form of a hint that preemption |
| should be avoided in a given region of code, with this hint typically |
| being placed in a machine register. |
| These hints frequently take the form of a bit set in a particular |
| machine register, which enables extremely low per-lock-acquisition overhead |
| for these mechanisms. |
| In contrast, Linux avoids these hints, instead getting |
| similar results from a mechanism called |
| \emph{futexes}~\cite{HubertusFrancke2002Futex,IngoMolnar2006RobustFutexes,StevenRostedt2006piFutexes,UlrichDrepper2011Futexes}. |
| |
| Interestingly enough, atomic instructions are not strictly needed to |
| implement locks~\cite{Dijkstra65a,Lamport74a}. |
| An excellent exposition of the issues surrounding locking implementations |
| based on simple loads and stores may be found in Herlihy's and |
| Shavit's textbook~\cite{HerlihyShavit2008Textbook,HerlihyShavit2020Textbook}. |
| The main point echoed here is that such implementations currently |
| have little practical application, although a careful study of |
| them can be both entertaining and enlightening. |
| Nevertheless, with one exception described below, such study is left |
| as an exercise for the reader. |
| |
| Gamsa et al.~\cite[Section 5.3]{Gamsa99} describe a token-based |
| mechanism in which a token circulates among the CPUs. |
| When the token reaches a given CPU, it has exclusive |
| access to anything protected by that token. |
| There are any number of schemes that may be used to implement |
| the token-based mechanism, for example: |
| |
| \begin{enumerate} |
| \item Maintain a per-CPU flag, which is initially |
| zero for all but one CPU\@. |
| When a CPU's flag is non-zero, it holds the token. |
| When it finishes with the token, it zeroes its flag and |
| sets the flag of the next CPU to one (or to any other |
| non-zero value). |
| \item Maintain a per-CPU counter, which is initially set |
| to the corresponding CPU's number, which we assume |
| to range from zero to $N-1$, where $N$ is the number |
| of CPUs in the system. |
| When a CPU's counter is greater than that of the next |
| CPU (taking counter wrap into account), the first CPU holds the token. |
| When it is finished with the token, it sets the next |
| CPU's counter to a value one greater than its own counter. |
| \end{enumerate} |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| How can you tell if one counter is greater than another, |
| while accounting for counter wrap? |
| }\QuickQuizAnswerB{ |
| In the C language, the following macro correctly handles this: |
| |
| \begin{VerbatimU} |
| #define ULONG_CMP_LT(a, b) \ |
| (ULONG_MAX / 2 < (a) - (b)) |
| \end{VerbatimU} |
| |
| Although it is tempting to simply subtract two signed integers, |
| this should be avoided because signed overflow is undefined |
| in the C language. |
| For example, if the compiler knows that one of the values is |
| positive and the other negative, it is within its rights to |
| simply assume that the positive number is greater than the |
| negative number, even though subtracting the negative number |
| from the positive number might well result in overflow and |
| thus a negative number. |
| |
| How could the compiler know the signs of the two numbers? |
| It might be able to deduce it based on prior assignments |
| and comparisons. |
| In this case, if the per-CPU counters were signed, the compiler |
| could deduce that they were always increasing in value, and |
| then might assume that they would never go negative. |
| This assumption could well lead the compiler to generate |
| unfortunate code~\cite{PaulEMcKenney2012SignedOverflow,JohnRegehr2010UndefinedBehavior}. |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| Which is better, the counter approach or the flag approach? |
| }\QuickQuizAnswerE{ |
| The flag approach will normally suffer fewer cache misses, |
| but a better answer is to try both and see which works best |
| for your particular workload. |
| }\QuickQuizEndE |
| } |
| |
| This lock is unusual in that a given CPU cannot necessarily acquire it |
| immediately, even if no other CPU is using it at the moment. |
| Instead, the CPU must wait until the token comes around to it. |
| This is useful in cases where CPUs need periodic access to the \IX{critical |
| section}, but can tolerate variances in token-circulation rate. |
| Gamsa et al.~\cite{Gamsa99} used it to implement a variant of |
| read-copy update (see \cref{sec:defer:Read-Copy Update (RCU)}), |
| but it could also be used to protect periodic per-CPU operations such |
| as flushing per-CPU caches used by memory allocators~\cite{McKenney93}, |
| garbage-collecting per-CPU data structures, or flushing per-CPU |
| data to shared storage (or to mass storage, for that matter). |
| |
| The Linux kernel now uses queued spinlocks~\cite{JonathanCorbet2014qspinlocks}, |
| but because of the complexity of implementations that provide good |
| performance across the range of contention levels, the path has not |
| always been smooth~\cite{CatalinMarinas2018qspinlockTLA,WillDeacon2018qspinlock}. |
| As increasing numbers of people gain familiarity with parallel hardware |
| and parallelize increasing amounts of code, we can continue to expect more |
| special-purpose locking primitives to appear, see for example Guerraoui et |
| al.~\cite{Guerraoui:2019:LPA:3319851.3301501,HugoGuirouxPhD}. |
| Nevertheless, you should carefully consider this important safety tip: |
| Use the standard synchronization primitives whenever humanly possible. |
| The big advantage of the standard synchronization primitives over |
| roll-your-own efforts is that the standard primitives are typically |
| \emph{much} less bug-prone.\footnote{ |
| And yes, I have done at least my share of roll-your-own |
| synchronization primitives. |
| However, you will notice that my hair is much greyer than |
| it was before I started doing that sort of work. |
| Coincidence? |
| Maybe. |
| But are you \emph{really} willing to risk your own hair turning |
| prematurely grey?} |
| |
| \input{locking/locking-existence} |
| |
| \section{Locking: |
| Hero or Villain?} |
| \label{sec:locking:Locking: Hero or Villain?} |
| % |
| \epigraph{You either die a hero or you live long enough to see yourself |
| become the villain.} |
| {\emph{Aaron Eckhart} as \emph{Harvey Dent}} |
| |
| As is often the case in real life, locking can be either hero or villain, |
| depending on how it is used and on the problem at hand. |
| In my experience, those writing whole applications are happy with |
| locking, those writing parallel libraries are less happy, and those |
| parallelizing existing sequential libraries are extremely unhappy. |
| The following sections discuss some reasons for these differences in |
| viewpoints. |
| |
| \subsection{Locking For Applications: |
| Hero!} |
| \label{sec:locking:Locking For Applications: Hero!} |
| |
| When writing an entire application (or entire kernel), developers have |
| full control of the design, including the synchronization design. |
| Assuming that the design makes good use of partitioning, as discussed in |
| \cref{cha:Partitioning and Synchronization Design}, locking |
| can be an extremely effective synchronization mechanism, as demonstrated |
| by the heavy use of locking in production-quality parallel software. |
| |
| Nevertheless, although such software usually bases most of its |
| synchronization design on locking, such software also almost always |
| makes use of other synchronization mechanisms, including |
| special counting algorithms (\cref{chp:Counting}), |
| data ownership (\cref{chp:Data Ownership}), |
| reference counting (\cref{sec:defer:Reference Counting}), |
| hazard pointers (\cref{sec:defer:Hazard Pointers}), |
| sequence locking (\cref{sec:defer:Sequence Locks}), and |
| read-copy update (\cref{sec:defer:Read-Copy Update (RCU)}). |
| In addition, practitioners use tools for deadlock |
| detection~\cite{JonathanCorbet2006lockdep}, |
| lock acquisition/release balancing~\cite{JonathanCorbet2004sparse}, |
| cache-miss analysis~\cite{ValgrindHomePage}, |
| hardware-counter-based profiling~\cite{LinuxKernelPerfWiki,OProfileHomePage}, |
| and many more besides. |
| |
| Given careful design, use of a good combination of synchronization |
| mechanisms, and good tooling, locking works quite well for applications |
| and kernels. |
| |
| \subsection{Locking For Parallel Libraries: |
| Just Another Tool} |
| \label{sec:locking:Locking For Parallel Libraries: Just Another Tool} |
| |
| Unlike applications and kernels, the designer of a library cannot |
| know the locking design of the code that the library will be interacting |
| with. |
| In fact, that code might not be written for years to come. |
| Library designers therefore have less control and must exercise more |
| care when laying out their synchronization design. |
| |
| Deadlock is of course of particular concern, and the techniques discussed |
| in \cref{sec:locking:Deadlock} need to be applied. |
| One popular deadlock-avoidance strategy is therefore to ensure that |
| the library's locks are independent subtrees of the enclosing program's |
| locking hierarchy. |
| However, this can be harder than it looks. |
| |
| One complication was discussed in |
| \cref{sec:locking:Local Locking Hierarchies}, namely |
| when library functions call into application code, with \co{qsort()}'s |
| comparison-function argument being a case in point. |
| Another complication is the interaction with signal handlers. |
| If an application signal handler is invoked from a signal received within |
| the library function, deadlock can ensue just as surely as |
| if the library function had called the signal handler directly. |
| A final complication occurs for those library functions that can be used |
| between a \co{fork()}/\co{exec()} pair, for example, due to use of |
| the \co{system()} function. |
| In this case, if your library function was holding a lock at the time of |
| the \co{fork()}, then the child process will begin life with that lock held. |
| Because the thread that will release the lock is running in the parent |
| but not the child, if the child calls your library function, deadlock |
| will ensue. |
| |
| The following strategies may be used to avoid deadlock problems in these cases: |
| |
| \begin{enumerate} |
| \item Don't use either callbacks or signals. |
| \item Don't acquire locks from within callbacks or signal handlers. |
| \item Let the caller control synchronization. |
| \item Parameterize the library API to delegate locking to caller. |
| \item Explicitly avoid callback deadlocks. |
| \item Explicitly avoid signal-handler deadlocks. |
| \item Avoid invoking \co{fork()}. |
| \end{enumerate} |
| |
| Each of these strategies is discussed in one of the following sections. |
| |
| \subsubsection{Use Neither Callbacks Nor Signals} |
| \label{sec:locking:Use Neither Callbacks Nor Signals} |
| |
| If a library function avoids callbacks and the application as a whole |
| avoids signals, then any locks acquired by that library function will |
| be leaves of the locking-hierarchy tree. |
| This arrangement avoids deadlock, as discussed in |
| \cref{sec:locking:Locking Hierarchies}. |
| Although this strategy works extremely well where it applies, |
| there are some applications that must use signal handlers, |
| and there are some library functions (such as the \co{qsort()} function |
| discussed in |
| \cref{sec:locking:Local Locking Hierarchies}) |
| that require callbacks. |
| |
| The strategy described in the next section can often be used in these cases. |
| |
| \subsubsection{Avoid Locking in Callbacks and Signal Handlers} |
| \label{sec:locking:Avoid Locking in Callbacks and Signal Handlers} |
| |
| If neither callbacks nor signal handlers acquire locks, then they |
| cannot be involved in deadlock cycles, which allows straightforward |
| locking hierarchies to once again consider library functions to |
| be leaves on the locking-hierarchy tree. |
| This strategy works very well for most uses of \co{qsort}, whose |
| callbacks usually simply compare the two values passed in to them. |
| This strategy also works wonderfully for many signal handlers, |
| especially given that acquiring locks from within signal handlers |
| is generally frowned upon~\cite{TheOpenGroup1997SUS},\footnote{ |
| But the standard's words do not stop clever coders from creating |
| their own home-brew locking primitives from atomic operations.} |
| but can fail if the application needs to manipulate complex data structures |
| from a signal handler. |
| |
| Here are some ways to avoid acquiring locks in signal handlers even |
| if complex data structures must be manipulated: |
| |
| \begin{enumerate} |
| \item Use simple data structures based on \IXacrl{nbs}, |
| as will be discussed in |
| \cref{sec:advsync:Simple NBS}. |
| \item If the data structures are too complex for reasonable use of |
| non-blocking synchronization, create a queue that allows |
| non-blocking enqueue operations. |
| In the signal handler, instead of manipulating the complex |
| data structure, add an element to the queue describing the |
| required change. |
| A separate thread can then remove elements from the queue and |
| carry out the required changes using normal locking. |
| There are a number of readily available implementations of |
| concurrent |
| queues~\cite{ChristophMKirsch2012FIFOisntTR,MathieuDesnoyers2009URCU,MichaelScott96}. |
| \end{enumerate} |
| |
| This strategy should be enforced with occasional manual or (preferably) |
| automated inspections of callbacks and signal handlers. |
| When carrying out these inspections, be wary of clever coders who |
| might have (unwisely) created home-brew locks from atomic operations. |
| |
| \subsubsection{Caller Controls Synchronization} |
| \label{sec:locking:Caller Controls Synchronization} |
| |
| Letting the caller control synchronization |
| works extremely well when the library functions are operating |
| on independent caller-visible instances of a data structure, each |
| of which may be synchronized separately. |
| For example, if the library functions operate on a search tree, |
| and if the application needs a large number of independent search |
| trees, then the application can associate a lock with each tree. |
| The application then acquires and releases locks as needed, so |
| that the library need not be aware of parallelism at all. |
| Instead, the application controls the parallelism, so that locking |
| can work very well, as was discussed in |
| \cref{sec:locking:Locking For Applications: Hero!}. |
| |
| However, this strategy fails if the |
| library implements a data structure that requires internal |
| concurrency, for example, a hash table or a parallel sort. |
| In this case, the library absolutely must control its own |
| synchronization. |
| |
| \subsubsection{Parameterize Library Synchronization} |
| \label{sec:locking:Parameterize Library Synchronization} |
| |
| The idea here is to add arguments to the library's API to specify |
| which locks to acquire, how to acquire and release them, or both. |
| This strategy allows the application to take on the global task of |
| avoiding deadlock by specifying which locks to acquire (by passing in |
| pointers to the locks in question) and how to |
| acquire them (by passing in pointers to lock acquisition and release |
| functions), |
| but also allows a given library function to control its own concurrency |
| by deciding where the locks should be acquired and released. |
| |
| In particular, this strategy allows the lock acquisition and release |
| functions to block signals as needed without the library code needing to |
| be concerned with which signals need to be blocked by which locks. |
| The separation of concerns used by this strategy can be quite effective, |
| but in some cases the strategies laid out in the following sections |
| can work better. |
| |
| That said, passing explicit pointers to locks to external APIs must |
| be very carefully considered, as discussed in |
| \cref{sec:locking:Locking Hierarchies and Pointers to Locks}. |
| Although this practice is sometimes the right thing to do, you should do |
| yourself a favor by looking into alternative designs first. |
| |
| \subsubsection{Explicitly Avoid Callback Deadlocks} |
| \label{sec:locking:Explicitly Avoid Callback Deadlocks} |
| |
| The basic rule behind this strategy was discussed in |
| \cref{sec:locking:Local Locking Hierarchies}: |
| ``Release all locks before invoking unknown code.'' |
| This is usually the best approach because it allows the application to |
| ignore the library's locking hierarchy: |
| The library remains a leaf or isolated subtree of the application's |
| overall locking hierarchy. |
| |
| In cases where it is not possible to release all locks before invoking |
| unknown code, the layered locking hierarchies described in |
| \cref{sec:locking:Layered Locking Hierarchies} can work well. |
| For example, if the unknown code is a signal handler, this implies that |
| the library function block signals across all lock acquisitions, which |
| can be complex and slow. |
| Therefore, in cases where signal handlers (probably unwisely) acquire |
| locks, the strategies in the next section may prove helpful. |
| |
| \subsubsection{Explicitly Avoid Signal-Handler Deadlocks} |
| \label{sec:locking:Explicitly Avoid Signal-Handler Deadlocks} |
| |
| Suppose that a given library function is known to acquire locks, |
| but does not block signals. |
| Suppose further that it is necessary to invoke that function both from |
| within and outside of a signal handler, and that it is not permissible |
| to modify this library function. |
| Of course, if no special action is taken, then if a signal arrives |
| while that library function is holding its lock, deadlock can occur |
| when the signal handler invokes that same library function, |
| which in turn attempts to re-acquire that same lock. |
| |
| Such deadlocks can be avoided as follows: |
| |
| \begin{enumerate} |
| \item If the application invokes the library function from |
| within a signal handler, then that signal must be blocked |
| every time that the library function is invoked from outside |
| of a signal handler. |
| \item If the application invokes the library function |
| while holding a lock acquired within a given signal |
| handler, then that signal must be blocked every time that the |
| library function is called outside of a signal handler. |
| \end{enumerate} |
| |
| These rules can be enforced by using tools similar to |
| the Linux kernel's lockdep lock dependency |
| checker~\cite{JonathanCorbet2006lockdep}. |
| One of the great strengths of lockdep is that it is not fooled by |
| human intuition~\cite{StevenRostedt2011locdepCryptic}. |
| |
| \subsubsection{Library Functions Used Between \tco{fork()} and \tco{exec()}} |
| \label{sec:locking:Library Functions Used Between fork() and exec()} |
| |
| As noted earlier, if a thread executing a library function is holding |
| a lock at the time that some other thread invokes \apipx{fork()}, the |
| fact that the parent's memory is copied to create the child means that |
| this lock will be born held in the child's context. |
| The thread that will release this lock is running in the parent, but not |
| in the child, which means that the child's copy of this lock will never |
| be released. |
| Therefore, any attempt on the part of the child to invoke that same |
| library function will result in deadlock. |
| |
| A pragmatic and straightforward way of solving this problem is |
| to \co{fork()} a child process while the process is still single-threaded, |
| and have this child process remain single-threaded. |
| Requests to create further child processes can then be communicated |
| to this initial child process, which can safely carry out any |
| needed \co{fork()} and \apipx{exec()} system calls on behalf of its |
| multi-threaded parent process. |
| |
| Another rather less pragmatic and straightforward solution to this problem |
| is to have the library function check to see if the owner of the lock |
| is still running, and if not, ``breaking'' the lock by re-initializing |
| and then acquiring it. |
| However, this approach has a couple of vulnerabilities: |
| |
| \begin{enumerate} |
| \item The data structures protected by that lock are likely to |
| be in some intermediate state, so that naively breaking the lock |
| might result in arbitrary memory corruption. |
| \item If the child creates additional threads, two threads might |
| break the lock concurrently, with the result that both |
| threads believe they own the lock. |
| This could again result in arbitrary memory corruption. |
| \end{enumerate} |
| |
| The \apipx{pthread_atfork()} function is provided to help deal with these situations. |
| The idea is to register a triplet of functions, one to be called by the |
| parent before the \co{fork()}, one to be called by the parent after the |
| \co{fork()}, and one to be called by the child after the \co{fork()}. |
| Appropriate cleanups can then be carried out at these three points. |
| |
| Be warned, however, that coding of \co{pthread_atfork()} handlers is quite subtle |
| in general. |
| The cases where \co{pthread_atfork()} works best are cases where the data structure |
| in question can simply be re-initialized by the child. |
| |
| \subsubsection{Parallel Libraries: |
| Discussion} |
| \label{sec:locking:Parallel Libraries: Discussion} |
| |
| Regardless of the strategy used, the description of the library's API |
| must include a clear description of that strategy and how the caller |
| should interact with that strategy. |
| In short, constructing parallel libraries using locking is possible, |
| but not as easy as constructing a parallel application. |
| |
| \subsection{Locking For Parallelizing Sequential Libraries: |
| Villain!} |
| \label{sec:locking:Locking For Parallelizing Sequential Libraries: Villain!} |
| |
| With the advent of readily available low-cost multicore systems, |
| a common task is parallelizing an existing library that was designed |
| with only single-threaded use in mind. |
| This all-too-common disregard for parallelism can result in a library |
| API that is severely flawed from a parallel-programming viewpoint. |
| Candidate flaws include: |
| |
| \begin{enumerate} |
| \item Implicit prohibition of partitioning. |
| \item Callback functions requiring locking. |
| \item Object-oriented spaghetti code. |
| \end{enumerate} |
| |
| These flaws and the consequences for locking are discussed in the following |
| sections. |
| |
| \subsubsection{Partitioning Prohibited} |
| \label{sec:locking:Partitioning Prohibited} |
| |
| Suppose that you were writing a single-threaded hash-table implementation. |
| It is easy and fast to maintain an exact count of the total number of items |
| in the hash table, and also easy and fast to return this exact count on each |
| addition and deletion operation. |
| So why not? |
| |
| One reason is that exact counters do not perform or scale well on |
| multicore systems, as was |
| seen in \cref{chp:Counting}. |
| As a result, the parallelized implementation of the hash table will not |
| perform or scale well. |
| |
| So what can be done about this? |
| One approach is to return an approximate count, using one of the algorithms |
| from \cref{chp:Counting}. |
| Another approach is to drop the element count altogether. |
| |
| Either way, it will be necessary to inspect uses of the hash table to see |
| why the addition and deletion operations need the exact count. |
| Here are a few possibilities: |
| |
| \begin{enumerate} |
| \item Determining when to resize the hash table. |
| In this case, an approximate count should work quite well. |
| It might also be useful to trigger the resizing operation from |
| the length of the longest chain, which can be computed and |
| maintained in a nicely partitioned per-chain manner. |
| \item Producing an estimate of the time required to traverse the |
| entire hash table. |
| An approximate count works well in this case, also. |
| \item For diagnostic purposes, for example, to check for items being |
| lost when transferring them to and from the hash table. |
| This clearly requires an exact count. |
| However, given that this usage is diagnostic in nature, it might |
| suffice to maintain the lengths of the hash chains, then to |
| infrequently sum them |
| up while locking out addition and deletion operations. |
| \end{enumerate} |
| |
| It turns out that there is now a strong theoretical basis for some of the |
| constraints that performance and scalability place on a parallel library's |
| APIs~\cite{HagitAttiya2011LawsOfOrder,Attiya:2011:LOE:1925844.1926442,PaulEMcKenney2011SNC}. |
| Anyone designing a parallel library needs to pay close attention to |
| those constraints. |
| |
| Although it is all too easy to blame locking for what are really problems |
| due to a concurrency-unfriendly API, doing so is not helpful. |
| On the other hand, one has little choice but to sympathize with the |
| hapless developer who made this choice in (say) 1985. |
| It would have been a rare and courageous developer to anticipate the |
| need for parallelism at that time, and it would have required an |
| even more rare combination of brilliance and luck to actually arrive |
| at a good parallel-friendly API\@. |
| |
| Times change, and code must change with them. |
| That said, there might be a huge number of users of a popular library, |
| in which case an incompatible change to the API would be quite foolish. |
| Adding a parallel-friendly API to complement the existing heavily used |
| sequential-only API is usually the best course of action. |
| |
| Nevertheless, human nature being what it is, we can expect our hapless |
| developer to be more likely to complain about locking than about his |
| or her own poor (though understandable) API design choices. |
| |
| \subsubsection{Deadlock-Prone Callbacks} |
| \label{sec:locking:Deadlock-Prone Callbacks} |
| |
| \Cref{sec:locking:Local Locking Hierarchies,% |
| sec:locking:Layered Locking Hierarchies,% |
| sec:locking:Locking For Parallel Libraries: Just Another Tool} |
| described how undisciplined use of callbacks can result in locking |
| woes. |
| These sections also described how to design your library function to |
| avoid these problems, but it is unrealistic to expect a 1990s programmer |
| with no experience in parallel programming to have followed such a design. |
| Therefore, someone attempting to parallelize an existing callback-heavy |
| single-threaded library will likely have many opportunities to curse |
| locking's villainy. |
| |
| If there are a very large number of uses of a callback-heavy library, |
| it may be wise to again add a parallel-friendly API to the library in |
| order to allow existing users to convert their code incrementally. |
| Alternatively, some advocate use of transactional memory in these cases. |
| While the jury is still out on transactional memory, |
| \cref{sec:future:Transactional Memory} discusses its strengths and |
| weaknesses. |
| It is important to note that hardware transactional memory |
| (discussed in |
| \cref{sec:future:Hardware Transactional Memory}) |
| cannot help here unless the hardware transactional memory implementation |
| provides \IXpl{forward-progress guarantee}, which few do. |
| Other alternatives that appear to be quite practical (if less heavily |
| hyped) include the methods discussed in |
| \cref{sec:locking:Conditional Locking,sec:locking:Acquire Needed Locks First}, |
| as well as those that will be discussed in |
| \cref{chp:Data Ownership,chp:Deferred Processing}. |
| |
| \subsubsection{Object-Oriented Spaghetti Code} |
| \label{sec:locking:Object-Oriented Spaghetti Code} |
| |
| Object-oriented programming went mainstream sometime in the 1980s or |
| 1990s, and as a result there is a huge amount of single-threaded |
| object-oriented code in production. |
| Although object orientation can be a valuable software technique, |
| undisciplined use of objects can easily result in object-oriented |
| spaghetti code. |
| In object-oriented spaghetti code, control flits from object to object |
| in an essentially random manner, making the code hard to understand |
| and even harder, and perhaps impossible, to accommodate a locking hierarchy. |
| |
| Although many might argue that such code should be cleaned up in any |
| case, such things are much easier to say than to do. |
| If you are tasked with parallelizing such a beast, you can reduce the |
| number of opportunities to curse locking by using the techniques |
| described in |
| \cref{sec:locking:Conditional Locking,sec:locking:Acquire Needed Locks First}, |
| as well as those that will be discussed in |
| \cref{chp:Data Ownership,chp:Deferred Processing}. |
| This situation appears to be the use case that inspired transactional |
| memory, so it might be worth a try as well. |
| That said, the choice of synchronization mechanism should be made in |
| light of the hardware habits discussed in |
| \cref{chp:Hardware and its Habits}. |
| After all, if the overhead of the synchronization mechanism is orders of |
| magnitude more than that of the operations being protected, the results |
| are not going to be pretty. |
| |
| And that leads to a question well worth asking in these situations: |
| Should the code remain sequential? |
| For example, perhaps parallelism should be introduced at the process level |
| rather than the thread level. |
| In general, if a task is proving extremely hard, it is worth some time |
| spent thinking about not only alternative ways to accomplish that |
| particular task, but also alternative tasks that might better solve |
| the problem at hand. |
| |
| \section{Summary} |
| \label{sec:locking:Summary} |
| % |
| \epigraph{Achievement unlocked.}{\emph{Unknown}} |
| |
| Locking is perhaps the most widely used and most generally useful |
| synchronization tool. |
| However, it works best when designed into an application |
| or library from the beginning. |
| Given the large quantity of pre-existing single-threaded code that might |
| need to one day run in parallel, locking should therefore not be the |
| only tool in your parallel-programming toolbox. |
| The next few chapters will discuss other tools, and how they can best |
| be used in concert with locking and with each other. |
| |
| \QuickQuizAnswersChp{qqzlocking} |