| % advsync/advsync.tex |
| |
| \QuickQuizChapter{sec:advsync:Advanced Synchronization}{Advanced Synchronization} |
| |
| \section{Avoiding Locks} |
| \label{sec:advsync:Avoiding Locks} |
| |
| Although locking is the workhorse of parallelism in production, in |
| many situations performance, scalability, and real-time response can |
| all be greatly improved though use of lockless techniques. |
| A particularly impressive example of such a lockless technique are |
| the statistical counters describe in |
| Section~\ref{sec:count:Statistical Counters}, |
| which avoids not only locks, but also atomic operations, memory barriers, |
| and even cache misses for counter increments. |
| Other examples we have covered include: |
| |
| \begin{enumerate} |
| \item The fastpaths through a number of other counting algorithms |
| in Chapter~\ref{chp:Counting}. |
| \item The fastpath through resource allocator caches in |
| Section~\ref{sec:SMPdesign:Resource Allocator Caches}. |
| \item The maze solver in Section~\ref{sec:SMPdesign:Beyond Partitioning}. |
| \item The data-ownership techniques described in |
| Section~\ref{chp:Data Ownership}. |
| \item The reference-counting and RCU techinques described in |
| Chapter~\ref{chp:Deferred Processing}. |
| \item The lookup code paths described in Chapter~\ref{chp:Data Structures}. |
| \item Many of the techniques described in |
| Chapter~\ref{chp:Putting It All Together}. |
| \end{enumerate} |
| |
| In short, lockless techniques are quite useful and are heavily used. |
| |
| However, it is best if lockless techniques are hidden behind a |
| well-defined API, such as the \co{inc_count()}, \co{memblock_alloc()}, |
| \co{rcu_read_lock()}, and so on. |
| The reason for this is that undisciplined use of lockless techniques |
| is a good way to create difficult bugs. |
| |
| A key component of many lockless techniques is the memory barrier, |
| which is described in the following section. |
| |
| % \input{advsync/rcu} % @@@ need to clean up the files referenced. |
| \input{advsync/memorybarriers} |
| |
| \section{Non-Blocking Synchronization} |
| \label{sec:advsync:Non-Blocking Synchronization} |
| |
| The term \emph{non-blocking synchronization} (NBS) describes six classes of |
| linearizable algorithms with differing \emph{forward-progress guarantees}. |
| These forward-progress guarantees are orthogonal to those that form the |
| basis of real-time programming: |
| |
| \begin{enumerate} |
| \item Real-time forward-progress guarantees usually have some |
| definite time associated with them, for example, |
| ``scheduling latency must be less than 100 microseconds.'' |
| In contrast, NBS only |
| that progress will be made in finite time, with no definite |
| bound. |
| \item Real-time forward-progress guarantees are sometimes |
| probabilistic, as in the soft-real-time guarantee that |
| ``at least 99.9\% of the time, scheduling latency must |
| be less than 100 microseconds.'' |
| In contrast, NBS's forward-progress |
| guarantees have traditionally been unconditional. |
| \item Real-time forward-progress guarantees are often conditioned on |
| environmental constraints, for example, only being honored |
| for the highest-priority tasks, when each CPU spends at least |
| a certain fraction of its time idle, or when I/O rates are |
| below some specified maximum. |
| In contrast, NBS's forward-progress |
| guarantees are usually unconditional.\footnote{ |
| As we will see below, some recent NBS work relaxes |
| this guarantee.} |
| \item Real-time forward-progress guarantees usually apply only |
| in the absence of software bugs. |
| In contrast, most NBS guarantees apply even in the face of |
| fail-stop bugs.\footnote{ |
| Again, some recent NBS work relaxes this guarantee.} |
| \item NBS forward-progress guarantee classes imply linearizability. |
| In contrast, real-time forward progress guarantees are often |
| independent of ordering constraints such as linearizability. |
| \end{enumerate} |
| |
| Despite these differences, a number of NBS algorithms are extremely |
| useful in real-time programs. |
| |
| There are currently six levels in the NBS |
| hierarchy~\cite{DanAlitarh2013PracticalProgress}, which are roughly |
| as follows: |
| |
| \begin{enumerate} |
| \item \emph{Wait-free synchronization}: Every thread will make progress |
| in finite time~\cite{Herlihy93}. |
| \item \emph{Lock-free synchronization}: At least one thread will |
| make progress in finite time~\cite{Herlihy93}. |
| \item \emph{Obstruction-free synchronization}: Every thread will |
| make progress in finite time in the absence of |
| contention~\cite{HerlihyLM03}. |
| \item \emph{Clash-free synchronization}: At least one thread will |
| make progress in finite time in the absence of |
| contention~\cite{DanAlitarh2013PracticalProgress}. |
| \item \emph{Starvation-free synchronization}: Every thread will |
| make progress in finite time in the absence of |
| failures~\cite{DanAlitarh2013PracticalProgress}. |
| \item \emph{Deadlock-free synchronization}: At least one thread will |
| make progress in finite time in the absence of |
| failures~\cite{DanAlitarh2013PracticalProgress}. |
| \end{enumerate} |
| |
| NBS classes~1 and~2 were first formulated in the early 1990s, |
| class~3 was first fomrulated in the early 2000s, |
| and class~4 was first formulated in 2013. |
| The final two classes have seen informal use for a great many decades, |
| but were reformulated in 2013. |
| |
| In theory, any parallel algorithm can be cast into wait-free form, |
| but there are a relatively small subset of NBS algorithms that are |
| in common use. |
| A few of these are listed in the following section. |
| |
| \subsection{Simple NBS} |
| \label{sec:advsync:Simple NBS} |
| |
| Perhaps the simplest NBS algorithm is atomic update of an integer |
| counter using fetch-and-add (\co{atomic_add_return()}) primitives. |
| |
| Another simple NBS algorithm implements a set of integers in an array. |
| Here the array index indicates a value that might be a member of the set |
| and the array element indicates whether or not that value actually is |
| a set member. |
| The linearizability criterion for NBS algorithms requires that reads from |
| and updates to the array either use atomic instructions or be accompanied |
| by memory barriers, but in the not-uncommon case where linearizability |
| is not important, simple volatile loads and stores suffice, for example, |
| using \co{ACCESS_ONCE()}. |
| |
| An NBS set may also be implemented using a bitmap, where each value that |
| might be a member of the set corresponds to one bit. |
| Reads and updates must normally be carried out via atomic bit-manipulation |
| instructions, although compare-and-swap (\co{cmpxchg()} or CAS) |
| instructions can also be used. |
| |
| The statistical counters algorithm discussed in |
| Section~\ref{sec:count:Statistical Counters} |
| can be considered wait-free, but only but using a cute definitional trick |
| in which the sum is considered approximate rather than exact.\footnote{ |
| Citation needed. |
| I hear of this trick verbally from Mark Moir.} |
| Given sufficiently wide error bounds that are a function of the length |
| of time that the \co{read_count()} function takes to sum the counters, |
| it is not possible to prove that any non-linearizable behavior occurred. |
| This definitely (if a bit artificially) classifies the statistical-counters |
| algorithm as wait-free. |
| This algorithm is probably the most heavily used NBS algorithm in |
| the Linux kernel. |
| |
| \begin{figure}[tbp] |
| { \scriptsize |
| \begin{verbatim} |
| 1 static inline bool |
| 2 ___cds_wfcq_append(struct cds_wfcq_head *head, |
| 3 struct cds_wfcq_tail *tail, |
| 4 struct cds_wfcq_node *new_head, |
| 5 struct cds_wfcq_node *new_tail) |
| 6 { |
| 7 struct cds_wfcq_node *old_tail; |
| 8 |
| 9 old_tail = uatomic_xchg(&tail->p, new_tail); |
| 10 CMM_STORE_SHARED(old_tail->next, new_head); |
| 11 return old_tail != &head->node; |
| 12 } |
| 13 |
| 14 static inline bool |
| 15 _cds_wfcq_enqueue(struct cds_wfcq_head *head, |
| 16 struct cds_wfcq_tail *tail, |
| 17 struct cds_wfcq_node *new_tail) |
| 18 { |
| 19 return ___cds_wfcq_append(head, tail, |
| 20 new_tail, new_tail); |
| 21 } |
| \end{verbatim} |
| } |
| \caption{NBS Enqueue Algorithm} |
| \label{fig:count:NBS Enqueue Algorithm} |
| \end{figure} |
| |
| Another common NBS algorithm is the atomic queue where elements are |
| enqueued using an atomic exchange instruction~\cite{MagedMichael1993JPDC}, |
| followed by a store into the \co{->next} pointer of the new element's |
| predecessor, as shown in |
| Figure~\ref{fig:count:NBS Enqueue Algorithm}, |
| which shows the userspace-RCU library |
| implementation~\cite{MathieuDesnoyers2009URCU}. |
| Line~9 updates the tail pointer to reference the new element while |
| returning a reference to its predecessor, which is stored in |
| local variable \co{old_tail}. |
| Line~10 then updates the predecessor's \co{->next} pointer to |
| reference the newly added element, and finally line~11 |
| returns an indication as to whether or not the queue was initially |
| empty. |
| |
| Although mutual exclusion is required to dequeue a single element |
| (so that dequeue is blocking), it is possible to carry out a non-blocking |
| removal of the entire contents of the queue. |
| What is not possible is to dequeue any given element in a non-blocking |
| manner: The enqueuer might have failed between lines~9 and~10 of the |
| figure, so that the element in question is only partially enqueued. |
| This results in a half-NBS algorithm where enqueues are NBS but |
| dequeues are blocking. |
| This algorithm is nevertheless used in practice, in part because |
| most production software is not required to tolerate arbitrary fail-stop |
| errors. |
| |
| \subsection{NBS Discussion} |
| \label{sec:advsync:NBS Discussion} |
| |
| It is possible to create fully non-blocking queues~\cite{MichaelScott96}, |
| however, such queues are much more complex than the half-NBS algorithm |
| outlined above. |
| The lesson here is to carefully consider what your requirements really are. |
| Relaxing irrelevant requirements can often result in great |
| improvements in both simplicity and performance. |
| |
| Recent research points to another important way to relax requirements. |
| It turns out that systems providing fair scheduling can enjoy most |
| of the benefits of wait-free synchronization even when running |
| algorithms that provide only non-blocking |
| synchronization, both in theory~\cite{DanAlitarh2013PracticalProgress} |
| and in practice~\cite{SamyAlBahra2013NBS}. |
| Because a great many schedulers used in production do in fact |
| provide fairness, |
| the more-complex algorithms providing wait-free synchronization |
| usually provide no practical advantages over their simpler |
| and often faster non-blocking-synchronization counterparts. |
| |
| Interestingly enough, fair scheduling is but one beneficial |
| constraint that is often respected in practice. |
| Other sets of constraints can permit blocking algorithms to |
| achieve deterministic real-time response. |
| For example, given fair locks that are granted to requesters in FIFO order at |
| a given priority level, |
| a method of avoiding priority inversion (such as priority |
| inheritance~\cite{Takada:1995:RSN:527074.828566,Cai-DongWang1996PrioInherLock} |
| or priority ceiling), a bounded number of threads, |
| bounded critical sections, |
| bounded load, |
| and avoidance of fail-stop bugs, |
| lock-based applications can provide deterministic |
| response times~\cite{BjoernBrandenburgPhD}. |
| % @@@ Check dissertation details once internet is available. |
| This approach of course blurs the distinction between blocking and wait-free |
| synchronization, which is all to the good. |
| Hopefully theoeretical frameworks continue to grow, further increasing |
| their ability to |
| describe how software is actually constructed in practice. |