together/seqlock.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % together/seqlock.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \section{Sequence-Locking Specials}
 \label{sec:together:Sequence-Locking Specials}
 %
 \epigraph{The girl who can't dance says the band can't play.}
 	 {Yiddish proverb}

 This section looks at some special uses of sequence locks.

 \subsection{Dueling Sequence Locks}
 \label{sec:together:Dueling Sequence Locks}

 The classic sequence-locking use case enables a reader to see a consistent
 snapshot of a small collection of variables, for example, calibration
 constants for timekeeping.
 This works quite well in practice because calibration constants are
 rarely updated and, when updated, are updated quickly.
 Readers therefore almost never need to retry.

 However, if the updater is delayed during the update, readers will
 also be delayed.
 Such delays might be due to interrupts, NMIs, or even
 virtual-CPU preemption.

 One way to prevent updater delays from causing reader delays is to
 maintain two sets of calibration constants.
 Each set is updated in turn, but frequently enough that readers can
 make good use of either set.
 Each set has its own sequence lock (\co{seqlock_t} structure).

 The updater alternates between the two sets, so that an delayed updater
 delays readers of at most one of the sets.

 Each reader attempts to access the first set, but upon retry attempts
 to access the second set.
 If the second set also forces a retry, the reader repeats starting
 again from the first set.
 If the updater is stuck, only one of the two sets will force
 readers to retry, and therefore readers will succeed as soon as
 they attempt to access the other set.

 \QuickQuiz{
 	Why don't all sequence-locking use cases replicate the
 	data in this fashion?
 }\QuickQuizAnswer{
 	Such replication is impractical if the data is too
 	large, as it might be in the Schr\"odinger's-zoo example
 	described in
 	\cref{sec:together:Correlated Data Elements}.

 	Such replication is unnecessary if delays are prevented,
 	for example, when updaters disable interrupts when running
 	on bare-metal hardware (that is, without the use of
 	a vCPU-preemption-prone hypervisor).

 	Alternatively, if readers can tolerate the occasional delay,
 	then replication is again unnecessary.
 	Consider the example of reader-writer locking, where
 	writers always delay readers and vice versa.

 	However, if the data to be replicated is reasonably
 	small, if delays are possible, and if readers cannot
 	tolerate these delays, replicating the data is an
 	excellent approach.
 }\QuickQuizEnd

 \subsection{Correlated Data Elements}
 \label{sec:together:Correlated Data Elements}

 Suppose we have a hash table where we need correlated views of two or
 more of the elements.
 These elements are updated together, and we do not want to see an old
 version of the first element along with new versions of the other
 elements.
 For example, Schr\"odinger decided to add his extended family to his
 in-memory database along with all his animals.
 Although Schr\"odinger understands that marriages and divorces do not
 happen instantaneously, he is also a traditionalist.
 As such, he absolutely does not want his database ever to show that the
 bride is now married, but the groom is not, and vice versa.
 Plus, if you think Schr\"odinger is a traditionalist, you just
 try conversing with some of his family members!
 In other words, Schr\"odinger wants to be able to carry out a
 wedlock-consistent traversal of his database.

 One approach is to use sequence locks
 (see \cref{sec:defer:Sequence Locks}),
 so that wedlock-related updates are carried out under the
 protection of \co{write_seqlock()}, while reads requiring
 wedlock consistency are carried out within
 a \co{read_seqbegin()} / \co{read_seqretry()} loop.
 Note that sequence locks are not a replacement for RCU protection:
 Sequence locks protect against concurrent modifications, but RCU
 is still needed to protect against concurrent deletions.

 This approach works quite well when the number of correlated elements is
 small, the time to read these elements is short, and the update rate is
 low.
 Otherwise, updates might happen so quickly that readers might never complete.
 Although Schr\"odinger does not expect that even his least-sane relatives
 will marry and divorce quickly enough for this to be a problem,
 he does realize that this problem could well arise in other situations.
 One way to avoid this reader-starvation problem is to have the readers
 use the update-side primitives if there have been too many retries,
 but this can degrade both performance and scalability.
 Another way to avoid \IX{starvation} is to have multiple sequence locks,
 in Schr\"odinger's case, perhaps one per species.

 In addition, if the update-side primitives are used too frequently,
 poor performance and scalability will result due to lock contention.
 One way to avoid this is to maintain a per-element sequence lock,
 and to hold both spouses' locks when updating their marital status.
 Readers can do their retry looping on either of the spouses' locks
 to gain a stable view of any change in marital status involving both
 members of the pair.
 This avoids contention due to high marriage and divorce rates, but
 complicates gaining a stable view of all marital statuses during a
 single scan of the database.

 If the element groupings are well-defined and persistent, which marital
 status is hoped to be,
 then one approach is to add pointers to the data elements to link
 together the members of a given group.
 Readers can then traverse these pointers to access all the data elements
 in the same group as the first one located.

 This technique is used heavily in the Linux kernel, perhaps most
 notably in the dcache subsystem~\cite{NeilBrown2015RCUwalk}.
 Note that it is likely that similar schemes also work with hazard
 pointers.

 This approach provides sequential consistency to successful readers,
 each of which will either see the effects of a given update or not,
 with any partial updates resulting in a read-side retry.
 Sequential consistency is an extremely strong guarantee, incurring equally
 strong restrictions and equally high overheads.
 In this case, we saw that readers might be starved on the one hand, or
 might need to acquire the update-side lock on the other.
 Although this works very well in cases where updates are infrequent,
 it unnecessarily forces read-side retries even when the update does not
 affect any of the data that a retried reader accesses.
 \Cref{sec:together:Correlated Fields} therefore covers a much weaker form
 of consistency that not only avoids reader starvation, but also avoids
 any form of read-side retry.
 The next section instead presents a weaker form of consistency that
 can be provided with much lower probabilities of reader starvation.

 \subsection{Atomic Move}
 \label{sec:together:Atomic Move}

 Suppose that individual data elements are moved from one data structure
 to another, and that readers look up only single data structures.
 However, when a data element moves, readers must must never see it as
 being in both structures at the same time and must also never see it as
 missing from both structures at the same time.
 At the same time, any reader seeing the element in its new location
 must never subsequently see it in its old location.
 In addition, the move may be implemented by inserting a new copy of the
 old data element into the destination location.

 For example, consider a hash table that supports an atomic-to-readers
 rename operation.
 Expanding on Schr\"odinger's zoo, suppose that an animal's name changes,
 for example, each of the brides in Schr\"odinger's traditionalist family
 might change their last name to match that of their groom.

 But changing their name might change the hash value, and might also
 require that the bride's element move from one hash chain to another.
 The consistency set forth above requires that if a reader successfully
 looks up the new name, then any subsequent lookup of the old name by
 that reader must result in failure.
 Similarly, if a reader's lookup of the old name results in lookup failure,
 then any subsequent lookup of the new name by that reader must succeed.
 In short, a given reader should not see a bride momentarily blinking out
 of existence, nor should that reader lookup a bride under her new name
 and then later lookup that bride under her old name.

 This consistency guarantee could be enforced with a single global sequence
 lock as described in \cref{sec:together:Correlated Data Elements}, but
 this can result in reader starvation even for readers that are not looking
 up a bride who is currently undergoing a name change.
 This guarantee could also be enforced by requiring that readers acquire
 a per-hash-chain lock, but reviewing
 \cref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo}
 shows that this results in poor performance and scalabilty, even for
 single-socket systems.

 Another more reader-friendly way to implement this is to use RCU and to
 place a sequence lock on each element.
 Readers looking up a given element act as sequence-lock readers across
 their full set of accesses to that element.
 Note that these sequence-lock operations will order each reader's lookups.

 Renaming an element can then proceed roughly as follows:

 \begin{enumerate}
 \item	Acquire a global lock protecting rename operations.
 \item	Allocate and initialize a copy of the element with the new name.
 \item	Write-acquire the sequence lock on the element with the old name,
 	which has the side effect of ordering this acquisition with the
 	following insertion.
 	Concurrent lookups of the old name will now repeatedly retry.
 \item	Insert the copy of the element with the new name.
 	Lookups of the new name will now succeed.
 \item	Execute \co{smp_wmb()} to order the prior insertion with the
 	subsequent removal.
 \item	Remove the element with the old name.
 	Concurrent lookups of the old name will now fail.
 \item	Write-release the sequence lock if necessary, for example, if
 	required by lock dependency checkers.
 \item	Release the global lock.
 \end{enumerate}

 Thus, readers looking up the old name will retry until the new name
 is available, at which point their final retry will fail.
 Any subsequent lookups of the new name will succeed.
 Any reader succeeding in looking up the new name is guaranteed that
 any subsequent lookup of the old name will fail, perhaps after a series
 of retries.

 \QuickQuizSeries{%
 \QuickQuizB{
 	Is it possible to write-acquire the sequence lock on
 	the new element before it is inserted instead of acquiring
 	that of the old element before it is removed?
 }\QuickQuizAnswerB{
 	Yes, and the details are left as an exercise to the reader.

 	The term \emph{tombstone} is sometimes used to refer to the
 	element with the old name after its sequence lock is acquired.
 	Similarly, the term \emph{birthstone} is sometimes used to refer
 	to the element with the new name while its sequence lock is
 	still held.
 }\QuickQuizEndB

 \QuickQuizE{
 	Is it possible to avoid the global lock?
 }\QuickQuizAnswerE{
 	Yes, and one way to do this would be to use per-hash-chain locks.
 	The updater could acquire lock(s) corresponding to both the old
 	and the new element, acquiring them in address order.
 	In this case, the insertion and removal operations would of
 	course need to refrain from acquiring and releasing these
 	same per-hash-chain locks.
 	This complexity can be worthwhile if rename operations are
 	frequent, and of course can allow rename operations to execute
 	concurrently.
 }\QuickQuizEndE
 }% End of \QuickQuizSeries

 It is of course possible to instead implement this procedure somewhat
 more efficiently using simple flags.
 However, this can be thought of as a simplified variant of sequence
 locking that relies on the fact that a given element's sequence lock is
 never write-acquired more than once.

 % @@@ Reference TBD flag-for-deletion section.

 \subsection{Upgrade to Writer}
 \label{sec:together:Upgrade to Writer}

 As discussed in
 \cref{sec:defer:Quasi Reader-Writer Lock},
 RCU permits readers to upgrade to writers.
 This capability can be quite useful when a reader scanning an
 RCU-protected data structure notices that the current element
 needs to be updated.
 What happens when you try this trick with sequence locking?

 It turns out that this sequence-locking trick is actually used in
 the Linux kernel, for example, by the \co{sdma_flush()} function in
 \path{drivers/infiniband/hw/hfi1/sdma.c}.
 The effect is to doom the enclosing reader to retry.
 This trick is therefore used when the reader detects some condition
 that requires a retry.
	% together/seqlock.tex
	% mainfile: ../perfbook.tex
	% SPDX-License-Identifier: CC-BY-SA-3.0

	\section{Sequence-Locking Specials}
	\label{sec:together:Sequence-Locking Specials}
	%
	\epigraph{The girl who can't dance says the band can't play.}
	{Yiddish proverb}

	This section looks at some special uses of sequence locks.

	\subsection{Dueling Sequence Locks}
	\label{sec:together:Dueling Sequence Locks}

	The classic sequence-locking use case enables a reader to see a consistent
	snapshot of a small collection of variables, for example, calibration
	constants for timekeeping.
	This works quite well in practice because calibration constants are
	rarely updated and, when updated, are updated quickly.
	Readers therefore almost never need to retry.

	However, if the updater is delayed during the update, readers will
	also be delayed.
	Such delays might be due to interrupts, NMIs, or even
	virtual-CPU preemption.

	One way to prevent updater delays from causing reader delays is to
	maintain two sets of calibration constants.
	Each set is updated in turn, but frequently enough that readers can
	make good use of either set.
	Each set has its own sequence lock (\co{seqlock_t} structure).

	The updater alternates between the two sets, so that an delayed updater
	delays readers of at most one of the sets.

	Each reader attempts to access the first set, but upon retry attempts
	to access the second set.
	If the second set also forces a retry, the reader repeats starting
	again from the first set.
	If the updater is stuck, only one of the two sets will force
	readers to retry, and therefore readers will succeed as soon as
	they attempt to access the other set.

	\QuickQuiz{
	Why don't all sequence-locking use cases replicate the
	data in this fashion?
	}\QuickQuizAnswer{
	Such replication is impractical if the data is too
	large, as it might be in the Schr\"odinger's-zoo example
	described in
	\cref{sec:together:Correlated Data Elements}.

	Such replication is unnecessary if delays are prevented,
	for example, when updaters disable interrupts when running
	on bare-metal hardware (that is, without the use of
	a vCPU-preemption-prone hypervisor).

	Alternatively, if readers can tolerate the occasional delay,
	then replication is again unnecessary.
	Consider the example of reader-writer locking, where
	writers always delay readers and vice versa.

	However, if the data to be replicated is reasonably
	small, if delays are possible, and if readers cannot
	tolerate these delays, replicating the data is an
	excellent approach.
	}\QuickQuizEnd

	\subsection{Correlated Data Elements}
	\label{sec:together:Correlated Data Elements}

	Suppose we have a hash table where we need correlated views of two or
	more of the elements.
	These elements are updated together, and we do not want to see an old
	version of the first element along with new versions of the other
	elements.
	For example, Schr\"odinger decided to add his extended family to his
	in-memory database along with all his animals.
	Although Schr\"odinger understands that marriages and divorces do not
	happen instantaneously, he is also a traditionalist.
	As such, he absolutely does not want his database ever to show that the
	bride is now married, but the groom is not, and vice versa.
	Plus, if you think Schr\"odinger is a traditionalist, you just
	try conversing with some of his family members!
	In other words, Schr\"odinger wants to be able to carry out a
	wedlock-consistent traversal of his database.

	One approach is to use sequence locks
	(see \cref{sec:defer:Sequence Locks}),
	so that wedlock-related updates are carried out under the
	protection of \co{write_seqlock()}, while reads requiring
	wedlock consistency are carried out within
	a \co{read_seqbegin()} / \co{read_seqretry()} loop.
	Note that sequence locks are not a replacement for RCU protection:
	Sequence locks protect against concurrent modifications, but RCU
	is still needed to protect against concurrent deletions.

	This approach works quite well when the number of correlated elements is
	small, the time to read these elements is short, and the update rate is
	low.
	Otherwise, updates might happen so quickly that readers might never complete.
	Although Schr\"odinger does not expect that even his least-sane relatives
	will marry and divorce quickly enough for this to be a problem,
	he does realize that this problem could well arise in other situations.
	One way to avoid this reader-starvation problem is to have the readers
	use the update-side primitives if there have been too many retries,
	but this can degrade both performance and scalability.
	Another way to avoid \IX{starvation} is to have multiple sequence locks,
	in Schr\"odinger's case, perhaps one per species.

	In addition, if the update-side primitives are used too frequently,
	poor performance and scalability will result due to lock contention.
	One way to avoid this is to maintain a per-element sequence lock,
	and to hold both spouses' locks when updating their marital status.
	Readers can do their retry looping on either of the spouses' locks
	to gain a stable view of any change in marital status involving both
	members of the pair.
	This avoids contention due to high marriage and divorce rates, but
	complicates gaining a stable view of all marital statuses during a
	single scan of the database.

	If the element groupings are well-defined and persistent, which marital
	status is hoped to be,
	then one approach is to add pointers to the data elements to link
	together the members of a given group.
	Readers can then traverse these pointers to access all the data elements
	in the same group as the first one located.

	This technique is used heavily in the Linux kernel, perhaps most
	notably in the dcache subsystem~\cite{NeilBrown2015RCUwalk}.
	Note that it is likely that similar schemes also work with hazard
	pointers.

	This approach provides sequential consistency to successful readers,
	each of which will either see the effects of a given update or not,
	with any partial updates resulting in a read-side retry.
	Sequential consistency is an extremely strong guarantee, incurring equally
	strong restrictions and equally high overheads.
	In this case, we saw that readers might be starved on the one hand, or
	might need to acquire the update-side lock on the other.
	Although this works very well in cases where updates are infrequent,
	it unnecessarily forces read-side retries even when the update does not
	affect any of the data that a retried reader accesses.
	\Cref{sec:together:Correlated Fields} therefore covers a much weaker form
	of consistency that not only avoids reader starvation, but also avoids
	any form of read-side retry.
	The next section instead presents a weaker form of consistency that
	can be provided with much lower probabilities of reader starvation.

	\subsection{Atomic Move}
	\label{sec:together:Atomic Move}

	Suppose that individual data elements are moved from one data structure
	to another, and that readers look up only single data structures.
	However, when a data element moves, readers must must never see it as
	being in both structures at the same time and must also never see it as
	missing from both structures at the same time.
	At the same time, any reader seeing the element in its new location
	must never subsequently see it in its old location.
	In addition, the move may be implemented by inserting a new copy of the
	old data element into the destination location.

	For example, consider a hash table that supports an atomic-to-readers
	rename operation.
	Expanding on Schr\"odinger's zoo, suppose that an animal's name changes,
	for example, each of the brides in Schr\"odinger's traditionalist family
	might change their last name to match that of their groom.

	But changing their name might change the hash value, and might also
	require that the bride's element move from one hash chain to another.
	The consistency set forth above requires that if a reader successfully
	looks up the new name, then any subsequent lookup of the old name by
	that reader must result in failure.
	Similarly, if a reader's lookup of the old name results in lookup failure,
	then any subsequent lookup of the new name by that reader must succeed.
	In short, a given reader should not see a bride momentarily blinking out
	of existence, nor should that reader lookup a bride under her new name
	and then later lookup that bride under her old name.

	This consistency guarantee could be enforced with a single global sequence
	lock as described in \cref{sec:together:Correlated Data Elements}, but
	this can result in reader starvation even for readers that are not looking
	up a bride who is currently undergoing a name change.
	This guarantee could also be enforced by requiring that readers acquire
	a per-hash-chain lock, but reviewing
	\cref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo}
	shows that this results in poor performance and scalabilty, even for
	single-socket systems.

	Another more reader-friendly way to implement this is to use RCU and to
	place a sequence lock on each element.
	Readers looking up a given element act as sequence-lock readers across
	their full set of accesses to that element.
	Note that these sequence-lock operations will order each reader's lookups.

	Renaming an element can then proceed roughly as follows:

	\begin{enumerate}
	\item Acquire a global lock protecting rename operations.
	\item Allocate and initialize a copy of the element with the new name.
	\item Write-acquire the sequence lock on the element with the old name,
	which has the side effect of ordering this acquisition with the
	following insertion.
	Concurrent lookups of the old name will now repeatedly retry.
	\item Insert the copy of the element with the new name.
	Lookups of the new name will now succeed.
	\item Execute \co{smp_wmb()} to order the prior insertion with the
	subsequent removal.
	\item Remove the element with the old name.
	Concurrent lookups of the old name will now fail.
	\item Write-release the sequence lock if necessary, for example, if
	required by lock dependency checkers.
	\item Release the global lock.
	\end{enumerate}

	Thus, readers looking up the old name will retry until the new name
	is available, at which point their final retry will fail.
	Any subsequent lookups of the new name will succeed.
	Any reader succeeding in looking up the new name is guaranteed that
	any subsequent lookup of the old name will fail, perhaps after a series
	of retries.

	\QuickQuizSeries{%
	\QuickQuizB{
	Is it possible to write-acquire the sequence lock on
	the new element before it is inserted instead of acquiring
	that of the old element before it is removed?
	}\QuickQuizAnswerB{
	Yes, and the details are left as an exercise to the reader.

	The term \emph{tombstone} is sometimes used to refer to the
	element with the old name after its sequence lock is acquired.
	Similarly, the term \emph{birthstone} is sometimes used to refer
	to the element with the new name while its sequence lock is
	still held.
	}\QuickQuizEndB

	\QuickQuizE{
	Is it possible to avoid the global lock?
	}\QuickQuizAnswerE{
	Yes, and one way to do this would be to use per-hash-chain locks.
	The updater could acquire lock(s) corresponding to both the old
	and the new element, acquiring them in address order.
	In this case, the insertion and removal operations would of
	course need to refrain from acquiring and releasing these
	same per-hash-chain locks.
	This complexity can be worthwhile if rename operations are
	frequent, and of course can allow rename operations to execute
	concurrently.
	}\QuickQuizEndE
	}% End of \QuickQuizSeries

	It is of course possible to instead implement this procedure somewhat
	more efficiently using simple flags.
	However, this can be thought of as a simplified variant of sequence
	locking that relies on the fact that a given element's sequence lock is
	never write-acquired more than once.

	% @@@ Reference TBD flag-for-deletion section.

	\subsection{Upgrade to Writer}
	\label{sec:together:Upgrade to Writer}

	As discussed in
	\cref{sec:defer:Quasi Reader-Writer Lock},
	RCU permits readers to upgrade to writers.
	This capability can be quite useful when a reader scanning an
	RCU-protected data structure notices that the current element
	needs to be updated.
	What happens when you try this trick with sequence locking?

	It turns out that this sequence-locking trick is actually used in
	the Linux kernel, for example, by the \co{sdma_flush()} function in
	\path{drivers/infiniband/hw/hfi1/sdma.c}.
	The effect is to doom the enclosing reader to retry.
	This trick is therefore used when the reader detects some condition
	that requires a retry.