qqz.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 \QuickQAC{chp:Introduction}{Introduction}
 \QuickQ{}
 	Come on now!!!
 	Parallel programming has been known to be exceedingly
 	hard for many decades.
 	You seem to be hinting that it is not so hard.
 	What sort of game are you playing?
 \QuickA{}
 	If you really believe that parallel programming is exceedingly
 	hard, then you should have a ready answer to the question
 	``Why is parallel programming hard?''
 	One could list any number of reasons, ranging from deadlocks to
 	race conditions to testing coverage, but the real answer is that
 	{\em it is not really all that hard}.
 	After all, if parallel programming was really so horribly difficult,
 	how could a large number of open-source projects, ranging from Apache
 	to MySQL to the Linux kernel, have managed to master it?

 	A better question might be: ''Why is parallel programming {\em
 	perceived} to be so difficult?''
 	To see the answer, let's go back to the year 1991.
 	Paul McKenney was walking across the parking lot to Sequent's
 	benchmarking center carrying six dual-80486 Sequent Symmetry CPU
 	boards, when he suddenly realized that he was carrying several
 	times the price of the house he had just purchased.\footnote{
 		Yes, this sudden realization {\em did} cause him to walk quite
 		a bit more carefully.
 		Why do you ask?}
 	This high cost of parallel systems meant that
 	parallel programming was restricted to a privileged few who
 	worked for an employer who either manufactured or could afford to
 	purchase machines costing upwards of \$100,000 --- in 1991 dollars US.

 	In contrast, in 2006, Paul finds himself typing these words on a
 	dual-core x86 laptop.
 	Unlike the dual-80486 CPU boards, this laptop also contains
 	2GB of main memory, a 60GB disk drive, a display, Ethernet,
 	USB ports, wireless, and Bluetooth.
 	And the laptop is more than an order of magnitude cheaper than
 	even one of those dual-80486 CPU boards, even before taking inflation
 	into account.

 	Parallel systems have truly arrived.
 	They are no longer the sole domain of a privileged few, but something
 	available to almost everyone.

 	The earlier restricted availability of parallel hardware is
 	the \emph{real} reason that parallel programming is considered
 	so difficult.
 	After all, it is quite difficult to learn to program even the simplest
 	machine if you have no access to it.
 	Since the age of rare and expensive parallel machines is for the most
 	part behind us, the age during which
 	parallel programming is perceived to be mind-crushingly difficult is
 	coming to a close.\footnote{
 		Parallel programming is in some ways more difficult than
 		sequential programming, for example, parallel validation
 		is more difficult.
 		But no longer mind-crushingly difficult.}

 \QuickQ{}
 	How could parallel programming \emph{ever} be as easy
 	as sequential programming?
 \QuickA{}
 	It depends on the programming environment.
 	SQL~\cite{DIS9075SQL92} is an underappreciated success
 	story, as it permits programmers who know nothing about parallelism
 	to keep a large parallel system productively busy.
 	We can expect more variations on this theme as parallel
 	computers continue to become cheaper and more readily available.
 	For example, one possible contender in the scientific and
 	technical computing arena is MATLAB*P,
 	which is an attempt to automatically parallelize common
 	matrix operations.

 	Finally, on Linux and UNIX systems, consider the following
 	shell command:

 	{\small \tt get\_input | grep "interesting" | sort}

 	This shell pipeline runs the \co{get_input}, \co{grep},
 	and \co{sort} processes in parallel.
 	There, that wasn't so hard, now was it?

 \QuickQ{}
 	Oh, really???
 	What about correctness, maintainability, robustness, and so on?
 \QuickA{}
 	These are important goals, but they are just as important for
 	sequential programs as they are for parallel programs.
 	Therefore, important though they are, they do not belong on
 	a list specific to parallel programming.

 \QuickQ{}
 	And if correctness, maintainability, and robustness don't
 	make the list, why do productivity and generality?
 \QuickA{}
 	Given that parallel programming is perceived to be much harder
 	than is sequential programming, productivity is tantamount and
 	therefore must not be omitted.
 	Furthermore, high-productivity parallel-programming environments
 	such as SQL have been special purpose, hence generality must
 	also be added to the list.

 \QuickQ{}
 	Given that parallel programs are much harder to prove
 	correct than are sequential programs, again, shouldn't
 	correctness \emph{really} be on the list?
 \QuickA{}
 	From an engineering standpoint, the difficulty in proving
 	correctness, either formally or informally, would be important
 	insofar as it impacts the primary goal of productivity.
 	So, in cases where correctness proofs are important, they
 	are subsumed under the ``productivity'' rubric.

 \QuickQ{}
 	What about just having fun?
 \QuickA{}
 	Having fun is important as well, but, unless you are a hobbyist,
 	would not normally be a \emph{primary} goal.
 	On the other hand, if you \emph{are} a hobbyist, go wild!

 \QuickQ{}
 	Are there no cases where parallel programming is about something
 	other than performance?
 \QuickA{}
 	There are certainly cases where the problem to be solved is
 	inherently parallel, for example, Monte Carlo methods and
 	some numerical computations.
 	Even in these cases, however, there will be some amount of
 	extra work managing the parallelism.

 \QuickQ{}
 	Why all this prattling on about non-technical issues???
 	And not just \emph{any} non-technical issue, but \emph{productivity}
 	of all things?
 	Who cares?
 \QuickA{}
 	If you are a pure hobbyist, perhaps you don't need to care.
 	But even pure hobbyists will often care about how much they
 	can get done, and how quickly.
 	After all, the most popular hobbyist tools are usually those
 	that are the best suited for the job, and an important part of
 	the definition of ``best suited'' involves productivity.
 	And if someone is paying you to write parallel code, they will
 	very likely care deeply about your productivity.
 	And if the person paying you cares about something, you would
 	be most wise to pay at least some attention to it!

 	Besides, if you \emph{really} didn't care about productivity,
 	you would be doing it by hand rather than using a computer!

 \QuickQ{}
 	Given how cheap parallel hardware has become, how can anyone
 	afford to pay people to program it?
 \QuickA{}
 	There are a number of answers to this question:
 	\begin{enumerate}
 	\item	Given a large computational cluster of parallel machines,
 		the aggregate cost of the cluster can easily justify
 		substantial developer effort, because the development
 		cost can be spread over the large number of machines.
 	\item	Popular software that is run by tens of millions of users
 		can easily justify substantial developer effort,
 		as the cost of this development can be spread over the tens
 		of millions of users.
 		Note that this includes things like kernels and system
 		libraries.
 	\item	If the low-cost parallel machine is controlling the operation
 		of a valuable piece of equipment, then the cost of this
 		piece of equipment might easily justify substantial
 		developer effort.
 	\item	If the software for the low-cost parallel produces an
 		extremely valuable result (e.g., mineral exploration),
 		then the valuable result might again justify substantial
 		developer cost.
 	\item	Safety-critical systems protect lives, which can clearly
 		justify very large developer effort.
 	\item	Hobbyists and researchers might seek knowledge, experience,
 		fun, or glory rather than mere money.
 	\end{enumerate}
 	So it is not the case that the decreasing cost of hardware renders
 	software worthless, but rather that it is no longer possible to
 	``hide'' the cost of software development within the cost of
 	the hardware, at least not unless there are extremely large
 	quantities of hardware.

 \QuickQ{}
 	This is a ridiculously unachievable ideal!
 	Why not focus on something that is achievable in practice?
 \QuickA{}
 	This is eminently achievable.
 	The cellphone is a computer that can be used to make phone
 	calls and to send and receive text messages with little or
 	no programming or configuration on the part of the end user.

 	This might seem to be a trivial example at first glance,
 	but if you consider it carefully you will see that it is
 	both simple and profound.
 	When we are willing to sacrifice generality, we can achieve
 	truly astounding increases in productivity.
 	Those who cling to generality will therefore fail to set
 	the productivity bar high enough to succeed in production
 	environments.

 \QuickQ{}
 	What other bottlenecks might prevent additional CPUs from
 	providing additional performance?
 \QuickA{}
 	There are any number of potential bottlenecks:
 	\begin{enumerate}
 	\item	Main memory.  If a single thread consumes all available
 		memory, additional threads will simply page themselves
 		silly.
 	\item	Cache.  If a single thread's cache footprint completely
 		fills any shared CPU cache(s), then adding more threads
 		will simply thrash the affected caches.
 	\item	Memory bandwidth.  If a single thread consumes all available
 		memory bandwidth, additional threads will simply
 		result in additional queuing on the system interconnect.
 	\item	I/O bandwidth.  If a single thread is I/O bound,
 		adding more threads will simply result in them all
 		waiting in line for the affected I/O resource.
 	\end{enumerate}

 	Specific hardware systems might have any number of additional
 	bottlenecks.

 \QuickQ{}
 	What besides CPU cache capacity might require limiting the
 	number of concurrent threads?
 \QuickA{}
 	There are any number of potential limits on the number of
 	threads:
 	\begin{enumerate}
 	\item	Main memory.  Each thread consumes some memory
 		(for its stack if nothing else), so that excessive
 		numbers of threads can exhaust memory, resulting
 		in excessive paging or memory-allocation failures.
 	\item	I/O bandwidth.  If each thread initiates a given
 		amount of mass-storage I/O or networking traffic,
 		excessive numbers of threads can result in excessive
 		I/O queuing delays, again degrading performance.
 		Some networking protocols may be subject to timeouts
 		or other failures if there are so many threads that
 		networking events cannot be responded to in a timely
 		fashion.
 	\item	Synchronization overhead.
 		For many synchronization protocols, excessive numbers
 		of threads can result in excessive spinning, blocking,
 		or rollbacks, thus degrading performance.
 	\end{enumerate}

 	Specific applications and platforms may have any number of additional
 	limiting factors.

 \QuickQ{}
 	Are there any other obstacles to parallel programming?
 \QuickA{}
 	There are a great many other potential obstacles to parallel
 	programming.
 	Here are a few of them:
 	\begin{enumerate}
 	\item	The only known algorithms for a given project might
 		be inherently sequential in nature.
 		In this case, either avoid parallel programming
 		(there being no law saying that your project \emph{has}
 		to run in parallel) or invent a new parallel algorithm.
 	\item	The project allows binary-only plugins that share the same
 		address space, such that no one developer has access to
 		all of the source code for the project.
 		Because many parallel bugs, including deadlocks, are
 		global in nature, such binary-only plugins pose a severe
 		challenge to current software development methodologies.
 		This might well change, but for the time being, all
 		developers of parallel code sharing a given address space
 		need to be able to see \emph{all} of the code running in
 		that address space.
 	\item	The project contains heavily used APIs that were designed
 		without regard to
 		parallelism~\cite{HagitAttiya2011LawsOfOrder}.
 		Some of the more ornate features of the System V
 		message-queue API form a case in point.
 		Of course, if your project has been around for a few
 		decades, and if its developers did not have access to
 		parallel hardware, your project undoubtedly has at least
 		its share of such APIs.
 	\item	The project was implemented without regard to parallelism.
 		Given that there are a great many techniques that work
 		extremely well in a sequential environment, but that
 		fail miserably in parallel environments, if your project
 		ran only on sequential hardware for most of its lifetime,
 		then your project undoubtably has at least its share of
 		parallel-unfriendly code.
 	\item	The project was implemented without regard to good
 		software-development practice.
 		The cruel truth is that shared-memory parallel
 		environments are often much less forgiving of sloppy
 		development practices than are sequential environments.
 		You may be well-served to clean up the existing design
 		and code prior to attempting parallelization.
 	\item	The people who originally did the development on your
 		project have since moved on, and the people remaining,
 		while well able to maintain it or add small features,
 		are unable to make ``big animal'' changes.
 		In this case, unless you can work out a very simple
 		way to parallelize your project, you will probably
 		be best off leaving it sequential.
 		That said, there are a number of simple approaches that
 		you might use
 		to parallelize your project, including running multiple
 		instances of it, using a parallel implementation of
 		some heavily used library function, or making use of
 		some other parallel project, such as a database.
 	\end{enumerate}

 	One can argue that many of these obstacles are non-technical
 	in nature, but that does not make them any less real.
 	In short, parallelization of a large body of code
 	can be a large and complex effort.
 	As with any large and complex effort, it makes sense to
 	do your homework beforehand.

 \QuickQ{}
 	Where are the answers to the Quick Quizzes found?
 \QuickA{}
 	In Appendix~\ref{chp:Answers to Quick Quizzes} starting on
 	page~\pageref{chp:Answers to Quick Quizzes}.

 	Hey, I thought I owed you an easy one!

 \QuickQ{}
 	Some of the Quick Quiz questions seem to be from the viewpoint
 	of the reader rather than the author.
 	Is that really the intent?
 \QuickA{}
 	Indeed it is!
 	Many are questions that Paul E. McKenney would probably have
 	asked if he was a novice student in a class covering this material.
 	It is worth noting that Paul was taught most of this material by
 	parallel hardware and software, not by professors.
 	In Paul's experience, professors are much more likely to provide
 	answers to verbal questions than are parallel systems,
 	Watson notwithstanding.
 	Of course, we could have a lengthy debate over which of professors
 	or parallel systems provide the most useful answers to these sorts
 	of questions,
 	but for the time being let's just agree that usefulness of
 	answers varies widely across the population both of professors
 	and of parallel systems.

 	Other quizzes are quite similar to actual questions that have been
 	asked during conference presentations and lectures covering the
 	material in this book.
 	A few others are from the viewpoint of the author.

 \QuickQ{}
 	These Quick Quizzes just are not my cup of tea.
 	What do you recommend?
 \QuickA{}
 	There are a number of alternatives available to you:
 	\begin{enumerate}
 	\item	Just ignore the Quick Quizzes and read the rest of
 		the book.
 		You might miss out on the interesting material in
 		some of the Quick Quizzes, but the rest of the book
 		has lots of good material as well.
 		This is an eminently reasonable approach if your main
 		goal is to gain a general understanding of the material
 		or if you are skimming through to book to find a
 		solution to a specific problem.
 	\item	If you find the Quick Quizzes distracting but impossible
 		to ignore, you can always clone the \LaTeX{} source for
 		this book from the git archive.
 		Then modify \co{Makefile} and \co{qqz.sty} to eliminate
 		the Quick Quizzes from the PDF output.
 		Alternatively, you could modify these two files so as
 		to pull the answers inline, immediately following
 		the questions.
 	\item	Look at the answer immediately rather than investing
 		a large amount of time in coming up with your own
 		answer.
 		This approach is reasonable when a given Quick Quiz's
 		answer holds the key to a specific problem you are
 		trying to solve.
 		This approach is also reasonable if you want a somewhat
 		deeper understanding of the material, but when you do not
 		expect to be called upon to generate parallel solutions given
 		only a blank sheet of paper.
 	\item	If you prefer a more academic and rigorous treatment of
 		parallel programming,
 		you might like Herlihy's and Shavit's
 		textbook~\cite{HerlihyShavit2008Textbook}.
 		This book starts with an interesting combination
 		of low-level primitives at high levels of abstraction
 		from the hardware, and works its way through locking
 		and simple data structures including lists, queues,
 		hash tables, and counters, culminating with transactional
 		memory.
 		Michael Scott's textbook~\cite{MichaelScott2013Textbook}
 		approaches similar material with more of a
 		software-engineering focus, and as far as I know, is
 		the first formally published textbook to include a
 		section devoted to RCU.
 	\item	If you would like an academic treatment of parallel
 		programming from a programming-language-pragmatics viewpoint,
 		you might be interested in the concurrency chapter from Scott's
 		textbook~\cite{MichaelScott2006Textbook}
 		on programming languages.
 	\item	If you are interested in an object-oriented patternist
 		treatment of parallel programming focussing on C++,
 		you might try Volumes~2 and 4 of Schmidt's POSA
 		series~\cite{SchmidtStalRohnertBuschmann2000v2Textbook,
 		BuschmannHenneySchmidt2007v4Textbook}.
 		Volume 4 in particular has some interesting chapters
 		applying this work to a warehouse application.
 		The realism of this example is attested to by
 		the section entitled ``Partitioning the Big Ball of Mud'',
 		wherein the problems inherent in parallelism often
 		take a back seat to the problems inherent in getting
 		one's head around a real-world application.
 	\item	If your primary focus is scientific and technical computing,
 		and you prefer a patternist approach,
 		you might try Mattson et al.'s
 		textbook~\cite{Mattson2005Textbook}.
 		It covers Java, C/C++, OpenMP, and MPI.
 		Its patterns are admirably focused first on design,
 		then on implementation.
 	\item	If you are interested in POSIX Threads, you might take
 		a look at David R. Butenhof's book~\cite{Butenhof1997pthreads}.
 	\item	If you are interested in C++, but in a Windows environment,
 		you might try Herb Sutter's ``Effective Concurrency''
 		series in
 		Dr. Dobbs Journal~\cite{HerbSutter2008EffectiveConcurrency}.
 		This series does a reasonable job of presenting a
 		commonsense approach to parallelism.
 	\item	If you want to try out Intel Threading Building Blocks,
 		then perhaps James Reinders's book~\cite{Reinders2007Textbook}
 		is what you are looking for.
 	\item	Finally, those preferring to work in Java might be
 		well-served by Doug Lea's
 		textbooks~\cite{DougLea1997Textbook,Goetz2007Textbook}.
 	\end{enumerate}
 	In contrast, this book meshes real-world machines with real-world
 	algorithms.
 	If your sole goal is to find (say) an optimal parallel queue, you
 	might be better served by one of the above books.
 	However, if you are interested in principles of parallel design
 	that allow multiple such queues to operate in parallel, read on!

 	Coming back to the topic of Quick Quizzes, if you need a deep
 	understanding of the material, then you might well need to
 	learn to tolerate the Quick Quizzes.
 	Don't get me wrong, passively reading the material can be quite
 	valuable, but gaining full problem-solving capability really
 	does require that you practice solving problems.

 	I learned this the hard way during coursework for my late-in-life
 	Ph.D.
 	I was studying a familiar topic, and was surprised at how few of
 	the chapter's exercises I could solve off the top of my head.
 	Forcing myself to answer the questions greatly increased my
 	retention of the material.
 	So with these Quick Quizzes I am not asking you to do anything
 	that I have not been doing myself!

 \QuickQAC{chp:Hardware and its Habits}{Hardware and its Habits}
 \QuickQ{}
 	Why should parallel programmers bother learning low-level
 	properties of the hardware?
 	Wouldn't it be easier, better, and more general to remain at
 	a higher level of abstraction?
 \QuickA{}
 	It might well be easier to ignore the detailed properties of
 	the hardware, but in most cases it would be quite foolish
 	to do so.
 	If you accept that the only purpose of parallelism is to
 	increase performance, and if you further accept that
 	performance depends on detailed properties of the hardware,
 	then it logically follows that parallel programmers are going
 	to need to know at least a few hardware properties.

 	This is the case in most engineering disciplines.
 	Would \emph{you} want to use a bridge designed by an
 	engineer who did not understand the properties of
 	the concrete and steel making up that bridge?
 	If not, why would you expect a parallel programmer to be
 	able to develop competent parallel software without at least
 	\emph{some} understanding of the underlying hardware?

 \QuickQ{}
 	What types of machines would allow atomic operations on
 	multiple data elements?
 \QuickA{}
 	One answer to this question is that it is often possible to
 	pack multiple elements of data into a single machine word,
 	which can then be manipulated atomically.

 	A more trendy answer would be machines supporting transactional
 	memory~\cite{DBLomet1977SIGSOFT}.
 	However, such machines are still research curiosities,
 	although as of early 2012 it appears that commodity systems
 	supporting limited forms of hardware transactional memory
 	will be commercially available within a couple of years.
 	The jury is still out on the applicability of software transactional
 	memory~\cite{McKenney2007PLOSTM,DonaldEPorter2007TRANSACT,
 	ChistopherJRossbach2007a,CalinCascaval2008tmtoy,
 	AleksandarDragovejic2011STMnotToy,AlexanderMatveev2012PessimisticTM}.

 \QuickQ{}
 	So have CPU designers also greatly reduced the overhead of
 	cache misses?
 \QuickA{}
 	Unfortunately, not so much.
 	There has been some reduction given constant numbers of CPUs,
 	but the finite speed of light and the atomic nature of
 	matter limits their ability to reduce cache-miss overhead
 	for larger systems.
 	Section~\ref{sec:cpu:Hardware Free Lunch?}
 	discusses some possible avenues for possible future progress.

 \QuickQ{}
 	This is a \emph{simplified} sequence of events?
 	How could it \emph{possibly} be any more complex?
 \QuickA{}
 	This sequence ignored a number of possible complications,
 	including:

 	\begin{enumerate}
 	\item	Other CPUs might be concurrently attempting to perform
 		CAS operations involving this same cacheline.
 	\item	The cacheline might have been replicated read-only in
 		several CPUs' caches, in which case, it would need to
 		be flushed from their caches.
 	\item	CPU~7 might have been operating on the cache line when
 		the request for it arrived, in which case CPU~7 might
 		need to hold off the request until its own operation
 		completed.
 	\item	CPU~7 might have ejected the cacheline from its cache
 		(for example, in order to make room for other data),
 		so that by the time that the request arrived, the
 		cacheline was on its way to memory.
 	\item	A correctable error might have occurred in the cacheline,
 		which would then need to be corrected at some point before
 		the data was used.
 	\end{enumerate}

 	Production-quality cache-coherence mechanisms are extremely
 	complicated due to these sorts of considerations.


 \QuickQ{}
 	Why is it necessary to flush the cacheline from CPU~7's cache?
 \QuickA{}
 	If the cacheline was not flushed from CPU~7's cache, then
 	CPUs~0 and 7 might have different values for the same set
 	of variables in the cacheline.
 	This sort of incoherence would greatly complicate parallel
 	software, and so hardware architects have been convinced to
 	avoid it.

 \QuickQ{}
 	Surely the hardware designers could be persuaded to improve
 	this situation!
 	Why have they been content with such abysmal performance
 	for these single-instruction operations?
 \QuickA{}
 	The hardware designers \emph{have} been working on this
 	problem, and have consulted with no less a luminary than
 	the physicist Stephen Hawking.
 	Hawking's observation was that the hardware designers have
 	two basic problems~\cite{BryanGardiner2007}:

 	\begin{enumerate}
 	\item	the finite speed of light, and
 	\item	the atomic nature of matter.
 	\end{enumerate}

 \begin{table}
 \centering
 \begin{tabular}{l||r|r}
 	Operation		& Cost (ns) 	& Ratio \\
 	\hline
 	\hline
 	Clock period		&           0.4	&           1.0 \\
 	\hline
 	``Best-case'' CAS	&          12.2	&          33.8 \\
 	\hline
 	Best-case lock		&          25.6	&          71.2 \\
 	\hline
 	Single cache miss	&          12.9	&          35.8 \\
 	\hline
 	CAS cache miss		&           7.0	&          19.4 \\
 	\hline
 	Off-Core		&		&		\\
 	\hline
 	Single cache miss	&          31.2	&          86.6 \\
 	\hline
 	CAS cache miss		&          31.2	&          86.5 \\
 	\hline
 	Off-Socket		&		&		\\
 	\hline
 	Single cache miss	&          92.4	&         256.7 \\
 	\hline
 	CAS cache miss		&          95.9	&         266.4 \\
 	\hline
 	Comms Fabric		&       4,500	&       7,500 \\
 	\hline
 	Global Comms		& 195,000,000	& 324,000,000 \\
 \end{tabular}
 \caption{Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System}
 \label{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System}
 \end{table}

 	The first problem limits raw speed, and the second limits
 	miniaturization, which in turn limits frequency.
 	And even this sidesteps the power-consumption issue that
 	is currently holding production frequencies to well below
 	10 GHz.

 	Nevertheless, some progress is being made, as may be seen
 	by comparing
 	Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System}
 	with
 	Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System}
 	on
 	page~\pageref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System}.
 	Integration of hardware threads in a single core and multiple
 	cores on a die have improved latencies greatly, at least within the
 	confines of a single core or single die.
 	There has been some improvement in overall system latency,
 	but only by about a factor of two.
 	Unfortunately, neither the speed of light nor the atomic nature
 	of matter has changed much in the past few years.

 	Section~\ref{sec:cpu:Hardware Free Lunch?}
 	looks at what else hardware designers might be
 	able to do to ease the plight of parallel programmers.

 \QuickQ{}
 	These numbers are insanely large!
 	How can I possibly get my head around them?
 \QuickA{}
 	Get a roll of toilet paper.
 	In the USA, each roll will normally have somewhere around 350-500
 	sheets.
 	Tear off one sheet to represent a single clock cycle, setting it aside.
 	Now unroll the rest of the roll.

 	The resulting pile of toilet paper will likely represent a single
 	CAS cache miss.

 	For the more-expensive inter-system communications latencies,
 	use several rolls (or multiple cases) of toilet paper to represent
 	the communications latency.

 	Important safety tip: make sure to account for the needs of
 	those you live with when appropriating toilet paper!

 \QuickQ{}
 	But individual electrons don't move anywhere near that fast,
 	even in conductors!!!
 	The electron drift velocity in a conductor under the low voltages
 	found in semiconductors is on the order of only one \emph{millimeter}
 	per second.
 	What gives???
 \QuickA{}
 	Electron drift velocity tracks the long-term movement of individual
 	electrons.
 	It turns out that individual electrons bounce around quite
 	randomly, so that their instantaneous speed is very high, but
 	over the long term, they don't move very far.
 	In this, electrons resemble long-distance commuters, who
 	might spend most of their time traveling at full highway
 	speed, but over the long term going nowhere.
 	These commuters' speed might be 70 miles per hour
 	(113 kilometers per hour), but their long-term drift velocity
 	relative to the planet's surface is zero.

 	When designing circuitry, electrons' instantaneous speed is
 	often more important than their drift velocity.
 	When a voltage is applied to a wire, more electrons enter the
 	wire than leave it, but the electrons entering cause the
 	electrons already there to move a bit further down the wire,
 	which causes other electrons to move down, and so on.
 	The result is that the electric field moves quite quickly down
 	the wire.
 	Just as the speed of sound in air is much greater than is
 	the typical wind speed, the electric field propagates down
 	the wire at a much higher velocity than the electron drift
 	velocity.

 \QuickQ{}
 	Given that distributed-systems communication is so horribly
 	expensive, why does anyone bother with them?
 \QuickA{}
 	There are a number of reasons:

 	\begin{enumerate}
 	\item	Shared-memory multiprocessor systems have strict size limits.
 		If you need more than a few thousand CPUs, you have no
 		choice but to use a distributed system.
 	\item	Extremely large shared-memory systems tend to be
 		quite expensive and to have even longer cache-miss
 		latencies than does the small four-CPU system
 		shown in
 		Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System}.
 	\item	The distributed-systems communications latencies do
 		not necessarily consume the CPU, which can often allow
 		computation to proceed in parallel with message transfer.
 	\item	Many important problems are ``embarrassingly parallel'',
 		so that extremely large quantities of processing may
 		be enabled by a very small number of messages.
 		SETI@HOME~\cite{SETIatHOME2008}
 		is but one example of such an application.
 		These sorts of applications can make good use of networks
 		of computers despite extremely long communications
 		latencies.
 	\end{enumerate}

 	It is likely that continued work on parallel applications will
 	increase the number of embarrassingly parallel applications that
 	can run well on machines and/or clusters having long communications
 	latencies.
 	That said, greatly reduced hardware latencies would be an
 	extremely welcome development.

 \QuickQ{}
 	OK, if we are going to have to apply distributed-programming
 	techniques to shared-memory parallel programs, why not just
 	always use these distributed techniques and dispense with
 	shared memory?
 \QuickA{}
 	Because it is often the case that only a small fraction of
 	the program is performance-critical.
 	Shared-memory parallelism allows us to focus distributed-programming
 	techniques on that small fraction, allowing simpler shared-memory
 	techniques to be used on the non-performance-critical bulk of
 	the program.

 \QuickQAC{chp:Tools of the Trade}{Tools of the Trade}
 \QuickQ{}
 	But this silly shell script isn't a \emph{real} parallel program!
 	Why bother with such trivia???
 \QuickA{}
 	Because you should \emph{never} forget the simple stuff!

 	Please keep in mind that the title of this book is
 	``Is Parallel Programming Hard, And, If So, What Can You Do About It?''.
 	One of the most effective things you can do about it is to
 	avoid forgetting the simple stuff!
 	After all, if you choose to do parallel programming the hard
 	way, you have no one but yourself to blame.

 \QuickQ{}
 	Is there a simpler way to create a parallel shell script?
 	If so, how?  If not, why not?
 \QuickA{}
 	One straightforward approach is the shell pipeline:
 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \small
 \begin{verbatim}
 grep $pattern1 | sed -e 's/a/b/' | sort
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}
 	For a sufficiently large input file,
 	\co{grep} will pattern-match in parallel with \co{sed}
 	editing and with the input processing of \co{sort}.
 	See the file \co{parallel.sh} for a demonstration of
 	shell-script parallelism and pipelining.

 \QuickQ{}
 	But if script-based parallel programming is so easy, why
 	bother with anything else?
 \QuickA{}
 	In fact, it is quite likely that a very large fraction of
 	parallel programs in use today are script-based.
 	However, script-based parallelism does have its limitations:
 	\begin{enumerate}
 	\item	Creation of new processes is usually quite heavyweight,
 		involving the expensive \co{fork()} and \co{exec()}
 		system calls.
 	\item	Sharing of data, including pipelining, typically involves
 		expensive file I/O.
 	\item	The reliable synchronization primitives available to
 		scripts also typically involve expensive file I/O.
 	\end{enumerate}
 	These limitations require that script-based parallelism use
 	coarse-grained parallelism, with each unit of work having
 	execution time of at least tens of milliseconds, and preferably
 	much longer.

 	Those requiring finer-grained parallelism are well advised to
 	think hard about their problem to see if it can be expressed
 	in a coarse-grained form.
 	If not, they should consider using other parallel-programming
 	environments, such as those discussed in
 	Section~\ref{sec:toolsoftrade:POSIX Multiprocessing}.

 \QuickQ{}
 	Why does this \co{wait()} primitive need to be so complicated?
 	Why not just make it work like the shell-script \co{wait} does?
 \QuickA{}
 	Some parallel applications need to take special action when
 	specific children exit, and therefore need to wait for each
 	child individually.
 	In addition, some parallel applications need to detect the
 	reason that the child died.
 	As we saw in Figure~\ref{fig:toolsoftrade:Using the wait() Primitive},
 	it is not hard to build a \co{waitall()} function out of
 	the \co{wait()} function, but it would be impossible to
 	do the reverse.
 	Once the information about a specific child is lost, it is lost.

 \QuickQ{}
 	Isn't there a lot more to \co{fork()} and \co{wait()}
 	than discussed here?
 \QuickA{}
 	Indeed there is, and
 	it is quite possible that this section will be expanded in
 	future versions to include messaging features (such as UNIX
 	pipes, TCP/IP, and shared file I/O) and memory mapping
 	(such as \co{mmap()} and \co{shmget()}).
 	In the meantime, there are any number of textbooks that cover
 	these primitives in great detail,
 	and the truly motivated can read manpages, existing parallel
 	applications using these primitives, as well as the
 	source code of the Linux-kernel implementations themselves.

 \QuickQ{}
 	If the \co{mythread()} function in
 	Figure~\ref{fig:toolsoftrade:Threads Created Via pthread-create() Share Memory}
 	can simply return, why bother with \co{pthread_exit()}?
 \QuickA{}
 	In this simple example, there is no reason whatsoever.
 	However, imagine a more complex example, where \co{mythread()}
 	invokes other functions, possibly separately compiled.
 	In such a case, \co{pthread_exit()} allows these other functions
 	to end the thread's execution without having to pass some sort
 	of error return all the way back up to \co{mythread()}.

 \QuickQ{}
 	If the C language makes no guarantees in presence of a data
 	race, then why does the Linux kernel have so many data races?
 	Are you trying to tell me that the Linux kernel is completely
 	broken???
 \QuickA{}
 	Ah, but the Linux kernel is written in a carefully selected
 	superset of the C language that includes special gcc
 	extensions, such as asms, that permit safe execution even
 	in presence of data races.
 	In addition, the Linux kernel does not run on a number of
 	platforms where data races would be especially problematic.
 	For an example, consider embedded systems with 32-bit pointers
 	and 16-bit busses.
 	On such a system, a data race involving a store to and a load
 	from a given pointer might well result in the load returning the
 	low-order 16 bits of the old value of the pointer concatenated
 	with the high-order 16 bits of the new value of the pointer.

 \QuickQ{}
 	What if I want several threads to hold the same lock at the
 	same time?
 \QuickA{}
 	The first thing you should do is to ask yourself why you would
 	want to do such a thing.
 	If the answer is ``because I have a lot of data that is read
 	by many threads, and only occasionally updated'', then
 	POSIX reader-writer locks might be what you are looking for.
 	These are introduced in
 	Section~\ref{sec:toolsoftrade:POSIX Reader-Writer Locking}.

 	Another way to get the effect of multiple threads holding
 	the same lock is for one thread to acquire the lock, and
 	then use \co{pthread_create()} to create the other threads.
 	The question of why this would ever be a good idea is left
 	to the reader.

 \QuickQ{}
 	Why not simply make the argument to \co{lock_reader()}
 	on line~5 of
 	Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks}
 	be a pointer to a \co{pthread_mutex_t}?
 \QuickA{}
 	Because we will need to pass \co{lock_reader()} to
 	\co{pthread_create()}.
 	Although we could cast the function when passing it to
 	\co{pthread_create()}, function casts are quite a bit
 	uglier and harder to get right than are simple pointer casts.

 \QuickQ{}
 	Writing four lines of code for each acquisition and release
 	of a \co{pthread_mutex_t} sure seems painful!
 	Isn't there a better way?
 \QuickA{}
 	Indeed!
 	And for that reason, the \co{pthread_mutex_lock()} and
 	\co{pthread_mutex_unlock()} primitives are normally wrapped
 	in functions that do this error checking.
 	Later on, we will wrapper them with the Linux kernel
 	\co{spin_lock()} and \co{spin_unlock()} APIs.

 \QuickQ{}
 	Is ``x = 0'' the only possible output from the code fragment
 	shown in
 	Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock}?
 	If so, why?
 	If not, what other output could appear, and why?
 \QuickA{}
 	No.
 	The reason that ``x = 0'' was output was that \co{lock_reader()}
 	acquired the lock first.
 	Had \co{lock_writer()} instead acquired the lock first, then
 	the output would have been ``x = 3''.
 	However, because the code fragment started \co{lock_reader()} first
 	and because this run was performed on a multiprocessor,
 	one would normally expect \co{lock_reader()} to acquire the
 	lock first.
 	However, there are no guarantees, especially on a busy system.

 \QuickQ{}
 	Using different locks could cause quite a bit of confusion,
 	what with threads seeing each others' intermediate states.
 	So should well-written parallel programs restrict themselves
 	to using a single lock in order to avoid this kind of confusion?
 \QuickA{}
 	Although it is sometimes possible to write a program using a
 	single global lock that both performs and scales well, such
 	programs are exceptions to the rule.
 	You will normally need to use multiple locks to attain good
 	performance and scalability.

 	One possible exception to this rule is ``transactional memory'',
 	which is currently a research topic.
 	Transactional-memory semantics can be loosely thought of as those
 	of a single global lock with optimizations permitted and
 	with the addition of rollback~\cite{HansJBoehm2009HOTPAR}.

 \QuickQ{}
 	In the code shown in
 	Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks},
 	is \co{lock_reader()} guaranteed to see all the values produced
 	by \co{lock_writer()}?
 	Why or why not?
 \QuickA{}
 	No.
 	On a busy system, \co{lock_reader()} might be preempted
 	for the entire duration of \co{lock_writer()}'s execution,
 	in which case it would not see \emph{any} of \co{lock_writer()}'s
 	intermediate states for \co{x}.

 \QuickQ{}
 	Wait a minute here!!!
 	Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock}
 	didn't initialize shared variable \co{x},
 	so why does it need to be initialized in
 	Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks}?
 \QuickA{}
 	See line~3 of
 	Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks}.
 	Because the code in
 	Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock}
 	ran first, it could rely on the compile-time initialization of
 	\co{x}.
 	The code in
 	Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks}
 	ran next, so it had to re-initialize \co{x}.

 \QuickQ{}
 	Instead of using \co{ACCESS_ONCE()} everywhere, why not just
 	declare \co{goflag} as \co{volatile} on line~10 of
 	Figure~\ref{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability}?
 \QuickA{}
 	A \co{volatile} declaration is in fact a reasonable alternative in
 	this particular case.
 	However, use of \co{ACCESS_ONCE()} has the benefit of clearly
 	flagging to the reader that \co{goflag} is subject to concurrent
 	reads and updates.
 	However, \co{ACCESS_ONCE()} is especially useful in cases where
 	most of the accesses are protected by a lock (and thus \emph{not}
 	subject to change), but where a few of the accesses are made outside
 	of the lock.
 	Using a volatile declaration in this case would make it harder
 	for the reader to note the special accesses outside of the lock,
 	and would also make it harder for the compiler to generate good
 	code under the lock.

 \QuickQ{}
 	\co{ACCESS_ONCE()} only affects the compiler, not the CPU.
 	Don't we also need memory barriers to make sure
 	that the change in \co{goflag}'s value propagates to the
 	CPU in a timely fashion in
 	Figure~\ref{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability}?
 \QuickA{}
 	No, memory barriers are not needed and won't help here.
 	Memory barriers only enforce ordering among multiple
 	memory references:  They do absolutely nothing to expedite
 	the propagation of data from one par of the system to
 	another.
 	This leads to a quick rule of thumb:  You do not need
 	memory barriers unless you are using more than one
 	variable to communicate between multiple threads.

 	But what about \co{nreadersrunning}?
 	Isn't that a second variable used for communication?
 	Indeed it is, and there really are the needed memory-barrier
 	instructions buried in \co{__sync_fetch_and_add()},
 	which make sure that the thread proclaims its presence
 	before checking to see if it should start.

 \QuickQ{}
 	Would it ever be necessary to use \co{ACCESS_ONCE()} when accessing
 	a per-thred variable, for example, a variable declared using
 	the \co{gcc} \co{__thread} storage class?
 \QuickA{}
 	It depends.
 	If the per-thread variable was accessed only from its thread,
 	and never from a single handler, then no.
 	Otherwise, it is quite possible that \co{ACCESS_ONCE()} is needed.
 	We will see examples of both situations in
 	Section~\ref{sec:count:Signal-Theft Limit Counter Implementation}.

 	This leads to the question of how one thread can gain access to
 	another thread's \co{__thread} variable, and the answer is that
 	the second thread must store a pointer to its \co{__thread}
 	pointer somewhere that the first thread has access to.
 	One common approach is to maintain a linked list with one
 	element per thread, and to store the address of each thread's
 	\co{__thread} variable in the corresponding element.

 \QuickQ{}
 	Isn't comparing against single-CPU throughput a bit harsh?
 \QuickA{}
 	Not at all.
 	In fact, this comparison was, if anything, overly lenient.
 	A more balanced comparison would be against single-CPU
 	throughput with the locking primitives commented out.

 \QuickQ{}
 	But 1,000 instructions is not a particularly small size for
 	a critical section.
 	What do I do if I need a much smaller critical section, for
 	example, one containing only a few tens of instructions?
 \QuickA{}
 	If the data being read \emph{never} changes, then you do not
 	need to hold any locks while accessing it.
 	If the data changes sufficiently infrequently, you might be
 	able to checkpoint execution, terminate all threads, change
 	the data, then restart at the checkpoint.

 	Another approach is to keep a single exclusive lock per
 	thread, so that a thread read-acquires the larger aggregate
 	reader-writer lock by acquiring its own lock, and write-acquires
 	by acquiring all the per-thread locks~\cite{WilsonCHsieh92a}.
 	This can work quite well for readers, but causes writers
 	to incur increasingly large overheads as the number of threads
 	increases.

 	Some other ways of handling very small critical sections are
 	described in Section~\ref{sec:defer:Read-Copy Update (RCU)}.

 \QuickQ{}
 	In
 	Figure~\ref{fig:intro:Reader-Writer Lock Scalability},
 	all of the traces other than the 100M trace deviate gently
 	from the ideal line.
 	In contrast, the 100M trace breaks sharply from the ideal
 	line at 64 CPUs.
 	In addition, the spacing between the 100M trace and the 10M
 	trace is much smaller than that between the 10M trace and the
 	1M trace.
 	Why does the 100M trace behave so much differently than the
 	other traces?
 \QuickA{}
 	Your first clue is that 64 CPUs is exactly half of the 128
 	CPUs on the machine.
 	The difference is an artifact of hardware threading.
 	This system has 64 cores with two hardware threads per core.
 	As long as fewer than 64 threads are running, each can run
 	in its own core.
 	But as soon as there are more than 64 threads, some of the threads
 	must share cores.
 	Because the pair of threads in any given core share some hardware
 	resources, the throughput of two threads sharing a core is not
 	quite as high as that of two threads each in their own core.
 	So the performance of the 100M trace is limited not by the
 	reader-writer lock, but rather by the sharing of hardware resources
 	between hardware threads in a single core.

 	This can also be seen in the 10M trace, which deviates gently from
 	the ideal line up to 64 threads, then breaks sharply down, parallel
 	to the 100M trace.
 	Up to 64 threads, the 10M trace is limited primarily by reader-writer
 	lock scalability, and beyond that, also by sharing of hardware
 	resources between hardware threads in a single core.

 \QuickQ{}
 	Power-5 is several years old, and new hardware should
 	be faster.
 	So why should anyone worry about reader-writer locks being slow?
 \QuickA{}
 	In general, newer hardware is improving.
 	However, it will need to improve more than two orders of magnitude
 	to permit reader-writer lock to achieve idea performance on
 	128 CPUs.
 	Worse yet, the greater the number of CPUs, the larger the
 	required performance improvement.
 	The performance problems of reader-writer locking are therefore
 	very likely to be with us for quite some time to come.

 \QuickQ{}
 	Is it really necessary to have both sets of primitives?
 \QuickA{}
 	Strictly speaking, no.
 	One could implement any member of the second set using the
 	corresponding member of the first set.
 	For example, one could implement \co{__sync_nand_and_fetch()}
 	in terms of \co{__sync_fetch_and_nand()} as follows:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 tmp = v;
 ret = __sync_fetch_and_nand(p, tmp);
 ret = ~ret & tmp;
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 	It is similarly possible to implement \co{__sync_fetch_and_add()},
 	\co{__sync_fetch_and_sub()}, and \co{__sync_fetch_and_xor()}
 	in terms of their post-value counterparts.

 	However, the alternative forms can be quite convenient, both
 	for the programmer and for the compiler/library implementor.

 \QuickQ{}
 	Given that these atomic operations will often be able to
 	generate single atomic instructions that are directly
 	supported by the underlying instruction set, shouldn't
 	they be the fastest possible way to get things done?
 \QuickA{}
 	Unfortunately, no.
 	See Chapter~\ref{chp:Counting} for some stark counterexamples.

 \QuickQ{}
 	What happened to the Linux-kernel equivalents to \co{fork()}
 	and \co{join()}?
 \QuickA{}
 	They don't really exist.
 	All tasks executing within the Linux kernel share memory,
 	at least unless you want to do a huge amount of memory-mapping
 	work by hand.

 \QuickQ{}
 	Wouldn't the shell normally use \co{vfork()} rather than
 	\co{fork()}?
 \QuickA{}
 	It might well do that, however, checking is left as an exercise
 	for the reader.
 	But in the meantime, I hope that we cna agree that \co{vfork()}
 	is a variant of \co{fork()}, so that we can use \co{fork()}
 	as a generic term covering both.

 \QuickQAC{chp:Counting}{Counting}
 \QuickQ{}
 	Why on earth should efficient and scalable counting be hard?
 	After all, computers have special hardware for the sole purpose
 	of doing counting,
 	addition, subtraction, and lots more besides, don't they???
 \QuickA{}
 	Because the straightforward counting algorithms, for example,
 	atomic operations on a shared counter, either are slow and scale
 	badly, or are inaccurate, as will be seen in
 	Section~\ref{sec:count:Why Isn't Concurrent Counting Trivial?}.

 \QuickQ{}
 	{ \bfseries Network-packet counting problem. }
 	Suppose that you need to collect statistics on the number
 	of networking packets (or total number of bytes) transmitted
 	and/or received.
 	Packets might be transmitted or received by any CPU on
 	the system.
 	Suppose further that this large machine is capable of
 	handling a million packets per second, and that there
 	is a systems-monitoring package that reads out the count
 	every five seconds.
 	How would you implement this statistical counter?
 \QuickA{}
 	Hint: The act of updating the counter must be blazingly
 	fast, but because the counter is read out only about once
 	in five million updates, the act of reading out the counter can be
 	quite slow.
 	In addition, the value read out normally need not be all that
 	accurate---after all, since the counter is updated a thousand
 	times per millisecond, we should be able to work with a value
 	that is within a few thousand counts of the ``true value'',
 	whatever ``true value'' might mean in this context.
 	However, the value read out should maintain roughly the same
 	absolute error over time.
 	For example, a 1\% error might be just fine when the count
 	is on the order of a million or so, but might be absolutely
 	unacceptable once the count reaches a trillion.
 	See Section~\ref{sec:count:Statistical Counters}.

 \QuickQ{}
 	{ \bfseries Approximate structure-allocation limit problem. }
 	Suppose that you need to maintain a count of the number of
 	structures allocated in order to fail any allocations
 	once the number of structures in use exceeds a limit
 	(say, 10,000).
 	Suppose further that these structures are short-lived,
 	that the limit is rarely exceeded, and that a ``sloppy''
 	approximate limit is acceptable.
 \QuickA{}
 	Hint: The act of updating the counter must again be blazingly
 	fast, but the counter is read out each time that the
 	counter is increased.
 	However, the value read out need not be accurate
 	\emph{except} that it must distinguish approximately
 	between values below the limit and values greater than or
 	equal to the limit.
 	See Section~\ref{sec:count:Approximate Limit Counters}.

 \QuickQ{}
 	{ \bfseries Exact structure-allocation limit problem. }
 	Suppose that you need to maintain a count of the number of
 	structures allocated in order to fail any allocations
 	once the number of structures in use exceeds an exact limit
 	(again, say 10,000).
 	Suppose further that these structures are short-lived,
 	and that the limit is rarely exceeded, that there is almost
 	always at least one structure in use, and suppose further
 	still that it is necessary to know exactly when this counter reaches
 	zero, for example, in order to free up some memory
 	that is not required unless there is at least one structure
 	in use.
 \QuickA{}
 	Hint: The act of updating the counter must once again be blazingly
 	fast, but the counter is read out each time that the
 	counter is increased.
 	However, the value read out need not be accurate
 	\emph{except} that it absolutely must distinguish perfectly
 	between values between the limit and zero on the one hand,
 	and values that either are less than or equal to zero or
 	are greater than or equal to the limit on the other hand.
 	See Section~\ref{sec:count:Exact Limit Counters}.

 \QuickQ{}
 	{ \bfseries Removable I/O device access-count problem. }
 	Suppose that you need to maintain a reference count on a
 	heavily used removable mass-storage device, so that you
 	can tell the user when it is safe to remove the device.
 	This device follows the usual removal procedure where
 	the user indicates a desire to remove the device, and
 	the system tells the user when it is safe to do so.
 \QuickA{}
 	Hint: Yet again, the act of updating the counter must be blazingly
 	fast and scalable in order to avoid slowing down I/O operations,
 	but because the counter is read out only when the
 	user wishes to remove the device, the counter read-out
 	operation can be extremely slow.
 	Furthermore, there is no need to be able to read out
 	the counter at all unless the user has already indicated
 	a desire to remove the device.
 	In addition, the value read out need not be accurate
 	\emph{except} that it absolutely must distinguish perfectly
 	between non-zero and zero values, and even then only when
 	the device is in the process of being removed.
 	However, once it has read out a zero value, it must act
 	to keep the value at zero until it has taken some action
 	to prevent subsequent threads from gaining access to the
 	device being removed.
 	See Section~\ref{sec:count:Applying Specialized Parallel Counters}.

 \QuickQ{}
 	But doesn't the \co{++} operator produce an x86 add-to-memory
 	instruction?
 	And won't the CPU cache cause this to be atomic?
 \QuickA{}
 	Although the \co{++} operator \emph{could} be atomic, there
 	is no requirement that it be so.
 	And indeed, \co{gcc} often
 	chooses to load the value to a register, increment
 	the register, then store the value to memory, which is
 	decidedly non-atomic.

 \QuickQ{}
 	The 8-figure accuracy on the number of failures indicates
 	that you really did test this.
 	Why would it be necessary to test such a trivial program,
 	especially when the bug is easily seen by inspection?
 \QuickA{}
 	Not only are there very few
 	trivial parallel programs, and most days I am
 	not so sure that there are many trivial sequential programs, either.

 	No matter how small or simple the program, if you haven't tested
 	it, it does not work.
 	And even if you have tested it, Murphy's Law says that there will
 	be at least a few bugs still lurking.

 	Furthermore, while proofs of correctness certainly do have their
 	place, they never will replace testing, including the
 	\url{counttorture.h} test setup used here.
 	After all, proofs are only as good as the assumptions that they
 	are based on.
 	Furthermore, proofs can have bugs just as easily as programs can!

 \QuickQ{}
 	Why doesn't the dashed line on the x~axis meet the
 	diagonal line at $x=1$?
 \QuickA{}
 	Because of the overhead of the atomic operation.
 	The dashed line on the x~axis represents the overhead of
 	a single \emph{non-atomic} increment.
 	After all, an \emph{ideal} algorithm would not only scale
 	linearly, it would also incur no performance penalty compared
 	to single-threaded code.

 	This level of idealism may seem severe, but if it is good
 	enough for Linus Torvalds, it is good enough for you.

 \QuickQ{}
 	But atomic increment is still pretty fast.
 	And incrementing a single variable in a tight loop sounds
 	pretty unrealistic to me, after all, most of the program's
 	execution should be devoted to actually doing work, not accounting
 	for the work it has done!
 	Why should I care about making this go faster?
 \QuickA{}
 	In many cases, atomic increment will in fact be fast enough
 	for you.
 	In those cases, you should by all means use atomic increment.
 	That said, there are many real-world situations where
 	more elaborate counting algorithms are required.
 	The canonical example of such a situation is counting packets
 	and bytes in highly optimized networking stacks, where it is
 	all too easy to find much of the execution time going into
 	these sorts of accounting tasks, especially on large
 	multiprocessors.

 	In addition, as noted at the beginning of this chapter,
 	counting provides an excellent view of the
 	issues encountered in shared-memory parallel programs.

 \QuickQ{}
 	But why can't CPU designers simply ship the addition operation to the
 	data, avoiding the need to circulate the cache line containing
 	the global variable being incremented?
 \QuickA{}
 	It might well be possible to do this in some cases.
 	However, there are a few complications:
 	\begin{enumerate}
 	\item	If the value of the variable is required, then the
 		thread will be forced to wait for the operation
 		to be shipped to the data, and then for the result
 		to be shipped back.
 	\item	If the atomic increment must be ordered with respect
 		to prior and/or subsequent operations, then the thread
 		will be forced to wait for the operation to be shipped
 		to the data, and for an indication that the operation
 		completed to be shipped back.
 	\item	Shipping operations among CPUs will likely require
 		more lines in the system interconnect, which will consume
 		more die area and more electrical power.
 	\end{enumerate}
 	But what if neither of the first two conditions holds?
 	Then you should think carefully about the algorithms discussed
 	in Section~\ref{sec:count:Statistical Counters}, which achieve
 	near-ideal performance on commodity hardware.

 \begin{figure}[tb]
 \begin{center}
 \resizebox{3in}{!}{\includegraphics{count/GlobalTreeInc}}
 \end{center}
 \caption{Data Flow For Global Combining-Tree Atomic Increment}
 \label{fig:count:Data Flow For Global Combining-Tree Atomic Increment}
 \end{figure}

 	If either or both of the first two conditions hold, there is
 	\emph{some} hope for improved hardware.
 	One could imagine the hardware implementing a combining tree,
 	so that the increment requests from multiple CPUs are combined
 	by the hardware into a single addition when the combined request
 	reaches the hardware.
 	The hardware could also apply an order to the requests, thus
 	returning to each CPU the return value corresponding to its
 	particular atomic increment.
 	This results in instruction latency that varies as $O(log N)$,
 	where $N$ is the number of CPUs, as shown in
 	Figure~\ref{fig:count:Data Flow For Global Combining-Tree Atomic Increment}.
 	And CPUs with this sort of hardware optimization are starting to
 	appear as of 2011.

 	This is a great improvement over the $O(N)$ performance
 	of current hardware shown in
 	Figure~\ref{fig:count:Data Flow For Global Atomic Increment},
 	and it is possible that hardware latencies might decrease
 	further if innovations such as three-dimensional fabrication prove
 	practical.
 	Nevertheless, we will see that in some important special cases,
 	software can do \emph{much} better.

 \QuickQ{}
 	But doesn't the fact that C's ``integers'' are limited in size
 	complicate things?
 \QuickA{}
 	No, because modulo addition is still commutative and associative.
 	At least as long as you use unsigned integers.
 	Recall that in the C standard, overflow of signed integers results
 	in undefined behavior, never mind the fact that machines that
 	do anything other than wrap on overflow are quite rare these days.
 	Unfortunately, compilers frequently carry out optimizations that
 	assume that signed integers will not overflow, so if your code
 	allows signed integers to overflow, you can run into trouble
 	even on twos-complement hardware.

 	That said, one potential source of additional complexity arises
 	when attempting to gather (say) a 64-bit sum from 32-bit
 	per-thread counters.
 	Dealing with this added complexity is left as
 	an exercise for the reader, for whom some of the techniques
 	introduced later in this chapter could be quite helpful.

 \QuickQ{}
 	An array???
 	But doesn't that limit the number of threads?
 \QuickA{}
 	It can, and in this toy implementation, it does.
 	But it is not that hard to come up with an alternative
 	implementation that permits an arbitrary number of threads,
 	for example, using the \co{gcc} \co{__thread} facility,
 	as shown in
 	Section~\ref{sec:count:Per-Thread-Variable-Based Implementation}.

 \QuickQ{}
 	What other choice does gcc have, anyway???
 \QuickA{}
 	According to the C standard, the effects of fetching a variable
 	that might be concurrently modified by some other thread are
 	undefined.
 	It turns out that the C standard really has no other choice,
 	given that C must support (for example) eight-bit architectures
 	which are incapable of atomically loading a \co{long}.
 	An upcoming version of the C standard aims to fill this gap,
 	but until then, we depend on the kindness of the gcc developers.

 	Alternatively, use of volatile accesses such as those provided
 	by \co{ACCESS_ONCE()}~\cite{JonCorbet2012ACCESS:ONCE}
 	can help constrain the compiler, at least
 	in cases where the hardware is capable of accessing the value
 	with a single memory-reference instruction.

 \QuickQ{}
 	How does the per-thread \co{counter} variable in
 	Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
 	get initialized?
 \QuickA{}
 	The C standard specifies that the initial value of
 	global variables is zero, unless they are explicitly initialized.
 	So the initial value of all the instances of \co{counter}
 	will be zero.
 	Furthermore, in the common case where the user is interested only
 	in differences between consecutive reads
 	from statistical counters, the initial value is irrelevant.

 \QuickQ{}
 	How is the code in
 	Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
 	supposed to permit more than one counter?
 \QuickA{}
 	Indeed, this toy example does not support more than one counter.
 	Modifying it so that it can provide multiple counters is left
 	as an exercise to the reader.

 \QuickQ{}
 	The read operation takes time to sum up the per-thread values,
 	and during that time, the counter could well be changing.
 	This means that the value returned by
 	\co{read_count()} in
 	Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
 	will not necessarily be exact.
 	Assume that the counter is being incremented at rate
 	$r$ counts per unit time, and that \co{read_count()}'s
 	execution consumes $\Delta$ units of time.
 	What is the expected error in the return value?
 \QuickA{}
 	Let's do worst-case analysis first, followed by a less
 	conservative analysis.

 	In the worst case, the read operation completes immediately,
 	but is then delayed for $\Delta$ time units before returning,
 	in which case the worst-case error is simply $r \Delta$.

 	This worst-case behavior is rather unlikely, so let us instead
 	consider the case where the reads from each of the $N$
 	counters is spaced equally over the time period $\Delta$.
 	There will be $N+1$ intervals of duration $\frac{\Delta}{N+1}$
 	between the $N$ reads.
 	The error due to the delay after the read from the last thread's
 	counter will be given by $\frac{r \Delta}{N \left( N + 1 \right)}$,
 	the second-to-last thread's counter by
 	$\frac{2 r \Delta}{N \left( N + 1 \right)}$,
 	the third-to-last by
 	$\frac{3 r \Delta}{N \left( N + 1 \right)}$,
 	and so on.
 	The total error is given by the sum of the errors due to the
 	reads from each thread's counter, which is:

 	\begin{equation}
 		\frac{r \Delta}{N \left( N + 1 \right)}
 			\sum_{i = 1}^N i
 	\end{equation}

 	Expressing the summation in closed form yields:

 	\begin{equation}
 		\frac{r \Delta}{N \left( N + 1 \right)}
 			\frac{N \left( N + 1 \right)}{2}
 	\end{equation}

 	Cancelling yields the intuitively expected result:

 	\begin{equation}
 		\frac{r \Delta}{2}
 	\end{equation}

 	It is important to remember that error continues accumulating
 	as the caller executes code making use of the count returned
 	by the read operation.
 	For example, if the caller spends time $t$ executing some
 	computation based on the result of the returned count, the
 	worst-case error will have increased to $r \left(t \Delta \right)$.

 	The expected error will have similarly increased to:

 	\begin{equation}
 		r \left( \frac{\Delta}{2} + t \right)
 	\end{equation}

 	Of course, it is sometimes unacceptable for the counter to
 	continue incrementing during the read operation.
 	Section~\ref{sec:count:Applying Specialized Parallel Counters}
 	discusses a way to handle this situation.

 	All that aside, in most uses of statistical counters, the
 	error in the value returned by \co{read_count()} is
 	irrelevant.
 	This irrelevance is due to the fact that the time required
 	for \co{read_count()} to execute is normally extremely
 	small compared to the time interval between successive
 	calls to \co{read_count()}.

 \QuickQ{}
 	Why doesn't \co{inc_count()} in
 	Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
 	need to use atomic instructions?
 	After all, we now have multiple threads accessing the per-thread
 	counters!
 \QuickA{}
 	Because one of the two threads only reads, and because the
 	variable is aligned and machine-sized, non-atomic instructions
 	suffice.
 	That said, the \co{ACCESS_ONCE()} macro is used to prevent
 	compiler optimizations that might otherwise prevent the
 	counter updates from becoming visible to
 	\co{eventual()}~\cite{JonCorbet2012ACCESS:ONCE}.

 	An older version of this algorithm did in fact use atomic
 	instructions, kudos to Ersoy Bayramoglu for pointing out that
 	they are in fact unnecessary.
 	That said, atomic instructions would be needed in cases where
 	the per-thread \co{counter} variables were smaller than the
 	global \co{global_count}.
 	However, note that on a 32-bit system,
 	the per-thread \co{counter} variables
 	might need to be limited to 32 bits in order to sum them accurately,
 	but with a 64-bit \co{global_count} variable to avoid overflow.
 	In this case, it is necessary to zero the per-thead
 	\co{counter} variables periodically in order to avoid overflow.
 	It is extremely important to note that this zeroing cannot
 	be delayed too long or overflow of the smaller per-thread
 	variables will result.
 	This approach therefore imposes real-time requirements on the
 	underlying system, and in turn must be used with extreme care.

 	In contrast, if all variables are the same size, overflow
 	of any variable is harmless because the eventual sum
 	will be modulo the word size.

 \QuickQ{}
 	Won't the single global thread in the function \co{eventual()} of
 	Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
 	be just as severe a bottleneck as a global lock would be?
 \QuickA{}
 	In this case, no.
 	What will happen instead is that as the number of threads increases,
 	the estimate of the counter
 	value returned by \co{read_count()} will become more inaccurate.

 \QuickQ{}
 	Won't the estimate returned by \co{read_count()} in
 	Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
 	become increasingly
 	inaccurate as the number of threads rises?
 \QuickA{}
 	Yes.
 	If this proves problematic, one fix is to provide multiple
 	\co{eventual()} threads, each covering its own subset of
 	the other threads.
 	In more extreme cases, a tree-like hierarchy of
 	\co{eventual()} threads might be required.

 \QuickQ{}
 	Given that in the eventually-consistent algorithm shown in
 	Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
 	both reads and updates have extremely low overhead
 	and are extremely scalable, why would anyone bother with the
 	implementation described in
 	Section~\ref{sec:count:Array-Based Implementation},
 	given its costly read-side code?
 \QuickA{}
 	The thread executing \co{eventual()} consumes CPU time.
 	As more of these eventually-consistent counters are added,
 	the resulting \co{eventual()} threads will eventually
 	consume all available CPUs.
 	This implementation therefore suffers a different sort of
 	scalability limitation, with the scalability limit being in
 	terms of the number of eventually consistent counters rather
 	than in terms of the number of threads or CPUs.

 \QuickQ{}
 	Why do we need an explicit array to find the other threads'
 	counters?
 	Why doesn't gcc provide a \co{per_thread()} interface, similar
 	to the Linux kernel's \co{per_cpu()} primitive, to allow
 	threads to more easily access each others' per-thread variables?
 \QuickA{}
 	Why indeed?

 	To be fair, gcc faces some challenges that the Linux kernel
 	gets to ignore.
 	When a user-level thread exits, its per-thread variables all
 	disappear, which complicates the problem of per-thread-variable
 	access, particularly before the advent of user-level RCU
 	(see Section~\ref{sec:defer:Read-Copy Update (RCU)}).
 	In contrast, in the Linux kernel, when a CPU goes offline,
 	that CPU's per-CPU variables remain mapped and accessible.

 	Similarly, when a new user-level thread is created, its
 	per-thread variables suddenly come into existence.
 	In contrast, in the Linux kernel, all per-CPU variables are
 	mapped and initialized at boot time, regardless of whether
 	the corresponding CPU exists yet, or indeed, whether the
 	corresponding CPU will ever exist.

 	A key limitation that the Linux kernel imposes is a compile-time
 	maximum bound on the number of CPUs, namely, \co{CONFIG_NR_CPUS},
 	along with a typically tighter boot-time bound of \co{nr_cpu_ids}.
 	In contrast, in user space, there is no hard-coded upper limit
 	on the number of threads.

 	Of course, both environments must handle dynamically loaded
 	code (dynamic libraries in user space, kernel modules in the
 	Linux kernel), which increases the complexity of per-thread
 	variables.

 	These complications make it significantly harder for user-space
 	environments to provide access to other threads' per-thread
 	variables.
 	Nevertheless, such access is highly useful, and it is hoped
 	that it will someday appear.

 \QuickQ{}
 	Doesn't the check for \co{NULL} on line~19 of
 	Figure~\ref{fig:count:Per-Thread Statistical Counters}
 	add extra branch mispredictions?
 	Why not have a variable set permanently to zero, and point
 	unused counter-pointers to that variable rather than setting
 	them to \co{NULL}?
 \QuickA{}
 	This is a reasonable strategy.
 	Checking for the performance difference is left as an exercise
 	for the reader.
 	However, please keep in mind that the fastpath is not
 	\co{read_count()}, but rather \co{inc_count()}.

 \QuickQ{}
 	Why on earth do we need something as heavyweight as a \emph{lock}
 	guarding the summation in the function \co{read_count()} in
 	Figure~\ref{fig:count:Per-Thread Statistical Counters}?
 \QuickA{}
 	Remember, when a thread exits, its per-thread variables disappear.
 	Therefore, if we attempt to access a given thread's per-thread
 	variables after that thread exits, we will get a segmentation
 	fault.
 	The lock coordinates summation and thread exit, preventing this
 	scenario.

 	Of course, we could instead read-acquire a reader-writer lock,
 	but Chapter~\ref{chp:Deferred Processing} will introduce even
 	lighter-weight mechanisms for implementing the required coordination.

 	Another approach would be to use an array instead of a per-thread
 	variable, which, as Alexey Roytman notes, would eliminate
 	the tests against \co{NULL}.
 	However, array accesses are often slower than accesses to
 	per-thread variables, and use of an array would imply a
 	fixed upper bound on the number of threads.
 	Also, note that neither tests nor locks are needed on the
 	\co{inc_count()} fastpath.

 \QuickQ{}
 	Why on earth do we need to acquire the lock in
 	\co{count_register_thread()} in
 	Figure~\ref{fig:count:Per-Thread Statistical Counters}?
 	It is a single properly aligned machine-word store to a location
 	that no other thread is modifying, so it should be atomic anyway,
 	right?
 \QuickA{}
 	This lock could in fact be omitted, but better safe than
 	sorry, especially given that this function is executed only at
 	thread startup, and is therefore not on any critical path.
 	Now, if we were testing on machines with thousands of CPUs,
 	we might need to omit the lock, but on machines with ``only''
 	a hundred or so CPUs, there is no need to get fancy.

 \QuickQ{}
 	Fine, but the Linux kernel doesn't have to acquire a lock
 	when reading out the aggregate value of per-CPU counters.
 	So why should user-space code need to do this???
 \QuickA{}
 	Remember, the Linux kernel's per-CPU variables are always
 	accessible, even if the corresponding CPU is offline --- even
 	if the corresponding CPU never existed and never will exist.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbatim}
   1 long __thread counter = 0;
   2 long *counterp[NR_THREADS] = { NULL };
   3 int finalthreadcount = 0;
   4 DEFINE_SPINLOCK(final_mutex);
   5
   6 void inc_count(void)
   7 {
   8   counter++;
   9 }
  10
  11 long read_count(void)
  12 {
  13   int t;
  14   long sum = 0;
  15
  16   for_each_thread(t)
  17     if (counterp[t] != NULL)
  18       sum += *counterp[t];
  19   return sum;
  20 }
  21
  22 void count_init(void)
  23 {
  24 }
  25
  26 void count_register_thread(void)
  27 {
  28   counterp[smp_thread_id()] = &counter;
  29 }
  30
  31 void count_unregister_thread(int nthreadsexpected)
  32 {
  33   spin_lock(&final_mutex);
  34   finalthreadcount++;
  35   spin_unlock(&final_mutex);
  36   while (finalthreadcount < nthreadsexpected)
  37     poll(NULL, 0, 1);
  38 }
 \end{verbatim}
 }
 \caption{Per-Thread Statistical Counters With Lockless Summation}
 \label{fig:count:Per-Thread Statistical Counters With Lockless Summation}
 \end{figure}

 	One workaround is to ensure that each thread continues to exist
 	until all threads are finished, as shown in
 	Figure~\ref{fig:count:Per-Thread Statistical Counters With Lockless Summation}
 	(\co{count_tstat.c}).
 	Analysis of this code is left as an exercise to the reader,
 	however, please note that it does not fit well into the
 	\url{counttorture.h} counter-evaluation scheme.
 	(Why not?)
 	Chapter~\ref{chp:Deferred Processing} will introduce
 	synchronization mechanisms that handle this situation in a much
 	more graceful manner.

 \QuickQ{}
 	What fundamental difference is there between counting packets
 	and counting the total number of bytes in the packets, given
 	that the packets vary in size?
 \QuickA{}
 	When counting packets, the counter is only incremented by the
 	value one.
 	On the other hand, when counting bytes, the counter might
 	be incremented by largish numbers.

 	Why does this matter?
 	Because in the increment-by-one case, the value returned will
 	be exact in the sense that the counter must necessarily have
 	taken on that value at some point in time, even if it is impossible
 	to say precisely when that point occurred.
 	In contrast, when counting bytes, two different threads might
 	return values that are inconsistent with any global ordering
 	of operations.

 	To see this, suppose that thread~0 adds the value three to its
 	counter, thread~1 adds the value five to its counter, and
 	threads~2 and 3 sum the counters.
 	If the system is ``weakly ordered'' or if the compiler
 	uses aggressive optimizations, thread~2 might find the
 	sum to be three and thread~3 might find the sum to be five.
 	The only possible global orders of the sequence of values
 	of the counter are 0,3,8 and 0,5,8, and neither order is
 	consistent with the results obtained.

 	If you missed this one, you are not alone.
 	Michael Scott used this question to stump Paul E.~McKenney
 	during Paul's Ph.D. defense.

 \QuickQ{}
 	Given that the reader must sum all the threads' counters,
 	this could take a long time given large numbers of threads.
 	Is there any way that the increment operation can remain
 	fast and scalable while allowing readers to also enjoy
 	reasonable performance and scalability?
 \QuickA{}
 	One approach would be to maintain a global approximation
 	to the value.
 	Readers would increment their per-thread variable, but when it
 	reached some predefined limit, atomically add it to a global
 	variable, then zero their per-thread variable.
 	This would permit a tradeoff between average increment overhead
 	and accuracy of the value read out.

 	The reader is encouraged to think up and try out other approaches,
 	for example, using a combining tree.

 \QuickQ{}
 	Why does
 	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}
 	provide \co{add_count()} and \co{sub_count()} instead of the
 	\co{inc_count()} and \co{dec_count()} interfaces show in
 	Section~\ref{sec:count:Statistical Counters}?
 \QuickA{}
 	Because structures come in different sizes.
 	Of course, a limit counter corresponding to a specific size
 	of structure might still be able to use
 	\co{inc_count()} and \co{dec_count()}.

 \QuickQ{}
 	What is with the strange form of the condition on line~3 of
 	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
 	Why not the following more intuitive form of the fastpath?

 	\vspace{5pt}
 	\begin{minipage}[t]{\columnwidth}
 	\small
 	\begin{verbatim}
   3 if (counter + delta <= countermax){
   4   counter += delta;
   5   return 1;
   6 }
 	\end{verbatim}
 	\end{minipage}
 	\vspace{5pt}
 \QuickA{}
 	Two words.
 	``Integer overflow.''

 	Try the above formulation with \co{counter} equal to 10 and
 	\co{delta} equal to \co{ULONG_MAX}.
 	Then try it again with the code shown in
 	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}.

 	A good understanding of integer overflow will be required for
 	the rest of this example, so if you have never dealt with
 	integer overflow before, please try several examples to get
 	the hang of it.
 	Integer overflow can sometimes be more difficult to get right
 	than parallel algorithms!

 \QuickQ{}
 	Why does \co{globalize_count()} zero the per-thread variables,
 	only to later call \co{balance_count()} to refill them in
 	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
 	Why not just leave the per-thread variables non-zero?
 \QuickA{}
 	That is in fact what an earlier version of this code did.
 	But addition and subtraction are extremely cheap, and handling
 	all of the special cases that arise is quite complex.
 	Again, feel free to try it yourself, but beware of integer
 	overflow!

 \QuickQ{}
 	Given that \co{globalreserve} counted against us in \co{add_count()},
 	why doesn't it count for us in \co{sub_count()} in
 	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
 \QuickA{}
 	The \co{globalreserve} variable tracks the sum of all threads'
 	\co{countermax} variables.
 	The sum of these threads' \co{counter} variables might be anywhere
 	from zero to \co{globalreserve}.
 	We must therefore take a conservative approach, assuming that
 	all threads' \co{counter} variables are full in \co{add_count()}
 	and that they are all empty in \co{sub_count()}.

 	But remember this question, as we will come back to it later.

 \QuickQ{}
 	Suppose that one thread invokes \co{add_count()} shown in
 	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read},
 	and then another thread invokes \co{sub_count()}.
 	Won't \co{sub_count()} return failure even though the value of
 	the counter is non-zero?
 \QuickA{}
 	Indeed it will!
 	In many cases, this will be a problem, as discussed in
 	Section~\ref{sec:count:Simple Limit Counter Discussion}, and
 	in those cases the algorithms from
 	Section~\ref{sec:count:Exact Limit Counters}
 	will likely be preferable.

 \QuickQ{}
 	Why have both \co{add_count()} and \co{sub_count()} in
 	Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
 	Why not simply pass a negative number to \co{add_count()}?
 \QuickA{}
 	Given that \co{add_count()} takes an \co{unsigned} \co{long}
 	as its argument, it is going to be a bit tough to pass it a
 	negative number.
 	And unless you have some anti-matter memory, there is little
 	point in allowing negative numbers when counting the number
 	of structures in use!

 \QuickQ{}
 	Why set \co{counter} to \co{countermax / 2} in line~15 of
 	Figure~\ref{fig:count:Simple Limit Counter Utility Functions}?
 	Wouldn't it be simpler to just take \co{countermax} counts?
 \QuickA{}
 	First, it really is reserving \co{countermax} counts
 	(see line~14), however,
 	it adjusts so that only half of these are actually in use
 	by the thread at the moment.
 	This allows the thread to carry out at least \co{countermax / 2}
 	increments or decrements before having to refer back to
 	\co{globalcount} again.

 	Note that the accounting in \co{globalcount} remains accurate,
 	thanks to the adjustment in line~18.

 \QuickQ{}
 	In Figure~\ref{fig:count:Schematic of Globalization and Balancing},
 	even though a quarter of the remaining count up to the limit is
 	assigned to thread~0, only an eighth of the remaining count is
 	consumed, as indicated by the uppermost dotted line connecting
 	the center and the rightmost configurations.
 	Why is that?
 \QuickA{}
 	The reason this happened is that thread~0's \co{counter} was
 	set to half of its \co{countermax}.
 	Thus, of the quarter assigned to thread~0, half of that quarter
 	(one eighth) came from \co{globalcount}, leaving the other half
 	(again, one eighth) to come from the remaining count.

 	There are two purposes for taking this approach:
 	(1)~To allow thread~0 to use the fastpath for decrements as
 	well as increments, and
 	(2)~To reduce the inaccuracies if all threads are monotonically
 	incrementing up towards the limit.
 	To see this last point, step through the algorithm and watch
 	what it does.

 \QuickQ{}
 	Why is it necessary to atomically manipulate the thread's
 	\co{counter} and \co{countermax} variables as a unit?
 	Wouldn't it be good enough to atomically manipulate them
 	individually?
 \QuickA{}
 	This might well be possible, but great care is required.
 	Note that removing \co{counter} without first zeroing
 	\co{countermax} could result in the corresponding thread
 	increasing \co{counter} immediately after it was zeroed,
 	completely negating the effect of zeroing the counter.

 	The opposite ordering, namely zeroing \co{countermax} and then
 	removing \co{counter}, can also result in a non-zero
 	\co{counter}.
 	To see this, consider the following sequence of events:

 	\begin{enumerate}
 	\item	Thread~A fetches its \co{countermax}, and finds that
 		it is non-zero.
 	\item	Thread~B zeroes Thread~A's \co{countermax}.
 	\item	Thread~B removes Thread~A's \co{counter}.
 	\item	Thread~A, having found that its \co{countermax}
 		is non-zero, proceeds to add to its \co{counter},
 		resulting in a non-zero value for \co{counter}.
 	\end{enumerate}

 	Again, it might well be possible to atomically manipulate
 	\co{countermax} and \co{counter} as separate variables,
 	but it is clear that great care is required.
 	It is also quite likely that doing so will slow down the
 	fastpath.

 	Exploring these possibilities are left as exercises for
 	the reader.

 \QuickQ{}
 	In what way does line~7 of
 	Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}
 	violate the C standard?
 \QuickA{}
 	It assumes eight bits per byte.
 	This assumption does hold for all current commodity microprocessors
 	that can be easily assembled into shared-memory multiprocessors,
 	but certainly does not hold for all computer systems that have
 	ever run C code.
 	(What could you do instead in order to comply with the C
 	standard?  What drawbacks would it have?)

 \QuickQ{}
 	Given that there is only one \co{ctrandmax} variable,
 	why bother passing in a pointer to it on line~18 of
 	Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}?
 \QuickA{}
 	There is only one \co{ctrandmax} variable \emph{per thread}.
 	Later, we will see code that needs to pass other threads'
 	\co{ctrandmax} variables to \co{split_ctrandmax()}.

 \QuickQ{}
 	Why does \co{merge_ctrandmax()} in
 	Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}
 	return an \co{int} rather than storing directly into an
 	\co{atomic_t}?
 \QuickA{}
 	Later, we will see that we need the \co{int} return to pass
 	to the \co{atomic_cmpxchg()} primitive.

 \QuickQ{}
 	Yecch!
 	Why the ugly \co{goto} on line~11 of
 	Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}?
 	Haven't you heard of the \co{break} statement???
 \QuickA{}
 	Replacing the \co{goto} with a \co{break} would require keeping
 	a flag to determine whether or not line~15 should return, which
 	is not the sort of thing you want on a fastpath.
 	If you really hate the \co{goto} that much, your best bet would
 	be to pull the fastpath into a separate function that returned
 	success or failure, with ``failure'' indicating a need for the
 	slowpath.
 	This is left as an exercise for goto-hating readers.

 \QuickQ{}
 	Why would the \co{atomic_cmpxchg()} primitive at lines~13-14 of
 	Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}
 	ever fail?
 	After all, we picked up its old value on line~9 and have not
 	changed it!
 \QuickA{}
 	Later, we will see how the \co{flush_local_count()} function in
 	Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 1}
 	might update this thread's \co{ctrandmax} variable concurrently
 	with the execution of the fastpath on lines~8-14 of
 	Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}.

 \QuickQ{}
 	What stops a thread from simply refilling its
 	\co{ctrandmax} variable immediately after
 	\co{flush_local_count()} on line 14 of
 	Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 1}
 	empties it?
 \QuickA{}
 	This other thread cannot refill its \co{ctrandmax}
 	until the caller of \co{flush_local_count()} releases the
 	\co{gblcnt_mutex}.
 	By that time, the caller of \co{flush_local_count()} will have
 	finished making use of the counts, so there will be no problem
 	with this other thread refilling --- assuming that the value
 	of \co{globalcount} is large enough to permit a refill.

 \QuickQ{}
 	What prevents concurrent execution of the fastpath of either
 	\co{atomic_add()} or \co{atomic_sub()} from interfering with
 	the \co{ctrandmax} variable while
 	\co{flush_local_count()} is accessing it on line 27 of
 	Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 1}
 	empties it?
 \QuickA{}
 	Nothing.
 	Consider the following three cases:
 	\begin{enumerate}
 	\item	If \co{flush_local_count()}'s \co{atomic_xchg()} executes
 		before the \co{split_ctrandmax()} of either fastpath,
 		then the fastpath will see a zero \co{counter} and
 		\co{countermax}, and will thus transfer to the slowpath
 		(unless of course \co{delta} is zero).
 	\item	If \co{flush_local_count()}'s \co{atomic_xchg()} executes
 		after the \co{split_ctrandmax()} of either fastpath,
 		but before that fastpath's \co{atomic_cmpxchg()},
 		then the \co{atomic_cmpxchg()} will fail, causing the
 		fastpath to restart, which reduces to case~1 above.
 	\item	If \co{flush_local_count()}'s \co{atomic_xchg()} executes
 		after the \co{atomic_cmpxchg()} of either fastpath,
 		then the fastpath will (most likely) complete successfully
 		before \co{flush_local_count()} zeroes the thread's
 		\co{ctrandmax} variable.
 	\end{enumerate}
 	Either way, the race is resolved correctly.

 \QuickQ{}
 	Given that the \co{atomic_set()} primitive does a simple
 	store to the specified \co{atomic_t}, how can line~21 of
 	\co{balance_count()} in
 	Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 2}
 	work correctly in face of concurrent \co{flush_local_count()}
 	updates to this variable?
 \QuickA{}
 	The caller of both \co{balance_count()} and
 	\co{flush_local_count()} hold \co{gblcnt_mutex}, so
 	only one may be executing at a given time.

 \QuickQ{}
 	But signal handlers can be migrated to some other
 	CPU while running.
 	Doesn't this possibility require that atomic instructions
 	and memory barriers are required to reliably communicate
 	between a thread and a signal handler that interrupts that
 	thread?
 \QuickA{}
 	No.
 	If the signal handler is migrated to another CPU, then the
 	interrupted thread is also migrated along with it.

 \QuickQ{}
 	In Figure~\ref{fig:count:Signal-Theft State Machine}, why is
 	the REQ \co{theft} state colored red?
 \QuickA{}
 	To indicate that only the fastpath is permitted to change the
 	\co{theft} state, and that if the thread remains in this
 	state for too long, the thread running the slowpath will
 	resend the POSIX signal.

 \QuickQ{}
 	In Figure~\ref{fig:count:Signal-Theft State Machine}, what is
 	the point of having separate REQ and ACK \co{theft} states?
 	Why not simplify the state machine by collapsing
 	them into a single REQACK state?
 	Then whichever of the signal handler or the fastpath gets there
 	first could set the state to READY.
 \QuickA{}
 	Reasons why collapsing the REQ and ACK states would be a very
 	bad idea include:
 	\begin{enumerate}
 	\item	The slowpath uses the REQ and ACK states to determine
 		whether the signal should be retransmitted.
 		If the states were collapsed, the slowpath would have
 		no choice but to send redundant signals, which would
 		have the unhelpful effect of needlessly slowing down
 		the fastpath.
 	\item	The following race would result:
 		\begin{enumerate}
 		\item	The slowpath sets a given thread's state to REQACK.
 		\item	That thread has just finished its fastpath, and
 			notes the REQACK state.
 		\item	The thread receives the signal, which also notes
 			the REQACK state, and, because there is no fastpath
 			in effect, sets the state to READY.
 		\item	The slowpath notes the READY state, steals the
 			count, and sets the state to IDLE, and completes.
 		\item	The fastpath sets the state to READY, disabling
 			further fastpath execution for this thread.
 		\end{enumerate}
 		The basic problem here is that the combined REQACK state
 		can be referenced by both the signal handler and the
 		fastpath.
 		The clear separation maintained by the four-state
 		setup ensures orderly state transitions.
 	\end{enumerate}
 	That said, you might well be able to make a three-state setup
 	work correctly.
 	If you do succeed, compare carefully to the four-state setup.
 	Is the three-state solution really preferable, and why or why not?

 \QuickQ{}
 	In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}
 	function \co{flush_local_count_sig()}, why are there
 	\co{ACCESS_ONCE()} wrappers around the uses of the
 	\co{theft} per-thread variable?
 \QuickA{}
 	The first one (on line~11) can be argued to be unnecessary.
 	The last two (lines~14 and 16) are important.
 	If these are removed, the compiler would be within its rights
 	to rewrite lines~14-17 as follows:
 	\vspace{5pt}
 	\begin{minipage}[t]{\columnwidth}
 	\small
 	\begin{verbatim}
  14   theft = THEFT_READY;
  15   if (counting) {
  16     theft = THEFT_ACK;
  17   }
 	\end{verbatim}
 	\end{minipage}
 	\vspace{5pt}
 	This would be fatal, as the slowpath might see the transient
 	value of \co{THEFT_READY}, and start stealing before the
 	corresponding thread was ready.

 \QuickQ{}
 	In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions},
 	why is it safe for line~28 to directly access the other thread's
 	\co{countermax} variable?
 \QuickA{}
 	Because the other thread is not permitted to change the value
 	of its \co{countermax} variable unless it holds the
 	\co{gblcnt_mutex} lock.
 	But the caller has acquired this lock, so it is not possible
 	for the other thread to hold it, and therefore the other thread
 	is not permitted to change its \co{countermax} variable.
 	We can therefore safely access it --- but not change it.

 \QuickQ{}
 	In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions},
 	why doesn't line~33 check for the current thread sending itself
 	a signal?
 \QuickA{}
 	There is no need for an additional check.
 	The caller of \co{flush_local_count()} has already invoked
 	\co{globalize_count()}, so the check on line~28 will have
 	succeeded, skipping the later \co{pthread_kill()}.

 \QuickQ{}
 	The code in
 	Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions},
 	works with gcc and POSIX.
 	What would be required to make it also conform to the ISO C standard?
 \QuickA{}
 	The \co{theft} variable must be of type \co{sig_atomic_t}
 	to guarantee that it can be safely shared between the signal
 	handler and the code interrupted by the signal.

 \QuickQ{}
 	In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}, why does line~41 resend the signal?
 \QuickA{}
 	Because many operating systems over several decades have
 	had the property of losing the occasional signal.
 	Whether this is a feature or a bug is debatable, but
 	irrelevant.
 	The obvious symptom from the user's viewpoint will not be
 	a kernel bug, but rather a user application hanging.

 	\emph{Your} user application hanging!

 \QuickQ{}
 	Not only are POSIX signals slow, sending one to each thread
 	simply does not scale.
 	What would you do if you had (say) 10,000 threads and needed
 	the read side to be fast?
 \QuickA{}
 	One approach is to use the techniques shown in
 	Section~\ref{sec:count:Eventually Consistent Implementation},
 	summarizing an approximation to the overall counter value in
 	a single variable.
 	Another approach would be to use multiple threads to carry
 	out the reads, with each such thread interacting with a
 	specific subset of the updating threads.

 \QuickQ{}
 	What if you want an exact limit counter to be exact only for
 	its lower limit, but to allow the upper limit to be inexact?
 \QuickA{}
 	One simple solution is to overstate the upper limit by the
 	desired amount.
 	The limiting case of such overstatement results in the
 	upper limit being set to the largest value that the counter is
 	capable of representing.

 \QuickQ{}
 	What else had you better have done when using a biased counter?
 \QuickA{}
 	You had better have set the upper limit to be large enough
 	accommodate the bias, the expected maximum number of accesses,
 	and enough ``slop'' to allow the counter to work efficiently
 	even when the number of accesses is at its maximum.

 \QuickQ{}
 	This is ridiculous!
 	We are \emph{read}-acquiring a reader-writer lock to
 	\emph{update} the counter?
 	What are you playing at???
 \QuickA{}
 	Strange, perhaps, but true!
 	Almost enough to make you think that the name
 	``reader-writer lock'' was poorly chosen, isn't it?

 \QuickQ{}
 	What other issues would need to be accounted for in a real system?
 \QuickA{}
 	A huge number!

 	Here are a few to start with:

 	\begin{enumerate}
 	\item	There could be any number of devices, so that the
 		global variables are inappropriate, as are the
 		lack of arguments to functions like \co{do_io()}.
 	\item	Polling loops can be problematic in real systems.
 		In many cases, it is far better to have the last
 		completing I/O wake up the device-removal thread.
 	\item	The I/O might fail, and so \co{do_io()} will likely
 		need a return value.
 	\item	If the device fails, the last I/O might never complete.
 		In such cases, there might need to be some sort of
 		timeout to allow error recovery.
 	\item	Both \co{add_count()} and \co{sub_count()} can
 		fail, but their return values are not checked.
 	\item	Reader-writer locks do not scale well.
 		One way of avoiding the high read-acquisition costs
 		of reader-writer locks is presented in
 		Chapters~\ref{chp:Locking}
 		and~\ref{chp:Deferred Processing}.
 	\item	The polling loops result in poor energy efficiency.
 		An event-driven design is preferable.
 	\end{enumerate}

 \QuickQ{}
 	On the \url{count_stat.c} row of
 	Table~\ref{tab:count:Statistical Counter Performance on Power-6},
 	we see that the update side scales linearly with the number of
 	threads.
 	How is that possible given that the more threads there are,
 	the more per-thread counters must be summed up?
 \QuickA{}
 	The read-side code must scan the entire fixed-size array, regardless
 	of the number of threads, so there is no difference in performance.
 	In contrast, in the last two algorithms, readers must do more
 	work when there are more threads.
 	In addition, the last two algorithms interpose an additional
 	level of indirection because they map from integer thread ID
 	to the corresponding \co{__thread} variable.

 \QuickQ{}
 	Even on the last row of
 	Table~\ref{tab:count:Statistical Counter Performance on Power-6},
 	the read-side performance of these statistical counter
 	implementations is pretty horrible.
 	So why bother with them?
 \QuickA{}
 	``Use the right tool for the job.''

 	As can be seen from
 	Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem},
 	single-variable atomic increment need not apply for any job
 	involving heavy use of parallel updates.
 	In contrast, the algorithms shown in
 	Table~\ref{tab:count:Statistical Counter Performance on Power-6}
 	do an excellent job of handling update-heavy situations.
 	Of course, if you have a read-mostly situation, you should
 	use something else, for example, an eventually consistent design
 	featuring a single atomically incremented
 	variable that can be read out using a single load,
 	similar to the approach used in
 	Section~\ref{sec:count:Eventually Consistent Implementation}.

 \QuickQ{}
 	Given the performance data shown in
 	Table~\ref{tab:count:Limit Counter Performance on Power-6},
 	we should always prefer update-side signals over read-side
 	atomic operations, right?
 \QuickA{}
 	That depends on the workload.
 	Note that you need almost one hundred thousand readers (with roughly
 	a 60-nanosecond performance gain) to make up for even one
 	writer (with almost a 5-\emph{millisecond} performance loss).
 	Although there are no shortage of workloads with far greater
 	read intensity, you will need to consider your particular
 	workload.

 	In addition, although memory barriers have historically been
 	expensive compared to ordinary instructions, you should
 	check this on the specific hardware you will be running.
 	The properties of computer hardware do change over time,
 	and algorithms must change accordingly.

 \QuickQ{}
 	Can advanced techniques be applied to address the lock
 	contention for readers seen in
 	Table~\ref{tab:count:Limit Counter Performance on Power-6}?
 \QuickA{}
 	One approach is to give up some update-side performance, as is
 	done with scalable non-zero indicators
 	(SNZI)~\cite{FaithEllen:2007:SNZI}.
 	There are a number of other ways one might go about this, and these
 	are left as exercises for the reader.
 	Any number of approaches that apply hierarchy, which replace
 	frequent global-lock acquisitions with local lock acquisitions
 	corresponding to lower levels of the hierarchy, should work quite well.

 \QuickQ{}
 	The \co{++} operator works just fine for 1,000-digit numbers!
 	Haven't you heard of operator overloading???
 \QuickA{}
 	In the C++ language, you might well be able to use \co{++}
 	on a 1,000-digit number, assuming that you had access to a
 	class implementing such numbers.
 	But as of 2010, the C language does not permit operator overloading.

 \QuickQ{}
 	But if we are going to have to partition everything, why bother
 	with shared-memory multithreading?
 	Why not just partition the problem completely and run as
 	multiple processes, each in its own address space?
 \QuickA{}
 	Indeed, multiple processes with separate address spaces can be
 	an excellent way to exploit parallelism, as the proponents of
 	the fork-join methodology and the Erlang language would be very
 	quick to tell you.
 	However, there are also some advantages to shared-memory parallelism:
 	\begin{enumerate}
 	\item	Only the most performance-critical portions of the
 		application must be partitioned, and such portions
 		are usually a small fraction of the application.
 	\item	Although cache misses are quite slow compared to
 		individual register-to-register instructions,
 		they are typically considerably faster than
 		inter-process-communication primitives, which in
 		turn are considerably faster than things like
 		TCP/IP networking.
 	\item	Shared-memory multiprocessors are readily available
 		and quite inexpensive, so, in stark contrast to the
 		1990s, there is little cost penalty for use of
 		shared-memory parallelism.
 	\end{enumerate}
 	As always, use the right tool for the job!

 \QuickQAC{cha:Partitioning and Synchronization Design}{Partitioning and Synchronization Design}
 \QuickQ{}
 	Is there a better solution to the Dining
 	Philosophers Problem?
 \QuickA{}

 \begin{figure}[tb]
 \begin{center}
 \includegraphics[scale=.7]{SMPdesign/DiningPhilosopher5PEM}
 \end{center}
 \caption{Dining Philosophers Problem, Fully Partitioned}
 \ContributedBy{Figure}{fig:SMPdesign:Dining Philosophers Problem, Fully Partitioned}{Kornilios Kourtis}
 \end{figure}

 	One such improved solution is shown in
 	Figure~\ref{fig:SMPdesign:Dining Philosophers Problem, Fully Partitioned},
 	where the philosophers are simply provided with an additional
 	five forks.
 	All five philosophers may now eat simultaneously, and there
 	is never any need for philosophers to wait on one another.
 	In addition, this approach offers greatly improved disease control.

 	This solution might seem like cheating to some, but such
 	``cheating'' is key to finding good solutions to many
 	concurrency problems.

 \QuickQ{}
 	And in just what sense can this ``horizontal parallelism'' be
 	said to be ``horizontal''?
 \QuickA{}
 	Inman was working with protocol stacks, which are normally
 	depicted vertically, with the application on top and the
 	hardware interconnect on the bottom.
 	Data flows up and down this stack.
 	``Horizontal parallelism'' processes packets from different network
 	connections in parallel, while ``vertical parallelism''
 	handles different protocol-processing steps for a given
 	packet in parallel.

 	``Vertical parallelism'' is also called ``pipelining''.

 \QuickQ{}
 	In this compound double-ended queue implementation, what should
 	be done if the queue has become non-empty while releasing
 	and reacquiring the lock?
 \QuickA{}
 	In this case, simply dequeue an item from the non-empty
 	queue, release both locks, and return.

 \QuickQ{}
 	Is the hashed double-ended queue a good solution?
 	Why or why not?
 \QuickA{}
 	The best way to answer this is to run \url{lockhdeq.c} on
 	a number of different multiprocessor systems, and you are
 	encouraged to do so in the strongest possible terms.
 	One reason for concern is that each operation on this
 	implementation must acquire not one but two locks.
 	% Getting about 500 nanoseconds per element when used as
 	% a queue on a 4.2GHz Power system.  This is roughly the same as
 	% the version covered by a single lock.  Sequential (unlocked
 	% variant is more than an order of magnitude faster!

 	The first well-designed performance study will be cited.\footnote{
 		The studies by Dalessandro
 		et al.~\cite{LukeDalessandro:2011:ASPLOS:HybridNOrecSTM:deque}
 		and Dice et al.~\cite{DavidDice:2010:SCA:HTM:deque} are
 		good starting points.}
 	Do not forget to compare to a sequential implementation!

 \QuickQ{}
 	Move \emph{all} the elements to the queue that became empty?
 	In what possible universe is this brain-dead solution in any
 	way optimal???
 \QuickA{}
 	It is optimal in the case where data flow switches direction only
 	rarely.
 	It would of course be an extremely poor choice if the double-ended
 	queue was being emptied from both ends concurrently.
 	This of course raises another question, namely, in what possible
 	universe emptying from both ends concurrently would be a reasonable
 	thing to do.
 	Work-stealing queues are one possible answer to this question.

 \QuickQ{}
 	Why can't the compound parallel double-ended queue
 	implementation be symmetric?
 \QuickA{}
 	The need to avoid deadlock by imposing a lock hierarchy
 	forces the asymmetry, just as it does in the fork-numbering
 	solution to the Dining Philosophers Problem
 	(see Section~\ref{sec:SMPdesign:Dining Philosophers Problem}).

 \QuickQ{}
 	Why is it necessary to retry the right-dequeue operation
 	on line~28 of
 	Figure~\ref{fig:SMPdesign:Compound Parallel Double-Ended Queue Implementation}?
 \QuickA{}
 	This retry is necessary because some other thread might have
 	enqueued an element between the time that this thread dropped
 	\co{d->rlock} on line~25 and the time that it reacquired this
 	same lock on line~27.

 \QuickQ{}
 	Surely the left-hand lock must \emph{sometimes} be available!!!
 	So why is it necessary that line~25 of
 	Figure~\ref{fig:SMPdesign:Compound Parallel Double-Ended Queue Implementation}
 	unconditionally release the right-hand lock?
 \QuickA{}
 	It would be possible to use \co{spin_trylock()} to attempt
 	to acquire the left-hand lock when it was available.
 	However, the failure case would still need to drop the
 	right-hand lock and then re-acquire the two locks in order.
 	Making this transformation (and determining whether or not
 	it is worthwhile) is left as an exercise for the reader.

 \QuickQ{}
 	The tandem double-ended queue runs about twice as fast as
 	the hashed double-ended queue, even when I increase the
 	size of the hash table to an insanely large number.
 	Why is that?
 \QuickA{}
 	The hashed double-ended queue's locking design only permits
 	one thread at a time at each end, and further requires
 	two lock acquisitions for each operation.
 	The tandem double-ended queue also permits one thread at a time
 	at each end, and in the common case requires only one lock
 	acquisition per operation.
 	Therefore, the tandem double-ended queue should be expected to
 	outperform the hashed double-ended queue.

 	Can you created a double-ended queue that allows multiple
 	concurrent operations at each end?
 	If so, how?  If not, why not?

 \QuickQ{}
 	Is there a significantly better way of handling concurrency
 	for double-ended queues?
 \QuickA{}
 	One approach is to transform the problem to be solved
 	so that multiple double-ended queues can be used in parallel,
 	allowing the simpler single-lock double-ended queue to be used,
 	and perhaps also replace each double-ended queue with a pair of
 	conventional single-ended queues.
 	Without such ``horizontal scaling'', the speedup is limited
 	to 2.0.
 	In contrast, horizontal-scaling designs can achieve very large
 	speedups, and are especially attractive if there are multiple threads
 	working either end of the queue, because in this
 	multiple-thread case the dequeue
 	simply cannot provide strong ordering guarantees.
 	After all, the fact that a given thread removed an item first
 	in no way implies that it will process that item
 	first~\cite{AndreasHaas2012FIFOisnt}.
 	And if there are no guarantees, we may as well obtain the
 	performance benefits that come with refusing to provide these
 	guarantees.
 	% about twice as fast as hashed version on 4.2GHz Power.

 	Regardless of whether or not the problem can be transformed
 	to use multiple queues, it is worth asking whether work can
 	be batched so that each enqueue and dequeue operation corresponds
 	to larger units of work.
 	This batching approach decreases contention on the queue data
 	structures, which increases both performance and scalability,
 	as will be seen in
 	Section~\ref{sec:SMPdesign:Synchronization Granularity}.
 	After all, if you must incur high synchronization overheads,
 	be sure you are getting your money's worth.

 	Other researchers are working on other ways to take advantage
 	of limited ordering guarantees in
 	queues~\cite{ChristophMKirsch2012FIFOisntTR}.

 \QuickQ{}
 	Don't all these problems with critical sections mean that
 	we should just always use
 	non-blocking synchronization~\cite{MauriceHerlihy90a},
 	which don't have critical sections?
 \QuickA{}
 	Although non-blocking synchronization can be very useful
 	in some situations, it is no panacea.
 	Also, non-blocking synchronization really does have
 	critical sections, as noted by Josh Triplett.
 	For example, in a non-blocking algorithm based on
 	compare-and-swap operations, the code starting at the
 	initial load and continuing to the compare-and-swap
 	is in many ways analogous to a lock-based critical section.

 \QuickQ{}
 	What are some ways of preventing a structure from being freed while
 	its lock is being acquired?
 \QuickA{}
 	Here are a few possible solutions to this \emph{existence guarantee}
 	problem:

 	\begin{enumerate}
 	\item	Provide a statically allocated lock that is held while
 		the per-structure lock is being acquired, which is an
 		example of hierarchical locking (see
 		Section~\ref{sec:SMPdesign:Hierarchical Locking}).
 		Of course, using a single global lock for this purpose
 		can result in unacceptably high levels of lock contention,
 		dramatically reducing performance and scalability.
 	\item	Provide an array of statically allocated locks, hashing
 		the structure's address to select the lock to be acquired,
 		as described in Chapter~\ref{chp:Locking}.
 		Given a hash function of sufficiently high quality, this
 		avoids the scalability limitations of the single global
 		lock, but in read-mostly situations, the lock-acquisition
 		overhead can result in unacceptably degraded performance.
 	\item	Use a garbage collector, in software environments providing
 		them, so that a structure cannot be deallocated while being
 		referenced.
 		This works very well, removing the existence-guarantee
 		burden (and much else besides) from the developer's
 		shoulders, but imposes the overhead of garbage collection
 		on the program.
 		Although garbage-collection technology has advanced
 		considerably in the past few decades, its overhead
 		may be unacceptably high for some applications.
 		In addition, some applications require that the developer
 		exercise more control over the layout and placement of
 		data structures than is permitted by most garbage collected
 		environments.
 	\item	As a special case of a garbage collector, use a global
 		reference counter, or a global array of reference counters.
 	\item	Use \emph{hazard pointers}~\cite{MagedMichael04a}, which
 		can be thought of as an inside-out reference count.
 		Hazard-pointer-based algorithms maintain a per-thread list of
 		pointers, so that the appearance of a given pointer on
 		any of these lists acts as a reference to the corresponding
 		structure.
 		Hazard pointers are an interesting research direction, but
 		have not yet seen much use in production (written in 2008).
 	\item	Use transactional memory
 		(TM)~\cite{Herlihy93a,DBLomet1977SIGSOFT,Shavit95},
 		so that each reference and
 		modification to the data structure in question is
 		performed atomically.
 		Although TM has engendered much excitement in recent years,
 		and seems likely to be of some use in production software,
 		developers should exercise some
 		caution~\cite{Blundell2005DebunkTM,Blundell2006TMdeadlock,McKenney2007PLOSTM},
 		particularly in performance-critical code.
 		In particular, existence guarantees require that the
 		transaction cover the full path from a global reference
 		to the data elements being updated.
 	\item	Use RCU, which can be thought of as an extremely lightweight
 		approximation to a garbage collector.
 		Updaters are not permitted to free RCU-protected
 		data structures that RCU readers might still be referencing.
 		RCU is most heavily used for read-mostly data structures,
 		and is discussed at length in
 		Chapter~\ref{chp:Deferred Processing}.
 	\end{enumerate}

 	For more on providing existence guarantees, see
 	Chapters~\ref{chp:Locking} and \ref{chp:Deferred Processing}.

 \QuickQ{}
 	How can a single-threaded 64-by-64 matrix multiple possibly
 	have an efficiency of less than 1.0?
 	Shouldn't all of the traces in
 	Figure~\ref{fig:SMPdesign:Matrix Multiply Efficiency}
 	have efficiency of exactly 1.0 when running on only one thread?
 \QuickA{}
 	The \texttt{matmul.c} program creates the specified number of
 	worker threads, so even the single-worker-thread case incurs
 	thread-creation overhead.
 	Making the changes required to optimize away thread-creation
 	overhead in the single-worker-thread case is left as an
 	exercise to the reader.

 \QuickQ{}
 	How are data-parallel techniques going to help with matrix
 	multiply?
 	It is \emph{already} data parallel!!!
 \QuickA{}
 	I am glad that you are paying attention!
 	This example serves to show that although data parallelism can
 	be a very good thing, it is not some magic wand that automatically
 	wards off any and all sources of inefficiency.
 	Linear scaling at full performance, even to ``only'' 64 threads,
 	requires care at all phases of design and implementation.

 	In particular, you need to pay careful attention to the
 	size of the partitions.
 	For example, if you split a 64-by-64 matrix multiply across
 	64 threads, each thread gets only 64 floating-point multiplies.
 	The cost of a floating-point multiply is miniscule compared to
 	the overhead of thread creation.

 	Moral: If you have a parallel program with variable input,
 	always include a check for the input size being too small to
 	be worth parallelizing.
 	And when it is not helpful to parallelize, it is not helpful
 	to incur the overhead required to spawn a thread, now is it?

 \QuickQ{}
 	In what situation would hierarchical locking work well?
 \QuickA{}
 	If the comparison on line~31 of
 	Figure~\ref{fig:SMPdesign:Hierarchical-Locking Hash Table Search}
 	were replaced by a much heavier-weight operation,
 	then releasing {\tt bp->bucket\_lock} \emph{might} reduce lock
 	contention enough to outweigh the overhead of the extra
 	acquisition and release of {\tt cur->node\_lock}.

 \QuickQ{}
 	In Figure~\ref{fig:SMPdesign:Allocator Cache Performance},
 	there is a pattern of performance rising with increasing run
 	length in groups of three samples, for example, for run lengths
 	10, 11, and 12.
 	Why?
 \QuickA{}
 	This is due to the per-CPU target value being three.
 	A run length of 12 must acquire the global-pool lock twice,
 	while a run length of 13 must acquire the global-pool lock
 	three times.

 \QuickQ{}
 	Allocation failures were observed in the two-thread
 	tests at run lengths of 19 and greater.
 	Given the global-pool size of 40 and the per-thread target
 	pool size $s$ of three, number of thread $n$ equal to two,
 	and assuming that the per-thread pools are initially
 	empty with none of the memory in use, what is the smallest allocation
 	run length $m$ at which failures can occur?
 	(Recall that each thread repeatedly allocates $m$ block of memory,
 	and then frees the $m$ blocks of memory.)
 	Alternatively, given $n$ threads each with pool size $s$, and
 	where each thread repeatedly first allocates $m$ blocks of memory
 	and then frees those $m$ blocks, how large must the global pool
 	size be?
 \QuickA{}
 	The exact solution to this problem is left as an exercise to
 	the reader.
 	The first solution received will be credited to its submitter.
 	As a rough rule of thumb, the global pool size should be at least
 	$m+2sn$, where
 	``m'' is the maximum number of elements allocated at a given time,
 	``s'' is the per-CPU pool size,
 	and ``n'' is the number of CPUs.

 \QuickQAC{chp:Locking}{Locking}
 \QuickQ{}
 	Just how can serving as a whipping boy be considered to be
 	in any way honorable???
 \QuickA{}
 	The reason locking serves as a research-paper whipping boy is
 	because it is heavily used in practice.
 	In contrast, if no one used or cared about locking, most research
 	papers would not bother even mentioning it.

 \QuickQ{}
 	But the definition of deadlock only said that each thread
 	was holding at least one lock and waiting on another lock
 	that was held by some thread.
 	How do you know that there is a cycle?
 \QuickA{}
 	Suppose that there is no cycle in the graph.
 	We would then have a directed acyclic graph (DAG), which would
 	have at least one leaf node.

 	If this leaf node was a lock, then we would have a thread
 	that was waiting on a lock that wasn't held by any thread,
 	which violates the definition.
 	(And in this case the thread would immediately acquire the
 	lock.)

 	On the other hand, if this leaf node was a thread, then
 	we would have a thread that was not waiting on any lock,
 	again violating the definition.
 	(And in this case, the thread would either be running or
 	be blocked on something that is not a lock.)

 	Therefore, given this definition of deadlock, there must
 	be a cycle in the corresponding graph.

 \QuickQ{}
 	Are there any exceptions to this rule, so that there really could be
 	a deadlock cycle containing locks from both the library and
 	the caller, even given that the library code never invokes
 	any of the caller's functions?
 \QuickA{}
 	Indeed there are!
 	Here are a few of them:
 	\begin{enumerate}
 	\item	If one of the library function's arguments is a pointer
 		to a lock that this library function acquires, and if
 		the library function holds one if its locks while
 		acquiring the caller's lock, then we could have a
 		deadlock cycle involving both caller and library locks.
 	\item	If one of the library functions returns a pointer to
 		a lock that is acquired by the caller, and if the
 		caller acquires one if its locks while holding the
 		library's lock, we could again have a deadlock
 		cycle involving both caller and library locks.
 	\item	If one of the library functions acquires a lock and
 		then returns while still holding it, and if the caller
 		acquires one of its locks, we have yet another way
 		to create a deadlock cycle involving both caller
 		and library locks.
 	\item	If the caller has a signal handler that acquires
 		locks, then the deadlock cycle can involve both
 		caller and library locks.
 		In this case, however, the library's locks are
 		innocent bystanders in the deadlock cycle.
 		That said, please note that acquiring a lock from
 		within a signal handler is a no-no in most
 		environments---it is not just a bad idea, it
 		is unsupported.
 	\end{enumerate}

 \QuickQ{}
 	But if \co{qsort()} releases all its locks before invoking
 	the comparison function, how can it protect against races
 	with other \co{qsort()} threads?
 \QuickA{}
 	By privatizing the data elements being compared
 	(as discussed in Chapter~\ref{chp:Data Ownership})
 	or through use of deferral mechanisms such as
 	reference counting (as discussed in
 	Chapter~\ref{chp:Deferred Processing}).

 \QuickQ{}
 	Name one common exception where it is perfectly reasonable
 	to pass a pointer to a lock into a function.
 \QuickA{}
 	Locking primitives, of course!

 \QuickQ{}
 	Doesn't the fact that \co{pthread_cond_wait()} first releases the
 	mutex and then re-acquires it eliminate the possibility of deadlock?
 \QuickA{}
 	Absolutely not!

 	Consider the a program that acquires \co{mutex_a}, and then
 	\co{mutex_b}, in that order, and then passes \co{mutex_a}
 	to \co{pthread_cond_wait}.
 	Now, \co{pthread_cond_wait} will release \co{mutex_a}, but
 	will re-acquire it before returning.
 	If some other thread acquires \co{mutex_a} in the meantime
 	and then blocks on \co{mutex_b}, the program will deadlock.

 \QuickQ{}
 	Can the transformation from
 	Figure~\ref{fig:locking:Protocol Layering and Deadlock} to
 	Figure~\ref{fig:locking:Avoiding Deadlock Via Conditional Locking}
 	be applied universally?
 \QuickA{}
 	Absolutely not!

 	This transformation assumes that the
 	\co{layer_2_processing()} function is idempotent, given that
 	it might be executed multiple times on the same packet when
 	the \co{layer_1()} routing decision changes.
 	Therefore, in real life, this transformation can become
 	arbitrarily complex.

 \QuickQ{}
 	But the complexity in
 	Figure~\ref{fig:locking:Avoiding Deadlock Via Conditional Locking}
 	is well worthwhile given that it avoids deadlock, right?
 \QuickA{}
 	Maybe.

 	If the routing decision in \co{layer_1()} changes often enough,
 	the code will always retry, never making forward progress.
 	This is termed ``livelock'' if no thread makes any forward progress or
 	``starvation''
 	if some threads make forward progress but other do not
 	(see Section~\ref{sec:locking:Livelock and Starvation}).

 \QuickQ{}
 	When using the ``acquire needed locks first'' approach described in
 	Section~\ref{sec:locking:Acquire Needed Locks First},
 	how can livelock be avoided?
 \QuickA{}
 	Provide an additional global lock.
 	If a given thread has repeatedly tried and failed to acquire the needed
 	locks, then have that thread unconditionally acquire the new
 	global lock, and then unconditionally acquire any needed locks.
 	(Suggested by Doug Lea.)

 \QuickQ{}
 	Why is it illegal to acquire a Lock~A that is acquired outside
 	of a signal handler without blocking signals while holding
 	a Lock~B that is acquired within a signal handler?
 \QuickA{}
 	Because this would lead to deadlock.
 	Given that Lock~A is held outside of a signal
 	handler without blocking signals, a signal might be handled while
 	holding this lock.
 	The corresponding signal handler might then acquire
 	Lock~B, so that Lock~B is acquired while holding Lock~A.
 	Therefore, if we also acquire Lock~A while holding Lock~B
 	as called out in the question, we will have a deadlock cycle.

 	Therefore, it is illegal to acquire a lock that is acquired outside
 	of a signal handler without blocking signals while holding
 	a another lock that is acquired within a signal handler.

 \QuickQ{}
 	How can you legally block signals within a signal handler?
 \QuickA{}
 	One of the simplest and fastest ways to do so is to use
 	the \co{sa_mask} field of the \co{struct sigaction} that
 	you pass to \co{sigaction()} when setting up the signal.

 \QuickQ{}
 	If acquiring locks in signal handlers is such a bad idea, why
 	even discuss ways of making it safe?
 \QuickA{}
 	Because these same rules apply to the interrupt handlers used in
 	operating-system kernels and in some embedded applications.

 	In many application environments, acquiring locks in signal
 	handlers is frowned upon~\cite{OpenGroup1997pthreads}.
 	However, that does not stop clever developers from (usually
 	unwisely) fashioning home-brew locks out of atomic operations.
 	And atomic operations are in many cases perfectly legal in
 	signal handlers.

 \QuickQ{}
 	Given an object-oriented application that passes control freely
 	among a group of objects such that there is no straightforward
 	locking hierarchy,\footnote{
 		Also known as ``object-oriented spaghetti code.''}
 	layered or otherwise, how can this
 	application be parallelized?
 \QuickA{}
 	There are a number of approaches:
 	\begin{enumerate}
 	\item	In the case of parametric search via simulation,
 		where a large number of simulations will be run
 		in order to converge on (for example) a good design
 		for a mechanical or electrical device, leave the
 		simulation single-threaded, but run many instances
 		of the simulation in parallel.
 		This retains the object-oriented design, and
 		gains parallelism at a higher level, and likely
 		also avoids synchronization overhead.
 	\item	Partition the objects into groups such that there
 		is no need to operate on objects in
 		more than one group at a given time.
 		Then associate a lock with each group.
 		This is an example of a single-lock-at-a-time
 		design, which discussed in
 		Section~\ref{sec:locking:Single-Lock-at-a-Time Designs}.
 	\item	Partition the objects into groups such that threads
 		can all operate on objects in the groups in some
 		groupwise ordering.
 		Then associate a lock with each group, and impose a
 		locking hierarchy over the groups.
 	\item	Impose an arbitrarily selected hierarchy on the locks,
 		and then use conditional locking if it is necessary
 		to acquire a lock out of order, as was discussed in
 		Section~\ref{sec:locking:Conditional Locking}.
 	\item	Before carrying out a given group of operations, predict
 		which locks will be acquired, and attempt to acquire them
 		before actually carrying out any updates.
 		If the prediction turns out to be incorrect, drop
 		all the locks and retry with an updated prediction
 		that includes the benefit of experience.
 		This approach was discussed in
 		Section~\ref{sec:locking:Acquire Needed Locks First}.
 	\item	Use transactional memory.
 		This approach has a number of advantages and disadvantages
 		which will be discussed in
 		Section~\ref{sec:future:Transactional Memory}.
 	\item	Refactor the application to be more concurrency-friendly.
 		This would likely also have the side effect of making
 		the application run faster even when single-threaded, but might
 		also make it more difficult to modify the application.
 	\item	Use techniques from later chapters in addition to locking.
 	\end{enumerate}

 \QuickQ{}
 	How can the livelock shown in
 	Figure~\ref{fig:locking:Abusing Conditional Locking}
 	be avoided?
 \QuickA{}
 	Figure~\ref{fig:locking:Avoiding Deadlock Via Conditional Locking}
 	provides some good hints.
 	In many cases, livelocks are a hint that you should revisit your
 	locking design.
 	Or visit it in the first place if your locking design
 	``just grew''.

 	That said, one good-and-sufficient approach due to Doug Lea
 	is to use conditional locking as described in
 	Section~\ref{sec:locking:Conditional Locking}, but combine this
 	with acquiring all needed locks first, before modifying shared
 	data, as described in
 	Section~\ref{sec:locking:Acquire Needed Locks First}.
 	If a given critical section retries too many times,
 	unconditionally acquire
 	a global lock, then unconditionally acquire all the needed locks.
 	This avoids both deadlock and livelock, and scales reasonably
 	assuming that the global lock need not be acquired too often.

 \QuickQ{}
 	What problems can you spot in the code in
 	Figure~\ref{fig:locking:Conditional Locking and Exponential Backoff}?
 \QuickA{}
 	Here are a couple:
 	\begin{enumerate}
 	\item	A one-second wait is way too long for most uses.
 		Wait intervals should begin with roughly the time
 		required to execute the critical section, which will
 		normally be in the microsecond or millisecond range.
 	\item	The code does not check for overflow.
 		On the other hand, this bug is nullified
 		by the previous bug: 32 bits worth of seconds is
 		more than 50 years.
 	\end{enumerate}

 \QuickQ{}
 	Wouldn't it be better just to use a good parallel design
 	so that lock contention was low enough to avoid unfairness?
 \QuickA{}
 	It would be better in some sense, but there are situations
 	where it can be appropriate to use
 	designs that sometimes result in high lock contentions.

 	For example, imagine a system that is subject to a rare error
 	condition.
 	It might well be best to have a simple error-handling design
 	that has poor performance and scalability for the duration of
 	the rare error condition, as opposed to a complex and
 	difficult-to-debug design that is helpful only when one of
 	those rare error conditions is in effect.

 	That said, it is usually worth putting some effort into
 	attempting to produce a design that both simple as well as
 	efficient during error conditions, for example by partitioning
 	the problem.

 \QuickQ{}
 	How might the lock holder be interfered with?
 \QuickA{}
 	If the data protected by the lock is in the same cache line
 	as the lock itself, then attempts by other CPUs to acquire
 	the lock will result in expensive cache misses on the part
 	of the CPU holding the lock.
 	This is a special case of false sharing, which can also occur
 	if a pair of variables protected by different locks happen
 	to share a cache line.
 	In contrast, if the lock is in a different cache line than
 	the data that it protects, the CPU holding the lock will
 	usually suffer a cache miss only on first access to a given
 	variable.

 	Of course, the downside of placing the lock and data into separate
 	cache lines is that the code will incur two cache misses rather
 	than only one in the uncontended case.

 \QuickQ{}
 	Does it ever make sense to have an exclusive lock acquisition
 	immediately followed by a release of that same lock, that is,
 	an empty critical section?
 \QuickA{}
 	This usage is rare, but is occasionally used.
 	The point is that the semantics of exclusive locks have two
 	components: (1)~the familiar data-protection semantic and
 	(2)~a messaging semantic, where releasing a given lock notifies
 	a waiting acquisition of that same lock.

 	One historical use of empty critical sections appeared in the
 	networking stack of the 2.4 Linux kernel.
 	This usage pattern can be thought of as a way of approximating
 	the effects of read-copy update (RCU), which is discussed in
 	Section~\ref{sec:defer:Read-Copy Update (RCU)}.

 	The empty-lock-critical-section idiom can also be used to
 	reduce lock contention in some situations.
 	For example, consider a multithreaded user-space application where
 	each thread processes unit of work maintained in a per-thread
 	list, where thread are prohibited from touching each others'
 	lists.
 	There could also be updates that require that all previously
 	scheduled units of work have completed before the update can
 	progress.
 	One way to handle this is to schedule a unit of work on each
 	thread, so that when all of these units of work complete, the
 	update may proceed.

 	In some applications, threads can come and go.
 	For example, each thread might correspond to one user of the
 	application, and thus be removed when that user logs out or
 	otherwise disconnects.
 	In many applications, threads cannot depart atomically: They must
 	instead explicitly unravel themselves from various portions of
 	the application using a specific sequence of actions.
 	One specific action will be refusing to accept further requests
 	from other threads, and another specific action will be disposing
 	of any remaining units of work on its list, for example, by
 	placing these units of work in a global work-item-disposal list
 	to be taken by one of the remaining threads.
 	(Why not just drain the thread's work-item list by executing
 	each item?
 	Because a given work item might generate more work items, so
 	that the list could not be drained in a timely fashion.)

 	If the application is to perform and scale well, a good locking
 	design is required.
 	One common solution is to have a global lock (call it \co{G})
 	protecting the entire
 	process of departing (and perhaps other things as well),
 	with finer-grained locks protecting the
 	individual unraveling operations.

 	Now, a departing thread must clearly refuse to accept further
 	requests before disposing of the work on its list, because
 	otherwise additional work might arrive after the disposal action,
 	which would render that disposal action ineffective.
 	So simplified pseudocode for a departing thread might be as follows:

 	\begin{enumerate}
 	\item	Acquire lock \co{G}.
 	\item	Acquire the lock guarding communications.
 	\item	Refuse further communications from other threads.
 	\item	Release the lock guarding communications.
 	\item	Acquire the lock guarding the global work-item-disposal list.
 	\item	Move all pending work items to the global
 		work-item-disposal list.
 	\item	Release the lock guarding the global work-item-disposal list.
 	\item	Release lock \co{G}.
 	\end{enumerate}

 	Of course, a thread that needs to wait for all pre-existing work
 	items will need to take departing threads into account.
 	To see this, suppose that this thread starts waiting for all
 	pre-existing work items just after a departing thread has refused
 	further communications from other threads.
 	How can this thread wait for the departing thread's work items
 	to complete, keeping in mind that threads are not allowed to
 	access each others' lists of work items?

 	One straightforward approach is for this thread to acquire \co{G}
 	and then the lock guarding the global work-item-disposal list, then
 	move the work items to its own list.
 	The thread then release both locks,
 	places a work item on the end of it own list,
 	and then wait for all of the work items that it placed on each thread's
 	list (including its own) to complete.

 	This approach does work well in many cases, but if special
 	processing is required for each work item as it is pulled in
 	from the global work-item-disposal list, the result could be
 	excessive contention on \co{G}.
 	One way to avoid that contention is to acquire \co{G} and then
 	immediately release it.
 	Then the process of waiting for all prior work items look
 	something like the following:

 	\begin{enumerate}
 	\item	Set a global counter to one and initialize a condition
 		variable to zero.
 	\item	Send a message to all threads to cause them to atomically
 		increment the global counter, and then to enqueue a
 		work item.
 		The work item will atomically decrement the global
 		counter, and if the result is zero, it will set a
 		condition variable to one.
 	\item	Acquire \co{G}, which will wait on any currently departing
 		thread to finish departing.
 		Because only one thread may depart at a time, all the
 		remaining threads will have already received the message
 		sent in the preceding step.
 	\item	Release \co{G}.
 	\item	Acquire the lock guarding the global work-item-disposal list.
 	\item	Move all work items from the global work-item-disposal list
 		to this thread's list, processing them as needed along the way.
 	\item	Release the lock guarding the global work-item-disposal list.
 	\item	Enqueue an additional work item onto this thread's list.
 		(As before, this work item will atomically decrement
 		the global counter, and if the result is zero, it will
 		set a condition variable to one.)
 	\item	Wait for the condition variable to take on the value one.
 	\end{enumerate}

 	Once this procedure completes, all pre-existing work items are
 	guaranteed to have completed.
 	The empty critical sections are using locking for messaging as
 	well as for protection of data.

 \QuickQ{}
 	Is there any other way for the VAX/VMS DLM to emulate
 	a reader-writer lock?
 \QuickA{}
 	There are in fact several.
 	One way would be to use the null, protected-read, and exclusive
 	modes.
 	Another way would be to use the null, protected-read, and
 	concurrent-write modes.
 	A third way would be to use the null, concurrent-read, and
 	exclusive modes.

 \QuickQ{}
 	The code in
 	Figure~\ref{fig:locking:Conditional Locking to Reduce Contention}
 	is ridiculously complicated!
 	Why not conditionally acquire a single global lock?
 \QuickA{}
 	Conditionally acquiring a single global lock does work very well,
 	but only for relatively small numbers of CPUs.
 	To see why it is problematic in systems with many hundreds of
 	CPUs, look at
 	Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem}
 	and extrapolate the delay from eight to 1,000 CPUs.

 \QuickQ{}
 	Wait a minute!
 	If we ``win'' the tournament on line~16 of
 	Figure~\ref{fig:locking:Conditional Locking to Reduce Contention},
 	we get to do all the work of \co{do_force_quiescent_state()}.
 	Exactly how is that a win, really?
 \QuickA{}
 	How indeed?
 	This just shows that in concurrency, just as in life, one
 	should take care to learn exactly what winning entails before
 	playing the game.

 \QuickQ{}
 	Why not rely on the C language's default initialization of
 	zero instead of using the explicit initializer shown on
 	line~2 of
 	Figure~\ref{fig:locking:Sample Lock Based on Atomic Exchange}?
 \QuickA{}
 	Because this default initialization does not apply to locks
 	allocated as auto variables within the scope of a function.

 \QuickQ{}
 	Why bother with the inner loop on lines~7-8 of
 	Figure~\ref{fig:locking:Sample Lock Based on Atomic Exchange}?
 	Why not simply repeatedly do the atomic exchange operation
 	on line~6?
 \QuickA{}
 	Suppose that the lock is held and that several threads
 	are attempting to acquire the lock.
 	In this situation, if these threads all loop on the atomic
 	exchange operation, they will ping-pong the cache line
 	containing the lock among themselves, imposing load
 	on the interconnect.
 	In contrast, if these threads are spinning in the inner
 	loop on lines~7-8, they will each spin within their own
 	caches, putting negligible load on the interconnect.

 \QuickQ{}
 	Why not simply store zero into the lock word on line 14 of
 	Figure~\ref{fig:locking:Sample Lock Based on Atomic Exchange}?
 \QuickA{}
 	This can be a legitimate implementation, but only if
 	this store is preceded by a memory barrier and makes use
 	of \co{ACCESS_ONCE()}.
 	The memory barrier is not required when the \co{xchg()}
 	operation is used because this operation implies a
 	full memory barrier due to the fact that it returns
 	a value.

 \QuickQ{}
 	How can you tell if one counter is greater than another,
 	while accounting for counter wrap?
 \QuickA{}
 	In the C language, the following macro correctly handles this:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \small
 \begin{verbatim}
 #define ULONG_CMP_LT(a, b) \
         (ULONG_MAX / 2 < (a) - (b))
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 	Although it is tempting to simply subtract two signed integers,
 	this should be avoided because signed overflow is undefined
 	in the C language.
 	For example, if the compiler knows that one of the values is
 	positive and the other negative, it is within its rights to
 	simply assume that the positive number is greater than the
 	negative number, even though subtracting the negative number
 	from the positive number might well result in overflow and
 	thus a negative number.

 	How could the compiler know the signs of the two numbers?
 	It might be able to deduce it based on prior assignments
 	and comparisons.
 	In this case, if the per-CPU counters were signed, the compiler
 	could deduce that they were always increasing in value, and
 	then might assume that they would never go negative.
 	This assumption could well lead the compiler to generate
 	unfortunate code~\cite{PaulEMcKenney2012SignedOverflow,JohnRegehr2010UndefinedBehavior}.

 \QuickQ{}
 	Which is better, the counter approach or the flag approach?
 \QuickA{}
 	The flag approach will normally suffer fewer cache misses,
 	but a better answer is to try both and see which works best
 	for your particular workload.

 \QuickQ{}
 	How can relying on implicit existence guarantees result in
 	a bug?
 \QuickA{}
 	Here are some bugs resulting from improper use of implicit
 	existence guarantees:
 	\begin{enumerate}
 	\item	A program writes the address of a global variable to
 		a file, then a later instance of that same program
 		reads that address and attempts to dereference it.
 		This can fail due to address-space randomization,
 		to say nothing of recompilation of the program.
 	\item	A module can record the address of one of its variables
 		in a pointer located in some other module, then attempt
 		to dereference that pointer after the module has
 		been unloaded.
 	\item	A function can record the address of one of its on-stack
 		variables into a global pointer, which some other
 		function might attempt to dereference after that function
 		has returned.
 	\end{enumerate}
 	I am sure that you can come up with additional possibilities.

 \QuickQ{}
 	What if the element we need to delete is not the first element
 	of the list on line~8 of
 	Figure~\ref{fig:locking:Per-Element Locking Without Existence Guarantees}?
 \QuickA{}
 	This is a very simple hash table with no chaining, so the only
 	element in a given bucket is the first element.
 	The reader is invited to adapt this example to a hash table with
 	full chaining.

 \QuickQ{}
 	What race condition can occur in
 	Figure~\ref{fig:locking:Per-Element Locking Without Existence Guarantees}?
 \QuickA{}
 	Consider the following sequence of events:
 	\begin{enumerate}
 	\item	Thread~0 invokes \co{delete(0)}, and reaches line~10 of
 		the figure, acquiring the lock.
 	\item	Thread~1 concurrently invokes \co{delete(0)}, reaching
 		line~10, but spins on the lock because Thread~0 holds it.
 	\item	Thread~0 executes lines~11-14, removing the element from
 		the hashtable, releasing the lock, and then freeing the
 		element.
 	\item	Thread~0 continues execution, and allocates memory, getting
 		the exact block of memory that it just freed.
 	\item	Thread~0 then initializes this block of memory as some
 		other type of structure.
 	\item	Thread~1's \co{spin_lock()} operation fails due to the
 		fact that what it believes to be \co{p->lock} is no longer
 		a spinlock.
 	\end{enumerate}
 	Because there is no existence guarantee, the identity of the
 	data element can change while a thread is attempting to acquire
 	that element's lock on line~10!

 \QuickQAC{chp:Data Ownership}{Data Ownership}
 \QuickQ{}
 	What form of data ownership is extremely difficult
 	to avoid when creating shared-memory parallel programs
 	(for example, using pthreads) in C or C++?
 \QuickA{}
 	Use of auto variables in functions.
 	By default, these are private to the thread executing the
 	current function.

 \QuickQ{}
 	What synchronization remains in the example shown in
 	Section~\ref{sec:owned:Multiple Processes}?
 \QuickA{}
 	The creation of the threads via the \co{sh} \co{&} operator
 	and the joining of thread via the \co{sh} \co{wait}
 	command.

 	Of course, if the processes explicitly share memory, for
 	example, using the \co{shmget()} or \co{mmap()} system
 	calls, explicit synchronization might well be needed when
 	acccessing or updating the shared memory.
 	The processes might also synchronize using any of the following
 	interprocess communications mechanisms:
 	\begin{enumerate}
 	\item	System V semaphores.
 	\item	System V message queues.
 	\item	UNIX-domain sockets.
 	\item	Networking protocols, including TCP/IP, UDP, and a whole
 		host of others.
 	\item	File locking.
 	\item	Use of the \co{open()} system call with the
 		\co{O_CREAT} and \co{O_EXCL} flags.
 	\item	Use of the \co{rename()} system call.
 	\end{enumerate}
 	A complete list of possible synchronization mechanisms is left
 	as an exercise to the reader, who is warned that it will be
 	an extremely long list.
 	A surprising number of unassuming system calls can be pressed
 	into service as synchronization mechanisms.

 \QuickQ{}
 	Is there any shared data in the example shown in
 	Section~\ref{sec:owned:Multiple Processes}?
 \QuickA{}
 	That is a philosophical question.

 	Those wishing the answer ``no'' might argue that processes by
 	definition do not share memory.

 	Those wishing to answer ``yes'' might list a large number of
 	synchronization mechanisms that do not require shared memory,
 	note that the kernel will have some shared state, and perhaps
 	even argue that the assignment of process IDs (PIDs) constitute
 	shared data.

 	Such arguments are excellent intellectual exercise, and are
 	also a wonderful way of feeling intelligent, scoring points
 	against hapless classmates or colleagues,
 	and (especially!) avoiding getting anything useful done.

 \QuickQ{}
 	Does it ever make sense to have partial data ownership where
 	each thread reads only its own instance of a per-thread variable,
 	but writes to other threads' instances?
 \QuickA{}
 	Amazingly enough, yes.
 	One example is a simple message-passing system where threads post
 	messages to other threads' mailboxes, and where each thread
 	is responsible for removing any message it sent once that message
 	has been acted on.
 	Implementation of such an algorithm is left as an exercise for the
 	reader, as is the task of identifying other algorithms with
 	similar ownership patterns.

 \QuickQ{}
 	What mechanisms other than POSIX signals may be used to ship
 	functions?
 \QuickA{}
 	There is a very large number of such mechanisms, including:
 	\begin{enumerate}
 	\item	System V message queues.
 	\item	Shared-memory dequeue (see
 		Section~\ref{sec:SMPdesign:Double-Ended Queue}).
 	\item	Shared-memory mailboxes.
 	\item	UNIX-domain sockets.
 	\item	TCP/IP or UDP, possibly augmented by any number of
 		higher-level protocols, including RPC, HTTP,
 		XML, SOAP, and so on.
 	\end{enumerate}
 	Compilation of a complete list is left as an exercise to
 	sufficiently single-minded readers, who are warned that the
 	list will be extremely long.

 \QuickQ{}
 	But none of the data in the \co{eventual()} function shown on
 	lines~15-32 of
 	Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
 	is actually owned by the \co{eventual()} thread!
 	In just what way is this data ownership???
 \QuickA{}
 	The key phrase is ``owns the rights to the data''.
 	In this case, the rights in question are the rights to access
 	the per-thread \co{counter} variable defined on line~1
 	of the figure.
 	This situation is similar to that described in
 	Section~\ref{sec:owned:Partial Data Ownership and pthreads}.

 	However, there really is data that is owned by the \co{eventual()}
 	thread, namely the \co{t} and \co{sum} variables defined on
 	lines~17 and~18 of the figure.

 	For other examples of designated threads, look at the kernel
 	threads in the Linux kernel, for example, those created by
 	\co{kthread_create()} and \co{kthread_run()}.

 \QuickQ{}
 	Is it possible to obtain greater accuracy while still
 	maintaining full privacy of the per-thread data?
 \QuickA{}
 	Yes.
 	One approach is for \co{read_count()} to add the value
 	of its own per-thread variable.
 	This maintains full ownership and performance, but only
 	a slight improvement in accuracy, particulary on systems
 	with very large numbers of threads.

 	Another approach is for \co{read_count()} to use function
 	shipping, for example, in the form of per-thread signals.
 	This greatly improves accuracy, but at a significant performance
 	cost for \co{read_count()}.

 	However, both of these methods have the advantage of eliminating
 	cache-line bouncing for the common case of updating counters.

 \QuickQAC{chp:Deferred Processing}{Deferred Processing}
 \QuickQ{}
 	Why not implement reference-acquisition using
 	a simple compare-and-swap operation that only
 	acquires a reference if the reference counter is
 	non-zero?
 \QuickA{}
 	Although this can resolve the race between the release of
 	the last reference and acquisition of a new reference,
 	it does absolutely nothing to prevent the data structure
 	from being freed and reallocated, possibly as some completely
 	different type of structure.
 	It is quite likely that the ``simple compare-and-swap
 	operation'' would give undefined results if applied to the
 	differently typed structure.

 	In short, use of atomic operations such as compare-and-swap
 	absolutely requires either type-safety or existence guarantees.

 \QuickQ{}
 	Why isn't it necessary to guard against cases where one CPU
 	acquires a reference just after another CPU releases the last
 	reference?
 \QuickA{}
 	Because a CPU must already hold a reference in order
 	to legally acquire another reference.
 	Therefore, if one CPU releases the last reference,
 	there cannot possibly be any CPU that is permitted
 	to acquire a new reference.
 	This same fact allows the non-atomic check in line~22
 	of Figure~\ref{fig:defer:Linux Kernel kref API}.

 \QuickQ{}
 	Suppose that just after the \co{atomic_dec_and_test()}
 	on line~22 of
 	Figure~\ref{fig:defer:Linux Kernel kref API} is invoked,
 	that some other CPU invokes \co{kref_get()}.
 	Doesn't this result in that other CPU now having an illegal
 	reference to a released object?
 \QuickA{}
 	This cannot happen if these functions are used correctly.
 	It is illegal to invoke \co{kref_get()} unless you already
 	hold a reference, in which case the \co{kref_sub()} could
 	not possibly have decremented the counter to zero.

 \QuickQ{}
 	Suppose that \co{kref_sub()} returns zero, indicating that
 	the \co{release()} function was not invoked.
 	Under what conditions can the caller rely on the continued
 	existence of the enclosing object?
 \QuickA{}
 	The caller cannot rely on the continued existence of the
 	object unless it knows that at least one reference will
 	continue to exist.
 	Normally, the caller will have no way of knowing this, and
 	must therefore carefullly avoid referencing the object after
 	the call to \co{kref_sub()}.

 \QuickQ{}
 	Why can't the check for a zero reference count be
 	made in a simple ``if'' statement with an atomic
 	increment in its ``then'' clause?
 \QuickA{}
 	Suppose that the ``if'' condition completed, finding
 	the reference counter value equal to one.
 	Suppose that a release operation executes, decrementing
 	the reference counter to zero and therefore starting
 	cleanup operations.
 	But now the ``then'' clause can increment the counter
 	back to a value of one, allowing the object to be
 	used after it has been cleaned up.

 \QuickQ{}
 	Why does \co{hp_store()} in
 	Figure~\ref{fig:defer:Hazard-Pointer Storage and Erasure}
 	take a double indirection to the data element?
 	Why not \co{void *} instead of \co{void **}?
 \QuickA{}
 	Because \co{hp_record()} must check for concurrent modifications.
 	To do that job, it needs a pointer to a pointer to the element,
 	so that it can check for a modification to the pointer to the
 	element.

 \QuickQ{}
 	Why does \co{hp_store()}'s caller need to restart its
 	traversal from the beginning in case of failure?
 	Isn't that inefficient for large data structures?
 \QuickA{}
 	It might be inefficient in some sense, but the fact is that
 	such restarting is absolutely required for correctness.
 	To see this, consider a hazard-pointer-protected linked list
 	containing elements~A, B, and~C that is subjecte to the
 	following sequence of events:

 	\begin{enumerate}
 	\item	Thread~0 stores a hazard pointer to element~B
 		(having presumably traversed to element~B from element~A).
 	\item	Thread~1 removes element~B from the list, which sets
 		the pointer from element~B to element~C to a special
 		\co{HAZPTR_POISON} value in order to mark the deletion.
 		Because Thread~0 has a hazard pointer to element~B,
 		it cannot yet be freed.
 	\item	Thread~1 removes element~C from the list.
 		Because there are no hazard pointers referencing element~C,
 		it is immediately freed.
 	\item	Thread~0 attempts to acquire a hazard pointer to
 		now-removed element~B's successor, but sees the
 		\co{HAZPTR_POISON} value, and thus returns zero,
 		forcing the caller to restart its traversal from the
 		beginning of the list.
 	\end{enumerate}

 	Which is a very good thing, because otherwise Thread~0 would
 	have attempted to access the now-freed element~C,
 	which might have resulted in arbitrarily horrible
 	memory corruption, especially if the memory for
 	element~C had since been re-allocated for some other
 	purpose.

 \QuickQ{}
 	Given that papers on hazard pointers use the bottom bits
 	of each pointer to mark deleted elements, what is up with
 	\co{HAZPTR_POISON}?
 \QuickA{}
 	The published implementations of hazard pointers used
 	non-blocking synchornization techniques for insertion
 	and deletion.
 	These techniques require that readers traversing the
 	data structure ``help'' updaters complete their updates,
 	which in turn means that readers need to look at the successor
 	of a deleted element.

 	In contrast, we will be using locking to synchronize updates,
 	which does away with the need for readers to help updaters
 	complete their updates, which in turn allows us to leave
 	pointers' bottom bits alone.
 	This approach allows read-side code to be simpler and faster.

 \QuickQ{}
 	But don't these restrictions on hazard pointers also apply
 	to other forms of reference counting?
 \QuickA{}
 	These restrictions apply only to reference-counting mechanisms whose
 	reference acquisition can fail.

 \QuickQ{}
 	But hazard pointers don't write to the data structure!
 \QuickA{}
 	Indeed, they do not.
 	However, they do write to the hazard pointers themselves,
 	and, more important, require that possible failures be
 	handled for all \co{hp_store()} calls, each of which
 	might fail.
 	Therefore, although hazard pointers are extremely useful,
 	it is still worth looking for improved mechanisms.

 \QuickQ{}
 	Why isn't this sequence-lock discussion in Chapter~\ref{chp:Locking},
 	you know, the one on \emph{locking}?
 \QuickA{}
 	The sequence-lock mechanism is really a combination of two
 	separate synchronization mechanisms, sequence counts and
 	locking.
 	In fact, the sequence-count mechanism is available separately
 	in the Linux kernel via the
 	\co{write_seqcount_begin()} and \co{write_seqcount_end()}
 	primitives.

 	However, the combined \co{write_seqlock()} and
 	\co{write_sequnlock()} primitives are used much more heavily
 	in the Linux kernel.
 	More importantly, many more people will understand what you
 	mean if you say ``sequence lock'' than if you say
 	``sequence count''.

 	So this section is entitled ``Sequence Locks'' so that people
 	will understand what it is about just from the title, and
 	it appears in the ``Deferred Processing'' because (1) of the
 	emphasis on the ``sequence count'' aspect of ``sequence locks''
 	and (2) because a ``sequence lock'' is much more than merely
 	a lock.

 \QuickQ{}
 	Can you use sequence locks as the only synchronization
 	mechanism protecting a linked list supporting concurrent
 	addition, deletion, and search?
 \QuickA{}
 	One trivial way of accomplishing this is to surround all
 	accesses, including the read-only accesses, with
 	\co{write_seqlock()} and \co{write_sequnlock()}.
 	Of course, this solution also prohibits all read-side
 	parallelism, and furthermore could just as easily be implemented
 	using simple locking.

 	If you do come up with a solution that uses \co{read_seqbegin()}
 	and \co{read_seqretry()} to protect read-side accesses, make
 	sure that you correctly handle the following sequence of events:

 	\begin{enumerate}
 	\item	CPU~0 is traversing the linked list, and picks up a pointer
 		to list element~A.
 	\item	CPU~1 removes element~A from the list and frees it.
 	\item	CPU~2 allocates an unrelated data structure, and gets
 		the memory formerly occupied by element~A.
 		In this unrelated data structure, the memory previously
 		used for element~A's \co{->next} pointer is now occupied
 		by a floating-point number.
 	\item	CPU~0 picks up what used to be element~A's \co{->next}
 		pointer, gets random bits, and therefore gets a
 		segmentation fault.
 	\end{enumerate}

 	One way to protect against this sort of problem requires use
 	of ``type-safe memory'', which will be discussed in
 	Section~\ref{sec:defer:RCU is a Way of Providing Type-Safe Memory}.
 	But in that case, you would be using some other synchronization
 	mechanism in addition to sequence locks!

 \QuickQ{}
 	Why bother with the check on line~19 of
 	\co{read_seqbegin()} in
 	Figure~\ref{fig:defer:Sequence-Locking Implementation}?
 	Given that a new writer could begin at any time, why not
 	simply incorporate the check into line~31 of
 	\co{read_seqretry()}?
 \QuickA{}
 	That would be a legitimate implementation.
 	However, it would not save anything to move the check down
 	to \co{read_seqretry()}: There would be roughly the same number
 	of instructions.
 	Furthermore, the reader's accesses from its doomed read-side
 	critical section could inflict overhead on the writer in
 	the form of cache misses.
 	We can avoid these cache misses by placing the check in
 	\co{read_seqbegin()} as shown on line~19 of
 	Figure~\ref{fig:defer:Sequence-Locking Implementation}.

 \QuickQ{}
 	Why is the \co{smp_mb()} on line~29 of
 	Figure~\ref{fig:defer:Sequence-Locking Implementation}
 	needed?
 \QuickA{}
 	If it was omitted, both the compiler and the CPU would be
 	within their rights to move the critical section preceding
 	the call to \co{read_seqretry()} down below this function.
 	This would prevent the sequence lock from protecting the
 	critical section.
 	The \co{smp_mb()} primitive prevents such reordering.

 \QuickQ{}
 	What prevents sequence-locking updaters from starving readers?
 \QuickA{}
 	Nothing.
 	This is one of the weaknesses of sequence locking, and as a
 	result, you should use sequence locking only in read-mostly
 	situations.
 	Unless of course read-side starvation is acceptable in your
 	situation, in which case, go wild with the sequence-locking updates!

 \QuickQ{}
 	What if something else serializes writers, so that the lock
 	is not needed?
 \QuickA{}
 	In this case, the \co{->lock} field could be omitted, as it
 	is in \co{seqcount_t} in the Linux kernel.

 \QuickQ{}
 	Why isn't \co{seq} on line 2 of
 	Figure~\ref{fig:defer:Sequence-Locking Implementation}
 	\co{unsigned} rather than \co{unsigned long}?
 	After all, if \co{unsigned} is good enough for the Linux
 	kernel, shouldn't it be good enough for everyone?
 \QuickA{}
 	Not at all.
 	The Linux kernel has a number of special attributes that allow
 	it to ignore the following sequence of events:
 	\begin{enumerate}
 	\item	Thread 0 executes \co{read_seqbegin()}, picking up
 		\co{->seq} in line~17, noting that the value is even,
 		and thus returning to the caller.
 	\item	Thread 0 starts executing its read-side critical section,
 		but is then preempted for a long time.
 	\item	Other threads repeatedly invoke \co{write_seqlock()} and
 		\co{write_sequnlock()}, until the value of \co{->seq}
 		overflows back to the value that Thread~0 fetched.
 	\item	Thread 0 resumes execution, completing its read-side
 		critical section with inconsistent data.
 	\item	Thread 0 invokes \co{read_seqretry()}, which incorrectly
 		concludes that Thread~0 has seen a consistent view of
 		the data protected by the sequence lock.
 	\end{enumerate}

 	The Linux kernel uses sequence locking for things that are
 	updated rarely, with time-of-day information being a case
 	in point.
 	This information is updated at most once per millisecond,
 	so that seven weeks would be required to overflow the counter.
 	If a kernel thread was preempted for seven weeks, the Linux
 	kernel's soft-lockup code would be emitting warnings every two
 	minutes for that entire time.

 	In contrast, with a 64-bit counter, more than five centuries
 	would be required to overflow, even given an update every
 	\emph{nano}second.
 	Therefore, this implementation uses a type for \co{->seq}
 	that is 64 bits on 64-bit systems.

 \QuickQ{}
 	But doesn't Section~\ref{sec:defer:Sequence Locks}'s seqlock
 	also permit readers and updaters to get work done concurrently?
 \QuickA{}
 	Yes and no.
 	Although seqlock readers can run concurrently with
 	seqlock writers, whenever this happens, the {\tt read\_seqretry()}
 	primitive will force the reader to retry.
 	This means that any work done by a seqlock reader running concurrently
 	with a seqlock updater will be discarded and redone.
 	So seqlock readers can \emph{run} concurrently with updaters,
 	but they cannot actually get any work done in this case.

 	In contrast, RCU readers can perform useful work even in presence
 	of concurrent RCU updaters.

 \QuickQ{}
 	What prevents the {\tt list\_for\_each\_entry\_rcu()} from
 	getting a segfault if it happens to execute at exactly the same
 	time as the {\tt list\_add\_rcu()}?
 \QuickA{}
 	On all systems running Linux, loads from and stores
 	to pointers are atomic, that is, if a store to a pointer occurs at
 	the same time as a load from that same pointer, the load will return
 	either the initial value or the value stored, never some bitwise
 	mashup of the two.
 	In addition, the {\tt list\_for\_each\_entry\_rcu()} always proceeds
 	forward through the list, never looking back.
 	Therefore, the {\tt list\_for\_each\_entry\_rcu()} will either see
 	the element being added by {\tt list\_add\_rcu()} or it will not,
 	but either way, it will see a valid well-formed list.

 \QuickQ{}
 	Why do we need to pass two pointers into
 	{\tt hlist\_for\_each\_entry\_rcu()}
 	when only one is needed for {\tt list\_for\_each\_entry\_rcu()}?
 \QuickA{}
 	Because in an hlist it is necessary to check for
 	NULL rather than for encountering the head.
 	(Try coding up a single-pointer {\tt hlist\_for\_each\_entry\_rcu()}
 	If you come up with a nice solution, it would be a very good thing!)

 \QuickQ{}
 	How would you modify the deletion example to permit more than two
 	versions of the list to be active?
 \QuickA{}
 	One way of accomplishing this is as shown in
 	Figure~\ref{fig:defer:Concurrent RCU Deletion}.

 \begin{figure}[htbp]
 { \centering
 \begin{verbatim}
   1 spin_lock(&mylock);
   2 p = search(head, key);
   3 if (p == NULL)
   4   spin_unlock(&mylock);
   5 else {
   6   list_del_rcu(&p->list);
   7   spin_unlock(&mylock);
   8   synchronize_rcu();
   9   kfree(p);
  10 }
 \end{verbatim}
 }
 \caption{Concurrent RCU Deletion}
 \label{fig:defer:Concurrent RCU Deletion}
 \end{figure}

 	Note that this means that multiple concurrent deletions might be
 	waiting in {\tt synchronize\_rcu()}.

 \QuickQ{}
 	How many RCU versions of a given list can be
 	active at any given time?
 \QuickA{}
 	That depends on the synchronization design.
 	If a semaphore protecting the update is held across the grace period,
 	then there can be at most two versions, the old and the new.

 	However, if only the search, the update, and the
 	{\tt list\_replace\_rcu()} were protected by a lock, then
 	there could be an arbitrary number of versions active, limited only
 	by memory and by how many updates could be completed within a
 	grace period.
 	But please note that data structures that are updated so frequently
 	probably are not good candidates for RCU.
 	That said, RCU can handle high update rates when necessary.

 \QuickQ{}
 	How can RCU updaters possibly delay RCU readers, given that the
 	{\tt rcu\_read\_lock()} and {\tt rcu\_read\_unlock()}
 	primitives neither spin nor block?
 \QuickA{}
 	The modifications undertaken by a given RCU updater will cause the
 	corresponding CPU to invalidate cache lines containing the data,
 	forcing the CPUs running concurrent RCU readers to incur expensive
 	cache misses.
 	(Can you design an algorithm that changes a data structure
 	\emph{without}
 	inflicting expensive cache misses on concurrent readers?
 	On subsequent readers?)

 \QuickQ{}
 	WTF?
 	How the heck do you expect me to believe that RCU has a
 	100-femtosecond overhead when the clock period at 3GHz is more than
 	300 \emph{picoseconds}?
 \QuickA{}
 	First, consider that the inner loop used to
 	take this measurement is as follows:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
   1 for (i = 0; i < CSCOUNT_SCALE; i++) {
   2   rcu_read_lock();
   3   rcu_read_unlock();
   4 }
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 	Next, consider the effective definitions of \co{rcu_read_lock()}
 	and \co{rcu_read_unlock()}:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
   1 #define rcu_read_lock()   do { } while (0)
   2 #define rcu_read_unlock() do { } while (0)
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 	Consider also that the compiler does simple optimizations,
 	allowing it to replace the loop with:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \scriptsize
 \begin{verbatim}
 i = CSCOUNT_SCALE;
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 	So the "measurement" of 100 femtoseconds is simply the fixed
 	overhead of the timing measurements divided by the number of
 	passes through the inner loop containing the calls
 	to \co{rcu_read_lock()} and \co{rcu_read_unlock()}.
 	And therefore, this measurement really is in error, in fact,
 	in error by an arbitrary number of orders of magnitude.
 	As you can see by the definition of \co{rcu_read_lock()}
 	and \co{rcu_read_unlock()} above, the actual overhead
 	is precisely zero.

 	It certainly is not every day that a timing measurement of
 	100 femtoseconds turns out to be an overestimate!

 \QuickQ{}
 	Why does both the variability and overhead of rwlock decrease as the
 	critical-section overhead increases?
 \QuickA{}
 	Because the contention on the underlying
 	\co{rwlock_t} decreases as the critical-section overhead
 	increases.
 	However, the rwlock overhead will not quite drop to that on a single
 	CPU because of cache-thrashing overhead.

 \QuickQ{}
 	Is there an exception to this deadlock immunity, and if so,
 	what sequence of events could lead to deadlock?
 \QuickA{}
 	One way to cause a deadlock cycle involving
 	RCU read-side primitives is via the following (illegal) sequence
 	of statements:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \small
 \begin{verbatim}
 idx = srcu_read_lock(&srcucb);
 synchronize_srcu(&srcucb);
 srcu_read_unlock(&srcucb, idx);
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 	The \co{synchronize_srcu()} cannot return until all
 	pre-existing SRCU read-side critical sections complete, but
 	is enclosed in an SRCU read-side critical section that cannot
 	complete until the \co{synchronize_srcu()} returns.
 	The result is a classic self-deadlock--you get the same
 	effect when attempting to write-acquire a reader-writer lock
 	while read-holding it.

 	Note that this self-deadlock scenario does not apply to
 	RCU Classic, because the context switch performed by the
 	\co{synchronize_rcu()} would act as a quiescent state
 	for this CPU, allowing a grace period to complete.
 	However, this is if anything even worse, because data used
 	by the RCU read-side critical section might be freed as a
 	result of the grace period completing.

 	In short, do not invoke synchronous RCU update-side primitives
 	from within an RCU read-side critical section.

 \QuickQ{}
 	But wait!
 	This is exactly the same code that might be used when thinking
 	of RCU as a replacement for reader-writer locking!
 	What gives?
 \QuickA{}
 	This is an effect of the Law of Toy Examples:
 	beyond a certain point, the code fragments look the same.
 	The only difference is in how we think about the code.
 	However, this difference can be extremely important.
 	For but one example of the importance, consider that if we think
 	of RCU as a restricted reference counting scheme, we would never
 	be fooled into thinking that the updates would exclude the RCU
 	read-side critical sections.

 	It nevertheless is often useful to think of RCU as a replacement
 	for reader-writer locking, for example, when you are replacing
 	reader-writer locking with RCU.

 \QuickQ{}
 	Why the dip in refcnt overhead near 6 CPUs?
 \QuickA{}
 	Most likely NUMA effects.
 	However, there is substantial variance in the values measured for the
 	refcnt line, as can be seen by the error bars.
 	In fact, standard deviations range in excess of 10\% of measured
 	values in some cases.
 	The dip in overhead therefore might well be a statistical aberration.

 \QuickQ{}
 	What if the element we need to delete is not the first element
 	of the list on line~9 of
 	Figure~\ref{fig:defer:Existence Guarantees Enable Per-Element Locking}?
 \QuickA{}
 	As with
 	Figure~\ref{fig:locking:Per-Element Locking Without Existence Guarantees},
 	this is a very simple hash table with no chaining, so the only
 	element in a given bucket is the first element.
 	The reader is again invited to adapt this example to a hash table with
 	full chaining.

 \QuickQ{}
 	Why is it OK to exit the RCU read-side critical section on
 	line~15 of
 	Figure~\ref{fig:defer:Existence Guarantees Enable Per-Element Locking}
 	before releasing the lock on line~17?
 \QuickA{}
 	First, please note that the second check on line~14 is
 	necessary because some other
 	CPU might have removed this element while we were waiting
 	to acquire the lock.
 	However, the fact that we were in an RCU read-side critical section
 	while acquiring the lock guarantees that this element could not
 	possibly have been re-allocated and re-inserted into this
 	hash table.
 	Furthermore, once we acquire the lock, the lock itself guarantees
 	the element's existence, so we no longer need to be in an
 	RCU read-side critical section.

 	The question as to whether it is necessary to re-check the
 	element's key is left as an exercise to the reader.
 	% A re-check is necessary if the key can mutate or if it is
 	% necessary to reject deleted entries (in cases where deletion
 	% is recorded by mutating the key.

 \QuickQ{}
 	Why not exit the RCU read-side critical section on
 	line~23 of
 	Figure~\ref{fig:defer:Existence Guarantees Enable Per-Element Locking}
 	before releasing the lock on line~22?
 \QuickA{}
 	Suppose we reverse the order of these two lines.
 	Then this code is vulnerable to the following sequence of
 	events:
 	\begin{enumerate}
 	\item	CPU~0 invokes \co{delete()}, and finds the element
 		to be deleted, executing through line~15.
 		It has not yet actually deleted the element, but
 		is about to do so.
 	\item	CPU~1 concurrently invokes \co{delete()}, attempting
 		to delete this same element.
 		However, CPU~0 still holds the lock, so CPU~1 waits
 		for it at line~13.
 	\item	CPU~0 executes lines~16 and 17, and blocks at
 		line~18 waiting for CPU~1 to exit its RCU read-side
 		critical section.
 	\item	CPU~1 now acquires the lock, but the test on line~14
 		fails because CPU~0 has already removed the element.
 		CPU~1 now executes line~22 (which we switched with line~23
 		for the purposes of this Quick Quiz)
 		and exits its RCU read-side critical section.
 	\item	CPU~0 can now return from \co{synchronize_rcu()},
 		and thus executes line~19, sending the element to
 		the freelist.
 	\item	CPU~1 now attempts to release a lock for an element
 		that has been freed, and, worse yet, possibly
 		reallocated as some other type of data structure.
 		This is a fatal memory-corruption error.
 	\end{enumerate}

 \QuickQ{}
 	But what if there is an arbitrarily long series of RCU
 	read-side critical sections in multiple threads, so that at
 	any point in time there is at least one thread in the system
 	executing in an RCU read-side critical section?
 	Wouldn't that prevent any data from a \co{SLAB_DESTROY_BY_RCU}
 	slab ever being returned to the system, possibly resulting
 	in OOM events?
 \QuickA{}
 	There could certainly be an arbitrarily long period of time
 	during which at least one thread is always in an RCU read-side
 	critical section.
 	However, the key words in the description in
 	Section~\ref{sec:defer:RCU is a Way of Providing Type-Safe Memory}
 	are ``in-use'' and ``pre-existing''.
 	Keep in mind that a given RCU read-side critical section is
 	conceptually only permitted to gain references to data elements
 	that were in use at the beginning of that critical section.
 	Furthermore, remember that a slab cannot be returned to the
 	system until all of its data elements have been freed, in fact,
 	the RCU grace period cannot start until after they have all been
 	freed.

 	Therefore, the slab cache need only wait for those RCU read-side
 	critical sections that started before the freeing of the last element
 	of the slab.
 	This in turn means that any RCU grace period that begins after
 	the freeing of the last element will do---the slab may be returned
 	to the system after that grace period ends.

 \QuickQ{}
 	Suppose that the \co{nmi_profile()} function was preemptible.
 	What would need to change to make this example work correctly?
 \QuickA{}
 	One approach would be to use
 	\co{rcu_read_lock()} and \co{rcu_read_unlock()}
 	in \co{nmi_profile()}, and to replace the
 	\co{synchronize_sched()} with \co{synchronize_rcu()},
 	perhaps as shown in
 	Figure~\ref{fig:defer:Using RCU to Wait for Mythical Preemptible NMIs to Finish}.

 \begin{figure}[tbp]
 { \tt \scriptsize
 \begin{verbatim}
   1 struct profile_buffer {
   2   long size;
   3   atomic_t entry[0];
   4 };
   5 static struct profile_buffer *buf = NULL;
   6
   7 void nmi_profile(unsigned long pcvalue)
   8 {
   9   struct profile_buffer *p;
  10
  11   rcu_read_lock();
  12   p = rcu_dereference(buf);
  13   if (p == NULL) {
  14     rcu_read_unlock();
  15     return;
  16   }
  17   if (pcvalue >= p->size) {
  18     rcu_read_unlock();
  19     return;
  20   }
  21   atomic_inc(&p->entry[pcvalue]);
  22   rcu_read_unlock();
  23 }
  24
  25 void nmi_stop(void)
  26 {
  27   struct profile_buffer *p = buf;
  28
  29   if (p == NULL)
  30     return;
  31   rcu_assign_pointer(buf, NULL);
  32   synchronize_rcu();
  33   kfree(p);
  34 }
 \end{verbatim}
 }
 \caption{Using RCU to Wait for Mythical Preemptible NMIs to Finish}
 \label{fig:defer:Using RCU to Wait for Mythical Preemptible NMIs to Finish}
 \end{figure}


 \QuickQ{}
 	Why do some of the cells in
 	Table~\ref{tab:defer:RCU Wait-to-Finish APIs}
 	have exclamation marks (``!'')?
 \QuickA{}
 	The API members with exclamation marks (\co{rcu_read_lock()},
 	\co{rcu_read_unlock()}, and \co{call_rcu()}) were the
 	only members of the Linux RCU API that Paul E. McKenney was aware
 	of back in the mid-90s.
 	During this timeframe, he was under the mistaken impression that
 	he knew all that there is to know about RCU.

 \QuickQ{}
 	How do you prevent a huge number of RCU read-side critical
 	sections from indefinitely blocking a \co{synchronize_rcu()}
 	invocation?
 \QuickA{}
 	There is no need to do anything to prevent RCU read-side
 	critical sections from indefinitely blocking a
 	\co{synchronize_rcu()} invocation, because the
 	\co{synchronize_rcu()} invocation need wait only for
 	\emph{pre-existing} RCU read-side critical sections.
 	So as long as each RCU read-side critical section is
 	of finite duration, there should be no problem.

 \QuickQ{}
 	The \co{synchronize_rcu()} API waits for all pre-existing
 	interrupt handlers to complete, right?
 \QuickA{}
 	Absolutely not!
 	And especially not when using preemptible RCU!
 	You instead want \co{synchronize_irq()}.
 	Alternatively, you can place calls to \co{rcu_read_lock()}
 	and \co{rcu_read_unlock()} in the specific interrupt handlers that
 	you want \co{synchronize_rcu()} to wait for.

 \QuickQ{}
 	What happens if you mix and match?
 	For example, suppose you use \co{rcu_read_lock()} and
 	\co{rcu_read_unlock()} to delimit RCU read-side critical
 	sections, but then use \co{call_rcu_bh()} to post an
 	RCU callback?
 \QuickA{}
 	If there happened to be no RCU read-side critical
 	sections delimited by \co{rcu_read_lock_bh()} and
 	\co{rcu_read_unlock_bh()} at the time \co{call_rcu_bh()}
 	was invoked, RCU would be within its rights to invoke the callback
 	immediately, possibly freeing a data structure still being used by
 	the RCU read-side critical section!
 	This is not merely a theoretical possibility: a long-running RCU
 	read-side critical section delimited by \co{rcu_read_lock()}
 	and \co{rcu_read_unlock()} is vulnerable to this failure mode.

 	This vulnerability disappears in -rt kernels, where
 	RCU Classic and RCU BH both map onto a common implementation.

 \QuickQ{}
 	Hardware interrupt handlers can be thought of as being
 	under the protection of an implicit \co{rcu_read_lock_bh()},
 	right?
 \QuickA{}
 	Absolutely not!
 	And especially not when using preemptible RCU!
 	If you need to access ``rcu\_bh''-protected data structures
 	in an interrupt handler, you need to provide explicit calls to
 	\co{rcu_read_lock_bh()} and \co{rcu_read_unlock_bh()}.

 \QuickQ{}
 	What happens if you mix and match RCU Classic and RCU Sched?
 \QuickA{}
 	In a non-\co{PREEMPT} or a \co{PREEMPT} kernel, mixing these
 	two works "by accident" because in those kernel builds, RCU Classic
 	and RCU Sched map to the same implementation.
 	However, this mixture is fatal in \co{PREEMPT_RT} builds using the -rt
 	patchset, due to the fact that Realtime RCU's read-side critical
 	sections can be preempted, which would permit
 	\co{synchronize_sched()} to return before the
 	RCU read-side critical section reached its \co{rcu_read_unlock()}
 	call.
 	This could in turn result in a data structure being freed before the
 	read-side critical section was finished with it,
 	which could in turn greatly increase the actuarial risk experienced
 	by your kernel.

 	In fact, the split between RCU Classic and RCU Sched was inspired
 	by the need for preemptible RCU read-side critical sections.

 \QuickQ{}
 	In general, you cannot rely on \co{synchronize_sched()} to
 	wait for all pre-existing interrupt handlers,
 	right?
 \QuickA{}
 	That is correct!
 	Because -rt Linux uses threaded interrupt handlers, there can
 	be context switches in the middle of an interrupt handler.
 	Because \co{synchronize_sched()} waits only until each
 	CPU has passed through a context switch, it can return
 	before a given interrupt handler completes.

 	If you need to wait for a given interrupt handler to complete,
 	you should instead use \co{synchronize_irq()} or place
 	explicit RCU read-side critical sections in the interrupt
 	handlers that you wish to wait on.

 \QuickQ{}
 	Why do both SRCU and QRCU lack asynchronous \co{call_srcu()}
 	or \co{call_qrcu()} interfaces?
 \QuickA{}
 	Given an asynchronous interface, a single task
 	could register an arbitrarily large number of SRCU or QRCU callbacks,
 	thereby consuming an arbitrarily large quantity of memory.
 	In contrast, given the current synchronous
 	\co{synchronize_srcu()} and \co{synchronize_qrcu()}
 	interfaces, a given task must finish waiting for a given grace period
 	before it can start waiting for the next one.

 \QuickQ{}
 	Under what conditions can \co{synchronize_srcu()} be safely
 	used within an SRCU read-side critical section?
 \QuickA{}
 	In principle, you can use
 	\co{synchronize_srcu()} with a given \co{srcu_struct}
 	within an SRCU read-side critical section that uses some other
 	\co{srcu_struct}.
 	In practice, however, doing this is almost certainly a bad idea.
 	In particular, the code shown in
 	Figure~\ref{fig:defer:Multistage SRCU Deadlocks}
 	could still result in deadlock.

 \begin{figure}[htbp]
 { \centering
 \begin{verbatim}
   1 idx = srcu_read_lock(&ssa);
   2 synchronize_srcu(&ssb);
   3 srcu_read_unlock(&ssa, idx);
   4
   5 /* . . . */
   6
   7 idx = srcu_read_lock(&ssb);
   8 synchronize_srcu(&ssa);
   9 srcu_read_unlock(&ssb, idx);
 \end{verbatim}
 }
 \caption{Multistage SRCU Deadlocks}
 \label{fig:defer:Multistage SRCU Deadlocks}
 \end{figure}


 \QuickQ{}
 	Why doesn't \co{list_del_rcu()} poison both the \co{next}
 	and \co{prev} pointers?
 \QuickA{}
 	Poisoning the \co{next} pointer would interfere
 	with concurrent RCU readers, who must use this pointer.
 	However, RCU readers are forbidden from using the \co{prev}
 	pointer, so it may safely be poisoned.

 \QuickQ{}
 	Normally, any pointer subject to \co{rcu_dereference()} \emph{must}
 	always be updated using \co{rcu_assign_pointer()}.
 	What is an exception to this rule?
 \QuickA{}
 	One such exception is when a multi-element linked
 	data structure is initialized as a unit while inaccessible to other
 	CPUs, and then a single \co{rcu_assign_pointer()} is used
 	to plant a global pointer to this data structure.
 	The initialization-time pointer assignments need not use
 	\co{rcu_assign_pointer()}, though any such assignments that
 	happen after the structure is globally visible \emph{must} use
 	\co{rcu_assign_pointer()}.

 	However, unless this initialization code is on an impressively hot
 	code-path, it is probably wise to use \co{rcu_assign_pointer()}
 	anyway, even though it is in theory unnecessary.
 	It is all too easy for a "minor" change to invalidate your cherished
 	assumptions about the initialization happening privately.

 \QuickQ{}
 	Are there any downsides to the fact that these traversal and update
 	primitives can be used with any of the RCU API family members?
 \QuickA{}
 	It can sometimes be difficult for automated
 	code checkers such as ``sparse'' (or indeed for human beings) to
 	work out which type of RCU read-side critical section a given
 	RCU traversal primitive corresponds to.
 	For example, consider the code shown in
 	Figure~\ref{fig:defer:Diverse RCU Read-Side Nesting}.

 \begin{figure}[htbp]
 { \centering
 \begin{verbatim}
   1 rcu_read_lock();
   2 preempt_disable();
   3 p = rcu_dereference(global_pointer);
   4
   5 /* . . . */
   6
   7 preempt_enable();
   8 rcu_read_unlock();
 \end{verbatim}
 }
 \caption{Diverse RCU Read-Side Nesting}
 \label{fig:defer:Diverse RCU Read-Side Nesting}
 \end{figure}

 	Is the \co{rcu_dereference()} primitive in an RCU Classic
 	or an RCU Sched critical section?
 	What would you have to do to figure this out?

 \QuickQ{}
 	Why wouldn't any deadlock in the RCU implementation in
 	Figure~\ref{fig:defer:Lock-Based RCU Implementation}
 	also be a deadlock in any other RCU implementation?
 \QuickA{}

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbatim}
   1 void foo(void)
   2 {
   3   spin_lock(&my_lock);
   4   rcu_read_lock();
   5   do_something();
   6   rcu_read_unlock();
   7   do_something_else();
   8   spin_unlock(&my_lock);
   9 }
  10
  11 void bar(void)
  12 {
  13   rcu_read_lock();
  14   spin_lock(&my_lock);
  15   do_some_other_thing();
  16   spin_unlock(&my_lock);
  17   do_whatever();
  18   rcu_read_unlock();
  19 }
 \end{verbatim}
 }
 \caption{Deadlock in Lock-Based RCU Implementation}
 \label{fig:defer:Deadlock in Lock-Based RCU Implementation}
 \end{figure}

 	Suppose the functions \co{foo()} and \co{bar()} in
 	Figure~\ref{fig:defer:Deadlock in Lock-Based RCU Implementation}
 	are invoked concurrently from different CPUs.
 	Then \co{foo()} will acquire \co{my_lock()} on line~3,
 	while \co{bar()} will acquire \co{rcu_gp_lock} on
 	line~13.
 	When \co{foo()} advances to line~4, it will attempt to
 	acquire \co{rcu_gp_lock}, which is held by \co{bar()}.
 	Then when \co{bar()} advances to line~14, it will attempt
 	to acquire \co{my_lock}, which is held by \co{foo()}.

 	Each function is then waiting for a lock that the other
 	holds, a classic deadlock.

 	Other RCU implementations neither spin nor block in
 	\co{rcu_read_lock()}, hence avoiding deadlocks.

 \QuickQ{}
 	Why not simply use reader-writer locks in the RCU implementation
 	in
 	Figure~\ref{fig:defer:Lock-Based RCU Implementation}
 	in order to allow RCU readers to proceed in parallel?
 \QuickA{}
 	One could in fact use reader-writer locks in this manner.
 	However, textbook reader-writer locks suffer from memory
 	contention, so that the RCU read-side critical sections would
 	need to be quite long to actually permit parallel
 	execution~\cite{McKenney03a}.

 	On the other hand, use of a reader-writer lock that is
 	read-acquired in \co{rcu_read_lock()} would avoid the
 	deadlock condition noted above.

 \QuickQ{}
 	Wouldn't it be cleaner to acquire all the locks, and then
 	release them all in the loop from lines~15-18 of
 	Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation}?
 	After all, with this change, there would be a point in time
 	when there were no readers, simplifying things greatly.
 \QuickA{}
 	Making this change would re-introduce the deadlock, so
 	no, it would not be cleaner.

 \QuickQ{}
 	Is the implementation shown in
 	Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation}
 	free from deadlocks?
 	Why or why not?
 \QuickA{}
 	One deadlock is where a lock is
 	held across \co{synchronize_rcu()}, and that same lock is
 	acquired within an RCU read-side critical section.
 	However, this situation will deadlock any correctly designed
 	RCU implementation.
 	After all, the \co{synchronize_rcu()} primitive must wait for all
 	pre-existing RCU read-side critical sections to complete,
 	but if one of those critical sections is spinning on a lock
 	held by the thread executing the \co{synchronize_rcu()},
 	we have a deadlock inherent in the definition of RCU.

 	Another deadlock happens when attempting to nest RCU read-side
 	critical sections.
 	This deadlock is peculiar to this implementation, and might
 	be avoided by using recursive locks, or by using reader-writer
 	locks that are read-acquired by \co{rcu_read_lock()} and
 	write-acquired by \co{synchronize_rcu()}.

 	However, if we exclude the above two cases,
 	this implementation of RCU does not introduce any deadlock
 	situations.
 	This is because only time some other thread's lock is acquired is when
 	executing \co{synchronize_rcu()}, and in that case, the lock
 	is immediately released, prohibiting a deadlock cycle that
 	does not involve a lock held across the \co{synchronize_rcu()}
 	which is the first case above.

 \QuickQ{}
 	Isn't one advantage of the RCU algorithm shown in
 	Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation}
 	that it uses only primitives that are widely available,
 	for example, in POSIX pthreads?
 \QuickA{}
 	This is indeed an advantage, but do not forget that
 	\co{rcu_dereference()} and \co{rcu_assign_pointer()}
 	are still required, which means \co{volatile} manipulation
 	for \co{rcu_dereference()} and memory barriers for
 	\co{rcu_assign_pointer()}.
 	Of course, many Alpha CPUs require memory barriers for both
 	primitives.

 \QuickQ{}
 	But what if you hold a lock across a call to
 	\co{synchronize_rcu()}, and then acquire that same lock within
 	an RCU read-side critical section?
 \QuickA{}
 	Indeed, this would deadlock any legal RCU implementation.
 	But is \co{rcu_read_lock()} \emph{really} participating in
 	the deadlock cycle?
 	If you believe that it is, then please
 	ask yourself this same question when looking at the
 	RCU implementation in
 	Section~\ref{defer:RCU Based on Quiescent States}.

 \QuickQ{}
 	How can the grace period possibly elapse in 40 nanoseconds when
 	\co{synchronize_rcu()} contains a 10-millisecond delay?
 \QuickA{}
 	The update-side test was run in absence of readers, so the
 	\co{poll()} system call was never invoked.
 	In addition, the actual code has this \co{poll()}
 	system call commented out, the better to evaluate the
 	true overhead of the update-side code.
 	Any production uses of this code would be better served by
 	using the \co{poll()} system call, but then again,
 	production uses would be even better served by other implementations
 	shown later in this section.

 \QuickQ{}
 	Why not simply make \co{rcu_read_lock()} wait when a concurrent
 	\co{synchronize_rcu()} has been waiting too long in
 	the RCU implementation in
 	Figure~\ref{fig:defer:RCU Implementation Using Single Global Reference Counter}?
 	Wouldn't that prevent \co{synchronize_rcu()} from starving?
 \QuickA{}
 	Although this would in fact eliminate the starvation, it would
 	also mean that \co{rcu_read_lock()} would spin or block waiting
 	for the writer, which is in turn waiting on readers.
 	If one of these readers is attempting to acquire a lock that
 	the spinning/blocking \co{rcu_read_lock()} holds, we again
 	have deadlock.

 	In short, the cure is worse than the disease.
 	See Section~\ref{defer:Starvation-Free Counter-Based RCU}
 	for a proper cure.

 \QuickQ{}
 	Why the memory barrier on line~5 of \co{synchronize_rcu()} in
 	Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}
 	given that there is a spin-lock acquisition immediately after?
 \QuickA{}
 	The spin-lock acquisition only guarantees that the spin-lock's
 	critical section will not ``bleed out'' to precede the
 	acquisition.
 	It in no way guarantees that code preceding the spin-lock
 	acquisition won't be reordered into the critical section.
 	Such reordering could cause a removal from an RCU-protected
 	list to be reordered to follow the complementing of
 	\co{rcu_idx}, which could allow a newly starting RCU
 	read-side critical section to see the recently removed
 	data element.

 	Exercise for the reader: use a tool such as Promela/spin
 	to determine which (if any) of the memory barriers in
 	Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}
 	are really needed.
 	See Section~\ref{chp:formal:Formal Verification}
 	for information on using these tools.
 	The first correct and complete response will be credited.

 \QuickQ{}
 	Why is the counter flipped twice in
 	Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}?
 	Shouldn't a single flip-and-wait cycle be sufficient?
 \QuickA{}
 	Both flips are absolutely required.
 	To see this, consider the following sequence of events:
 	\begin{enumerate}
 	\item	Line~8 of \co{rcu_read_lock()} in
 		Figure~\ref{fig:defer:RCU Read-Side Using Global Reference-Count Pair}
 		picks up \co{rcu_idx}, finding its value to be zero.
 	\item	Line~8 of \co{synchronize_rcu()} in
 		Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}
 		complements the value of \co{rcu_idx}, setting its
 		value to one.
 	\item	Lines~10-13 of \co{synchronize_rcu()} find that the
 		value of \co{rcu_refcnt[0]} is zero, and thus
 		returns.
 		(Recall that the question is asking what happens if
 		lines~14-20 are omitted.)
 	\item	Lines~9 and 10 of \co{rcu_read_lock()} store the
 		value zero to this thread's instance of \co{rcu_read_idx}
 		and increments \co{rcu_refcnt[0]}, respectively.
 		Execution then proceeds into the RCU read-side critical
 		section.
 		\label{defer:rcu_rcgp:RCU Read Side Start}
 	\item	Another instance of \co{synchronize_rcu()} again complements
 		\co{rcu_idx}, this time setting its value to zero.
 		Because \co{rcu_refcnt[1]} is zero, \co{synchronize_rcu()}
 		returns immediately.
 		(Recall that \co{rcu_read_lock()} incremented
 		\co{rcu_refcnt[0]}, not \co{rcu_refcnt[1]}!)
 		\label{defer:rcu_rcgp:RCU Grace Period Start}
 	\item	The grace period that started in
 		step~\ref{defer:rcu_rcgp:RCU Grace Period Start}
 		has been allowed to end, despite
 		the fact that the RCU read-side critical section
 		that started beforehand in
 		step~\ref{defer:rcu_rcgp:RCU Read Side Start}
 		has not completed.
 		This violates RCU semantics, and could allow the update
 		to free a data element that the RCU read-side critical
 		section was still referencing.
 	\end{enumerate}

 	Exercise for the reader: What happens if \co{rcu_read_lock()}
 	is preempted for a very long time (hours!) just after
 	line~8?
 	Does this implementation operate correctly in that case?
 	Why or why not?
 	The first correct and complete response will be credited.

 \QuickQ{}
 	Given that atomic increment and decrement are so expensive,
 	why not just use non-atomic increment on line~10 and a
 	non-atomic decrement on line~25 of
 	Figure~\ref{fig:defer:RCU Read-Side Using Global Reference-Count Pair}?
 \QuickA{}
 	Using non-atomic operations would cause increments and decrements
 	to be lost, in turn causing the implementation to fail.
 	See Section~\ref{defer:Scalable Counter-Based RCU}
 	for a safe way to use non-atomic operations in
 	\co{rcu_read_lock()} and \co{rcu_read_unlock()}.

 \QuickQ{}
 	Come off it!
 	We can see the \co{atomic_read()} primitive in
 	\co{rcu_read_lock()}!!!
 	So why are you trying to pretend that \co{rcu_read_lock()}
 	contains no atomic operations???
 \QuickA{}
 	The \co{atomic_read()} primitives does not actually execute
 	atomic machine instructions, but rather does a normal load
 	from an \co{atomic_t}.
 	Its sole purpose is to keep the compiler's type-checking happy.
 	If the Linux kernel ran on 8-bit CPUs, it would also need to
 	prevent ``store tearing'', which could happen due to the need
 	to store a 16-bit pointer with two eight-bit accesses on some
 	8-bit systems.
 	But thankfully, it seems that no one runs Linux on 8-bit systems.

 \QuickQ{}
 	Great, if we have $N$ threads, we can have $2N$ ten-millisecond
 	waits (one set per \co{flip_counter_and_wait()} invocation,
 	and even that assumes that we wait only once for each thread.
 	Don't we need the grace period to complete \emph{much} more quickly?
 \QuickA{}
 	Keep in mind that we only wait for a given thread if that thread
 	is still in a pre-existing RCU read-side critical section,
 	and that waiting for one hold-out thread gives all the other
 	threads a chance to complete any pre-existing RCU read-side
 	critical sections that they might still be executing.
 	So the only way that we would wait for $2N$ intervals
 	would be if the last thread still remained in a pre-existing
 	RCU read-side critical section despite all the waiting for
 	all the prior threads.
 	In short, this implementation will not wait unnecessarily.

 	However, if you are stress-testing code that uses RCU, you
 	might want to comment out the \co{poll()} statement in
 	order to better catch bugs that incorrectly retain a reference
 	to an RCU-protected data element outside of an RCU
 	read-side critical section.

 \QuickQ{}
 	All of these toy RCU implementations have either atomic operations
 	in \co{rcu_read_lock()} and \co{rcu_read_unlock()},
 	or \co{synchronize_rcu()}
 	overhead that increases linearly with the number of threads.
 	Under what circumstances could an RCU implementation enjoy
 	light-weight implementations for all three of these primitives,
 	all having deterministic ($O\left(1\right)$) overheads and latencies?
 \QuickA{}
 	Special-purpose uniprocessor implementations of RCU can attain
 	this ideal~\cite{PaulEMcKenney2009BloatwatchRCU}.

 \QuickQ{}
 	If any even value is sufficient to tell \co{synchronize_rcu()}
 	to ignore a given task, why doesn't line~10 of
 	Figure~\ref{fig:defer:Free-Running Counter Using RCU}
 	simply assign zero to \co{rcu_reader_gp}?
 \QuickA{}
 	Assigning zero (or any other even-numbered constant)
 	would in fact work, but assigning the value of
 	\co{rcu_gp_ctr} can provide a valuable debugging aid,
 	as it gives the developer an idea of when the corresponding
 	thread last exited an RCU read-side critical section.

 \QuickQ{}
 	Why are the memory barriers on lines~17 and 29 of
 	Figure~\ref{fig:defer:Free-Running Counter Using RCU}
 	needed?
 	Aren't the memory barriers inherent in the locking
 	primitives on lines~18 and 28 sufficient?
 \QuickA{}
 	These memory barriers are required because the locking
 	primitives are only guaranteed to confine the critical
 	section.
 	The locking primitives are under absolutely no obligation
 	to keep other code from bleeding in to the critical section.
 	The pair of memory barriers are therefore requires to prevent
 	this sort of code motion, whether performed by the compiler
 	or by the CPU.

 \QuickQ{}
 	Couldn't the update-side optimization described in
 	Section~\ref{defer:Scalable Counter-Based RCU With Shared Grace Periods}
 	be applied to the implementation shown in
 	Figure~\ref{fig:defer:Free-Running Counter Using RCU}?
 \QuickA{}
 	Indeed it could, with a few modifications.
 	This work is left as an exercise for the reader.

 \QuickQ{}
 	Is the possibility of readers being preempted in
 	line~3 of Figure~\ref{fig:defer:Free-Running Counter Using RCU}
 	a real problem, in other words, is there a real sequence
 	of events that could lead to failure?
 	If not, why not?
 	If so, what is the sequence of events, and how can the
 	failure be addressed?
 \QuickA{}
 	It is a real problem, there is a sequence of events leading to
 	failure, and there are a number of possible ways of
 	addressing it.
 	For more details, see the Quick Quizzes near the end of
 	Section~\ref{defer:Nestable RCU Based on Free-Running Counter}.
 	The reason for locating the discussion there is to (1) give you
 	more time to think about it, and (2) because the nesting support
 	added in that section greatly reduces the time required to
 	overflow the counter.

 \QuickQ{}
 	Why not simply maintain a separate per-thread nesting-level
 	variable, as was done in previous section, rather than having
 	all this complicated bit manipulation?
 \QuickA{}
 	The apparent simplicity of the separate per-thread variable
 	is a red herring.
 	This approach incurs much greater complexity in the guise
 	of careful ordering of operations, especially if signal
 	handlers are to be permitted to contain RCU read-side
 	critical sections.
 	But don't take my word for it, code it up and see what you
 	end up with!

 \QuickQ{}
 	Given the algorithm shown in
 	Figure~\ref{fig:defer:Nestable RCU Using a Free-Running Counter},
 	how could you double the time required to overflow the global
 	\co{rcu_gp_ctr}?
 \QuickA{}
 	One way would be to replace the magnitude comparison on
 	lines~33 and 34 with an inequality check of the per-thread
 	\co{rcu_reader_gp} variable against
 	\co{rcu_gp_ctr+RCU_GP_CTR_BOTTOM_BIT}.

 \QuickQ{}
 	Again, given the algorithm shown in
 	Figure~\ref{fig:defer:Nestable RCU Using a Free-Running Counter},
 	is counter overflow fatal?
 	Why or why not?
 	If it is fatal, what can be done to fix it?
 \QuickA{}
 	It can indeed be fatal.
 	To see this, consider the following sequence of events:
 	\begin{enumerate}
 	\item	Thread~0 enters \co{rcu_read_lock()}, determines
 		that it is not nested, and therefore fetches the
 		value of the global \co{rcu_gp_ctr}.
 		Thread~0 is then preempted for an extremely long time
 		(before storing to its per-thread \co{rcu_reader_gp}
 		variable).
 	\item	Other threads repeatedly invoke \co{synchronize_rcu()},
 		so that the new value of the global \co{rcu_gp_ctr}
 		is now \co{RCU_GP_CTR_BOTTOM_BIT}
 		less than it was when thread~0 fetched it.
 	\item	Thread~0 now starts running again, and stores into
 		its per-thread \co{rcu_reader_gp} variable.
 		The value it stores is
 		\co{RCU_GP_CTR_BOTTOM_BIT+1}
 		greater than that of the global \co{rcu_gp_ctr}.
 	\item	Thread~0 acquires a reference to RCU-protected data
 		element~A.
 	\item	Thread 1 now removes the data element~A that thread~0
 		just acquired a reference to.
 	\item	Thread 1 invokes \co{synchronize_rcu()}, which
 		increments the global \co{rcu_gp_ctr} by
 		\co{RCU_GP_CTR_BOTTOM_BIT}.
 		It then checks all of the per-thread \co{rcu_reader_gp}
 		variables, but thread~0's value (incorrectly) indicates
 		that it started after thread~1's call to
 		\co{synchronize_rcu()}, so thread~1 does not wait
 		for thread~0 to complete its RCU read-side critical
 		section.
 	\item	Thread 1 then frees up data element~A, which thread~0
 		is still referencing.
 	\end{enumerate}

 	Note that scenario can also occur in the implementation presented in
 	Section~\ref{defer:RCU Based on Free-Running Counter}.

 	One strategy for fixing this problem is to use 64-bit
 	counters so that the time required to overflow them would exceed
 	the useful lifetime of the computer system.
 	Note that non-antique members of the 32-bit x86 CPU family
 	allow atomic manipulation of 64-bit counters via the
 	\co{cmpxchg64b} instruction.

 	Another strategy is to limit the rate at which grace periods are
 	permitted to occur in order to achieve a similar effect.
 	For example, \co{synchronize_rcu()} could record the last time
 	that it was invoked, and any subsequent invocation would then
 	check this time and block as needed to force the desired
 	spacing.
 	For example, if the low-order four bits of the counter were
 	reserved for nesting, and if grace periods were permitted to
 	occur at most ten times per second, then it would take more
 	than 300 days for the counter to overflow.
 	However, this approach is not helpful if there is any possibility
 	that the system will be fully loaded with CPU-bound high-priority
 	real-time threads for the full 300 days.
 	(A remote possibility, perhaps, but best to consider it ahead
 	of time.)

 	A third approach is to administratively abolish real-time threads
 	from the system in question.
 	In this case, the preempted process will age up in priority,
 	thus getting to run long before the counter had a chance to
 	overflow.
 	Of course, this approach is less than helpful for real-time
 	applications.

 	A final approach would be for \co{rcu_read_lock()} to recheck
 	the value of the global \co{rcu_gp_ctr} after storing to its
 	per-thread \co{rcu_reader_gp} counter, retrying if the new
 	value of the global \co{rcu_gp_ctr} is inappropriate.
 	This works, but introduces non-deterministic execution time
 	into \co{rcu_read_lock()}.
 	On the other hand, if your application is being preempted long
 	enough for the counter to overflow, you have no hope of
 	deterministic execution time in any case!

 	% @@@ A fourth approach is rcu_nest32.[hc].

 \QuickQ{}
 	Doesn't the additional memory barrier shown on line~14 of
 	Figure~\ref{fig:defer:Quiescent-State-Based RCU Read Side},
 	greatly increase the overhead of \co{rcu_quiescent_state}?
 \QuickA{}
 	Indeed it does!
 	An application using this implementation of RCU should therefore
 	invoke \co{rcu_quiescent_state} sparingly, instead using
 	\co{rcu_read_lock()} and \co{rcu_read_unlock()} most of the
 	time.

 	However, this memory barrier is absolutely required so that
 	other threads will see the store on lines~12-13 before any
 	subsequent RCU read-side critical sections executed by the
 	caller.

 \QuickQ{}
 	Why are the two memory barriers on lines~19 and 22 of
 	Figure~\ref{fig:defer:Quiescent-State-Based RCU Read Side}
 	needed?
 \QuickA{}
 	The memory barrier on line~19 prevents any RCU read-side
 	critical sections that might precede the
 	call to \co{rcu_thread_offline()} won't be reordered by either
 	the compiler or the CPU to follow the assignment on lines~20-21.
 	The memory barrier on line~22 is, strictly speaking, unnecessary,
 	as it is illegal to have any RCU read-side critical sections
 	following the call to \co{rcu_thread_offline()}.

 \QuickQ{}
 	To be sure, the clock frequencies of ca-2008 Power
 	systems were quite high, but even a 5GHz clock
 	frequency is insufficient to allow
 	loops to be executed in 50~picoseconds!
 	What is going on here?
 \QuickA{}
 	Since the measurement loop contains a pair of empty functions,
 	the compiler optimizes it away.
 	The measurement loop takes 1,000 passes between each call to
 	\co{rcu_quiescent_state()}, so this measurement is roughly
 	one thousandth of the overhead of a single call to
 	\co{rcu_quiescent_state()}.

 \QuickQ{}
 	Why would the fact that the code is in a library make
 	any difference for how easy it is to use the RCU
 	implementation shown in
 	Figures~\ref{fig:defer:Quiescent-State-Based RCU Read Side} and
 	\ref{fig:defer:RCU Update Side Using Quiescent States}?
 \QuickA{}
 	A library function has absolutely no control over the caller,
 	and thus cannot force the caller to invoke \co{rcu_quiescent_state()}
 	periodically.
 	On the other hand, a library function that made many references
 	to a given RCU-protected data structure might be able to invoke
 	\co{rcu_thread_online()} upon entry,
 	\co{rcu_quiescent_state()} periodically, and
 	\co{rcu_thread_offline()} upon exit.

 \QuickQ{}
 	But what if you hold a lock across a call to
 	\co{synchronize_rcu()}, and then acquire that same lock within
 	an RCU read-side critical section?
 	This should be a deadlock, but how can a primitive that
 	generates absolutely no code possibly participate in a
 	deadlock cycle?
 \QuickA{}
 	Please note that the RCU read-side critical section is in
 	effect extended beyond the enclosing
 	\co{rcu_read_lock()} and \co{rcu_read_unlock()}, out to
 	the previous and next call to \co{rcu_quiescent_state()}.
 	This \co{rcu_quiescent_state} can be thought of as a
 	\co{rcu_read_unlock()} immediately followed by an
 	\co{rcu_read_lock()}.

 	Even so, the actual deadlock itself will involve the lock
 	acquisition in the RCU read-side critical section and
 	the \co{synchronize_rcu()}, never the \co{rcu_quiescent_state()}.

 \QuickQ{}
 	Given that grace periods are prohibited within RCU read-side
 	critical sections, how can an RCU data structure possibly be
 	updated while in an RCU read-side critical section?
 \QuickA{}
 	This situation is one reason for the existence of asynchronous
 	grace-period primitives such as \co{call_rcu()}.
 	This primitive may be invoked within an RCU read-side critical
 	section, and the specified RCU callback will in turn be invoked
 	at a later time, after a grace period has elapsed.

 	The ability to perform an RCU update while within an RCU read-side
 	critical section can be extremely convenient, and is analogous
 	to a (mythical) unconditional read-to-write upgrade for
 	reader-writer locking.

 \QuickQ{}
 	The statistical-counter implementation shown in
 	Figure~\ref{fig:count:Per-Thread Statistical Counters}
 	(\url{count_end.c})
 	used a global lock to guard the summation in \co{read_count()},
 	which resulted in poor performance and negative scalability.
 	How could you use RCU to provide \co{read_count()} with
 	excellent performance and good scalability.
 	(Keep in mind that \co{read_count()}'s scalability will
 	necessarily be limited by its need to scan all threads'
 	counters.)
 \QuickA{}
 	Hint: place the global variable \co{finalcount} and the
 	array \co{counterp[]} into a single RCU-protected struct.
 	At initialization time, this structure would be allocated
 	and set to all zero and \co{NULL}.

 	The \co{inc_count()} function would be unchanged.

 	The \co{read_count()} function would use \co{rcu_read_lock()}
 	instead of acquiring \co{final_mutex}, and would need to
 	use \co{rcu_dereference()} to acquire a reference to the
 	current structure.

 	The \co{count_register_thread()} function would set the
 	array element corresponding to the newly created thread
 	to reference that thread's per-thread \co{counter} variable.

 	The \co{count_unregister_thread()} function would need to
 	allocate a new structure, acquire \co{final_mutex},
 	copy the old structure to the new one, add the outgoing
 	thread's \co{counter} variable to the total, \co{NULL}
 	the pointer to this same \co{counter} variable,
 	use \co{rcu_assign_pointer()} to install the new structure
 	in place of the old one, release \co{final_mutex},
 	wait for a grace period, and finally free the old structure.

 	Does this really work?
 	Why or why not?

 	See
 	Section~\ref{sec:together:RCU and Per-Thread-Variable-Based Statistical Counters}
 	on
 	page~\pageref{sec:together:RCU and Per-Thread-Variable-Based Statistical Counters}
 	for more details.

 \QuickQ{}
 	Section~\ref{sec:count:Applying Specialized Parallel Counters}
 	showed a fanciful pair of code fragments that dealt with counting
 	I/O accesses to removable devices.
 	These code fragments suffered from high overhead on the fastpath
 	(starting an I/O) due to the need to acquire a reader-writer
 	lock.
 	How would you use RCU to provide excellent performance and
 	scalability?
 	(Keep in mind that the performance of the common-case first
 	code fragment that does I/O accesses is much more important
 	than that of the device-removal code fragment.)
 \QuickA{}
 	Hint: replace the read-acquisitions of the reader-writer lock
 	with RCU read-side critical sections, then adjust the
 	device-removal code fragment to suit.

 	See
 	Section~\ref{sec:together:RCU and Counters for Removable I/O Devices}
 	on
 	Page~\pageref{sec:together:RCU and Counters for Removable I/O Devices}
 	for one solution to this problem.

 \QuickQAC{chp:Data Structures}{Data Structures}
 \QuickQ{}
 	But there are many types of hash tables, of which the chained
 	hash tables described here are but one type.
 	Why the focus on chained hash tables?
 \QuickA{}
 	Chained hash tables are completely partitionable, and thus
 	well-suited to concurrent use.
 	There are other completely-partitionable hash tables, for
 	example, split-ordered list~\cite{OriShalev2006SplitOrderListHash},
 	but they are considerably more complex.
 	We therefore start with chained hash tables.

 \QuickQ{}
 	But isn't the double comparison on lines~15-18 in
 	Figure~\ref{fig:datastruct:Hash-Table Lookup} inefficient
 	in the case where the key fits into an unsigned long?
 \QuickA{}
 	Indeed it is!
 	However, hash tables quite frequently store information with
 	keys such as character strings that do not necessarily fit
 	into an unsigned long.
 	Simplifying the hash-table implementation for the case where
 	keys always fit into unsigned longs is left as an exercise
 	for the reader.

 \QuickQ{}
 	Given the negative scalability of the Schr\"odinger's
 	Zoo application across sockets, why not just run multiple
 	copies of the application, with each copy having a subset
 	of the animals and confined to run on a single socket?
 \QuickA{}
 	You can do just that!
 	In fact, you can extend this idea to large clustered systems,
 	running one copy of the application on each node of the cluster.
 	This practice is called ``sharding'', and is heavily used in
 	practice by large web-based
 	retailers~\cite{DeCandia:2007:DAH:1323293.1294281}.

 	However, if you are going to shard on a per-socket basis within
 	a multisocket system, why not buy separate smaller and cheaper
 	single-socket systems, and then run one shard of the database
 	on each of those systems?

 \QuickQ{}
 	But if elements in a hash table can be deleted concurrently
 	with lookups, doesn't that mean that a lookup could return
 	a reference to a data element that was deleted immediately
 	after it was looked up?
 \QuickA{}
 	Yes it can!
 	This is why \co{hashtab_lookup()} must be invoked within an
 	RCU read-side critical section, and it is why
 	\co{hashtab_add()} and \co{hashtab_del()} must also use
 	RCU-aware list-manipulation primitives.
 	Finally, this is why the caller of \co{hashtab_del()} must
 	wait for a grace period (e.g., by calling \co{synchronize_rcu()})
 	before freeing the deleted element.

 \QuickQ{}
 	The dangers of extrapolating from eight CPUs to 60 CPUs was
 	made quite clear in
 	Section~\ref{sec:datastruct:Hash-Table Performance}.
 	But why should extrapolating up from 60 CPUs be any safer?
 \QuickA{}
 	It isn't any safer, and a useful exercise would be to run these
 	programs on larger systems.
 	That said, other testing has shown that RCU read-side primitives
 	offer consistent performance and scalability up to at least 1024 CPUs.

 \QuickQ{}
 	The code in
 	Figure~\ref{fig:datastruct:Resizable Hash-Table Bucket Selection}
 	computes the hash twice!
 	Why this blatant inefficiency?
 \QuickA{}
 	The reason is that the old and new hash tables might have
 	completely different hash functions, so that a hash computed
 	for the old table might be completely irrelevant to the
 	new table.

 \QuickQ{}
 	How does the code in
 	Figure~\ref{fig:datastruct:Resizable Hash-Table Bucket Selection}
 	protect against the resizing process progressing past the
 	selected bucket?
 \QuickA{}
 	It does not provide any such protection.
 	That is instead the job of the update-side concurrency-control
 	functions described next.

 \QuickQ{}
 	The code in
 	Figures~\ref{fig:datastruct:Resizable Hash-Table Bucket Selection}
 	and~\ref{fig:datastruct:Resizable Hash-Table Update-Side Concurrency Control}
 	compute the hash and execute the bucket-selection logic twice for
 	updates!
 	Why this blatant inefficiency?
 \QuickA{}
 	This approach allows the \co{hashtorture.h} testing infrastructure
 	to be reused.
 	That said, a production-quality resizable hash table would likely
 	be optimized to avoid this double computation.
 	Carrying out this optimization is left as an exercise for the reader.

 \QuickQ{}
 	In the \co{hashtab_lookup()} function in
 	Figure~\ref{fig:datastruct:Resizable Hash-Table Access Functions},
 	the code carefully finds the right bucket in the new hash table
 	if the element to be looked up has already been distributed
 	by a concurrent resize operation.
 	This seems wasteful for RCU-protected lookups.
 	Why not just stick with the old hash table in this case?
 \QuickA{}
 	Suppose that a resize operation begins and distributes half of
 	the old table's buckets to the new table.
 	Suppose further that a thread adds a new element that goes into
 	one of the already-distributed buckets, and that this same thread
 	now looks up this newly added element.
 	If lookups unconditionally traversed only the old hash table,
 	this thread would get a lookup failure for the element that it
 	just added, which certainly sounds like a bug to me!

 \QuickQ{}
 	The \co{hashtab_del()} function in
 	Figure~\ref{fig:datastruct:Resizable Hash-Table Access Functions}
 	does not always remove the element from the old hash table.
 	Doesn't this mean that readers might access this newly removed
 	element after it has been freed?
 \QuickA{}
 	No.
 	The \co{hashtab_del()} function omits removing the element
 	from the old hash table only if the resize operation has
 	already progressed beyond the bucket containing the just-deleted
 	element.
 	But this means that new \co{hashtab_lookup()} operations will
 	use the new hash table when looking up that element.
 	Therefore, only old \co{hashtab_lookup()} operations that started
 	before the \co{hashtab_del()} might encounter the newly
 	removed element.
 	This means that \co{hashtab_del()} need only wait for an
 	RCU grace period to avoid inconveniencing
 	\co{hashtab_lookup()} operations.

 \QuickQ{}
 	In the \co{hashtab_resize()} function in
 	Figure~\ref{fig:datastruct:Resizable Hash-Table Access Functions},
 	what guarantees that the update to \co{->ht_new} on line~29
 	will be seen as happening before the update to \co{->ht_resize_cur}
 	on line~36 from the perspective of \co{hashtab_lookup()},
 	\co{hashtab_add()}, and \co{hashtab_del()}?
 \QuickA{}
 	The \co{synchronize_rcu()} on line~30 of
 	Figure~\ref{fig:datastruct:Resizable Hash-Table Access Functions}
 	ensures that all pre-existing RCU readers have completed between
 	the time that we install the new hash-table reference on
 	line~29 and the time that we update \co{->ht_resize_cur} on
 	line~36.
 	This means that any reader that sees a non-negative value
 	of \co{->ht_resize_cur} cannot have started before the
 	assignment to \co{->ht_new}, and thus must be able to see
 	the reference to the new hash table.

 \QuickQ{}
 	Couldn't the \co{hashtorture.h} code be modified to accommodate
 	a version of \co{hashtab_lock_mod()} that subsumes the
 	\co{ht_get_bucket()} functionality?
 \QuickA{}
 	It probably could, and doing so would benefit all of the
 	per-bucket-locked hash tables presented in this chapter.
 	Making this modification is left as an exercise for the
 	reader.

 \QuickQ{}
 	How much do these specializations really save?
 	Are the really worth it?
 \QuickA{}
 	The answer to the first question is left as an exercise to
 	the reader.
 	Try specializing the resizable hash table and see how much
 	performance improvement results.
 	The second question cannot be answered in general, but must
 	instead be answered with respect to a specific use case.
 	Some use cases are extremely sensitive to performance and
 	scalability, while others are less so.

 \QuickQAC{chp:Validation}{Validation}
 \QuickQ{}
 	When in computing is the willingness to follow a fragmentary
 	plan critically important?
 \QuickA{}
 	There are any number of situations, but perhaps the most important
 	situation is when no one has ever created anything resembling
 	the program to be developed.
 	In this case, the only way to create a credible plan is to
 	implement the program, create the plan, and implement it a
 	second time.
 	But whoever implements the program for the first time has no
 	choice but to follow a fragmentary plan because any detailed
 	plan created in ignorance cannot survive first contact with
 	the real world.

 	And perhaps this is one reason why evolution has favored insanely
 	optimistic human beings who are happy to follow fragmentary plans!

 \QuickQ{}
 	Suppose that you are writing a script that processes the
 	output of the \co{time} command, which looks as follows:

 	\vspace{5pt}
 	\begin{minipage}[t]{\columnwidth}
 	\tt
 	\scriptsize
 	\begin{verbatim}
 		real    0m0.132s
 		user    0m0.040s
 		sys     0m0.008s
 	\end{verbatim}
 	\end{minipage}
 	\vspace{5pt}

 	The script is required to check its input for errors, and to
 	give appropriate diagnostics if fed erroneous \co{time} output.
 	What test inputs should you provide to this program to test it
 	for use with \co{time} output generated by single-threaded programs?
 \QuickA{}
 	\begin{enumerate}
 	\item	Do you have a test case in which all the time is
 		consumed in user mode by a CPU-bound program?
 	\item	Do you have a test case in which all the time is
 		consumed in system mode by a CPU-bound program?
 	\item	Do you have a test case in which all three times
 		are zero?
 	\item	Do you have a test case in which the ``user'' and ``sys''
 		times sum to more than the ``real'' time?
 		(This would of course be completely legitimate in
 		a multithreaded program.)
 	\item	Do you have a set of tests cases in which one of the
 		times uses more than one second?
 	\item	Do you have a set of tests cases in which one of the
 		times uses more than ten second?
 	\item	Do you have a set of test cases in which one of the
 		times has non-zero minutes?  (For example, ``15m36.342s''.)
 	\item	Do you have a set of test cases in which one of the
 		times has a seconds value of greater than 60?
 	\item	Do you have a set of test cases in which one of the
 		times overflows 32 bits of milliseconds?  64 bits of
 		milliseconds?
 	\item	Do you have a set of test cases in which one of the
 		times is negative?
 	\item	Do you have a set of test cases in which one of the
 		times has a positive minutes value but a negative
 		seconds value?
 	\item	Do you have a set of test cases in which one of the
 		times omits the ``m'' or the ``s''?
 	\item	Do you have a set of test cases in which one of the
 		times is non-numeric?  (For example, ``Go Fish''.)
 	\item	Do you have a set of test cases in which one of the
 		lines is omitted?  (For example, where there is a
 		``real'' value and a ``sys'' value, but no ``user''
 		value.)
 	\item	Do you have a set of test cases where one of the
 		lines is duplicated?  Or duplicated, but with a
 		different time value for the duplicate?
 	\item	Do you have a set of test cases where a given line
 		has more than one time value?  (For example,
 		``real 0m0.132s 0m0.008s''.)
 	\item	Do you have a set of test cases containing random
 		characters?
 	\item	In all test cases involving invalid input, did you
 		generate all permutations?
 	\item	For each test case, do you have an expected outcome
 		for that test?
 	\end{enumerate}

 	If you did not generate test data for a substantial number of
 	the above cases, you will need to cultivate a more destructive
 	attitude in order to have a chance of generating high-quality
 	tests.

 	Of course, one way to economize on destructiveness is to
 	generate the tests with the to-be-tested source code at hand,
 	which is called white-box testing (as opposed to black-box testing).
 	However, this is no panacea: You will find that it is all too
 	easy to find your thinking limited by what the program can handle,
 	thus failing to generate truly destructive inputs.

 \QuickQ{}
 	You are asking me to do all this validation BS before
 	I even start coding???
 	That sounds like a great way to never get started!!!
 \QuickA{}
 	If it is your project, for example, a hobby, do what you like.
 	Any time you waste will be your own, and you have no one else
 	to answer to for it.
 	And there is a good chance that the time will not be completely
 	wasted.
 	For example, if you are embarking on a first-of-a-kind project,
 	the requirements are in some sense unknowable anyway.
 	In this case, the best approach might be to quickly prototype
 	a number of rough solutions, try them out, and see what works
 	best.

 \QuickQ{}
 	How can you implement \co{WARN_ON_ONCE()}?
 \QuickA{}
 	If you don't mind having a \co{WARN_ON_ONCE()} that
 	will sometimes warn twice or three times, simply maintain
 	a static variable that is initialized to zero.
 	If the condition triggers, check the static variable, and
 	if it is non-zero, return.
 	Otherwise, set it to one, print the message, and return.

 	If you really need the message to never appear more than once,
 	perhaps because it is huge, you can use an atomic exchange
 	operation in place of ``set it to one'' above.
 	Print the message only if the atomic exchange operation returns
 	zero.

 \QuickQ{}
 	Why would anyone bother copying existing code in pen on paper???
 	Doesn't that just increase the probability of transcription errors?
 \QuickA{}
 	If you are worried about transcription errors, please allow me
 	to be the first to introduce you to a really cool tool named
 	\co{diff}.
 	In addition, carrying out the copying can be quite valuable:
 	\begin{enumerate}
 	\item	If you are copying a lot of code, you are probably failing
 		to take advantage of an opportunity for abstraction.
 		The act of copying code can provide great motivation
 		for abstraction.
 	\item	Copying the code gives you an opportunity to think about
 		whether the code really works in its new setting.
 		Is there some non-obvious constraint, such as the need
 		to disable interrupts or to hold some lock?
 	\item	Copying the code also gives you time to consider whether
 		there is some better way to get the job done.
 	\end{enumerate}
 	So, yes, copy the code!

 \QuickQ{}
 	This procedure is ridiculously over-engineered!
 	How can you expect to get a reasonable amount of software
 	written doing it this way???
 \QuickA{}
 	Indeed, repeatedly copying code by hand is laborious and slow.
 	However, when combined with heavy-duty stress testing and
 	proofs of correctness, this approach is also extremely effective
 	for complex parallel code where ultimate performance and
 	reliability are required and where debugging is difficult.
 	The Linux-kernel RCU implementation is a case in point.

 	On the other hand, if you are writing a simple single-threaded
 	shell script to manipulate some data, then you would be
 	best-served by a different methodology.
 	For example, you might enter each command one at a time
 	into an interactive shell with a test data set to make
 	sure that it did what you wanted, then copy-and-paste the
 	successful commands into your script.
 	Finally, test the script as a whole.

 	If you have a friend or colleague who is willing to help out,
 	pair programming can work very well, as can any number of
 	formal design- and code-review processes.

 	And if you are writing code as a hobby, then do whatever you like.

 	In short, different types of software need different development
 	methodologies.

 \QuickQ{}
 	Suppose that you had a very large number of systems at your
 	disposal.
 	For example, at current cloud prices, you can purchase a
 	huge amount of CPU time at a reasonably low cost.
 	Why not use this approach to get close enough to certainty
 	for all practical purposes?
 \QuickA{}
 	This approach might well be a valuable addition to your
 	validation arsenal.
 	But it does have a few limitations:
 	\begin{enumerate}
 	\item	Some bugs have extremely low probabilities of occurrence,
 		but nevertheless need to be fixed.
 		For example, suppose that the Linux kernel's RCU
 		implementation had a bug that is triggered only once
 		per century of machine time on average.
 		A century of CPU time is hugely expensive even on
 		the cheapest cloud platforms, but we could expect
 		this bug to result in more than 2,000 failures per day
 		on the more than 100 million Linux instances in the
 		world as of 2011.
 	\item	The bug might well have zero probability of occurrence
 		on your test setup, which means that you won't see it
 		no matter how much machine time you burn testing it.
 	\end{enumerate}
 	Of course, if your code is small enough, formal validation
 	may be helpful, as discussed in
 	Section~\ref{chp:formal:Formal Verification}.
 	But beware: formal validation of your code will not find
 	errors in your assumptions, misunderstanding of the
 	requirements, misunderstanding of the software or hardware
 	primitives you use, or errors that you did not think to construct
 	a proof for.

 \QuickQ{}
 	Say what???
 	When I plug the earlier example of five tests each with a
 	10\% failure rate into the formula, I get 59,050\% and that
 	just doesn't make sense!!!
 \QuickA{}
 	You are right, that makes no sense at all.

 	Remember that a probability is a number between zero and one,
 	so that you need to divide a percentage by 100 to get a
 	probability.
 	So 10\% is a probability of 0.1, which gets a probability
 	of 0.4095, which rounds to 41\%, which quite sensibly
 	matches the earlier result.

 \QuickQ{}
 	In Equation~\ref{eq:debugging:Binomial Number of Tests Required},
 	are the logarithms base-10, base-2, or base-$e$?
 \QuickA{}
 	It does not matter.
 	You will get the same answer no matter what base of logarithms
 	you use because the result is a pure ratio of logarithms.
 	The only constraint is that you use the same base for both
 	the numerator and the denominator.

 \QuickQ{}
 	Suppose that a bug causes a test failure three times per hour
 	on average.
 	How long must the test run error-free to provide 99.9\%
 	confidence that the fix significantly reduced the probability
 	of failure?
 \QuickA{}
 	We set $n$ to $3$ and $P$ to $99.9$ in
 	Equation~\ref{eq:debugging:Error-Free Test Duration}, resulting in:

 	\begin{equation}
 		T = - \frac{1}{3} \log \frac{100 - 99.9}{100} = 2.3
 	\end{equation}

 	If the test runs without failure for 2.3 hours, we can be 99.9\%
 	certain that the fix reduced the probability of failure.

 \QuickQ{}
 	Doing the summation of all the factorials and exponentials
 	is a real pain.
 	Isn't there an easier way?
 \QuickA{}
 	One approach is to use the open-source symbolic manipulation
 	program named ``maxima''.
 	Once you have installed this program, which is a part of many
 	Debian-based Linux distributions, you can run it and give the
 	\co{load(distrib);} command followed by any number of
 	\co{bfloat(cdf_poisson(m,l));} commands, where the \co{m}
 	is replaced by the desired value of $m$ and the \co{l}
 	is replaced by the desired value of $\lambda$.

 	In particular, the \co{bfloat(cdf_poisson(2,24));} command
 	results in \co{1.181617112359357b-8}, which matches the value
 	given by Equation~\ref{eq:debugging:Possion CDF}.

 	Alternatively, you can use the rough-and-ready method described in
 	Section~\ref{sec:debugging:Abusing Statistics for Discrete Testing}.

 \QuickQ{}
 	But wait!!!
 	Given that there has to be \emph{some} number of failures
 	(including the possibility of zero failures),
 	shouldn't the summation shown in
 	Equation~\ref{eq:debugging:Possion CDF}
 	approach the value $1$ as $m$ goes to infinity?
 \QuickA{}
 	Indeed it should.
 	And it does.

 	To see this, note that $e^{-\lambda}$ does not depend on $i$,
 	which means that it can be pulled out of the summation as follows:

 	\begin{equation}
 		e^{-\lambda} \sum_{i=0}^\infty \frac{\lambda^i}{i!}
 	\end{equation}

 	The remaining summation is exactly the Taylor series for
 	$e^\lambda$, yielding:

 	\begin{equation}
 		e^{-\lambda} e^\lambda
 	\end{equation}

 	The two exponentials are reciprocals, and therefore cancel,
 	resulting in exactly $1$, as required.

 \QuickQ{}
 	How is this approach supposed to help if the corruption affected some
 	unrelated pointer, which then caused the corruption???
 \QuickA{}
 	Indeed, that can happen.
 	Many CPUs have hardware-debugging facilities that can help you
 	locate that unrelated pointer.
 	Furthermore, if you have a core dump, you can search the core
 	dump for pointers referencing the corrupted region of memory.
 	You can also look at the data layout of the corruption, and
 	check pointers whose type matches that layout.

 	You can also step back and test the modules making up your
 	program more intensively, which will likely confine the corruption
 	to the module responsible for it.
 	If this makes the corruption vanish, consider adding additional
 	argument checking to the functions exported from each module.

 	Nevertheless, this is a hard problem, which is why I used the
 	words ``a bit of a dark art''.

 \QuickQ{}
 	But I did the bisection, and ended up with a huge commit.
 	What do I do now?
 \QuickA{}
 	A huge commit?
 	Shame on you!
 	This is but one reason why you are supposed to keep the commits small.

 	And that is your answer: Break up the commit into bite-sized
 	pieces and bisect the pieces.
 	In my experience, the act of breaking up the commit is often
 	sufficient to make the bug painfully obvious.

 \QuickQ{}
 	Why don't existing conditional-locking primitives provide this
 	spurious-failure functionality?
 \QuickA{}
 	There are locking algorithms that depend on conditional-locking
 	primitives telling them the truth.
 	For example, if conditional-lock failure signals that
 	some other thread is already working on a given job,
 	spurious failure might cause that job to never get done,
 	possibly resulting in a hang.

 \QuickQ{}
 	That is ridiculous!!!
 	After all, isn't getting the correct answer later than one would like
 	\emph{has} better than getting an incorrect answer???
 \QuickA{}
 	This question fails to consider the option of choosing not to
 	compute the answer at all, and in doing so, also fails to consider
 	the costs of computing the answer.
 	For example, consider short-term weather forecasting, for which
 	accurate models exist, but which require large (and expensive)
 	clustered supercomputers, at least if you want to actually run
 	the model faster than the weather.

 	And in this case, any performance bug that prevents the model from
 	running faster than the actual weather prevents any forecasting.
 	Given that the whole purpose of purchasing the large clustered
 	supercomputer was to forecast weather, if you cannot run the
 	model faster than the weather, you would be better off not running
 	the model at all.

 	More severe examples may be found in the area of safety-critical
 	real-time computing.

 \QuickQ{}
 	But if you are going to put in all the hard work of parallelizing
 	an application, why not do it right?
 	Why settle for anything less than optimal performance and
 	linear scalability?
 \QuickA{}
 	Although I do heartily salute your spirit and aspirations,
 	you are forgetting that there may be high costs due to delays
 	in the program's completion.
 	For an extreme example, suppose that a 40\% performance shortfall
 	from a single-threaded application is causing one person to die
 	each day.
 	Suppose further that in a day you could hack together a
 	quick and dirty
 	parallel program that ran 50\% faster on an eight-CPU system
 	than the sequential version, but that an optimal parallel
 	program would require four months of painstaking design, coding,
 	debugging, and tuning.

 	It is safe to say that more than 100 people would prefer the
 	quick and dirty version.

 \QuickQ{}
 	But what about other sources of error, for example, due to
 	interactions between caches and memory layout?
 \QuickA{}
 	Changes in memory layout can indeed result in unrealistic
 	decreases in execution time.
 	For example, suppose that a given microbenchmark almost
 	always overflows the L0 cache's associativity, but with just the right
 	memory layout, it all fits.
 	If this is a real concern, consider running your microbenchmark
 	using huge pages (or within the kernel or on bare metal) in
 	order to completely control the memory layout.

 \QuickQ{}
 	Wouldn't the techniques suggested to isolate the code under
 	test also affect that code's performance, particularly if
 	it is running within a larger application?
 \QuickA{}
 	Indeed it might, although in most microbenchmarking efforts
 	you would extract the code under test from the enclosing
 	application.
 	Nevertheless, if for some reason you must keep the code under
 	test within the application, you will very likely need to use
 	the techniques discussed in
 	Section~\ref{sec:debugging:Detecting Interference}.

 \QuickQ{}
 	This approach is just plain weird!
 	Why not use means and standard deviations, like we were taught
 	in our statistics classes?
 \QuickA{}
 	Because mean and standard deviation were not designed to do this job.
 	To see this, try applying mean and standard deviation to the
 	following data set, given a 1\% relative error in measurement:

 	\begin{quote}
 		49,548.4 49,549.4 49,550.2 49,550.9 49,550.9 49,551.0
 		49,551.5 49,552.1 49,899.0 49,899.3 49,899.7 49,899.8
 		49,900.1 49,900.4 52,244.9 53,333.3 53,333.3 53,706.3
 		53,706.3 54,084.5
 	\end{quote}

 	The problem is that mean and standard deviation do not rest on
 	any sort of measurement-error assumption, and they will therefore
 	see the difference between the values near 49,500 and those near
 	49,900 as being statistically significant, when in fact they are
 	well within the bounds of estimated measurement error.

 	Of course, it is possible to create a script similar to
 	that in
 	Figure~\ref{fig:count:Statistical Elimination of Interference}
 	that uses standard deviation rather than absolute difference
 	to get a similar effect,
 	and this is left as an exercise for the interested reader.
 	Be careful to avoid divide-by-zero errors arising from strings
 	of identical data values!

 \QuickQ{}
 	But what if all the y-values in the trusted group of data
 	are exactly zero?
 	Won't that cause the script to reject any non-zero value?
 \QuickA{}
 	Indeed it will!
 	But if your performance measurements often produce a value of
 	exactly zero, perhaps you need to take a closer look at your
 	performance-measurement code.

 	Note that many approaches based on mean and standard deviation
 	will have similar problems with this sort of dataset.

 \QuickQAC{chp:formal:Formal Verification}{Formal Verification}
 \QuickQ{}
 	Why is there an unreached statement in
 	locker?  After all, isn't this a \emph{full} state-space
 	search?
 \QuickA{}
 	The locker process is an infinite loop, so control
 	never reaches the end of this process.
 	However, since there are no monotonically increasing variables,
 	Promela is able to model this infinite loop with a small
 	number of states.

 \QuickQ{}
 	What are some Promela code-style issues with this example?
 \QuickA{}
 	There are several:
 	\begin{enumerate}
 	\item	The declaration of {\tt sum} should be moved to within
 		the init block, since it is not used anywhere else.
 	\item	The assertion code should be moved outside of the
 		initialization loop.  The initialization loop can
 		then be placed in an atomic block, greatly reducing
 		the state space (by how much?).
 	\item	The atomic block covering the assertion code should
 		be extended to include the initialization of {\tt sum}
 		and {\tt j}, and also to cover the assertion.
 		This also reduces the state space (again, by how
 		much?).
 	\end{enumerate}

 \QuickQ{}
 	Is there a more straightforward way to code the do-od statement?
 \QuickA{}
 	Yes.
 	Replace it with {\tt if-fi} and remove the two {\tt break} statements.

 \QuickQ{}
 	Why are there atomic blocks at lines 12-21
 	and lines 44-56, when the operations within those atomic
 	blocks have no atomic implementation on any current
 	production microprocessor?
 \QuickA{}
 	Because those operations are for the benefit of the
 	assertion only.  They are not part of the algorithm itself.
 	There is therefore no harm in marking them atomic, and
 	so marking them greatly reduces the state space that must
 	be searched by the Promela model.

 \QuickQ{}
 	Is the re-summing of the counters on lines 24-27
 	\emph{really} necessary?
 \QuickA{}
 	Yes.  To see this, delete these lines and run the model.

 	Alternatively, consider the following sequence of steps:

 	\begin{enumerate}
 	\item	One process is within its RCU read-side critical
 		section, so that the value of {\tt ctr[0]} is zero and
 		the value of {\tt ctr[1]} is two.
 	\item	An updater starts executing, and sees that the sum of
 		the counters is two so that the fastpath cannot be
 		executed.  It therefore acquires the lock.
 	\item	A second updater starts executing, and fetches the value
 		of {\tt ctr[0]}, which is zero.
 	\item	The first updater adds one to {\tt ctr[0]}, flips
 		the index (which now becomes zero), then subtracts
 		one from {\tt ctr[1]} (which now becomes one).
 	\item	The second updater fetches the value of {\tt ctr[1]},
 		which is now one.
 	\item	The second updater now incorrectly concludes that it
 		is safe to proceed on the fastpath, despite the fact
 		that the original reader has not yet completed.
 	\end{enumerate}

 \QuickQ{}
 	Yeah, that's just great!
 	Now, just what am I supposed to do if I don't happen to have a
 	machine with 40GB of main memory???
 \QuickA{}
 	Relax, there are a number of lawful answers to
 	this question:
 	\begin{enumerate}
 	\item	Further optimize the model, reducing its memory consumption.
 	\item	Work out a pencil-and-paper proof, perhaps starting with the
 		comments in the code in the Linux kernel.
 	\item	Devise careful torture tests, which, though they cannot prove
 		the code correct, can find hidden bugs.
 	\item	There is some movement towards tools that do model
 		checking on clusters of smaller machines.
 		However, please note that we have not actually used such
 		tools myself, courtesy of some large machines that Paul has
 		occasional access to.
 	\item	Wait for memory sizes of affordable systems to expand
 		to fit your problem.
 	\item	Use one of a number of cloud-computing services to rent
 		a large system for a short time period.
 	\end{enumerate}

 \QuickQ{}
 	Why not simply increment \co{rcu_update_flag}, and then only
 	increment \co{dynticks_progress_counter} if the old value
 	of \co{rcu_update_flag} was zero???
 \QuickA{}
 	This fails in presence of NMIs.
 	To see this, suppose an NMI was received just after
 	\co{rcu_irq_enter()} incremented \co{rcu_update_flag},
 	but before it incremented \co{dynticks_progress_counter}.
 	The instance of \co{rcu_irq_enter()} invoked by the NMI
 	would see that the original value of \co{rcu_update_flag}
 	was non-zero, and would therefore refrain from incrementing
 	\co{dynticks_progress_counter}.
 	This would leave the RCU grace-period machinery no clue that the
 	NMI handler was executing on this CPU, so that any RCU read-side
 	critical sections in the NMI handler would lose their RCU protection.

 	The possibility of NMI handlers, which, by definition cannot
 	be masked, does complicate this code.

 \QuickQ{}
 	But if line~7 finds that we are the outermost interrupt,
 	wouldn't we \emph{always} need to increment
 	\co{dynticks_progress_counter}?
 \QuickA{}
 	Not if we interrupted a running task!
 	In that case, \co{dynticks_progress_counter} would
 	have already been incremented by \co{rcu_exit_nohz()},
 	and there would be no need to increment it again.

 \QuickQ{}
 	Can you spot any bugs in any of the code in this section?
 \QuickA{}
 	Read the next section to see if you were correct.

 \QuickQ{}
 	Why isn't the memory barrier in \co{rcu_exit_nohz()}
 	and \co{rcu_enter_nohz()} modeled in Promela?
 \QuickA{}
 	Promela assumes sequential consistency, so
 	it is not necessary to model memory barriers.
 	In fact, one must instead explicitly model lack of memory barriers,
 	for example, as shown in
 	Figure~\ref{fig:analysis:QRCU Unordered Summation} on
 	page~\pageref{fig:analysis:QRCU Unordered Summation}.

 \QuickQ{}
 	Isn't it a bit strange to model \co{rcu_exit_nohz()}
 	followed by \co{rcu_enter_nohz()}?
 	Wouldn't it be more natural to instead model entry before exit?
 \QuickA{}
 	It probably would be more natural, but we will need
 	this particular order for the liveness checks that we will add later.

 \QuickQ{}
 	Wait a minute!
 	In the Linux kernel, both \co{dynticks_progress_counter} and
 	\co{rcu_dyntick_snapshot} are per-CPU variables.
 	So why are they instead being modeled as single global variables?
 \QuickA{}
 	Because the grace-period code processes each
 	CPU's \co{dynticks_progress_counter} and
 	\co{rcu_dyntick_snapshot} variables separately,
 	we can collapse the state onto a single CPU.
 	If the grace-period code were instead to do something special
 	given specific values on specific CPUs, then we would indeed need
 	to model multiple CPUs.
 	But fortunately, we can safely confine ourselves to two CPUs, the
 	one running the grace-period processing and the one entering and
 	leaving dynticks-idle mode.

 \QuickQ{}
 	Given there are a pair of back-to-back changes to
 	\co{gp_state} on lines~25 and 26,
 	how can we be sure that line~25's changes won't be lost?
 \QuickA{}
 	Recall that Promela and spin trace out
 	every possible sequence of state changes.
 	Therefore, timing is irrelevant: Promela/spin will be quite
 	happy to jam the entire rest of the model between those two
 	statements unless some state variable specifically prohibits
 	doing so.

 \QuickQ{}
 	But what would you do if you needed the statements in a single
 	\co{EXECUTE_MAINLINE()} group to execute non-atomically?
 \QuickA{}
 	The easiest thing to do would be to put
 	each such statement in its own \co{EXECUTE_MAINLINE()}
 	statement.

 \QuickQ{}
 	But what if the \co{dynticks_nohz()} process had
 	``if'' or ``do'' statements with conditions,
 	where the statement bodies of these constructs
 	needed to execute non-atomically?
 \QuickA{}
 	One approach, as we will see in a later section,
 	is to use explicit labels and ``goto'' statements.
 	For example, the construct:

 	\vspace{5pt}
 	\begin{minipage}[t]{\columnwidth}
 	\scriptsize
 	\begin{verbatim}
 		if
 		:: i == 0 -> a = -1;
 		:: else -> a = -2;
 		fi;
 	\end{verbatim}
 	\end{minipage}
 	\vspace{5pt}

 	could be modeled as something like:

 	\vspace{5pt}
 	\begin{minipage}[t]{\columnwidth}
 	\scriptsize
 	\begin{verbatim}
 		EXECUTE_MAINLINE(stmt1,
 				 if
 				 :: i == 0 -> goto stmt1_then;
 				 :: else -> goto stmt1_else;
 				 fi)
 		stmt1_then: skip;
 		EXECUTE_MAINLINE(stmt1_then1, a = -1; goto stmt1_end)
 		stmt1_else: skip;
 		EXECUTE_MAINLINE(stmt1_then1, a = -2)
 		stmt1_end: skip;
 	\end{verbatim}
 	\end{minipage}
 	\vspace{5pt}

 	However, it is not clear that the macro is helping much in the case
 	of the ``if'' statement, so these sorts of situations will
 	be open-coded in the following sections.

 \QuickQ{}
 	Why are lines~45 and 46 (the \co{in_dyntick_irq = 0;}
 	and the \co{i++;}) executed atomically?
 \QuickA{}
 	These lines of code pertain to controlling the
 	model, not to the code being modeled, so there is no reason to
 	model them non-atomically.
 	The motivation for modeling them atomically is to reduce the size
 	of the state space.

 \QuickQ{}
 	What property of interrupts is this \co{dynticks_irq()}
 	process unable to model?
 \QuickA{}
 	One such property is nested interrupts,
 	which are handled in the following section.

 \QuickQ{}
 	Does Paul \emph{always} write his code in this painfully incremental
 	manner?
 \QuickA{}
 	Not always, but more and more frequently.
 	In this case, Paul started with the smallest slice of code that
 	included an interrupt handler, because he was not sure how best
 	to model interrupts in Promela.
 	Once he got that working, he added other features.
 	(But if he was doing it again, he would start with a ``toy'' handler.
 	For example, he might have the handler increment a variable twice and
 	have the mainline code verify that the value was always even.)

 	Why the incremental approach?
 	Consider the following, attributed to Brian W. Kernighan:

 	\begin{quote}
 		Debugging is twice as hard as writing the code in the first
 		place. Therefore, if you write the code as cleverly as possible,
 		you are, by definition, not smart enough to debug it.
 	\end{quote}

 	This means that any attempt to optimize the production of code should
 	place at least 66\% of its emphasis on optimizing the debugging process,
 	even at the expense of increasing the time and effort spent coding.
 	Incremental coding and testing is one way to optimize the debugging
 	process, at the expense of some increase in coding effort.
 	Paul uses this approach because he rarely has the luxury of
 	devoting full days (let alone weeks) to coding and debugging.

 \QuickQ{}
 	But what happens if an NMI handler starts running before
 	an irq handler completes, and if that NMI handler continues
 	running until a second irq handler starts?
 \QuickA{}
 	This cannot happen within the confines of a single CPU.
 	The first irq handler cannot complete until the NMI handler
 	returns.
 	Therefore, if each of the \co{dynticks} and \co{dynticks_nmi}
 	variables have taken on an even value during a given time
 	interval, the corresponding CPU really was in a quiescent
 	state at some time during that interval.

 \QuickQ{}
 	This is still pretty complicated.
 	Why not just have a \co{cpumask_t} that has a bit set for
 	each CPU that is in dyntick-idle mode, clearing the bit
 	when entering an irq or NMI handler, and setting it upon
 	exit?
 \QuickA{}
 	Although this approach would be functionally correct, it
 	would result in excessive irq entry/exit overhead on
 	large machines.
 	In contrast, the approach laid out in this section allows
 	each CPU to touch only per-CPU data on irq and NMI entry/exit,
 	resulting in much lower irq entry/exit overhead, especially
 	on large machines.

 \QuickQ{}
 	But x86 has strong memory ordering!  Why would you need to
 	formalize its memory model?
 \QuickA{}
 	Actually, academics consider the x86 memory model to be weak
 	because it can allow prior stores to be reordered with
 	subsequent loads.
 	From an academic viewpoint, a strong memory model is one
 	that allows absolutely no reordering, so that all threads
 	agree on the order of all operations visible to them.

 \QuickQ{}
 	Why does line~8
 	of Figure~\ref{fig:sec:formal:CPPMEM Litmus Test}
 	initialize the registers?
 	Why not instead initialize them on lines~4 and~5?
 \QuickA{}
 	Either way works.
 	However, in general, it is better to use initialization than
 	explicit instructions.
 	The explicit instructions are used in this example to demonstrate
 	their use.
 	In addition, many of the litmus tests available on the tool's
 	web site (\url{http://www.cl.cam.ac.uk/~pes20/ppcmem/}) were
 	automatically generated, which generates explicit
 	initialization instructions.

 \QuickQ{}
 	But whatever happened to line~17 of
 	Figure~\ref{fig:sec:formal:CPPMEM Litmus Test},
 	the one that is the \co{Fail:} label?
 \QuickA{}
 	The implementation of powerpc version of \co{atomic_add_return()}
 	loops when the \co{stwcx} instruction fails, which it communicates
 	by setting non-zero status in the condition-code register,
 	which in turn is tested by the bne instruction. Because actually
 	modeling the loop would result in state-space explosion, we
 	instead branch to the Fail: label, terminating the model with
 	the initial value of 2 in thread~1's \co{r3} register, which
 	will not trigger the exists assertion.

 	There is some debate about whether this trick is universally
 	applicable, but I have not seen an example where it fails.

 \QuickQ{}
 	Does the ARM Linux kernel have a similar bug?
 \QuickA{}
 	ARM does not have this particular bug because that it places
 	\co{smp_mb()} before and after the \co{atomic_add_return()}
 	function's assembly-language implementation.
 	PowerPC no longer has this bug; it has long since been fixed.
 	Finding any other bugs that the Linux kernel might have is left
 	as an exercise for the reader.

 \QuickQAC{chp:Putting It All Together}{Putting It All Together}
 \QuickQ{}
 	Why on earth did we need that global lock in the first place?
 \QuickA{}
 	A given thread's \co{__thread} variables vanish when that
 	thread exits.
 	It is therefore necessary to synchronize any operation that
 	accesses other threads' \co{__thread} variables with
 	thread exit.
 	Without such synchronization, accesses to \co{__thread} variable
 	of a just-exited thread will result in segmentation faults.

 \QuickQ{}
 	Just what is the accuracy of \co{read_count()}, anyway?
 \QuickA{}
 	Refer to
 	Figure~\ref{fig:count:Per-Thread Statistical Counters} on
 	Page~\pageref{fig:count:Per-Thread Statistical Counters}.
 	Clearly, if there are no concurrent invocations of \co{inc_count()},
 	\co{read_count()} will return an exact result.
 	However, if there \emph{are} concurrent invocations of
 	\co{inc_count()}, then the sum is in fact changing as
 	\co{read_count()} performs its summation.
 	That said, because thread creation and exit are excluded by
 	\co{final_mutex}, the pointers in \co{counterp} remain constant.

 	Let's imagine a mythical machine that is able to take an
 	instantaneous snapshot of its memory.
 	Suppose that this machine takes such a snapshot at the
 	beginning of \co{read_count()}'s execution, and another
 	snapshot at the end of \co{read_count()}'s execution.
 	Then \co{read_count()} will access each thread's counter
 	at some time between these two snapshots, and will therefore
 	obtain a result that is bounded by those of the two snapshots,
 	inclusive.
 	The overall sum will therefore be bounded by the pair of sums that
 	would have been obtained from each of the two snapshots (again,
 	inclusive).

 	The expected error is therefore half of the difference between
 	the pair of sums that would have been obtained from each of the
 	two snapshots, that is to say, half of the execution time of
 	\co{read_count()} multiplied by the number of expected calls to
 	\co{inc_count()} per unit time.

 	Or, for those who prefer equations:
 	\begin{equation}
 	\epsilon = \frac{T_r R_i}{2}
 	\end{equation}
 	where $\epsilon$ is the expected error in \co{read_count()}'s
 	return value,
 	$T_r$ is the time that \co{read_count()} takes to execute,
 	and $R_i$ is the rate of \co{inc_count()} calls per unit time.
 	(And of course, $T_r$ and $R_i$ should use the same units of
 	time: microseconds and calls per microsecond, seconds and calls
 	per second, or whatever, as long as they are the same units.)

 \QuickQ{}
 	Hey!!!
 	Line~45 of
 	Figure~\ref{fig:together:RCU and Per-Thread Statistical Counters}
 	modifies a value in a pre-existing \co{countarray} structure!
 	Didn't you say that this structure, once made available to
 	\co{read_count()}, remained constant???
 \QuickA{}
 	Indeed I did say that.
 	And it would be possible to make \co{count_register_thread()}
 	allocate a new structure, much as \co{count_unregister_thread()}
 	currently does.

 	But this is unnecessary.
 	Recall the derivation of the error bounds of \co{read_count()}
 	that was based on the snapshots of memory.
 	Because new threads start with initial \co{counter} values of
 	zero, the derivation holds even if we add a new thread partway
 	through \co{read_count()}'s execution.
 	So, interestingly enough, when adding a new thread, this
 	implementation gets the effect of allocating a new structure,
 	but without actually having to do the allocation.

 \QuickQ{}
 	Wow!
 	Figure~\ref{fig:together:RCU and Per-Thread Statistical Counters}
 	contains 69 lines of code, compared to only 42 in
 	Figure~\ref{fig:count:Per-Thread Statistical Counters}.
 	Is this extra complexity really worth it?
 \QuickA{}
 	This of course needs to be decided on a case-by-case basis.
 	If you need an implementation of \co{read_count()} that
 	scales linearly, then the lock-based implementation shown in
 	Figure~\ref{fig:count:Per-Thread Statistical Counters}
 	simply will not work for you.
 	On the other hand, if calls to \co{count_read()} are sufficiently
 	rare, then the lock-based version is simpler and might thus be
 	better, although much of the size difference is due
 	to the structure definition, memory allocation, and \co{NULL}
 	return checking.

 	Of course, a better question is "why doesn't the language
 	implement cross-thread access to \co{__thread} variables?"
 	After all, such an implementation would make both the locking
 	and the use of RCU unnecessary.
 	This would in turn enable an implementation that
 	was even simpler than the one shown in
 	Figure~\ref{fig:count:Per-Thread Statistical Counters}, but
 	with all the scalability and performance benefits of the
 	implementation shown in
 	Figure~\ref{fig:together:RCU and Per-Thread Statistical Counters}!

 \QuickQ{}
 	But cant't the approach shown in
 	Figure~\ref{fig:together:Correlated Measurement Fields}
 	result in extra cache misses, in turn resulting in additional
 	read-side overhead?
 \QuickA{}
 	Indeed it can.

 \begin{figure}[tbp]
 { \scriptsize
 \begin{verbatim}
  1 struct measurement {
  2   double meas_1;
  3   double meas_2;
  4   double meas_3;
  5 };
  6
  7 struct animal {
  8   char name[40];
  9   double age;
 10   struct measurement *mp;
 11   struct measurement meas;
 12   char photo[0]; /* large bitmap. */
 13 };
 \end{verbatim}
 }
 \caption{Localized Correlated Measurement Fields}
 \label{fig:together:Localized Correlated Measurement Fields}
 \end{figure}

 	One way to avoid this cache-miss overhead is shown in
 	Figure~\ref{fig:together:Localized Correlated Measurement Fields}:
 	Simply embed an instance of a \co{measurement} structure
 	named \co{meas}
 	into the \co{animal} structure, and point the \co{->mp}
 	field at this \co{->meas} field.

 	Measurement updates can then be carried out as follows:

 	\begin{enumerate}
 	\item	Allocate a new \co{measurement} structure and place
 		the new measurements into it.
 	\item	Use \co{rcu_assign_pointer()} to point \co{->mp} to
 		this new structure.
 	\item	Wait for a grace period to elapse, for example using
 		either \co{synchronize_rcu()} or \co{call_rcu()}.
 	\item	Copy the measurements from the new \co{measurement}
 		structure into the embedded \co{->meas} field.
 	\item	Use \co{rcu_assign_pointer()} to point \co{->mp}
 		back to the old embedded \co{->meas} field.
 	\item	After another grace period elapses, free up the
 		new \co{measurement} field.
 	\end{enumerate}

 	This approach uses a heavier weight update procedure to eliminate
 	the extra cache miss in the common case.
 	The extra cache miss will be incurred only while an update is
 	actually in progress.

 \QuickQ{}
 	But how does this scan work while a resizable hash table
 	is being resized?
 	In that case, neither the old nor the new hash table is
 	guaranteed to contain all the elements in the hash table!
 \QuickA{}
 	True, resizable hash tables as described in
 	Section~\ref{sec:datastruct:Non-Partitionable Data Structures}
 	cannot be fully scanned while being resized.
 	One simple way around this is to acquire the
 	\co{hashtab} structure's \co{->ht_lock} while scanning,
 	but this prevents more than one scan from proceeding
 	concurrently.

 	Another approach is for updates to mutate the old hash
 	table as well as the new one while resizing is in
 	progress.
 	This would allow scans to find all elements in the old
 	hash table.
 	Implementing this is left as an exercise for the reader.

 \QuickQAC{sec:advsync:Advanced Synchronization}{Advanced Synchronization}
 \QuickQ{}
 	How on earth could the assertion on line~21 of the code in
 	Figure~\ref{fig:advsync:Parallel Hardware is Non-Causal} on
 	page~\pageref{fig:advsync:Parallel Hardware is Non-Causal}
 	\emph{possibly} fail?
 \QuickA{}
 	The key point is that the intuitive analysis missed is that
 	there is nothing preventing the assignment to C from overtaking
 	the assignment to A as both race to reach {\tt thread2()}.
 	This is explained in the remainder of this section.

 \QuickQ{}
 	Great...  So how do I fix it?
 \QuickA{}
 	The easiest fix is to replace the \co{barrier()} on
 	line~12 with an \co{smp_mb()}.

 \QuickQ{}
 	What assumption is the code fragment
 	in Figure~\ref{fig:advsync:Software Logic Analyzer}
 	making that might not be valid on real hardware?
 \QuickA{}
 	The code assumes that as soon as a given CPU stops
 	seeing its own value, it will immediately see the
 	final agreed-upon value.
 	On real hardware, some of the CPUs might well see several
 	intermediate results before converging on the final value.

 \QuickQ{}
 	How could CPUs possibly have different views of the
 	value of a single variable \emph{at the same time?}
 \QuickA{}
 	Many CPUs have write buffers that record the values of
 	recent writes, which are applied once the corresponding
 	cache line makes its way to the CPU.
 	Therefore, it is quite possible for each CPU to see a
 	different value for a given variable at a single point
 	in time --- and for main memory to hold yet another value.
 	One of the reasons that memory barriers were invented was
 	to allow software to deal gracefully with situations like
 	this one.

 \QuickQ{}
 	Why do CPUs~2 and 3 come to agreement so quickly, when it
 	takes so long for CPUs~1 and 4 to come to the party?
 \QuickA{}
 	CPUs~2 and 3 are a pair of hardware threads on the same
 	core, sharing the same cache hierarchy, and therefore have
 	very low communications latencies.
 	This is a NUMA, or, more accurately, a NUCA effect.

 	This leads to the question of why CPUs~2 and 3 ever disagree
 	at all.
 	One possible reason is that they each might have a small amount
 	of private cache in addition to a larger shared cache.
 	Another possible reason is instruction reordering, given the
 	short 10-nanosecond duration of the disagreement and the
 	total lack of memory barriers in the code fragment.

 \QuickQ{}
 	But if the memory barriers do not unconditionally force
 	ordering, how the heck can a device driver reliably execute
 	sequences of loads and stores to MMIO registers?
 \QuickA{}
 	MMIO registers are special cases: because they appear
 	in uncached regions of physical memory.
 	Memory barriers \emph{do} unconditionally force ordering
 	of loads and stores to uncached memory.
 	See Section~@@@ for more information on memory barriers
 	and MMIO regions.

 \QuickQ{}
 	How do we know that modern hardware guarantees that at least
 	one of the loads will see the value stored by the other thread
 	in the ears-to-mouths scenario?
 \QuickA{}
 	The scenario is as follows, with A and B both initially zero:

 	CPU~0: A=1; \co{smp_mb()}; r1=B;

 	CPU~1: B=1; \co{smp_mb()}; r2=A;

 	If neither of the loads see the corresponding store, when both
 	CPUs finish, both \co{r1} and \co{r2} will be equal to zero.
 	Let's suppose that \co{r1} is equal to zero.
 	Then we know that CPU~0's load from B happened before CPU~1's
 	store to B: After all, we would have had \co{r1} equal to one
 	otherwise.
 	But given that CPU~0's load from B happened before CPU~1's store
 	to B, memory-barrier pairing guarantees that CPU~0's store to A
 	happens before CPU~1's load from A, which in turn guarantees that
 	\co{r2} will be equal to one, not zero.

 	Therefore, at least one of \co{r1} and \co{r2} must be nonzero,
 	which means that at least one of the loads saw the value from
 	the corresponding store, as claimed.

 \QuickQ{}
 	How can the other ``Only one store'' entries in
 	Table~\ref{tab:advsync:Memory-Barrier Combinations}
 	be used?
 \QuickA{}
 	For combination~2, if CPU~1's load from B sees a value prior
 	to CPU~2's store to B, then we know that CPU~2's load from A
 	will return the same value as CPU~1's load from A, or some later
 	value.

 	For combination~4, if CPU~2's load from B sees the value from
 	CPU~1's store to B, then we know that CPU~2's load from A
 	will return the same value as CPU~1's load from A, or some later
 	value.

 	For combination~8, if CPU~2's load from A sees CPU~1's store
 	to A, then we know that CPU~1's load from B will return the same
 	value as CPU~2's load from A, or some later value.

 \QuickQ{}
 	How could the assertion {\tt b==2} on
 	page~\pageref{codesample:advsync:What Can You Count On? 1}
 	possibly fail?
 \QuickA{}
 	If the CPU is not required to see all of its loads and
 	stores in order, then the {\tt b=1+a} might well see an
 	old version of the variable ``a''.

 	This is why it is so very important that each CPU or thread
 	see all of its own loads and stores in program order.

 \QuickQ{}
 	How could the code on
 	page~\pageref{codesample:advsync:What Can You Count On? 2}
 	possibly leak memory?
 \QuickA{}
 	Only the first execution of the critical section should
 	see {\tt p==NULL}.
 	However, if there is no global ordering of critical sections for
 	{\tt mylock}, then how can you say that a particular one was
 	first?
 	If several different executions of that critical section thought
 	that they were first, they would all see {\tt p==NULL}, and
 	they would all allocate memory.
 	All but one of those allocations would be leaked.

 	This is why it is so very important that all the critical sections
 	for a given exclusive lock appear to execute in some well-defined
 	order.

 \QuickQ{}
 	How could the code on
 	page~\pageref{codesample:advsync:What Can You Count On? 2}
 	possibly count backwards?
 \QuickA{}
 	Suppose that the counter started out with the value zero,
 	and that three executions of the critical section had therefore
 	brought its value to three.
 	If the fourth execution of the critical section is not constrained
 	to see the most recent store to this variable, it might well see
 	the original value of zero, and therefore set the counter to
 	one, which would be going backwards.

 	This is why it is so very important that loads from a given variable
 	in a given critical
 	section see the last store from the last prior critical section to
 	store to that variable.

 \QuickQ{}
 	What effect does the following sequence have on the
 	order of stores to variables ``a'' and ``b''? \\
 	{\tt ~~~~a = 1;} \\
 	{\tt ~~~~b = 1;} \\
 	{\tt ~~~~<write barrier>}
 \QuickA{}
 	Absolutely none.  This barrier {\em would} ensure that the
 	assignments to ``a'' and ``b'' happened before any subsequent
 	assignments, but it does nothing to enforce any order of
 	assignments to ``a'' and ``b'' themselves.

 \QuickQ{}
 	What sequence of LOCK-UNLOCK operations \emph{would}
 	act as a full memory barrier?
 \QuickA{}
 	A series of two back-to-back LOCK-UNLOCK operations, or, somewhat
 	less conventionally, an UNLOCK operation followed by a LOCK
 	operation.

 \QuickQ{}
 	What (if any) CPUs have memory-barrier instructions
 	from which these semi-permeable locking primitives might
 	be constructed?
 \QuickA{}
 	Itanium is one example.
 	The identification of any others is left as an
 	exercise for the reader.

 \QuickQ{}
 	Given that operations grouped in curly braces are executed
 	concurrently, which of the rows of
 	Table~\ref{tab:advsync:Lock-Based Critical Sections}
 	are legitimate reorderings of the assignments to variables
 	``A'' through ``F'' and the LOCK/UNLOCK operations?
 	(The order in the code is A, B, LOCK, C, D, UNLOCK, E, F.)
 	Why or why not?
 \QuickA{}
 	\begin{enumerate}
 	\item	Legitimate, executed in order.
 	\item	Legitimate, the lock acquisition was executed concurrently
 		with the last assignment preceding the critical section.
 	\item	Illegitimate, the assignment to ``F'' must follow the LOCK
 		operation.
 	\item	Illegitimate, the LOCK must complete before any operation in
 		the critical section.  However, the UNLOCK may legitimately
 		be executed concurrently with subsequent operations.
 	\item	Legitimate, the assignment to ``A'' precedes the UNLOCK,
 		as required, and all other operations are in order.
 	\item	Illegitimate, the assignment to ``C'' must follow the LOCK.
 	\item	Illegitimate, the assignment to ``D'' must precede the UNLOCK.
 	\item	Legitimate, all assignments are ordered with respect to the
 		LOCK and UNLOCK operations.
 	\item	Illegitimate, the assignment to ``A'' must precede the UNLOCK.
 	\end{enumerate}

 \QuickQ{}
 	What are the constraints for
 	Table~\ref{tab:advsync:Ordering With Multiple Locks}?
 \QuickA{}
 	All CPUs must see the following ordering constraints:
 	\begin{enumerate}
 	\item	LOCK M precedes B, C, and D.
 	\item	UNLOCK M follows A, B, and C.
 	\item	LOCK Q precedes F, G, and H.
 	\item	UNLOCK Q follows E, F, and G.
 	\end{enumerate}

 \QuickQAC{chp:Ease of Use}{Ease of Use}
 \QuickQ{}
 	Can a similar algorithm be used when deleting elements?
 \QuickA{}
 	Yes.
 	However, since each thread must hold the locks of three
 	consecutive elements to delete the middle one, if there
 	are $N$ threads, there must be $2N+1$ elements (rather than
 	just $N+1$) in order to avoid deadlock.

 \QuickQ{}
 	Yetch!
 	What ever possessed someone to come up with an algorithm
 	that deserves to be shaved as much as this one does???
 \QuickA{}
 	That would be Paul.

 	He was considering the \emph{Dining Philosopher's Problem}, which
 	involves a rather unsanitary spaghetti dinner attended by
 	five philosophers.
 	Given that there are five plates and but five forks on the table, and
 	given that each philosopher requires two forks at a time to eat,
 	one is supposed to come up with a fork-allocation algorithm that
 	avoids deadlock.
 	Paul's response was ``Sheesh!  Just get five more forks!''.

 	This in itself was OK, but Paul then applied this same solution to
 	circular linked lists.

 	This would not have been so bad either, but he had to go and tell
 	someone about it!

 \QuickQ{}
 	Give an exception to this rule.
 \QuickA{}
 	One exception would be a difficult and complex algorithm that
 	was the only one known to work in a given situation.
 	Another exception would be a difficult and complex algorithm
 	that was nonetheless the simplest of the set known to work in
 	a given situation.
 	However, even in these cases, it may be very worthwhile to spend
 	a little time trying to come up with a simpler algorithm!
 	After all, if you managed to invent the first algorithm
 	to do some task, it shouldn't be that hard to go on to
 	invent a simpler one.

 \QuickQAC{chp:Conflicting Visions of the Future}{Conflicting Visions of the Future}
 \QuickQ{}
 	What about non-persistent primitives represented by data
 	structures in \co{mmap()} regions of memory?
 	What happens when there is an \co{exec()} within a critical
 	section of such a primitive?
 \QuickA{}
 	If the \co{exec()}ed program maps those same regions of
 	memory, then this program could in principle simply release
 	the lock.
 	The question as to whether this approach is sound from a
 	software-engineering viewpoint is left as an exercise for
 	the reader.

 \QuickQ{}
 	Why would it matter that oft-written variables shared the cache
 	line with the lock variable?
 \QuickA{}
 	If the lock is in the same cacheline as some of the variables
 	that it is protecting, then writes to those variables by one CPU
 	will invalidate that cache line for all the other CPUs.
 	These invalidations will
 	generate large numbers of conflicts and retries, perhaps even
 	degrading performance and scalability compared to locking.

 \QuickQ{}
 	Why are relatively small updates important to HTM performance
 	and scalability?
 \QuickA{}
 	The larger the updates, the greater the probability of conflict,
 	and thus the greater probability of retries, which degrade
 	performance.

 \QuickQ{}
 	How could a red-black tree possibly efficiently enumerate all
 	elements of the tree regardless of choice of synchronization
 	mechanism???
 \QuickA{}
 	In many cases, the enumeration need not be exact.
 	In these cases, hazard pointers or RCU may be used to protect
 	readers with low probability of conflict with any given insertion
 	or deletion.

 \QuickQ{}
 	But why can't a debugger emulate single stepping by setting
 	breakpoints at successive lines of the transaction, relying
 	on the retry to retrace the steps of the earlier instances
 	of the transaction?
 \QuickA{}
 	This scheme might work with reasonably high probability, but it
 	can fail in ways that would be quite surprising to most users.
 	To see this, consider the following transaction:

 	\vspace{5pt}
 	\begin{minipage}[t]{\columnwidth}
 	\small
 \begin{verbatim}
   1 begin_trans();
   2 if (a) {
   3   do_one_thing();
   4   do_another_thing();
   5 } else {
   6   do_a_third_thing();
   7   do_a_fourth_thing();
   8 }
   9 end_trans();
 \end{verbatim}
 	\end{minipage}
 	\vspace{5pt}

 	Suppose that the user sets a breakpoint at line 3, which triggers,
 	aborting the transaction and entering the debugger.
 	Suppose that between the time that the breakpoint triggers
 	and the debugger gets around to stopping all the threads, some
 	other thread sets the value of \co{a} to zero.
 	When the poor user attempts to single-step the program, surprise!
 	The program is now in the else-clause instead of the then-clause.

 	This is \emph{not} what I call an easy-to-use debugger.

 \QuickQ{}
 	But why would \emph{anyone} need an empty lock-based critical
 	section???
 \QuickA{}
 	See the answer to the Quick Quiz in
 	Section~\ref{sec:locking:Exclusive Locks}.

 	However, it is claimed that given a strongly atomic HTM
 	implementation without forward-progress guarantees, any
 	memory-based locking design based on empty critical sections
 	will operate correctly in the presence of transactional
 	lock elision.
 	Although I have not seen a proof of this statement, there
 	is a straightforward rationale for this claim.
 	The main idea is that in a strongly atomic HTM implementation,
 	the results of a given transaction are not visible until
 	after the transaction completes successfully.
 	Therefore, if you can see that a transaction has started,
 	it is guaranteed to have already completed, which means
 	that a subsequent empty lock-based critical section will
 	successfully ``wait'' on it---after all, there is no waiting
 	required.

 	This line of reasoning does not apply to weakly atomic
 	systems (including many STM implementation), and it also
 	does not apply to lock-based programs that use means other
 	than memory to communicate.
 	One such means is the passage of time (for example, in
 	hard real-time systems) or flow of priority (for example,
 	in soft real-time systems).

 	Locking designs that rely on priority boosting are of particular
 	interest.

 \QuickQ{}
 	Can't transactional lock elision trivially handle locking's
 	time-based messaging semantics
 	by simply choosing not to elide empty lock-based critical sections?
 \QuickA{}
 	It could do so, but this would be both unnecessary and
 	insufficient.

 	It would be unnecessary in cases where the empty critical section
 	was due to conditional compilation.
 	Here, it might well be that the only purpose of the lock was to
 	protect data, so eliding it completely would be the right thing
 	to do.
 	In fact, leaving the empty lock-based critical section would
 	degrade performance and scalability.

 	On the other hand, it is possible for a non-empty lock-based
 	critical section to be relying on both the data-protection
 	and time-based and messaging semantics of locking.
 	Using transactional lock elision in such a case would be
 	incorrect, and would result in bugs.

 \QuickQ{}
 	Given modern hardware~\cite{PeterOkech2009InherentRandomness},
 	how can anyone possibly expect parallel software relying
 	on timing to work?
 \QuickA{}
 	The short answer is that on commonplace commodity hardware,
 	synchronization designs based on any sort of fine-grained
 	timing are foolhardy and cannot be expected to operate correctly
 	under all conditions.

 	That said, there are systems designed for hard real-time use
 	that are much more deterministic.
 	In the (very unlikely) event that you are using such a system,
 	here is a toy example showing how time-based synchronization can
 	work.
 	Again, do \emph{not} try this on commodity microprocessors,
 	as they have highly nondeterministic performance characteristics.

 	This example uses multiple worker threads along with a control
 	thread.
 	Each worker thread corresponds to an outbound data feed, and
 	records the current time (for example, from the
 	\co{clock_gettime()} system call) in a per-thread
 	\co{my_timestamp} variable after executing each unit
 	of work.
 	The real-time nature of this example results in the following
 	set of constraints:

 	\begin{enumerate}
 	\item	It is a fatal error for a given worker thread to fail
 		to update its timestamp for a time period of more than
 		\co{MAX_LOOP_TIME}.
 	\item	Locks are used sparingly to access and update global
 		state.
 	item	Locks are granted in strict FIFO order within
 		a given thread priority.
 	\end{enumerate}

 	When worker threads complete their feed, they must disentangle
 	themselves from the rest of the application and place a status
 	value in a per-thread \co{my_status} variable that is initialized
 	to -1.
 	Threads do not exit; they instead are placed on a thread pool
 	to accommodate later processing requirements.
 	The control thread assigns (and re-assigns) worker threads as
 	needed, and also maintains a histogram of thread statuses.
 	The control thread runs at a real-time priority no higher than
 	that of the worker threads.

 	Worker threads' code is as follows:

 	\vspace{5pt}
 	\begin{minipage}[t]{\columnwidth}
 	\scriptsize
 \begin{verbatim}
   1   int my_status = -1;  /* Thread local. */
   2
   3   while (continue_working()) {
   4     enqueue_any_new_work();
   5     wp = dequeue_work();
   6     do_work(wp);
   7     my_timestamp = clock_gettime(...);
   8   }
   9
  10   acquire_lock(&departing_thread_lock);
  11
  12   /*
  13    * Disentangle from application, might
  14    * acquire other locks, can take much longer
  15    * than MAX_LOOP_TIME, especially if many
  16    * threads exit concurrently.
  17    */
  18   my_status = get_return_status();
  19   release_lock(&departing_thread_lock);
  20
  21   /* thread awaits repurposing. */
 \end{verbatim}
 	\end{minipage}
 	\vspace{5pt}

 	The control thread's code is as follows:

 	\vspace{5pt}
 	\begin{minipage}[t]{\columnwidth}
 	\scriptsize
 \begin{verbatim}
   1   for (;;) {
   2     for_each_thread(t) {
   3       ct = clock_gettime(...);
   4       d = ct - per_thread(my_timestamp, t);
   5       if (d >= MAX_LOOP_TIME) {
   6         /* thread departing. */
   7         acquire_lock(&departing_thread_lock);
   8         release_lock(&departing_thread_lock);
   9         i = per_thread(my_status, t);
  10         status_hist[i]++; /* Bug if TLE! */
  11       }
  12     }
  13     /* Repurpose threads as needed. */
  14   }
 \end{verbatim}
 	\end{minipage}
 	\vspace{5pt}

 	Line~5 uses the passage of time to deduce that the thread
 	has exited, executing lines~6-10 if so.
 	The empty lock-based critical section on lines~7 and~8
 	guarantees that any thread in the process of exiting
 	completes (remember that locks are granted in FIFO order!).

 	Once again, do not try this sort of thing on commodity
 	microprocessors.
 	After all, it is difficult enough to get right on systems
 	specifically designed for hard real-time use!

 \QuickQ{}
 	But the \co{boostee()} function in
 	Figure~\ref{fig:future:Exploiting Priority Boosting}
 	alternatively acquires its locks in reverse order!
 	Won't this result in deadlock?
 \QuickA{}
 	No deadlock will result.
 	To arrive at deadlock, two different threads must each
 	acquire the two locks in oppposite orders, which does not
 	happen in this example.
 	However, deadlock detectors such as
 	lockdep~\cite{JonathanCorbet2006lockdep}
 	will flag this as a false positive.

 \QuickQAC{cha:app:Important Questions}{Important Questions}
 \QuickQ{}
 	What SMP coding errors can you see in these examples?
 	See \url{time.c} for full code.
 \QuickA{}
 	\begin{enumerate}
 	\item	Missing barrier() or volatile on tight loops.
 	\item	Missing Memory barriers on update side.
 	\item	Lack of synchronization between producer and consumer.
 	\end{enumerate}

 \QuickQ{}
 	How could there be such a large gap between successive
 	consumer reads?
 	See \url{timelocked.c} for full code.
 \QuickA{}
 	\begin{enumerate}
 	\item	The consumer might be preempted for long time periods.
 	\item	A long-running interrupt might delay the consumer.
 	\item	The producer might also be running on a faster CPU than is the
 		consumer (for example, one of the CPUs might have had to
 		decrease its
 		clock frequency due to heat-dissipation or power-consumption
 		constraints).
 	\end{enumerate}

 \QuickQAC{app:primitives:Synchronization Primitives}{Synchronization Primitives}
 \QuickQ{}
 	Give an example of a parallel program that could be written
 	without synchronization primitives.
 \QuickA{}
 	There are many examples.
 	One of the simplest would be a parametric study using a
 	single independent variable.
 	If the program {\tt run\_study} took a single argument,
 	then we could use the following bash script to run two
 	instances in parallel, as might be appropriate on a
 	two-CPU system:

 	{ \scriptsize \tt run\_study 1 > 1.out\& run\_study 2 > 2.out; wait}

 	One could of course argue that the bash ampersand operator and
 	the ``wait'' primitive are in fact synchronization primitives.
 	If so, then consider that
 	this script could be run manually in two separate
 	command windows, so that the only synchronization would be
 	supplied by the user himself or herself.

 \QuickQ{}
 	What problems could occur if the variable {\tt counter} were
 	incremented without the protection of {\tt mutex}?
 \QuickA{}
 	On CPUs with load-store architectures, incrementing {\tt counter}
 	might compile into something like the following:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \small
 \begin{verbatim}
 LOAD counter,r0
 INC r0
 STORE r0,counter
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 	On such machines, two threads might simultaneously load the
 	value of {\tt counter}, each increment it, and each store the
 	result.
 	The new value of {\tt counter} will then only be one greater
 	than before, despite two threads each incrementing it.

 \QuickQ{}
 	How could you work around the lack of a per-thread-variable
 	API on systems that do not provide it?
 \QuickA{}
 	One approach would be to create an array indexed by
 	{\tt smp\_thread\_id()}, and another would be to use a hash
 	table to map from {\tt smp\_thread\_id()} to an array
 	index --- which is in fact what this
 	set of APIs does in pthread environments.

 	Another approach would be for the parent to allocate a structure
 	containing fields for each desired per-thread variable, then
 	pass this to the child during thread creation.
 	However, this approach can impose large software-engineering
 	costs in large systems.
 	To see this, imagine if all global variables in a large system
 	had to be declared in a single file, regardless of whether or
 	not they were C static variables!

 \QuickQAC{chp:app:whymb:Why Memory Barriers?}{Why Memory Barriers?}
 \QuickQ{}
 	Where does a writeback message originate from and where does
 	it go to?
 \QuickA{}
 	The writeback message originates from a given CPU, or in some
 	designs from a given level of a given CPU's cache---or even
 	from a cache that might be shared among several CPUs.
 	The key point is that a given cache does not have room for
 	a given data item, so some other piece of data must be ejected
 	from the cache to make room.
 	If there is some other piece of data that is duplicated in some
 	other cache or in memory, then that piece of data may be simply
 	discarded, with no writeback message required.

 	On the other hand, if every piece of data that might be ejected
 	has been modified so that the only up-to-date copy is in this
 	cache, then one of those data items must be copied somewhere
 	else.
 	This copy operation is undertaken using a ``writeback message''.

 	The destination of the writeback message has to be something
 	that is able to store the new value.
 	This might be main memory, but it also might be some other cache.
 	If it is a cache, it is normally a higher-level cache for the
 	same CPU, for example, a level-1 cache might write back to a
 	level-2 cache.
 	However, some hardware designs permit cross-CPU writebacks,
 	so that CPU~0's cache might send a writeback message to CPU~1.
 	This would normally be done if CPU~1 had somehow indicated
 	an interest in the data, for example, by having recently
 	issued a read request.

 	In short, a writeback message is sent from some part of the
 	system that is short of space, and is received by some other
 	part of the system that can accommodate the data.

 \QuickQ{}
 	What happens if two CPUs attempt to invalidate the
 	same cache line concurrently?
 \QuickA{}
 	One of the CPUs gains access
 	to the shared bus first,
 	and that CPU ``wins''.  The other CPU must invalidate its copy of the
 	cache line and transmit an ``invalidate acknowledge'' message
 	to the other CPU. \\
 	Of course, the losing CPU can be expected to immediately issue a
 	``read invalidate'' transaction, so the winning CPU's victory will
 	be quite ephemeral.

 \QuickQ{}
 	When an ``invalidate'' message appears in a large multiprocessor,
 	every CPU must give an ``invalidate acknowledge'' response.
 	Wouldn't the resulting ``storm'' of ``invalidate acknowledge''
 	responses totally saturate the system bus?
 \QuickA{}
 	It might, if large-scale multiprocessors were in fact implemented
 	that way.  Larger multiprocessors, particularly NUMA machines,
 	tend to use so-called ``directory-based'' cache-coherence
 	protocols to avoid this and other problems.

 \QuickQ{}
 	If SMP machines are really using message passing
 	anyway, why bother with SMP at all?
 \QuickA{}
 	There has been quite a bit of controversy on this topic over
 	the past few decades.  One answer is that the cache-coherence
 	protocols are quite simple, and therefore can be implemented
 	directly in hardware, gaining bandwidths and latencies
 	unattainable by software message passing.  Another answer is that
 	the real truth is to be found in economics due to the relative
 	prices of large SMP machines and that of clusters of smaller
 	SMP machines.  A third answer is that the SMP programming
 	model is easier to use than that of distributed systems, but
 	a rebuttal might note the appearance of HPC clusters and MPI.
 	And so the argument continues.

 \QuickQ{}
 	How does the hardware handle the delayed transitions
 	described above?
 \QuickA{}
 	Usually by adding additional states, though these additional
 	states need not be actually stored with the cache line, due to
 	the fact that only a few lines at a time will be transitioning.
 	The need to delay transitions is but one issue that results in
 	real-world cache coherence protocols being much more complex than
 	the over-simplified MESI protocol described in this appendix.
 	Hennessy and Patterson's classic introduction to computer
 	architecture~\cite{Hennessy95a} covers many of these issues.

 \QuickQ{}
 	What sequence of operations would put the CPUs' caches
 	all back into the ``invalid'' state?
 \QuickA{}
 	There is no such sequence, at least in absence of special
 	``flush my cache'' instructions in the CPU's instruction set.
 	Most CPUs do have such instructions.

 \QuickQ{}
 	But if the main purpose of store buffers is to hide acknowledgment
 	latencies in multiprocessor cache-coherence protocols, why
 	do uniprocessors also have store buffers?
 \QuickA{}
 	Because the purpose of store buffers is not just to hide
 	acknowledgement latencies in multiprocessor cache-coherence protocols,
 	but to hide memory latencies in general.
 	Because memory is much slower than is cache on uniprocessors,
 	store buffers on uniprocessors can help to hide write-miss
 	latencies.

 \QuickQ{}
 	In step~1 above, why does CPU~0 need to issue a ``read invalidate''
 	rather than a simple ``invalidate''?
 \QuickA{}
 	Because the cache line in question contains more than just the
 	variable \co{a}.

 \QuickQ{}
 	In step~1 of the first scenario in
 	Section~\ref{sec:app:whymb:Invalidate Queues and Memory Barriers},
 	why is an ``invalidate'' sent instead of a ''read invalidate''
 	message?
 	Doesn't CPU~0 need the values of the other variables that share
 	this cache line with ``a''?
 \QuickA{}
 	CPU~0 already has the values of these variables, given that it
 	has a read-only copy of the cache line containing ``a''.
 	Therefore, all CPU~0 need do is to cause the other CPUs to discard
 	their copies of this cache line.
 	An ``invalidate'' message therefore suffices.

 \QuickQ{}
 	Say what???
 	Why do we need a memory barrier here, given that the CPU cannot
 	possibly execute the \co{assert()} until after the
 	\co{while} loop completes?
 \QuickA{}
 	CPUs are free to speculatively execute, which can have the effect
 	of executing the assertion before the \co{while} loop completes.
 	Furthermore, compilers normally assume that only the currently
 	executing thread is updating the variables, and this assumption
 	allows the compiler to hoist the load of \co{a} to precede the
 	loop.

 	In fact, some compilers would transform the loop to a branch
 	around an infinite loop as follows:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \small
 \begin{verbatim}
   1 void foo(void)
   2 {
   3   a = 1;
   4   smp_mb();
   5   b = 1;
   6 }
   7
   8 void bar(void)
   9 {
  10   if (b == 0)
  11     for (;;)
  12       continue;
  13   smp_mb();
  14   assert(a == 1);
  15 }
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 	Given this optimization, the assertion could clearly fire.
 	You should use volatile casts or (where available) C++
 	relaxed atomics to prevent the compiler from optimizing
 	your parallel code into oblivion.

 	In short, both compilers and CPUs are quite aggressive about
 	optimizing, so you must clearly communicate your constraints
 	to them, using compiler directives and memory barriers.


 \QuickQ{}
 	Does the guarantee that each CPU sees its own memory accesses
 	in order also guarantee that each user-level thread will see
 	its own memory accesses in order?
 	Why or why not?
 \QuickA{}
 	No.  Consider the case where a thread migrates from one CPU to
 	another, and where the destination CPU perceives the source
 	CPU's recent memory operations out of order.  To preserve
 	user-mode sanity, kernel hackers must use memory barriers in
 	the context-switch path.  However, the locking already required
 	to safely do a context switch should automatically provide
 	the memory barriers needed to cause the user-level task to see
 	its own accesses in order.  That said, if you are designing a
 	super-optimized scheduler, either in the kernel or at user level,
 	please keep this scenario in mind!

 \QuickQ{}
 	Could this code be fixed by inserting a memory barrier
 	between CPU~1's ``while'' and assignment to ``c''?
 	Why or why not?
 \QuickA{}
 	No.  Such a memory barrier would only force ordering local to CPU~1.
 	It would have no effect on the relative ordering of CPU~0's and
 	CPU~1's accesses, so the assertion could still fail.
 	However, all mainstream computer systems provide one mechanism
 	or another to provide ``transitivity'', which provides
 	intuitive causal ordering: if B saw the effects of A's accesses,
 	and C saw the effects of B's accesses, then C must also see
 	the effects of A's accesses.
 	In short, hardware designers have taken at least a little pity
 	on software developers.

 \QuickQ{}
 	Suppose that lines~3-5 for CPUs~1 and 2 in
 	Table~\ref{tab:app:whymb:Memory Barrier Example 3}
 	are in an interrupt
 	handler, and that the CPU~2's line~9 is run at process level.
 	What changes, if any, are required to enable the code to work
 	correctly, in other words, to prevent the assertion from firing?
 \QuickA{}
 	The assertion will need to written to ensure that the load of
 	``e'' precedes that of ``a''.
 	In the Linux kernel, the barrier() primitive may be used to accomplish
 	this in much the same way that the memory barrier was used in the
 	assertions in the previous examples.

 \QuickQ{}
 	If CPU~2 executed an \co{assert(e==0||c==1)} in the example in
 	Table~\ref{tab:app:whymb:Memory Barrier Example 3},
 	would this assert ever trigger?
 \QuickA{}
 	The result depends on whether the CPU supports ``transitivity.''
 	In other words, CPU~0 stored to ``e'' after seeing CPU~1's
 	store to ``c'', with a memory barrier between CPU~0's load
 	from ``c'' and store to ``e''.
 	If some other CPU sees CPU~0's store to ``e'', is it also
 	guaranteed to see CPU~1's store?

 	All CPUs I am aware of claim to provide transitivity.

 \QuickQ{}
 	Why is Alpha's \co{smp_read_barrier_depends()} an
 	\co{smp_mb()} rather than \co{smp_rmb()}?
 \QuickA{}
 	First, Alpha has only \co{mb} and \co{wmb} instructions,
 	so \co{smp_rmb()} would be implemented by the Alpha \co{mb}
 	instruction in either case.

 	More importantly, \co{smp_read_barrier_depends()} must
 	order subsequent stores.
 	For example, consider the following code:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \small
 \begin{verbatim}
   1 p = global_pointer;
   2 smp_read_barrier_depends();
   3 if (do_something_with(p->a, p->b) == 0)
   4   p->hey_look = 1;
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 	Here the store to \co{p->hey_look} must be ordered,
 	not just the loads from \co{p->a} and \co{p->b}.

 \QuickQAC{app:rcuimpl:Read-Copy Update Implementations}{Read-Copy Update Implementations}
 \QuickQ{}
 	Why is sleeping prohibited within Classic RCU read-side
 	critical sections?
 \QuickA{}
 	Because sleeping implies a context switch, which in Classic RCU is
 	a quiescent state, and RCU's grace-period detection requires that
 	quiescent states never appear in RCU read-side critical sections.

 \QuickQ{}
 	Why not permit sleeping in Classic RCU read-side critical sections
 	by eliminating context switch as a quiescent state, leaving user-mode
 	execution and idle loop as the remaining quiescent states?
 \QuickA{}
 	This would mean that a system undergoing heavy kernel-mode
 	execution load (e.g., due to kernel threads) might never
 	complete a grace period, which
 	would cause it to exhaust memory sooner or later.

 \QuickQ{}
 	Why is it OK to assume that updates separated by
 	{\tt synchronize\_sched()} will be performed in order?
 \QuickA{}
 	Because this property is required for the {\tt synchronize\_sched()}
 	aspect of RCU to work at all.
 	For example, consider a code sequence that removes an object
 	from a list, invokes {\tt synchronize\_sched()}, then frees
 	the object.
 	If this property did not hold, then that object might appear
 	to be freed before it was
 	removed from the list, which is precisely the situation that
 	{\tt synchronize\_sched()} is supposed to prevent!

 \QuickQ{}
 	Why must line~17 in {\tt synchronize\_srcu()}
 	(Figure~\ref{fig:app:rcuimpl:Update-Side Implementation})
 	precede the release of the mutex on line~18?
 	What would have to change to permit these two lines to be
 	interchanged?
 	Would such a change be worthwhile?
 	Why or why not?
 \QuickA{}
 	Suppose that the order was reversed, and that CPU~0
 	has just reached line~13 of
 	{\tt synchronize\_srcu()}, while both CPU~1 and CPU~2 start executing
 	another {\tt synchronize\_srcu()} each, and CPU~3 starts executing a
 	{\tt srcu\_read\_lock()}.
 	Suppose that CPU~1 reaches line~6 of {\tt synchronize\_srcu()}
 	just before CPU~0 increments the counter on line~13.
 	Most importantly, suppose that
 	CPU~3 executes {\tt srcu\_read\_lock()}
 	out of order with the following SRCU read-side critical section,
 	so that it acquires a reference to some SRCU-protected data
 	structure \emph{before} CPU~0 increments {\tt sp->completed}, but
 	executes the {\tt srcu\_read\_lock()} \emph{after} CPU~0 does
 	this increment.

 	Then CPU~0 will \emph{not} wait for CPU~3 to complete its
 	SRCU read-side critical section before exiting the ``while''
 	loop on lines~15-16 and releasing the mutex (remember, the
 	CPU could be reordering the code).

 	Now suppose that CPU~2 acquires the mutex next,
 	and again increments {\tt sp->completed}.
 	This CPU will then have to wait for CPU~3 to exit its SRCU
 	read-side critical section before exiting the loop on
 	lines~15-16 and releasing the mutex.
 	But suppose that CPU~3 again executes out of order,
 	completing the {\tt srcu\_read\_unlock()} prior to
 	executing a final reference to the pointer it obtained
 	when entering the SRCU read-side critical section.

 	CPU~1 will then acquire the mutex, but see that the
 	{\tt sp->completed} counter has incremented twice, and
 	therefore take the early exit.
 	The caller might well free up the element that CPU~3 is
 	still referencing (due to CPU~3's out-of-order execution).

 	To prevent this perhaps improbable, but entirely possible,
 	scenario, the final {\tt synchronize\_sched()} must precede
 	the mutex release in {\tt synchronize\_srcu()}.

 	Another approach would be to change to comparison on
 	line~7 of {\tt synchronize\_srcu()} to check for at
 	least three increments of the counter.
 	However, such a change would increase the latency of a
 	``bulk update'' scenario, where a hash table is being updated
 	or unloaded using multiple threads.
 	In the current code, the latency of the resulting concurrent
 	{\tt synchronize\_srcu()} calls would take at most two SRCU
 	grace periods, while with this change, three would be required.

 	More experience will be required to determine which approach
 	is really better.
 	For one thing, there must first be some use of SRCU with
 	multiple concurrent updaters.

 \QuickQ{}
 	Wait a minute!
 	With all those new locks, how do you avoid deadlock?
 \QuickA{}
 	Deadlock is avoided by never holding more than one of the
 	\co{rcu_node} structures' locks at a given time.
 	This algorithm uses two more locks, one to prevent CPU hotplug
 	operations from running concurrently with grace-period advancement
 	(\co{onofflock}) and another
 	to permit only one CPU at a time from forcing a quiescent state
 	to end quickly (\co{fqslock}).
 	These are subject to a locking hierarchy, so that
 	\co{fqslock} must be acquired before
 	\co{onofflock}, which in turn must be acquired before
 	any of the \co{rcu_node} structures' locks.

 	Also, as a practical matter, refusing to ever hold more than
 	one of the \co{rcu_node} locks means that it is unnecessary
 	to track which ones are held.
 	Such tracking would be painful as well as unnecessary.

 \QuickQ{}
 	Why stop at a 64-times reduction?
 	Why not go for a few orders of magnitude instead?
 \QuickA{}
 	RCU works with no problems on
 	systems with a few hundred CPUs, so allowing 64 CPUs to contend on
 	a single lock leaves plenty of headroom.
 	Keep in mind that these locks are acquired quite rarely, as each
 	CPU will check in about one time per grace period, and grace periods
 	extend for milliseconds.

 \QuickQ{}
 	But I don't care about McKenney's lame excuses in the answer to
 	Quick Quiz 2!!!
 	I want to get the number of CPUs contending on a single lock down
 	to something reasonable, like sixteen or so!!!
 \QuickA{}
 	OK, have it your way, then!
 	Set \co{CONFIG_RCU_FANOUT=16} and (for \co{NR_CPUS=4096})
 	you will get a
 	three-level hierarchy with with 256 \co{rcu_node} structures
 	at the lowest level, 16 \co{rcu_node} structures as intermediate
 	nodes, and a single root-level \co{rcu_node}.
 	The penalty you will pay is that more \co{rcu_node} structures
 	will need to be scanned when checking to see which CPUs need help
 	completing their quiescent states (256 instead of only 64).

 \QuickQ{}
 	OK, so what is the story with the colors?
 \QuickA{}
 	Data structures analogous to \co{rcu_state} (including
 	\co{rcu_ctrlblk}) are yellow,
 	those containing the bitmaps used to determine when CPUs have checked
 	in are pink,
 	and the per-CPU \co{rcu_data} structures are blue.
 	The data structures used to conserve energy
 	(such as \co{rcu_dynticks}) will be colored green.

 \QuickQ{}
 	Given such an egregious bug, why does Linux run at all?
 \QuickA{}
 	Because the Linux kernel contains device drivers that are (relatively)
 	well behaved.
 	Few if any of them spin in RCU read-side critical sections for the
 	many milliseconds that would be required to provoke this bug.
 	The bug nevertheless does need to be fixed, and this variant of
 	RCU does fix it.

 \QuickQ{}
 	But doesn't this state diagram indicate that dyntick-idle CPUs will
 	get hit with reschedule IPIs?  Won't that wake them up?
 \QuickA{}
 	No.
 	Keep in mind that RCU is handling groups of CPUs.
 	One particular group might contain both dyntick-idle CPUs and
 	CPUs in normal mode that have somehow managed to avoid passing through
 	a quiescent state.
 	Only the latter group will be sent a reschedule IPI; the dyntick-idle
 	CPUs will merely be marked as being in an extended quiescent state.

 \QuickQ{}
 	But what happens if a CPU tries to report going through a quiescent
 	state (by clearing its bit) before the bit-setting CPU has finished?
 \QuickA{}
 	There are three cases to consider here:

 	\begin{enumerate}
 	\item	A CPU corresponding to a non-yet-initialized leaf
 		\co{rcu_node} structure tries to report a quiescent state.
 		This CPU will see its bit already cleared, so will give up on
 		reporting its quiescent state.
 		Some later quiescent state will serve for the new grace period.
 	\item	A CPU corresponding to a leaf \co{rcu_node} structure that
 		is currently being initialized tries to report a quiescent
 		state.
 		This CPU will see that the \co{rcu_node} structure's
 		\co{->lock} is held, so will spin until it is
 		released.
 		But once the lock is released, the \co{rcu_node}
 		structure will have been initialized, reducing to the
 		following case.
 	\item	A CPU corresponding to a leaf \co{rcu_node} that has
 		already been initialized tries to report a quiescent state.
 		This CPU will find its bit set, and will therefore clear it.
 		If it is the last CPU for that leaf node, it will
 		move up to the next level of the hierarchy.
 		However, this CPU cannot possibly be the last CPU in the
 		system to report a quiescent state, given that the CPU
 		doing the initialization cannot yet have checked in.
 	\end{enumerate}

 	So, in all three cases, the potential race is resolved correctly.

 \QuickQ{}
 	And what happens if \emph{all} CPUs try to report going
 	through a quiescent
 	state before the bit-setting CPU has finished, thus ending the new
 	grace period before it starts?
 \QuickA{}
 	The bit-setting CPU cannot pass through a
 	quiescent state during initialization, as it has irqs disabled.
 	Its bits therefore remain non-zero, preventing the grace period from
 	ending until the data structure has been fully initialized.

 \QuickQ{}
 	And what happens if one CPU comes out of dyntick-idle mode and then
 	passed through a quiescent state just as another CPU notices that the
 	first CPU was in dyntick-idle mode?
 	Couldn't they both attempt to report a quiescent state at the same
 	time, resulting in confusion?
 \QuickA{}
 	They will both attempt to acquire the lock on the same leaf
 	\co{rcu_node} structure.
 	The first one to acquire the lock will report the quiescent state
 	and clear the appropriate bit, and the second one to acquire the
 	lock will see that this bit has already been cleared.

 \QuickQ{}
 	But what if \emph{all} the CPUs end up in dyntick-idle mode?
 	Wouldn't that prevent the current RCU grace period from ever ending?
 \QuickA{}
 	Indeed it will!
 	However, CPUs that have RCU callbacks are not permitted to enter
 	dyntick-idle mode, so the only way that \emph{all} the CPUs could
 	possibly end up in dyntick-idle mode would be if there were
 	absolutely no RCU callbacks in the system.
 	And if there are no RCU callbacks in the system, then there is no
 	need for the RCU grace period to end.
 	In fact, there is no need for the RCU grace period to even
 	\emph{start}.

 	RCU will restart if some irq handler does a \co{call_rcu()},
 	which will cause an RCU callback to appear on the corresponding CPU,
 	which will force that CPU out of dyntick-idle mode, which will in turn
 	permit the current RCU grace period to come to an end.

 \QuickQ{}
 	Given that \co{force_quiescent_state()} is a three-phase state
 	machine, don't we have triple the scheduling latency due to scanning
 	all the CPUs?
 \QuickA{}
 	Ah, but the three phases will not execute back-to-back on the same CPU,
 	and, furthermore, the first (initialization) phase doesn't do any
 	scanning.
 	Therefore, the scheduling-latency hit of the three-phase algorithm
 	is no different than that of a single-phase algorithm.
 	If the scheduling latency becomes a problem, one approach would be to
 	recode the state machine to scan the CPUs incrementally, most likely
 	by keeping state on a per-leaf-\co{rcu_node} basis.
 	But first show me a problem in the real world, \emph{then}
 	I will consider fixing it!

 \QuickQ{}
 	But the other reason to hold \co{->onofflock} is to prevent
 	multiple concurrent online/offline operations, right?
 \QuickA{}
 	Actually, no!
 	The CPU-hotplug code's synchronization design prevents multiple
 	concurrent CPU online/offline operations, so only one CPU
 	online/offline operation can be executing at any given time.
 	Therefore, the only purpose of \co{->onofflock} is to prevent a CPU
 	online or offline operation from running concurrently with grace-period
 	initialization.

 \QuickQ{}
 	Given all these acquisitions of the global \co{->onofflock},
 	won't there
 	be horrible lock contention when running with thousands of CPUs?
 \QuickA{}
 	Actually, there can be only three acquisitions of this lock per grace
 	period, and each grace period lasts many milliseconds.
 	One of the acquisitions is by the CPU initializing for the current
 	grace period, and the other two onlining and offlining some CPU.
 	These latter two cannot run concurrently due to the CPU-hotplug
 	locking, so at most two CPUs can be contending for this lock at any
 	given time.

 	Lock contention on \co{->onofflock} should therefore
 	be no problem, even on systems with thousands of CPUs.

 \QuickQ{}
 	Why not simplify the code by merging the detection of dyntick-idle
 	CPUs with that of offline CPUs?
 \QuickA{}
 	It might well be that such merging may eventually be the right
 	thing to do.
 	In the meantime, however, there are some challenges:

 	\begin{enumerate}
 	\item	CPUs are not allowed to go into dyntick-idle mode while they
 		have RCU callbacks pending, but CPUs \emph{are} allowed to go
 		offline with callbacks pending.
 		This means that CPUs going offline need to have their callbacks
 		migrated to some other CPU, thus, we cannot allow CPUs to
 		simply go quietly offline.
 	\item	Present-day Linux systems run with \co{NR_CPUS}
 		much larger than the actual number of CPUs.
 		A unified approach could thus end up uselessly waiting on
 		CPUs that are not just offline, but which never existed in
 		the first place.
 	\item	RCU is already operational when CPUs get onlined one
 		at a time during boot, and therefore must handle the online
 		process.
 		This onlining must exclude grace-period initialization, so
 		the \co{->onofflock} must still be used.
 	\item	CPUs often switch into and out of dyntick-idle mode
 		extremely frequently, so it is not reasonable to use the
 		heavyweight online/offline code path for entering and exiting
 		dyntick-idle mode.
 	\end{enumerate}

 \QuickQ{}
 	Why not simply disable bottom halves (softirq) when acquiring
 	the \co{rcu_data} structure's \co{lock}?
 	Wouldn't this be faster?
 \QuickA{}
 	Because this lock can be acquired from functions
 	called by \co{call_rcu()}, which in turn can be
 	invoked from irq handlers.
 	Therefore, irqs \emph{must} be disabled when
 	holding this lock.

 \QuickQ{}
 	How about the \co{qsmask} and \co{qsmaskinit}
 	fields for the leaf \co{rcu_node} structures?
 	Doesn't there have to be some way to work out
 	which of the bits in these fields corresponds
 	to each CPU covered by the \co{rcu_node} structure
 	in question?
 \QuickA{}
 	Indeed there does!
 	The \co{grpmask} field in each CPU's \co{rcu_data}
 	structure does this job.

 \QuickQ{}
 	But why bother setting \co{qs_pending} to one when a CPU
 	is coming online, given that being offline is an extended
 	quiescent state that should cover any ongoing grace period?
 \QuickA{}
 	Because this helps to resolve a race between a CPU coming online
 	just as a new grace period is starting.

 \QuickQ{}
 	Why record the last completed grace period number in
 	\co{passed_quiesc_completed}?
 	Doesn't that cause this RCU implementation to be vulnerable
 	to quiescent states seen while no grace period was in progress
 	being incorrectly applied to the next grace period that starts?
 \QuickA{}
 	We record the last completed grace period number in order
 	to avoid races where a quiescent state noted near the end of
 	one grace period is incorrectly applied to the next grace
 	period, especially for dyntick and CPU-offline grace periods.
 	Therefore, \co{force_quiescent_state()} and friends all
 	check the last completed grace period number to avoid such races.

 	Now these dyntick and CPU-offline grace periods are only checked
 	for when a grace period is actually active.
 	The only quiescent states that can be recorded when no grace
 	period is in progress are self-detected quiescent states,
 	which are recorded in the \co{passed_quiesc_completed},
 	\co{passed_quiesc}, and \co{qs_pending}.
 	These variables are initialized every time the corresponding
 	CPU notices that a new grace period has started, preventing
 	any obsolete quiescent states from being applied to the
 	new grace period.

 	All that said, optimizing grace-period latency may require that
 	\co{gpnum} be tracked in addition to \co{completed}.

 \QuickQ{}
 	What is the point of running a system with \co{NR_CPUS}
 	way bigger than the actual number of CPUs?
 \QuickA{}
 	Because this allows producing a single binary of the Linux kernel
 	that runs on a wide variety of systems, greatly easing administration
 	and validation.

 \QuickQ{}
 	Why not simply have multiple lists rather than this funny
 	multi-tailed list?
 \QuickA{}
 	Because this multi-tailed approach, due to Lai Jiangshan,
 	simplifies callback processing.

 \QuickQ{}
 	So some poor CPU has to note quiescent states on behalf of
 	each and every offline CPU?
 	Yecch!
 	Won't that result in excessive overheads in the not-uncommon
 	case of a system with a small number of CPUs but a large value
 	for \co{NR_CPUS}?
 \QuickA{}
 	Actually, no it will not!

 	Offline CPUs are excluded from both the \co{qsmask} and
 	\co{qsmaskinit} bit masks, so RCU normally ignores them.
 	However, there are races with online/offline operations that
 	can result in an offline CPU having its \co{qsmask} bit set.
 	These races must of course be handled correctly, and the way
 	they are handled is to permit other CPUs to note that RCU
 	is waiting on a quiescent state from an offline CPU.

 \QuickQ{}
 	So what guards the earlier fields in this structure?
 \QuickA{}
 	Nothing does, as they are constants set at compile time
 	or boot time.
 	Of course, the fields internal to each \co{rcu_node}
 	in the \co{->node} array may change, but they are
 	guarded separately.

 \QuickQ{}
 	I thought that RCU read-side processing was supposed to
 	be \emph{fast}!
 	The functions shown in
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:RCU Read-Side Critical Sections}
 	have so much junk in them that they just \emph{have} to be slow!
 	What gives here?
 \QuickA{}
 	Appearances can be deceiving.
 	The \co{preempt_disable()}, \co{preempt_enable()},
 	\co{local_bh_disable()}, and \co{local_bh_enable()} each
 	do a single non-atomic manipulation of local data.
 	Even that assumes \co{CONFIG_PREEMPT}, otherwise,
 	the \co{preempt_disable()} and \co{preempt_enable()}
 	functions emit no code, not even compiler directives.
 	The \co{__acquire()} and \co{__release()} functions
 	emit no code (not even compiler directives), but are instead
 	used by the \co{sparse} semantic-parsing bug-finding program.
 	Finally, \co{rcu_read_acquire()} and \co{rcu_read_release()}
 	emit no code (not even compiler directives) unless the
 	``lockdep'' lock-order debugging facility is enabled, in
 	which case they can indeed be somewhat expensive.

 	In short, unless you are a kernel hacker who has enabled
 	debugging options, these functions are extremely cheap,
 	and in some cases, absolutely free of overhead.
 	And, in the words of a Portland-area furniture retailer,
 	``free is a \emph{very} good price''.

 \QuickQ{}
 	Why not simply use \co{__get_cpu_var()} to pick up a
 	reference to the
 	current CPU's \co{rcu_data} structure on line~13 in
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcutree call-rcu}?
 \QuickA{}
 	Because we might be called either from \co{call_rcu()}
 	(in which case we would need \co{__get_cpu_var(rcu_data)})
 	or from \co{call_rcu_bh()} (in which case we would need
 	\co{__get_cpu_var(rcu_bh_data)}).
 	Using the \co{->rda[]} array of whichever
 	\co{rcu_state} structure we were passed works correctly
 	regardless of which API \co{__call_rcu()} was invoked from
 	(suggested by Lai Jiangshan~\cite{LaiJiangshan2008NewClassicAlgorithm}).

 \QuickQ{}
 	Given that \co{rcu_pending()} is always called twice
 	on lines~29-32 of
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcutree rcu-check-callbacks},
 	shouldn't there be some way to combine the checks of the
 	two structures?
 \QuickA{}
 	Sorry, but this was a trick question.
 	The C language's short-circuit boolean expression evaluation
 	means that \co{__rcu_pending()} is invoked on
 	\co{rcu_bh_state} only if the prior invocation on
 	\co{rcu_state} returns zero.

 	The reason the two calls are in this order is that
 	``rcu'' is used more heavily than is ``rcu\_bh'', so
 	the first call is more likely to return non-zero than
 	is the second.

 \QuickQ{}
 	Shouldn't line~42 of
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcutree rcu-check-callbacks}
 	also check for \co{in_hardirq()}?
 \QuickA{}
 	No.
 	The \co{rcu_read_lock_bh()} primitive disables
 	softirq, not hardirq.
 	Because \co{call_rcu_bh()} need only wait for pre-existing
 	``rcu\_bh'' read-side critical sections to complete,
 	we need only check \co{in_softirq()}.

 \QuickQ{}
 	But don't we also need to check that a grace period is
 	actually in progress in \co{__rcu_process_callbacks} in
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcutree rcu-process-callbacks}?
 \QuickA{}
 	Indeed we do!
 	And the first thing that \co{force_quiescent_state()} does
 	is to perform exactly that check.

 \QuickQ{}
 	What happens if two CPUs attempt to start a new grace
 	period concurrently in
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcutree rcu-process-callbacks}?
 \QuickA{}
 	One of the CPUs will be the first to acquire the root
 	\co{rcu_node} structure's lock, and that CPU will start
 	the grace period.
 	The other CPU will then acquire the lock and invoke
 	\co{rcu_start_gp()}, which, seeing that a grace period
 	is already in progress, will immediately release the
 	lock and return.

 \QuickQ{}
 	How does the code traverse a given path through
 	the \co{rcu_node} hierarchy from root to leaves?
 \QuickA{}
 	It turns out that the code never needs to do such a traversal,
 	so there is nothing special in place to handle this.

 \QuickQ{}
 	C-preprocessor macros are \emph{so} 1990s!
 	Why not get with the times and convert \co{RCU_DATA_PTR_INIT()}
 	in Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-init}
 	to be a function?
 \QuickA{}
 	Because, although it is possible to pass a reference to
 	a particular CPU's instance of a per-CPU variable to a function,
 	there does not appear to be a good way pass a reference to
 	the full set of instances of a given per-CPU variable to
 	a function.
 	One could of course build an array of pointers, then pass a
 	reference to the array in, but that is part of what
 	the \co{RCU_DATA_PTR_INIT()} macro is doing in the first place.

 \QuickQ{}
 	What happens if a CPU comes online between the time
 	that the last online CPU is notified on lines~25-26 of
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-init}
 	and the time that \co{register_cpu_notifier()} is invoked
 	on line~27?
 \QuickA{}
 	Only one CPU is online at this point, so the only way another
 	CPU can come online is if this CPU puts it online, which it
 	is not doing.

 \QuickQ{}
 	Why call \co{cpu_quiet()} on line~41 of
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-init-percpu-data},
 	given that we are excluding grace periods with various
 	locks, and given that any earlier grace periods would not have
 	been waiting on this previously-offlined CPU?
 \QuickA{}
 	A new grace period might have started just after the
 	\co{->onofflock} was released on line~40.
 	The \co{cpu_quiet()} will help expedite such a grace period.

 \QuickQ{}
 	But what if the \co{rcu_node} hierarchy has only a single
 	structure, as it would on a small system?
 	What prevents concurrent grace-period initialization in that
 	case, given the code in
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-offline-cpu}?
 \QuickA{}
 	The later acquisition of the sole \co{rcu_node} structure's
 	\co{->lock} on line~16 excludes grace-period initialization,
 	which must acquire this same lock in order to initialize this
 	sole \co{rcu_node} structure for the new grace period.

 	The \co{->onofflock} is needed only for multi-node hierarchies,
 	and is used in that case as an alternative to acquiring and
 	holding \emph{all} of the \co{rcu_node} structures'
 	\co{->lock} fields, which would be incredibly painful on
 	large systems.

 \QuickQ{}
 	But does line~25 of
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-offline-cpu}
 	ever really exit the loop?
 	Why or why not?
 \QuickA{}
 	The only way that line~25 could exit the loop is if \emph{all}
 	CPUs were to be put offline.
 	This cannot happen in the Linux kernel as of 2.6.28, though
 	other environments have been designed to offline all CPUs
 	during the normal shutdown procedure.

 \QuickQ{}
 	Suppose that line~26 got executed seriously out of order in
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-offline-cpu},
 	so that \co{lastcomp} is set to some prior grace period, but
 	so that the current grace period is still waiting on the
 	now-offline CPU?
 	In this case, won't the call to \co{cpu_quiet()} fail to
 	report the quiescent state, thus causing the grace period
 	to wait forever for this now-offline CPU?
 \QuickA{}
 	First, the lock acquisitions on lines~16 and 12 would prevent
 	the execution of line~26 from being pushed that far out of
 	order.
 	Nevertheless, even if line~26 managed to be misordered that
 	dramatically, what would happen is that \co{force_quiescent_state()}
 	would eventually be invoked, and would notice that the current
 	grace period was waiting for a quiescent state from an offline
 	CPU.
 	Then \co{force_quiescent_state()} would report the extended
 	quiescent state on behalf of the offlined CPU.

 \QuickQ{}
 	Given that an offline CPU is in an extended quiescent state,
 	why does line~28 of
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-offline-cpu}
 	need to care which grace period it is
 	dealing with?
 \QuickA{}
 	It really does not need to care in this case.
 	However, because it \emph{does} need to care in many other
 	cases, the \co{cpu_quiet()} function does take the
 	grace-period number as an argument, so some value must be
 	supplied.

 \QuickQ{}
 	But this list movement in
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-offline-cpu}
 	makes all of the going-offline CPU's callbacks go through
 	another grace period, even if they were ready to invoke.
 	Isn't that inefficient?
 	Furthermore, couldn't an unfortunate pattern of CPUs going
 	offline then coming back online prevent a given callback from
 	ever being invoked?
 \QuickA{}
 	It is inefficient, but it is simple.
 	Given that this is not a commonly executed code path, this
 	is the right tradeoff.
 	The starvation case would be a concern, except that the
 	online and offline process involves multiple grace periods.

 \QuickQ{}
 	Why not just expand \co{note_new_gpnum()} inline into
 	\co{check_for_new_grace_period()} in
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Noting New Grace Periods}?
 \QuickA{}
 	Because \co{note_new_gpnum()} must be called for each new
 	grace period, including both those started by this CPU and
 	those started by other CPUs.
 	In contrast, \co{check_for_new_grace_period()} is called only
 	for the case where some other CPU started the grace period.

 \QuickQ{}
 	But there has been no initialization yet at line~15 of
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Starting a Grace Period}!
 	What happens if a CPU notices the new grace period and
 	immediately attempts to report a quiescent state?
 	Won't it get confused?
 \QuickA{}
 	There are two cases of interest.

 	In the first case, there is only a single \co{rcu_node}
 	structure in the hierarchy.
 	Since the CPU executing in \co{rcu_start_gp()} is currently
 	holding that \co{rcu_node} structure's lock, the CPU
 	attempting to report the quiescent state will not be able
 	to acquire this lock until initialization is complete,
 	at which point the quiescent state will be reported
 	normally.

 	In the second case, there are multiple \co{rcu_node} structures,
 	and the leaf \co{rcu_node} structure corresponding to the
 	CPU that is attempting to report the quiescent state already
 	has that CPU's \co{->qsmask} bit cleared.
 	Therefore, the CPU attempting to report the quiescent state
 	will give up, and some later quiescent state for that CPU
 	will be applied to the new grace period.

 \QuickQ{}
 	Hey!
 	Shouldn't we hold the non-leaf \co{rcu_node} structures'
 	locks when munging their state in line~37 of
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Starting a Grace Period}???
 \QuickA{}
 	There is no need to hold their locks.
 	The reasoning is as follows:
 	\begin{enumerate}
 	\item	The new grace period cannot end, because the running CPU
 		(which is initializing it) won't pass through a
 		quiescent state.
 		Therefore, there is no race with another invocation
 		of \co{rcu_start_gp()}.
 	\item	The running CPU holds \co{->onofflock}, so there
 		is no race with CPU-hotplug operations.
 	\item	The leaf \co{rcu_node} structures are not yet initialized,
 		so they have all of their \co{->qsmask} bits cleared.
 		This means that any other CPU attempting to report
 		a quiescent state will stop at the leaf level,
 		and thus cannot race with the current CPU for non-leaf
 		\co{rcu_node} structures.
 	\item	The RCU tracing functions access, but do not modify,
 		the \co{rcu_node} structures' fields.
 		Races with these functions is therefore harmless.
 	\end{enumerate}

 \QuickQ{}
 	Why can't we merge the loop spanning lines~36-37 with
 	the loop spanning lines~40-44 in
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Starting a Grace Period}?
 \QuickA{}
 	If we were to do so, we would either be needlessly acquiring locks
 	for the non-leaf \co{rcu_node} structures or would need
 	ugly checks for a given node being a leaf node on each pass
 	through the loop.
 	(Recall that we must acquire the locks for the leaf
 	\co{rcu_node} structures due to races with CPUs attempting
 	to report quiescent states.)

 	Nevertheless, it is quite possible that experience on very large
 	systems will show that such merging is in fact the right thing
 	to do.

 \QuickQ{}
 	What prevents lines~11-12 of
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-check-quiescent-state}
 	from reporting a quiescent state from a prior
 	grace period against the current grace period?
 \QuickA{}
 	If this could occur, it would be a serious bug, since the
 	CPU in question might be in an RCU read-side critical section
 	that started before the beginning of the current grace period.

 	There are several cases to consider for the CPU in question:
 	\begin{enumerate}
 	\item	It remained online and active throughout.
 	\item	It was in dynticks-idle mode for at least part of the current
 		grace period.
 	\item	It was offline for at least part of the current grace period.
 	\end{enumerate}

 	In the first case, the prior grace period could not have
 	ended without this CPU explicitly reporting a quiescent
 	state, which would leave \co{->qs_pending} zero.
 	This in turn would mean that lines~7-8 would return, so
 	that control would not reach \co{cpu_quiet()} unless
 	\co{check_for_new_grace_period()} had noted the new grace
 	period.
 	However, if the current grace period had been noted, it would
 	also have set \co{->passed_quiesc} to zero, in which case
 	lines~9-10 would have returned, again meaning that \co{cpu_quiet()}
 	would not be invoked.
 	Finally, the only way that \co{->passed_quiesc} could be invoked
 	would be if \co{rcu_check_callbacks()} was invoked by
 	a scheduling-clock interrupt that occurred somewhere between
 	lines~5 and 9 of \co{rcu_check_quiescent_state()} in
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-check-quiescent-state}.
 	However, this would be a case of a quiescent state occurring
 	in the \emph{current} grace period, which would be totally
 	legitimate to report against the current grace period.
 	So this case is correctly covered.

 	In the second case, where the CPU in question spent part of
 	the new quiescent state in dynticks-idle mode, note that
 	dynticks-idle mode is an extended quiescent state, hence
 	it is again permissible to report this quiescent state against
 	the current grace period.

 	In the third case, where the CPU in question spent part of the
 	new quiescent state offline, note that offline CPUs are in
 	an extended quiescent state, which is again permissible to
 	report against the current grace period.

 	So quiescent states from prior grace periods are never reported
 	against the current grace period.

 \QuickQ{}
 	How do lines~22-23 of
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for cpu-quiet}
 	know that it is safe to promote the running CPU's RCU
 	callbacks?
 \QuickA{}
 	Because the specified CPU has not yet passed through a quiescent
 	state, and because we hold the corresponding leaf node's lock,
 	we know that the current grace period cannot possibly have
 	ended yet.
 	Therefore, there is no danger that any of the callbacks currently
 	queued were registered after the next grace period started, given
 	that they have already been queued and the next grace period
 	has not yet started.

 \QuickQ{}
 	Given that argument \co{mask} on line 2 of
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for cpu-quiet-msk}
 	is an unsigned long, how can it possibly deal with systems
 	with more than 64 CPUs?
 \QuickA{}
 	Because \co{mask} is specific to the specified leaf \co{rcu_node}
 	structure, it need only be large enough to represent the
 	CPUs corresponding to that particular \co{rcu_node} structure.
 	Since at most 64 CPUs may be associated with a given
 	\co{rcu_node} structure (32 CPUs on 32-bit systems),
 	the unsigned long \co{mask} argument suffices.

 \QuickQ{}
 	How do RCU callbacks on dynticks-idle or offline CPUs
 	get invoked?
 \QuickA{}
 	They don't.
 	CPUs with RCU callbacks are not permitted to enter dynticks-idle
 	mode, so dynticks-idle CPUs never have RCU callbacks.
 	When CPUs go offline, their RCU callbacks are migrated to
 	an online CPU, so offline CPUs never have RCU callbacks, either.
 	Thus, there is no need to invoke callbacks on dynticks-idle
 	or offline CPUs.

 \QuickQ{}
 	Why would lines~14-17 in
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-do-batch}
 	need to adjust the tail pointers?
 \QuickA{}
 	If any of the tail pointers reference the last callback
 	in the sublist that was ready to invoke, they must be
 	changed to instead reference the \co{->nxtlist} pointer.
 	This situation occurs when the sublists
 	immediately following the ready-to-invoke sublist are empty.

 \QuickQ{}
 	But how does the code in
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:NMIs from Dyntick-Idle Mode}
 	handle nested NMIs?
 \QuickA{}
 	It does not have to handle nested NMIs, because NMIs do not nest.

 \QuickQ{}
 	Why isn't there a memory barrier between lines~8 and 9 of
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for dyntick-save-progress-counter}?
 	Couldn't this cause the code to fetch even-numbered values
 	from both the \co{->dynticks} and \co{->dynticks_nmi} fields,
 	even though these two fields never were zero at the same time?
 \QuickA{}
 	First, review the code in
 	Figures~\ref{fig:app:rcuimpl:rcutreewt:Entering and Exiting Dyntick-Idle Mode},
 	\ref{fig:app:rcuimpl:rcutreewt:NMIs from Dyntick-Idle Mode}, and
 	\ref{fig:app:rcuimpl:rcutreewt:Interrupts from Dyntick-Idle Mode},
 	and note that \co{dynticks} and \co{dynticks_nmi} will never
 	have odd values simultaneously (see especially lines~6 and 17 of
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:NMIs from Dyntick-Idle Mode},
 	and recall that interrupts cannot happen from NMIs).

 	Of course, given the placement of the memory barriers in these
 	functions, it might \emph{appear} to another CPU that both
 	counters were odd at the same time, but logically this cannot
 	happen, and would indicate that the CPU had in fact passed
 	through dynticks-idle mode.

 	Now, let's suppose that at the time line~8 fetches \co{->dynticks},
 	the value of \co{->dynticks_nmi} was at odd number, and that at the
 	time line~9 fetches \co{->dynticks_nmi}, the value of
 	\co{->dynticks} was an odd number.
 	Given that both counters cannot be odd simultaneously, there must
 	have been a time between these two fetches when both counters
 	were even, and thus a time when the CPU was in dynticks-idle
 	mode, which is a quiescent state, as required.

 	So, why can't the \co{&&} on line~13 of
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for dyntick-save-progress-counter}
 	be replaced with an \co{==}?
 	Well, it could be, but this would likely be more confusing
 	than helpful.

 \QuickQ{}
 	Why wait the extra couple jiffies on lines~12-13 in
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for check-cpu-stall}?
 \QuickA{}
 	This added delay gives the offending CPU a better chance of
 	reporting on itself, thus getting a decent stack trace of
 	the stalled code.
 	Of course, if the offending CPU is spinning with interrupts
 	disabled, it will never report on itself, so other CPUs
 	do so after a short delay.

 \QuickQ{}
 	What prevents the grace period from ending before the
 	stall warning is printed in
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for print-cpu-stall}?
 \QuickA{}
 	The caller checked that this CPU still had not reported a
 	quiescent state, and because preemption is disabled, there is
 	no way that a quiescent state could have been reported in
 	the meantime.

 \QuickQ{}
 	Why does \co{print_other_cpu_stall()} in
 	Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for print-other-cpu-stall}
 	need to check for the grace period ending when
 	\co{print_cpu_stall()} did not?
 \QuickA{}
 	The other CPUs might pass through a quiescent state at any time,
 	so the grace period might well have ended in the meantime.

 \QuickQ{}
 	Why is it important that blocking primitives
 	called from within a preemptible-RCU read-side critical section be
 	subject to priority inheritance?
 \QuickA{}
 	Because blocked readers stall RCU grace periods,
 	which can result in OOM.
 	For example, if a reader did a \co{wait_event()} within
 	an RCU read-side critical section, and that event never occurred,
 	then RCU grace periods would stall indefinitely, guaranteeing that
 	the system would OOM sooner or later.
 	There must therefore be some way to cause these readers to progress
 	through their read-side critical sections in order to avoid such OOMs.
 	Priority boosting is one way to force such progress, but only if
 	readers are restricted to blocking such that they can be awakened via
 	priority boosting.

 	Of course, there are other methods besides priority inheritance
 	that handle the priority inversion problem, including priority ceiling,
 	preemption disabling, and so on.
 	However, there are good reasons why priority inheritance is the approach
 	used in the Linux kernel, so this is what is used for RCU.

 \QuickQ{}
 	Could the prohibition against using primitives
 	that would block in a non-\co{CONFIG_PREEMPT} kernel be lifted,
 	and if so, under what conditions?
 \QuickA{}
 	If testing and benchmarking demonstrated that the
 	preemptible RCU worked well enough that classic RCU could be dispensed
 	with entirely, and if priority inheritance was implemented for blocking
 	synchronization primitives
 	such as \co{semaphore}s, then those primitives could be
 	used in RCU read-side critical sections.

 \QuickQ{}
 	How is it possible for lines~38-43 of
 	\co{__rcu_advance_callbacks()} to be executed when
 	lines~7-37 have not?
 	Won't they both be executed just after a counter flip, and
 	never at any other time?
 \QuickA{}
 	Consider the following sequence of events:
 	\begin{enumerate}
 	\item	CPU 0 executes lines~5-12 of
 		\co{rcu_try_flip_idle()}.
 	\item	CPU 1 executes \co{__rcu_advance_callbacks()}.
 		Because \co{rcu_ctrlblk.completed} has been
 		incremented, lines~7-37 execute.
 		However, none of the \co{rcu_flip_flag} variables
 		have been set, so lines~38-43 do \emph{not} execute.
 	\item	CPU 0 executes lines~13-15 of
 		\co{rcu_try_flip_idle()}.
 	\item	Later, CPU 1 again executes \co{__rcu_advance_callbacks()}.
 		The counter has not been incremented since the earlier
 		execution, but the \co{rcu_flip_flag} variables have
 		all been set, so only lines~38-43 are executed.
 	\end{enumerate}

 \QuickQ{}
 	What problems could arise if the lines containing
 	\co{ACCESS_ONCE()} in \co{rcu_read_unlock()}
 	were reordered by the compiler?
 \QuickA{}
 	\begin{enumerate}
 	\item	If the \co{ACCESS_ONCE()} were omitted from the
 		fetch of \co{rcu_flipctr_idx} (line~14), then the compiler
 		would be within its rights to eliminate \co{idx}.
 		It would also be free to compile the \co{rcu_flipctr}
 		decrement as a fetch-increment-store sequence, separately
 		fetching \co{rcu_flipctr_idx} for both the fetch and
 		the store.
 		If an NMI were to occur between the fetch and the store, and
 		if the NMI handler contained an \co{rcu_read_lock()},
 		then the value of \co{rcu_flipctr_idx} would change
 		in the meantime, resulting in corruption of the
 		\co{rcu_flipctr} values, destroying the ability
 		to correctly identify grace periods.
 	\item	Another failure that could result from omitting the
 		\co{ACCESS_ONCE()} from line~14 is due to
 		the compiler reordering this statement to follow the
 		decrement of \co{rcu_read_lock_nesting}
 		(line~16).
 		In this case, if an NMI were to occur between these two
 		statements, then any \co{rcu_read_lock()} in the
 		NMI handler could corrupt \co{rcu_flipctr_idx},
 		causing the wrong \co{rcu_flipctr} to be
 		decremented.
 		As with the analogous situation in \co{rcu_read_lock()},
 		this could result in premature grace-period termination,
 		an indefinite grace period, or even both.
 	\item	If \co{ACCESS_ONCE()} macros were omitted such that
 		the update of \co{rcu_read_lock_nesting} could be
 		interchanged by the compiler with the decrement of
 		\co{rcu_flipctr}, and if an NMI occurred in between,
 		any \co{rcu_read_lock()} in the NMI handler would
 		incorrectly conclude that it was protected by an enclosing
 		\co{rcu_read_lock()}, and fail to increment the
 		\co{rcu_flipctr} variables.
 	\end{enumerate}

 	It is not clear that the \co{ACCESS_ONCE()} on the
 	fetch of \co{rcu_read_lock_nesting} (line~7) is required.

 \QuickQ{}
 	What problems could arise if the lines containing
 	\co{ACCESS_ONCE()} in \co{rcu_read_unlock()}
 	were reordered by the CPU?
 \QuickA{}
 	Absolutely none!  The code in \co{rcu_read_unlock()}
 	interacts with the scheduling-clock interrupt handler
 	running on the same CPU, and is thus insensitive to reorderings
 	because CPUs always see their own accesses as if they occurred
 	in program order.
 	Other CPUs do access the \co{rcu_flipctr}, but because these
 	other CPUs don't access any of the other variables, ordering is
 	irrelevant.

 \QuickQ{}
 	What problems could arise in
 	\co{rcu_read_unlock()} if irqs were not disabled?
 \QuickA{}
 	\begin{enumerate}
 	\item	Disabling irqs has the side effect of disabling preemption.
 		Suppose that this code were to be preempted in the midst
 		of line~17 between selecting the current CPU's copy
 		of the \co{rcu_flipctr} array and the decrement of
 		the element indicated by \co{rcu_flipctr_idx}.
 		Execution might well resume on some other CPU.
 		If this resumption happened concurrently with an
 		\co{rcu_read_lock()} or \co{rcu_read_unlock()}
 		running on the original CPU,
 		an increment or decrement might be lost, resulting in either
 		premature termination of a grace period, indefinite extension
 		of a grace period, or even both.
 	\item	Failing to disable preemption can also defeat RCU priority
 		boosting, which relies on \co{rcu_read_lock_nesting}
 		to determine which tasks to boost.
 		If preemption occurred between the update of
 		\co{rcu_read_lock_nesting} (line~16) and of
 		\co{rcu_flipctr} (line~17), then a grace
 		period might be stalled until this task resumed.
 		But because the RCU priority booster has no way of knowing
 		that this particular task is stalling grace periods, needed
 		boosting will never occur.
 		Therefore, if there are CPU-bound realtime tasks running,
 		the preempted task might never resume, stalling grace periods
 		indefinitely, and eventually resulting in OOM.
 	\end{enumerate}

 	Of course, both of these situations could be handled by disabling
 	preemption rather than disabling irqs.
 	(The CPUs I have access to do not show much difference between these
 	two alternatives, but others might.)

 \QuickQ{}
 	Suppose that the irq disabling in
 	\co{rcu_read_lock()} was replaced by preemption disabling.
 	What effect would that have on \co{GP_STAGES}?
 \QuickA{}
 	No finite value of \co{GP_STAGES} suffices.
 	The following scenario, courtesy of Oleg Nesterov, demonstrates this:

 	Suppose that low-priority Task~A has executed
 	\co{rcu_read_lock()} on CPU 0,
 	and thus has incremented \co{per_cpu(rcu_flipctr, 0)[0]},
 	which thus has a value of one.
 	Suppose further that Task~A is now preempted indefinitely.

 	Given this situation, consider the following sequence of events:
 	\begin{enumerate}
 	\item	Task~B starts executing \co{rcu_read_lock()}, also on
 		CPU 0, picking up the low-order bit of
 		\co{rcu_ctrlblk.completed}, which is still equal to zero.
 	\item	Task~B is interrupted by a sufficient number of scheduling-clock
 		interrupts to allow the current grace-period stage to complete,
 		and also be sufficient long-running interrupts to allow the
 		RCU grace-period state machine to advance the
 		\co{rcu_ctrlblk.complete} counter so that its bottom bit
 		is now equal to one and all CPUs have acknowledged this
 		increment operation.
 	\item	CPU 1 starts summing the index==0 counters, starting with
 		\co{per_cpu(rcu_flipctr, 0)[0]}, which is equal to one
 		due to Task~A's increment.
 		CPU 1's local variable \co{sum} is therefore equal to one.
 	\item	Task~B returns from interrupt, resuming its execution of
 		\co{rcu_read_lock()}, incrementing
 		\co{per_cpu(rcu_flipctr, 0)[0]}, which now has a value
 		of two.
 	\item	Task~B is migrated to CPU 2.
 	\item	Task~B completes its RCU read-side critical section, and
 		executes \co{rcu_read_unlock()}, which decrements
 		\co{per_cpu(rcu_flipctr, 2)[0]}, which is now -1.
 	\item	CPU 1 now adds \co{per_cpu(rcu_flipctr, 1)[0]} and
 		\co{per_cpu(rcu_flipctr, 2)[0]} to its
 		local variable \co{sum}, obtaining the value zero.
 	\item	CPU 1 then incorrectly concludes that all prior RCU read-side
 		critical sections have completed, and advances to the next
 		RCU grace-period stage.
 		This means that some other task might well free up data
 		structures that Task~A is still using!
 	\end{enumerate}

 	This sequence of events could repeat indefinitely, so that no finite
 	value of \co{GP_STAGES} could prevent disrupting Task~A.
 	This sequence of events demonstrates the importance of the promise
 	made by CPUs that acknowledge an increment of
 	\co{rcu_ctrlblk.completed}, as the problem illustrated by the
 	above sequence of events is caused by Task~B's repeated failure
 	to honor this promise.

 	Therefore, more-pervasive changes to the grace-period state will be
 	required in order for \co{rcu_read_lock()} to be able to safely
 	dispense with irq disabling.

 \QuickQ{}
 	Why can't the \co{rcu_dereference()}
 	precede the memory barrier?
 \QuickA{}
 	Because the memory barrier is being executed in
 	an interrupt handler, and interrupts are exact in the sense that
 	a single value of the PC is saved upon interrupt, so that the
 	interrupt occurs at a definite place in the code.
 	Therefore, if the
 	\co{rcu_dereference()} were to precede the memory barrier,
 	the interrupt would have had to have occurred after the
 	\co{rcu_dereference()}, and therefore
 	the interrupt would also have had to have occurred after the
 	\co{rcu_read_lock()} that begins the RCU read-side critical
 	section.
 	This would have forced the \co{rcu_read_lock()} to use
 	the earlier value of the grace-period counter, which would in turn
 	have meant that the corresponding \co{rcu_read_unlock()}
 	would have had to precede the first "Old counters zero [0]" rather
 	than the second one.
 	This in turn would have meant that the read-side critical section
 	would have been much shorter --- which would have been
 	counter-productive,
 	given that the point of this exercise was to identify the longest
 	possible RCU read-side critical section.

 \QuickQ{}
 	What is a more precise way to say "CPU~0
 	might see CPU~1's increment as early as CPU~1's last previous
 	memory barrier"?
 \QuickA{}
 	First, it is important to note that the problem with
 	the less-precise statement is that it gives the impression that there
 	might be a single global timeline, which there is not, at least not for
 	popular microprocessors.
 	Second, it is important to note that memory barriers are all about
 	perceived ordering, not about time.
 	Finally, a more precise way of stating above statement would be as
 	follows: "If CPU~0 loads the value resulting from CPU~1's
 	increment, then any subsequent load by CPU~0 will see the
 	values from any relevant stores by CPU~1 if these stores
 	preceded CPU~1's last prior memory barrier."

 	Even this more-precise version leaves some wiggle room.
 	The word "subsequent" must be understood to mean "ordered after",
 	either by an explicit memory barrier or by the CPU's underlying
 	memory ordering.
 	In addition, the memory barriers must be strong enough to order
 	the relevant operations.
 	For example, CPU~1's last prior memory barrier must order stores
 	(for example, \co{smp_wmb()} or \co{smp_mb()}).
 	Similarly, if CPU~0 needs an explicit memory barrier to
 	ensure that its later load follows the one that saw the increment,
 	then this memory barrier needs to be an \co{smp_rmb()}
 	or \co{smp_mb()}.

 	In general, much care is required when proving parallel algorithms.