blob: 12504778c7c2b414836651abdc10f3b42cc1ad4b [file] [log] [blame]
\QuickQAC{chp:Introduction}{Introduction}
\QuickQ{}
Come on now!!!
Parallel programming has been known to be exceedingly
hard for many decades.
You seem to be hinting that it is not so hard.
What sort of game are you playing?
\QuickA{}
If you really believe that parallel programming is exceedingly
hard, then you should have a ready answer to the question
``Why is parallel programming hard?''
One could list any number of reasons, ranging from deadlocks to
race conditions to testing coverage, but the real answer is that
{\em it is not really all that hard}.
After all, if parallel programming was really so horribly difficult,
how could a large number of open-source projects, ranging from Apache
to MySQL to the Linux kernel, have managed to master it?
A better question might be: ''Why is parallel programming {\em
perceived} to be so difficult?''
To see the answer, let's go back to the year 1991.
Paul McKenney was walking across the parking lot to Sequent's
benchmarking center carrying six dual-80486 Sequent Symmetry CPU
boards, when he suddenly realized that he was carrying several
times the price of the house he had just purchased.\footnote{
Yes, this sudden realization {\em did} cause him to walk quite
a bit more carefully.
Why do you ask?}
This high cost of parallel systems meant that
parallel programming was restricted to a privileged few who
worked for an employer who either manufactured or could afford to
purchase machines costing upwards of \$100,000 --- in 1991 dollars US.
In contrast, in 2006, Paul finds himself typing these words on a
dual-core x86 laptop.
Unlike the dual-80486 CPU boards, this laptop also contains
2GB of main memory, a 60GB disk drive, a display, Ethernet,
USB ports, wireless, and Bluetooth.
And the laptop is more than an order of magnitude cheaper than
even one of those dual-80486 CPU boards, even before taking inflation
into account.
Parallel systems have truly arrived.
They are no longer the sole domain of a privileged few, but something
available to almost everyone.
The earlier restricted availability of parallel hardware is
the \emph{real} reason that parallel programming is considered
so difficult.
After all, it is quite difficult to learn to program even the simplest
machine if you have no access to it.
Since the age of rare and expensive parallel machines is for the most
part behind us, the age during which
parallel programming is perceived to be mind-crushingly difficult is
coming to a close.\footnote{
Parallel programming is in some ways more difficult than
sequential programming, for example, parallel validation
is more difficult.
But no longer mind-crushingly difficult.}
\QuickQ{}
How could parallel programming \emph{ever} be as easy
as sequential programming?
\QuickA{}
It depends on the programming environment.
SQL~\cite{DIS9075SQL92} is an underappreciated success
story, as it permits programmers who know nothing about parallelism
to keep a large parallel system productively busy.
We can expect more variations on this theme as parallel
computers continue to become cheaper and more readily available.
For example, one possible contender in the scientific and
technical computing arena is MATLAB*P,
which is an attempt to automatically parallelize common
matrix operations.
Finally, on Linux and UNIX systems, consider the following
shell command:
{\small \tt get\_input | grep "interesting" | sort}
This shell pipeline runs the \co{get_input}, \co{grep},
and \co{sort} processes in parallel.
There, that wasn't so hard, now was it?
\QuickQ{}
Oh, really???
What about correctness, maintainability, robustness, and so on?
\QuickA{}
These are important goals, but they are just as important for
sequential programs as they are for parallel programs.
Therefore, important though they are, they do not belong on
a list specific to parallel programming.
\QuickQ{}
And if correctness, maintainability, and robustness don't
make the list, why do productivity and generality?
\QuickA{}
Given that parallel programming is perceived to be much harder
than is sequential programming, productivity is tantamount and
therefore must not be omitted.
Furthermore, high-productivity parallel-programming environments
such as SQL have been special purpose, hence generality must
also be added to the list.
\QuickQ{}
Given that parallel programs are much harder to prove
correct than are sequential programs, again, shouldn't
correctness \emph{really} be on the list?
\QuickA{}
From an engineering standpoint, the difficulty in proving
correctness, either formally or informally, would be important
insofar as it impacts the primary goal of productivity.
So, in cases where correctness proofs are important, they
are subsumed under the ``productivity'' rubric.
\QuickQ{}
What about just having fun?
\QuickA{}
Having fun is important as well, but, unless you are a hobbyist,
would not normally be a \emph{primary} goal.
On the other hand, if you \emph{are} a hobbyist, go wild!
\QuickQ{}
Are there no cases where parallel programming is about something
other than performance?
\QuickA{}
There are certainly cases where the problem to be solved is
inherently parallel, for example, Monte Carlo methods and
some numerical computations.
Even in these cases, however, there will be some amount of
extra work managing the parallelism.
\QuickQ{}
Why all this prattling on about non-technical issues???
And not just \emph{any} non-technical issue, but \emph{productivity}
of all things?
Who cares?
\QuickA{}
If you are a pure hobbyist, perhaps you don't need to care.
But even pure hobbyists will often care about how much they
can get done, and how quickly.
After all, the most popular hobbyist tools are usually those
that are the best suited for the job, and an important part of
the definition of ``best suited'' involves productivity.
And if someone is paying you to write parallel code, they will
very likely care deeply about your productivity.
And if the person paying you cares about something, you would
be most wise to pay at least some attention to it!
Besides, if you \emph{really} didn't care about productivity,
you would be doing it by hand rather than using a computer!
\QuickQ{}
Given how cheap parallel hardware has become, how can anyone
afford to pay people to program it?
\QuickA{}
There are a number of answers to this question:
\begin{enumerate}
\item Given a large computational cluster of parallel machines,
the aggregate cost of the cluster can easily justify
substantial developer effort, because the development
cost can be spread over the large number of machines.
\item Popular software that is run by tens of millions of users
can easily justify substantial developer effort,
as the cost of this development can be spread over the tens
of millions of users.
Note that this includes things like kernels and system
libraries.
\item If the low-cost parallel machine is controlling the operation
of a valuable piece of equipment, then the cost of this
piece of equipment might easily justify substantial
developer effort.
\item If the software for the low-cost parallel produces an
extremely valuable result (e.g., mineral exploration),
then the valuable result might again justify substantial
developer cost.
\item Safety-critical systems protect lives, which can clearly
justify very large developer effort.
\item Hobbyists and researchers might seek knowledge, experience,
fun, or glory rather than mere money.
\end{enumerate}
So it is not the case that the decreasing cost of hardware renders
software worthless, but rather that it is no longer possible to
``hide'' the cost of software development within the cost of
the hardware, at least not unless there are extremely large
quantities of hardware.
\QuickQ{}
This is a ridiculously unachievable ideal!
Why not focus on something that is achievable in practice?
\QuickA{}
This is eminently achievable.
The cellphone is a computer that can be used to make phone
calls and to send and receive text messages with little or
no programming or configuration on the part of the end user.
This might seem to be a trivial example at first glance,
but if you consider it carefully you will see that it is
both simple and profound.
When we are willing to sacrifice generality, we can achieve
truly astounding increases in productivity.
Those who cling to generality will therefore fail to set
the productivity bar high enough to succeed in production
environments.
\QuickQ{}
What other bottlenecks might prevent additional CPUs from
providing additional performance?
\QuickA{}
There are any number of potential bottlenecks:
\begin{enumerate}
\item Main memory. If a single thread consumes all available
memory, additional threads will simply page themselves
silly.
\item Cache. If a single thread's cache footprint completely
fills any shared CPU cache(s), then adding more threads
will simply thrash the affected caches.
\item Memory bandwidth. If a single thread consumes all available
memory bandwidth, additional threads will simply
result in additional queuing on the system interconnect.
\item I/O bandwidth. If a single thread is I/O bound,
adding more threads will simply result in them all
waiting in line for the affected I/O resource.
\end{enumerate}
Specific hardware systems might have any number of additional
bottlenecks.
\QuickQ{}
What besides CPU cache capacity might require limiting the
number of concurrent threads?
\QuickA{}
There are any number of potential limits on the number of
threads:
\begin{enumerate}
\item Main memory. Each thread consumes some memory
(for its stack if nothing else), so that excessive
numbers of threads can exhaust memory, resulting
in excessive paging or memory-allocation failures.
\item I/O bandwidth. If each thread initiates a given
amount of mass-storage I/O or networking traffic,
excessive numbers of threads can result in excessive
I/O queuing delays, again degrading performance.
Some networking protocols may be subject to timeouts
or other failures if there are so many threads that
networking events cannot be responded to in a timely
fashion.
\item Synchronization overhead.
For many synchronization protocols, excessive numbers
of threads can result in excessive spinning, blocking,
or rollbacks, thus degrading performance.
\end{enumerate}
Specific applications and platforms may have any number of additional
limiting factors.
\QuickQ{}
Are there any other obstacles to parallel programming?
\QuickA{}
There are a great many other potential obstacles to parallel
programming.
Here are a few of them:
\begin{enumerate}
\item The only known algorithms for a given project might
be inherently sequential in nature.
In this case, either avoid parallel programming
(there being no law saying that your project \emph{has}
to run in parallel) or invent a new parallel algorithm.
\item The project allows binary-only plugins that share the same
address space, such that no one developer has access to
all of the source code for the project.
Because many parallel bugs, including deadlocks, are
global in nature, such binary-only plugins pose a severe
challenge to current software development methodologies.
This might well change, but for the time being, all
developers of parallel code sharing a given address space
need to be able to see \emph{all} of the code running in
that address space.
\item The project contains heavily used APIs that were designed
without regard to
parallelism~\cite{HagitAttiya2011LawsOfOrder}.
Some of the more ornate features of the System V
message-queue API form a case in point.
Of course, if your project has been around for a few
decades, and if its developers did not have access to
parallel hardware, your project undoubtedly has at least
its share of such APIs.
\item The project was implemented without regard to parallelism.
Given that there are a great many techniques that work
extremely well in a sequential environment, but that
fail miserably in parallel environments, if your project
ran only on sequential hardware for most of its lifetime,
then your project undoubtably has at least its share of
parallel-unfriendly code.
\item The project was implemented without regard to good
software-development practice.
The cruel truth is that shared-memory parallel
environments are often much less forgiving of sloppy
development practices than are sequential environments.
You may be well-served to clean up the existing design
and code prior to attempting parallelization.
\item The people who originally did the development on your
project have since moved on, and the people remaining,
while well able to maintain it or add small features,
are unable to make ``big animal'' changes.
In this case, unless you can work out a very simple
way to parallelize your project, you will probably
be best off leaving it sequential.
That said, there are a number of simple approaches that
you might use
to parallelize your project, including running multiple
instances of it, using a parallel implementation of
some heavily used library function, or making use of
some other parallel project, such as a database.
\end{enumerate}
One can argue that many of these obstacles are non-technical
in nature, but that does not make them any less real.
In short, parallelization of a large body of code
can be a large and complex effort.
As with any large and complex effort, it makes sense to
do your homework beforehand.
\QuickQ{}
Where are the answers to the Quick Quizzes found?
\QuickA{}
In Appendix~\ref{chp:Answers to Quick Quizzes} starting on
page~\pageref{chp:Answers to Quick Quizzes}.
Hey, I thought I owed you an easy one!
\QuickQ{}
Some of the Quick Quiz questions seem to be from the viewpoint
of the reader rather than the author.
Is that really the intent?
\QuickA{}
Indeed it is!
Many are questions that Paul E. McKenney would probably have
asked if he was a novice student in a class covering this material.
It is worth noting that Paul was taught most of this material by
parallel hardware and software, not by professors.
In Paul's experience, professors are much more likely to provide
answers to verbal questions than are parallel systems,
Watson notwithstanding.
Of course, we could have a lengthy debate over which of professors
or parallel systems provide the most useful answers to these sorts
of questions,
but for the time being let's just agree that usefulness of
answers varies widely across the population both of professors
and of parallel systems.
Other quizzes are quite similar to actual questions that have been
asked during conference presentations and lectures covering the
material in this book.
A few others are from the viewpoint of the author.
\QuickQ{}
These Quick Quizzes just are not my cup of tea.
What do you recommend?
\QuickA{}
There are a number of alternatives available to you:
\begin{enumerate}
\item Just ignore the Quick Quizzes and read the rest of
the book.
You might miss out on the interesting material in
some of the Quick Quizzes, but the rest of the book
has lots of good material as well.
This is an eminently reasonable approach if your main
goal is to gain a general understanding of the material
or if you are skimming through to book to find a
solution to a specific problem.
\item If you find the Quick Quizzes distracting but impossible
to ignore, you can always clone the \LaTeX{} source for
this book from the git archive.
Then modify \co{Makefile} and \co{qqz.sty} to eliminate
the Quick Quizzes from the PDF output.
Alternatively, you could modify these two files so as
to pull the answers inline, immediately following
the questions.
\item Look at the answer immediately rather than investing
a large amount of time in coming up with your own
answer.
This approach is reasonable when a given Quick Quiz's
answer holds the key to a specific problem you are
trying to solve.
This approach is also reasonable if you want a somewhat
deeper understanding of the material, but when you do not
expect to be called upon to generate parallel solutions given
only a blank sheet of paper.
\item If you prefer a more academic and rigorous treatment of
parallel programming,
you might like Herlihy's and Shavit's
textbook~\cite{HerlihyShavit2008Textbook}.
This book starts with an interesting combination
of low-level primitives at high levels of abstraction
from the hardware, and works its way through locking
and simple data structures including lists, queues,
hash tables, and counters, culminating with transactional
memory.
Michael Scott's textbook~\cite{MichaelScott2013Textbook}
approaches similar material with more of a
software-engineering focus, and as far as I know, is
the first formally published textbook to include a
section devoted to RCU.
\item If you would like an academic treatment of parallel
programming from a programming-language-pragmatics viewpoint,
you might be interested in the concurrency chapter from Scott's
textbook~\cite{MichaelScott2006Textbook}
on programming languages.
\item If you are interested in an object-oriented patternist
treatment of parallel programming focussing on C++,
you might try Volumes~2 and 4 of Schmidt's POSA
series~\cite{SchmidtStalRohnertBuschmann2000v2Textbook,
BuschmannHenneySchmidt2007v4Textbook}.
Volume 4 in particular has some interesting chapters
applying this work to a warehouse application.
The realism of this example is attested to by
the section entitled ``Partitioning the Big Ball of Mud'',
wherein the problems inherent in parallelism often
take a back seat to the problems inherent in getting
one's head around a real-world application.
\item If your primary focus is scientific and technical computing,
and you prefer a patternist approach,
you might try Mattson et al.'s
textbook~\cite{Mattson2005Textbook}.
It covers Java, C/C++, OpenMP, and MPI.
Its patterns are admirably focused first on design,
then on implementation.
\item If you are interested in POSIX Threads, you might take
a look at David R. Butenhof's book~\cite{Butenhof1997pthreads}.
\item If you are interested in C++, but in a Windows environment,
you might try Herb Sutter's ``Effective Concurrency''
series in
Dr. Dobbs Journal~\cite{HerbSutter2008EffectiveConcurrency}.
This series does a reasonable job of presenting a
commonsense approach to parallelism.
\item If you want to try out Intel Threading Building Blocks,
then perhaps James Reinders's book~\cite{Reinders2007Textbook}
is what you are looking for.
\item Finally, those preferring to work in Java might be
well-served by Doug Lea's
textbooks~\cite{DougLea1997Textbook,Goetz2007Textbook}.
\end{enumerate}
In contrast, this book meshes real-world machines with real-world
algorithms.
If your sole goal is to find (say) an optimal parallel queue, you
might be better served by one of the above books.
However, if you are interested in principles of parallel design
that allow multiple such queues to operate in parallel, read on!
Coming back to the topic of Quick Quizzes, if you need a deep
understanding of the material, then you might well need to
learn to tolerate the Quick Quizzes.
Don't get me wrong, passively reading the material can be quite
valuable, but gaining full problem-solving capability really
does require that you practice solving problems.
I learned this the hard way during coursework for my late-in-life
Ph.D.
I was studying a familiar topic, and was surprised at how few of
the chapter's exercises I could solve off the top of my head.
Forcing myself to answer the questions greatly increased my
retention of the material.
So with these Quick Quizzes I am not asking you to do anything
that I have not been doing myself!
\QuickQAC{chp:Hardware and its Habits}{Hardware and its Habits}
\QuickQ{}
Why should parallel programmers bother learning low-level
properties of the hardware?
Wouldn't it be easier, better, and more general to remain at
a higher level of abstraction?
\QuickA{}
It might well be easier to ignore the detailed properties of
the hardware, but in most cases it would be quite foolish
to do so.
If you accept that the only purpose of parallelism is to
increase performance, and if you further accept that
performance depends on detailed properties of the hardware,
then it logically follows that parallel programmers are going
to need to know at least a few hardware properties.
This is the case in most engineering disciplines.
Would \emph{you} want to use a bridge designed by an
engineer who did not understand the properties of
the concrete and steel making up that bridge?
If not, why would you expect a parallel programmer to be
able to develop competent parallel software without at least
\emph{some} understanding of the underlying hardware?
\QuickQ{}
What types of machines would allow atomic operations on
multiple data elements?
\QuickA{}
One answer to this question is that it is often possible to
pack multiple elements of data into a single machine word,
which can then be manipulated atomically.
A more trendy answer would be machines supporting transactional
memory~\cite{DBLomet1977SIGSOFT}.
However, such machines are still research curiosities,
although as of early 2012 it appears that commodity systems
supporting limited forms of hardware transactional memory
will be commercially available within a couple of years.
The jury is still out on the applicability of software transactional
memory~\cite{McKenney2007PLOSTM,DonaldEPorter2007TRANSACT,
ChistopherJRossbach2007a,CalinCascaval2008tmtoy,
AleksandarDragovejic2011STMnotToy,AlexanderMatveev2012PessimisticTM}.
\QuickQ{}
So have CPU designers also greatly reduced the overhead of
cache misses?
\QuickA{}
Unfortunately, not so much.
There has been some reduction given constant numbers of CPUs,
but the finite speed of light and the atomic nature of
matter limits their ability to reduce cache-miss overhead
for larger systems.
Section~\ref{sec:cpu:Hardware Free Lunch?}
discusses some possible avenues for possible future progress.
\QuickQ{}
This is a \emph{simplified} sequence of events?
How could it \emph{possibly} be any more complex?
\QuickA{}
This sequence ignored a number of possible complications,
including:
\begin{enumerate}
\item Other CPUs might be concurrently attempting to perform
CAS operations involving this same cacheline.
\item The cacheline might have been replicated read-only in
several CPUs' caches, in which case, it would need to
be flushed from their caches.
\item CPU~7 might have been operating on the cache line when
the request for it arrived, in which case CPU~7 might
need to hold off the request until its own operation
completed.
\item CPU~7 might have ejected the cacheline from its cache
(for example, in order to make room for other data),
so that by the time that the request arrived, the
cacheline was on its way to memory.
\item A correctable error might have occurred in the cacheline,
which would then need to be corrected at some point before
the data was used.
\end{enumerate}
Production-quality cache-coherence mechanisms are extremely
complicated due to these sorts of considerations.
\QuickQ{}
Why is it necessary to flush the cacheline from CPU~7's cache?
\QuickA{}
If the cacheline was not flushed from CPU~7's cache, then
CPUs~0 and 7 might have different values for the same set
of variables in the cacheline.
This sort of incoherence would greatly complicate parallel
software, and so hardware architects have been convinced to
avoid it.
\QuickQ{}
Surely the hardware designers could be persuaded to improve
this situation!
Why have they been content with such abysmal performance
for these single-instruction operations?
\QuickA{}
The hardware designers \emph{have} been working on this
problem, and have consulted with no less a luminary than
the physicist Stephen Hawking.
Hawking's observation was that the hardware designers have
two basic problems~\cite{BryanGardiner2007}:
\begin{enumerate}
\item the finite speed of light, and
\item the atomic nature of matter.
\end{enumerate}
\begin{table}
\centering
\begin{tabular}{l||r|r}
Operation & Cost (ns) & Ratio \\
\hline
\hline
Clock period & 0.4 & 1.0 \\
\hline
``Best-case'' CAS & 12.2 & 33.8 \\
\hline
Best-case lock & 25.6 & 71.2 \\
\hline
Single cache miss & 12.9 & 35.8 \\
\hline
CAS cache miss & 7.0 & 19.4 \\
\hline
Off-Core & & \\
\hline
Single cache miss & 31.2 & 86.6 \\
\hline
CAS cache miss & 31.2 & 86.5 \\
\hline
Off-Socket & & \\
\hline
Single cache miss & 92.4 & 256.7 \\
\hline
CAS cache miss & 95.9 & 266.4 \\
\hline
Comms Fabric & 4,500 & 7,500 \\
\hline
Global Comms & 195,000,000 & 324,000,000 \\
\end{tabular}
\caption{Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System}
\label{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System}
\end{table}
The first problem limits raw speed, and the second limits
miniaturization, which in turn limits frequency.
And even this sidesteps the power-consumption issue that
is currently holding production frequencies to well below
10 GHz.
Nevertheless, some progress is being made, as may be seen
by comparing
Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 16-CPU 2.8GHz Intel X5550 (Nehalem) System}
with
Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System}
on
page~\pageref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System}.
Integration of hardware threads in a single core and multiple
cores on a die have improved latencies greatly, at least within the
confines of a single core or single die.
There has been some improvement in overall system latency,
but only by about a factor of two.
Unfortunately, neither the speed of light nor the atomic nature
of matter has changed much in the past few years.
Section~\ref{sec:cpu:Hardware Free Lunch?}
looks at what else hardware designers might be
able to do to ease the plight of parallel programmers.
\QuickQ{}
These numbers are insanely large!
How can I possibly get my head around them?
\QuickA{}
Get a roll of toilet paper.
In the USA, each roll will normally have somewhere around 350-500
sheets.
Tear off one sheet to represent a single clock cycle, setting it aside.
Now unroll the rest of the roll.
The resulting pile of toilet paper will likely represent a single
CAS cache miss.
For the more-expensive inter-system communications latencies,
use several rolls (or multiple cases) of toilet paper to represent
the communications latency.
Important safety tip: make sure to account for the needs of
those you live with when appropriating toilet paper!
\QuickQ{}
But individual electrons don't move anywhere near that fast,
even in conductors!!!
The electron drift velocity in a conductor under the low voltages
found in semiconductors is on the order of only one \emph{millimeter}
per second.
What gives???
\QuickA{}
Electron drift velocity tracks the long-term movement of individual
electrons.
It turns out that individual electrons bounce around quite
randomly, so that their instantaneous speed is very high, but
over the long term, they don't move very far.
In this, electrons resemble long-distance commuters, who
might spend most of their time traveling at full highway
speed, but over the long term going nowhere.
These commuters' speed might be 70 miles per hour
(113 kilometers per hour), but their long-term drift velocity
relative to the planet's surface is zero.
When designing circuitry, electrons' instantaneous speed is
often more important than their drift velocity.
When a voltage is applied to a wire, more electrons enter the
wire than leave it, but the electrons entering cause the
electrons already there to move a bit further down the wire,
which causes other electrons to move down, and so on.
The result is that the electric field moves quite quickly down
the wire.
Just as the speed of sound in air is much greater than is
the typical wind speed, the electric field propagates down
the wire at a much higher velocity than the electron drift
velocity.
\QuickQ{}
Given that distributed-systems communication is so horribly
expensive, why does anyone bother with them?
\QuickA{}
There are a number of reasons:
\begin{enumerate}
\item Shared-memory multiprocessor systems have strict size limits.
If you need more than a few thousand CPUs, you have no
choice but to use a distributed system.
\item Extremely large shared-memory systems tend to be
quite expensive and to have even longer cache-miss
latencies than does the small four-CPU system
shown in
Table~\ref{tab:cpu:Performance of Synchronization Mechanisms on 4-CPU 1.8GHz AMD Opteron 844 System}.
\item The distributed-systems communications latencies do
not necessarily consume the CPU, which can often allow
computation to proceed in parallel with message transfer.
\item Many important problems are ``embarrassingly parallel'',
so that extremely large quantities of processing may
be enabled by a very small number of messages.
SETI@HOME~\cite{SETIatHOME2008}
is but one example of such an application.
These sorts of applications can make good use of networks
of computers despite extremely long communications
latencies.
\end{enumerate}
It is likely that continued work on parallel applications will
increase the number of embarrassingly parallel applications that
can run well on machines and/or clusters having long communications
latencies.
That said, greatly reduced hardware latencies would be an
extremely welcome development.
\QuickQ{}
OK, if we are going to have to apply distributed-programming
techniques to shared-memory parallel programs, why not just
always use these distributed techniques and dispense with
shared memory?
\QuickA{}
Because it is often the case that only a small fraction of
the program is performance-critical.
Shared-memory parallelism allows us to focus distributed-programming
techniques on that small fraction, allowing simpler shared-memory
techniques to be used on the non-performance-critical bulk of
the program.
\QuickQAC{chp:Tools of the Trade}{Tools of the Trade}
\QuickQ{}
But this silly shell script isn't a \emph{real} parallel program!
Why bother with such trivia???
\QuickA{}
Because you should \emph{never} forget the simple stuff!
Please keep in mind that the title of this book is
``Is Parallel Programming Hard, And, If So, What Can You Do About It?''.
One of the most effective things you can do about it is to
avoid forgetting the simple stuff!
After all, if you choose to do parallel programming the hard
way, you have no one but yourself to blame.
\QuickQ{}
Is there a simpler way to create a parallel shell script?
If so, how? If not, why not?
\QuickA{}
One straightforward approach is the shell pipeline:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
grep $pattern1 | sed -e 's/a/b/' | sort
\end{verbatim}
\end{minipage}
\vspace{5pt}
For a sufficiently large input file,
\co{grep} will pattern-match in parallel with \co{sed}
editing and with the input processing of \co{sort}.
See the file \co{parallel.sh} for a demonstration of
shell-script parallelism and pipelining.
\QuickQ{}
But if script-based parallel programming is so easy, why
bother with anything else?
\QuickA{}
In fact, it is quite likely that a very large fraction of
parallel programs in use today are script-based.
However, script-based parallelism does have its limitations:
\begin{enumerate}
\item Creation of new processes is usually quite heavyweight,
involving the expensive \co{fork()} and \co{exec()}
system calls.
\item Sharing of data, including pipelining, typically involves
expensive file I/O.
\item The reliable synchronization primitives available to
scripts also typically involve expensive file I/O.
\end{enumerate}
These limitations require that script-based parallelism use
coarse-grained parallelism, with each unit of work having
execution time of at least tens of milliseconds, and preferably
much longer.
Those requiring finer-grained parallelism are well advised to
think hard about their problem to see if it can be expressed
in a coarse-grained form.
If not, they should consider using other parallel-programming
environments, such as those discussed in
Section~\ref{sec:toolsoftrade:POSIX Multiprocessing}.
\QuickQ{}
Why does this \co{wait()} primitive need to be so complicated?
Why not just make it work like the shell-script \co{wait} does?
\QuickA{}
Some parallel applications need to take special action when
specific children exit, and therefore need to wait for each
child individually.
In addition, some parallel applications need to detect the
reason that the child died.
As we saw in Figure~\ref{fig:toolsoftrade:Using the wait() Primitive},
it is not hard to build a \co{waitall()} function out of
the \co{wait()} function, but it would be impossible to
do the reverse.
Once the information about a specific child is lost, it is lost.
\QuickQ{}
Isn't there a lot more to \co{fork()} and \co{wait()}
than discussed here?
\QuickA{}
Indeed there is, and
it is quite possible that this section will be expanded in
future versions to include messaging features (such as UNIX
pipes, TCP/IP, and shared file I/O) and memory mapping
(such as \co{mmap()} and \co{shmget()}).
In the meantime, there are any number of textbooks that cover
these primitives in great detail,
and the truly motivated can read manpages, existing parallel
applications using these primitives, as well as the
source code of the Linux-kernel implementations themselves.
\QuickQ{}
If the \co{mythread()} function in
Figure~\ref{fig:toolsoftrade:Threads Created Via pthread-create() Share Memory}
can simply return, why bother with \co{pthread_exit()}?
\QuickA{}
In this simple example, there is no reason whatsoever.
However, imagine a more complex example, where \co{mythread()}
invokes other functions, possibly separately compiled.
In such a case, \co{pthread_exit()} allows these other functions
to end the thread's execution without having to pass some sort
of error return all the way back up to \co{mythread()}.
\QuickQ{}
If the C language makes no guarantees in presence of a data
race, then why does the Linux kernel have so many data races?
Are you trying to tell me that the Linux kernel is completely
broken???
\QuickA{}
Ah, but the Linux kernel is written in a carefully selected
superset of the C language that includes special gcc
extensions, such as asms, that permit safe execution even
in presence of data races.
In addition, the Linux kernel does not run on a number of
platforms where data races would be especially problematic.
For an example, consider embedded systems with 32-bit pointers
and 16-bit busses.
On such a system, a data race involving a store to and a load
from a given pointer might well result in the load returning the
low-order 16 bits of the old value of the pointer concatenated
with the high-order 16 bits of the new value of the pointer.
\QuickQ{}
What if I want several threads to hold the same lock at the
same time?
\QuickA{}
The first thing you should do is to ask yourself why you would
want to do such a thing.
If the answer is ``because I have a lot of data that is read
by many threads, and only occasionally updated'', then
POSIX reader-writer locks might be what you are looking for.
These are introduced in
Section~\ref{sec:toolsoftrade:POSIX Reader-Writer Locking}.
Another way to get the effect of multiple threads holding
the same lock is for one thread to acquire the lock, and
then use \co{pthread_create()} to create the other threads.
The question of why this would ever be a good idea is left
to the reader.
\QuickQ{}
Why not simply make the argument to \co{lock_reader()}
on line~5 of
Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks}
be a pointer to a \co{pthread_mutex_t}?
\QuickA{}
Because we will need to pass \co{lock_reader()} to
\co{pthread_create()}.
Although we could cast the function when passing it to
\co{pthread_create()}, function casts are quite a bit
uglier and harder to get right than are simple pointer casts.
\QuickQ{}
Writing four lines of code for each acquisition and release
of a \co{pthread_mutex_t} sure seems painful!
Isn't there a better way?
\QuickA{}
Indeed!
And for that reason, the \co{pthread_mutex_lock()} and
\co{pthread_mutex_unlock()} primitives are normally wrapped
in functions that do this error checking.
Later on, we will wrapper them with the Linux kernel
\co{spin_lock()} and \co{spin_unlock()} APIs.
\QuickQ{}
Is ``x = 0'' the only possible output from the code fragment
shown in
Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock}?
If so, why?
If not, what other output could appear, and why?
\QuickA{}
No.
The reason that ``x = 0'' was output was that \co{lock_reader()}
acquired the lock first.
Had \co{lock_writer()} instead acquired the lock first, then
the output would have been ``x = 3''.
However, because the code fragment started \co{lock_reader()} first
and because this run was performed on a multiprocessor,
one would normally expect \co{lock_reader()} to acquire the
lock first.
However, there are no guarantees, especially on a busy system.
\QuickQ{}
Using different locks could cause quite a bit of confusion,
what with threads seeing each others' intermediate states.
So should well-written parallel programs restrict themselves
to using a single lock in order to avoid this kind of confusion?
\QuickA{}
Although it is sometimes possible to write a program using a
single global lock that both performs and scales well, such
programs are exceptions to the rule.
You will normally need to use multiple locks to attain good
performance and scalability.
One possible exception to this rule is ``transactional memory'',
which is currently a research topic.
Transactional-memory semantics can be loosely thought of as those
of a single global lock with optimizations permitted and
with the addition of rollback~\cite{HansJBoehm2009HOTPAR}.
\QuickQ{}
In the code shown in
Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks},
is \co{lock_reader()} guaranteed to see all the values produced
by \co{lock_writer()}?
Why or why not?
\QuickA{}
No.
On a busy system, \co{lock_reader()} might be preempted
for the entire duration of \co{lock_writer()}'s execution,
in which case it would not see \emph{any} of \co{lock_writer()}'s
intermediate states for \co{x}.
\QuickQ{}
Wait a minute here!!!
Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock}
didn't initialize shared variable \co{x},
so why does it need to be initialized in
Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks}?
\QuickA{}
See line~3 of
Figure~\ref{fig:toolsoftrade:Demonstration of Exclusive Locks}.
Because the code in
Figure~\ref{fig:toolsoftrade:Demonstration of Same Exclusive Lock}
ran first, it could rely on the compile-time initialization of
\co{x}.
The code in
Figure~\ref{fig:toolsoftrade:Demonstration of Different Exclusive Locks}
ran next, so it had to re-initialize \co{x}.
\QuickQ{}
Instead of using \co{ACCESS_ONCE()} everywhere, why not just
declare \co{goflag} as \co{volatile} on line~10 of
Figure~\ref{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability}?
\QuickA{}
A \co{volatile} declaration is in fact a reasonable alternative in
this particular case.
However, use of \co{ACCESS_ONCE()} has the benefit of clearly
flagging to the reader that \co{goflag} is subject to concurrent
reads and updates.
However, \co{ACCESS_ONCE()} is especially useful in cases where
most of the accesses are protected by a lock (and thus \emph{not}
subject to change), but where a few of the accesses are made outside
of the lock.
Using a volatile declaration in this case would make it harder
for the reader to note the special accesses outside of the lock,
and would also make it harder for the compiler to generate good
code under the lock.
\QuickQ{}
\co{ACCESS_ONCE()} only affects the compiler, not the CPU.
Don't we also need memory barriers to make sure
that the change in \co{goflag}'s value propagates to the
CPU in a timely fashion in
Figure~\ref{fig:toolsoftrade:Measuring Reader-Writer Lock Scalability}?
\QuickA{}
No, memory barriers are not needed and won't help here.
Memory barriers only enforce ordering among multiple
memory references: They do absolutely nothing to expedite
the propagation of data from one par of the system to
another.
This leads to a quick rule of thumb: You do not need
memory barriers unless you are using more than one
variable to communicate between multiple threads.
But what about \co{nreadersrunning}?
Isn't that a second variable used for communication?
Indeed it is, and there really are the needed memory-barrier
instructions buried in \co{__sync_fetch_and_add()},
which make sure that the thread proclaims its presence
before checking to see if it should start.
\QuickQ{}
Would it ever be necessary to use \co{ACCESS_ONCE()} when accessing
a per-thred variable, for example, a variable declared using
the \co{gcc} \co{__thread} storage class?
\QuickA{}
It depends.
If the per-thread variable was accessed only from its thread,
and never from a single handler, then no.
Otherwise, it is quite possible that \co{ACCESS_ONCE()} is needed.
We will see examples of both situations in
Section~\ref{sec:count:Signal-Theft Limit Counter Implementation}.
This leads to the question of how one thread can gain access to
another thread's \co{__thread} variable, and the answer is that
the second thread must store a pointer to its \co{__thread}
pointer somewhere that the first thread has access to.
One common approach is to maintain a linked list with one
element per thread, and to store the address of each thread's
\co{__thread} variable in the corresponding element.
\QuickQ{}
Isn't comparing against single-CPU throughput a bit harsh?
\QuickA{}
Not at all.
In fact, this comparison was, if anything, overly lenient.
A more balanced comparison would be against single-CPU
throughput with the locking primitives commented out.
\QuickQ{}
But 1,000 instructions is not a particularly small size for
a critical section.
What do I do if I need a much smaller critical section, for
example, one containing only a few tens of instructions?
\QuickA{}
If the data being read \emph{never} changes, then you do not
need to hold any locks while accessing it.
If the data changes sufficiently infrequently, you might be
able to checkpoint execution, terminate all threads, change
the data, then restart at the checkpoint.
Another approach is to keep a single exclusive lock per
thread, so that a thread read-acquires the larger aggregate
reader-writer lock by acquiring its own lock, and write-acquires
by acquiring all the per-thread locks~\cite{WilsonCHsieh92a}.
This can work quite well for readers, but causes writers
to incur increasingly large overheads as the number of threads
increases.
Some other ways of handling very small critical sections are
described in Section~\ref{sec:defer:Read-Copy Update (RCU)}.
\QuickQ{}
In
Figure~\ref{fig:intro:Reader-Writer Lock Scalability},
all of the traces other than the 100M trace deviate gently
from the ideal line.
In contrast, the 100M trace breaks sharply from the ideal
line at 64 CPUs.
In addition, the spacing between the 100M trace and the 10M
trace is much smaller than that between the 10M trace and the
1M trace.
Why does the 100M trace behave so much differently than the
other traces?
\QuickA{}
Your first clue is that 64 CPUs is exactly half of the 128
CPUs on the machine.
The difference is an artifact of hardware threading.
This system has 64 cores with two hardware threads per core.
As long as fewer than 64 threads are running, each can run
in its own core.
But as soon as there are more than 64 threads, some of the threads
must share cores.
Because the pair of threads in any given core share some hardware
resources, the throughput of two threads sharing a core is not
quite as high as that of two threads each in their own core.
So the performance of the 100M trace is limited not by the
reader-writer lock, but rather by the sharing of hardware resources
between hardware threads in a single core.
This can also be seen in the 10M trace, which deviates gently from
the ideal line up to 64 threads, then breaks sharply down, parallel
to the 100M trace.
Up to 64 threads, the 10M trace is limited primarily by reader-writer
lock scalability, and beyond that, also by sharing of hardware
resources between hardware threads in a single core.
\QuickQ{}
Power-5 is several years old, and new hardware should
be faster.
So why should anyone worry about reader-writer locks being slow?
\QuickA{}
In general, newer hardware is improving.
However, it will need to improve more than two orders of magnitude
to permit reader-writer lock to achieve idea performance on
128 CPUs.
Worse yet, the greater the number of CPUs, the larger the
required performance improvement.
The performance problems of reader-writer locking are therefore
very likely to be with us for quite some time to come.
\QuickQ{}
Is it really necessary to have both sets of primitives?
\QuickA{}
Strictly speaking, no.
One could implement any member of the second set using the
corresponding member of the first set.
For example, one could implement \co{__sync_nand_and_fetch()}
in terms of \co{__sync_fetch_and_nand()} as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
tmp = v;
ret = __sync_fetch_and_nand(p, tmp);
ret = ~ret & tmp;
\end{verbatim}
\end{minipage}
\vspace{5pt}
It is similarly possible to implement \co{__sync_fetch_and_add()},
\co{__sync_fetch_and_sub()}, and \co{__sync_fetch_and_xor()}
in terms of their post-value counterparts.
However, the alternative forms can be quite convenient, both
for the programmer and for the compiler/library implementor.
\QuickQ{}
Given that these atomic operations will often be able to
generate single atomic instructions that are directly
supported by the underlying instruction set, shouldn't
they be the fastest possible way to get things done?
\QuickA{}
Unfortunately, no.
See Chapter~\ref{chp:Counting} for some stark counterexamples.
\QuickQ{}
What happened to the Linux-kernel equivalents to \co{fork()}
and \co{join()}?
\QuickA{}
They don't really exist.
All tasks executing within the Linux kernel share memory,
at least unless you want to do a huge amount of memory-mapping
work by hand.
\QuickQ{}
Wouldn't the shell normally use \co{vfork()} rather than
\co{fork()}?
\QuickA{}
It might well do that, however, checking is left as an exercise
for the reader.
But in the meantime, I hope that we cna agree that \co{vfork()}
is a variant of \co{fork()}, so that we can use \co{fork()}
as a generic term covering both.
\QuickQAC{chp:Counting}{Counting}
\QuickQ{}
Why on earth should efficient and scalable counting be hard?
After all, computers have special hardware for the sole purpose
of doing counting,
addition, subtraction, and lots more besides, don't they???
\QuickA{}
Because the straightforward counting algorithms, for example,
atomic operations on a shared counter, either are slow and scale
badly, or are inaccurate, as will be seen in
Section~\ref{sec:count:Why Isn't Concurrent Counting Trivial?}.
\QuickQ{}
{ \bfseries Network-packet counting problem. }
Suppose that you need to collect statistics on the number
of networking packets (or total number of bytes) transmitted
and/or received.
Packets might be transmitted or received by any CPU on
the system.
Suppose further that this large machine is capable of
handling a million packets per second, and that there
is a systems-monitoring package that reads out the count
every five seconds.
How would you implement this statistical counter?
\QuickA{}
Hint: The act of updating the counter must be blazingly
fast, but because the counter is read out only about once
in five million updates, the act of reading out the counter can be
quite slow.
In addition, the value read out normally need not be all that
accurate---after all, since the counter is updated a thousand
times per millisecond, we should be able to work with a value
that is within a few thousand counts of the ``true value'',
whatever ``true value'' might mean in this context.
However, the value read out should maintain roughly the same
absolute error over time.
For example, a 1\% error might be just fine when the count
is on the order of a million or so, but might be absolutely
unacceptable once the count reaches a trillion.
See Section~\ref{sec:count:Statistical Counters}.
\QuickQ{}
{ \bfseries Approximate structure-allocation limit problem. }
Suppose that you need to maintain a count of the number of
structures allocated in order to fail any allocations
once the number of structures in use exceeds a limit
(say, 10,000).
Suppose further that these structures are short-lived,
that the limit is rarely exceeded, and that a ``sloppy''
approximate limit is acceptable.
\QuickA{}
Hint: The act of updating the counter must again be blazingly
fast, but the counter is read out each time that the
counter is increased.
However, the value read out need not be accurate
\emph{except} that it must distinguish approximately
between values below the limit and values greater than or
equal to the limit.
See Section~\ref{sec:count:Approximate Limit Counters}.
\QuickQ{}
{ \bfseries Exact structure-allocation limit problem. }
Suppose that you need to maintain a count of the number of
structures allocated in order to fail any allocations
once the number of structures in use exceeds an exact limit
(again, say 10,000).
Suppose further that these structures are short-lived,
and that the limit is rarely exceeded, that there is almost
always at least one structure in use, and suppose further
still that it is necessary to know exactly when this counter reaches
zero, for example, in order to free up some memory
that is not required unless there is at least one structure
in use.
\QuickA{}
Hint: The act of updating the counter must once again be blazingly
fast, but the counter is read out each time that the
counter is increased.
However, the value read out need not be accurate
\emph{except} that it absolutely must distinguish perfectly
between values between the limit and zero on the one hand,
and values that either are less than or equal to zero or
are greater than or equal to the limit on the other hand.
See Section~\ref{sec:count:Exact Limit Counters}.
\QuickQ{}
{ \bfseries Removable I/O device access-count problem. }
Suppose that you need to maintain a reference count on a
heavily used removable mass-storage device, so that you
can tell the user when it is safe to remove the device.
This device follows the usual removal procedure where
the user indicates a desire to remove the device, and
the system tells the user when it is safe to do so.
\QuickA{}
Hint: Yet again, the act of updating the counter must be blazingly
fast and scalable in order to avoid slowing down I/O operations,
but because the counter is read out only when the
user wishes to remove the device, the counter read-out
operation can be extremely slow.
Furthermore, there is no need to be able to read out
the counter at all unless the user has already indicated
a desire to remove the device.
In addition, the value read out need not be accurate
\emph{except} that it absolutely must distinguish perfectly
between non-zero and zero values, and even then only when
the device is in the process of being removed.
However, once it has read out a zero value, it must act
to keep the value at zero until it has taken some action
to prevent subsequent threads from gaining access to the
device being removed.
See Section~\ref{sec:count:Applying Specialized Parallel Counters}.
\QuickQ{}
But doesn't the \co{++} operator produce an x86 add-to-memory
instruction?
And won't the CPU cache cause this to be atomic?
\QuickA{}
Although the \co{++} operator \emph{could} be atomic, there
is no requirement that it be so.
And indeed, \co{gcc} often
chooses to load the value to a register, increment
the register, then store the value to memory, which is
decidedly non-atomic.
\QuickQ{}
The 8-figure accuracy on the number of failures indicates
that you really did test this.
Why would it be necessary to test such a trivial program,
especially when the bug is easily seen by inspection?
\QuickA{}
Not only are there very few
trivial parallel programs, and most days I am
not so sure that there are many trivial sequential programs, either.
No matter how small or simple the program, if you haven't tested
it, it does not work.
And even if you have tested it, Murphy's Law says that there will
be at least a few bugs still lurking.
Furthermore, while proofs of correctness certainly do have their
place, they never will replace testing, including the
\url{counttorture.h} test setup used here.
After all, proofs are only as good as the assumptions that they
are based on.
Furthermore, proofs can have bugs just as easily as programs can!
\QuickQ{}
Why doesn't the dashed line on the x~axis meet the
diagonal line at $x=1$?
\QuickA{}
Because of the overhead of the atomic operation.
The dashed line on the x~axis represents the overhead of
a single \emph{non-atomic} increment.
After all, an \emph{ideal} algorithm would not only scale
linearly, it would also incur no performance penalty compared
to single-threaded code.
This level of idealism may seem severe, but if it is good
enough for Linus Torvalds, it is good enough for you.
\QuickQ{}
But atomic increment is still pretty fast.
And incrementing a single variable in a tight loop sounds
pretty unrealistic to me, after all, most of the program's
execution should be devoted to actually doing work, not accounting
for the work it has done!
Why should I care about making this go faster?
\QuickA{}
In many cases, atomic increment will in fact be fast enough
for you.
In those cases, you should by all means use atomic increment.
That said, there are many real-world situations where
more elaborate counting algorithms are required.
The canonical example of such a situation is counting packets
and bytes in highly optimized networking stacks, where it is
all too easy to find much of the execution time going into
these sorts of accounting tasks, especially on large
multiprocessors.
In addition, as noted at the beginning of this chapter,
counting provides an excellent view of the
issues encountered in shared-memory parallel programs.
\QuickQ{}
But why can't CPU designers simply ship the addition operation to the
data, avoiding the need to circulate the cache line containing
the global variable being incremented?
\QuickA{}
It might well be possible to do this in some cases.
However, there are a few complications:
\begin{enumerate}
\item If the value of the variable is required, then the
thread will be forced to wait for the operation
to be shipped to the data, and then for the result
to be shipped back.
\item If the atomic increment must be ordered with respect
to prior and/or subsequent operations, then the thread
will be forced to wait for the operation to be shipped
to the data, and for an indication that the operation
completed to be shipped back.
\item Shipping operations among CPUs will likely require
more lines in the system interconnect, which will consume
more die area and more electrical power.
\end{enumerate}
But what if neither of the first two conditions holds?
Then you should think carefully about the algorithms discussed
in Section~\ref{sec:count:Statistical Counters}, which achieve
near-ideal performance on commodity hardware.
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{count/GlobalTreeInc}}
\end{center}
\caption{Data Flow For Global Combining-Tree Atomic Increment}
\label{fig:count:Data Flow For Global Combining-Tree Atomic Increment}
\end{figure}
If either or both of the first two conditions hold, there is
\emph{some} hope for improved hardware.
One could imagine the hardware implementing a combining tree,
so that the increment requests from multiple CPUs are combined
by the hardware into a single addition when the combined request
reaches the hardware.
The hardware could also apply an order to the requests, thus
returning to each CPU the return value corresponding to its
particular atomic increment.
This results in instruction latency that varies as $O(log N)$,
where $N$ is the number of CPUs, as shown in
Figure~\ref{fig:count:Data Flow For Global Combining-Tree Atomic Increment}.
And CPUs with this sort of hardware optimization are starting to
appear as of 2011.
This is a great improvement over the $O(N)$ performance
of current hardware shown in
Figure~\ref{fig:count:Data Flow For Global Atomic Increment},
and it is possible that hardware latencies might decrease
further if innovations such as three-dimensional fabrication prove
practical.
Nevertheless, we will see that in some important special cases,
software can do \emph{much} better.
\QuickQ{}
But doesn't the fact that C's ``integers'' are limited in size
complicate things?
\QuickA{}
No, because modulo addition is still commutative and associative.
At least as long as you use unsigned integers.
Recall that in the C standard, overflow of signed integers results
in undefined behavior, never mind the fact that machines that
do anything other than wrap on overflow are quite rare these days.
Unfortunately, compilers frequently carry out optimizations that
assume that signed integers will not overflow, so if your code
allows signed integers to overflow, you can run into trouble
even on twos-complement hardware.
That said, one potential source of additional complexity arises
when attempting to gather (say) a 64-bit sum from 32-bit
per-thread counters.
Dealing with this added complexity is left as
an exercise for the reader, for whom some of the techniques
introduced later in this chapter could be quite helpful.
\QuickQ{}
An array???
But doesn't that limit the number of threads?
\QuickA{}
It can, and in this toy implementation, it does.
But it is not that hard to come up with an alternative
implementation that permits an arbitrary number of threads,
for example, using the \co{gcc} \co{__thread} facility,
as shown in
Section~\ref{sec:count:Per-Thread-Variable-Based Implementation}.
\QuickQ{}
What other choice does gcc have, anyway???
\QuickA{}
According to the C standard, the effects of fetching a variable
that might be concurrently modified by some other thread are
undefined.
It turns out that the C standard really has no other choice,
given that C must support (for example) eight-bit architectures
which are incapable of atomically loading a \co{long}.
An upcoming version of the C standard aims to fill this gap,
but until then, we depend on the kindness of the gcc developers.
Alternatively, use of volatile accesses such as those provided
by \co{ACCESS_ONCE()}~\cite{JonCorbet2012ACCESS:ONCE}
can help constrain the compiler, at least
in cases where the hardware is capable of accessing the value
with a single memory-reference instruction.
\QuickQ{}
How does the per-thread \co{counter} variable in
Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
get initialized?
\QuickA{}
The C standard specifies that the initial value of
global variables is zero, unless they are explicitly initialized.
So the initial value of all the instances of \co{counter}
will be zero.
Furthermore, in the common case where the user is interested only
in differences between consecutive reads
from statistical counters, the initial value is irrelevant.
\QuickQ{}
How is the code in
Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
supposed to permit more than one counter?
\QuickA{}
Indeed, this toy example does not support more than one counter.
Modifying it so that it can provide multiple counters is left
as an exercise to the reader.
\QuickQ{}
The read operation takes time to sum up the per-thread values,
and during that time, the counter could well be changing.
This means that the value returned by
\co{read_count()} in
Figure~\ref{fig:count:Array-Based Per-Thread Statistical Counters}
will not necessarily be exact.
Assume that the counter is being incremented at rate
$r$ counts per unit time, and that \co{read_count()}'s
execution consumes $\Delta$ units of time.
What is the expected error in the return value?
\QuickA{}
Let's do worst-case analysis first, followed by a less
conservative analysis.
In the worst case, the read operation completes immediately,
but is then delayed for $\Delta$ time units before returning,
in which case the worst-case error is simply $r \Delta$.
This worst-case behavior is rather unlikely, so let us instead
consider the case where the reads from each of the $N$
counters is spaced equally over the time period $\Delta$.
There will be $N+1$ intervals of duration $\frac{\Delta}{N+1}$
between the $N$ reads.
The error due to the delay after the read from the last thread's
counter will be given by $\frac{r \Delta}{N \left( N + 1 \right)}$,
the second-to-last thread's counter by
$\frac{2 r \Delta}{N \left( N + 1 \right)}$,
the third-to-last by
$\frac{3 r \Delta}{N \left( N + 1 \right)}$,
and so on.
The total error is given by the sum of the errors due to the
reads from each thread's counter, which is:
\begin{equation}
\frac{r \Delta}{N \left( N + 1 \right)}
\sum_{i = 1}^N i
\end{equation}
Expressing the summation in closed form yields:
\begin{equation}
\frac{r \Delta}{N \left( N + 1 \right)}
\frac{N \left( N + 1 \right)}{2}
\end{equation}
Cancelling yields the intuitively expected result:
\begin{equation}
\frac{r \Delta}{2}
\end{equation}
It is important to remember that error continues accumulating
as the caller executes code making use of the count returned
by the read operation.
For example, if the caller spends time $t$ executing some
computation based on the result of the returned count, the
worst-case error will have increased to $r \left(t \Delta \right)$.
The expected error will have similarly increased to:
\begin{equation}
r \left( \frac{\Delta}{2} + t \right)
\end{equation}
Of course, it is sometimes unacceptable for the counter to
continue incrementing during the read operation.
Section~\ref{sec:count:Applying Specialized Parallel Counters}
discusses a way to handle this situation.
All that aside, in most uses of statistical counters, the
error in the value returned by \co{read_count()} is
irrelevant.
This irrelevance is due to the fact that the time required
for \co{read_count()} to execute is normally extremely
small compared to the time interval between successive
calls to \co{read_count()}.
\QuickQ{}
Why doesn't \co{inc_count()} in
Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
need to use atomic instructions?
After all, we now have multiple threads accessing the per-thread
counters!
\QuickA{}
Because one of the two threads only reads, and because the
variable is aligned and machine-sized, non-atomic instructions
suffice.
That said, the \co{ACCESS_ONCE()} macro is used to prevent
compiler optimizations that might otherwise prevent the
counter updates from becoming visible to
\co{eventual()}~\cite{JonCorbet2012ACCESS:ONCE}.
An older version of this algorithm did in fact use atomic
instructions, kudos to Ersoy Bayramoglu for pointing out that
they are in fact unnecessary.
That said, atomic instructions would be needed in cases where
the per-thread \co{counter} variables were smaller than the
global \co{global_count}.
However, note that on a 32-bit system,
the per-thread \co{counter} variables
might need to be limited to 32 bits in order to sum them accurately,
but with a 64-bit \co{global_count} variable to avoid overflow.
In this case, it is necessary to zero the per-thead
\co{counter} variables periodically in order to avoid overflow.
It is extremely important to note that this zeroing cannot
be delayed too long or overflow of the smaller per-thread
variables will result.
This approach therefore imposes real-time requirements on the
underlying system, and in turn must be used with extreme care.
In contrast, if all variables are the same size, overflow
of any variable is harmless because the eventual sum
will be modulo the word size.
\QuickQ{}
Won't the single global thread in the function \co{eventual()} of
Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
be just as severe a bottleneck as a global lock would be?
\QuickA{}
In this case, no.
What will happen instead is that as the number of threads increases,
the estimate of the counter
value returned by \co{read_count()} will become more inaccurate.
\QuickQ{}
Won't the estimate returned by \co{read_count()} in
Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
become increasingly
inaccurate as the number of threads rises?
\QuickA{}
Yes.
If this proves problematic, one fix is to provide multiple
\co{eventual()} threads, each covering its own subset of
the other threads.
In more extreme cases, a tree-like hierarchy of
\co{eventual()} threads might be required.
\QuickQ{}
Given that in the eventually-consistent algorithm shown in
Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
both reads and updates have extremely low overhead
and are extremely scalable, why would anyone bother with the
implementation described in
Section~\ref{sec:count:Array-Based Implementation},
given its costly read-side code?
\QuickA{}
The thread executing \co{eventual()} consumes CPU time.
As more of these eventually-consistent counters are added,
the resulting \co{eventual()} threads will eventually
consume all available CPUs.
This implementation therefore suffers a different sort of
scalability limitation, with the scalability limit being in
terms of the number of eventually consistent counters rather
than in terms of the number of threads or CPUs.
\QuickQ{}
Why do we need an explicit array to find the other threads'
counters?
Why doesn't gcc provide a \co{per_thread()} interface, similar
to the Linux kernel's \co{per_cpu()} primitive, to allow
threads to more easily access each others' per-thread variables?
\QuickA{}
Why indeed?
To be fair, gcc faces some challenges that the Linux kernel
gets to ignore.
When a user-level thread exits, its per-thread variables all
disappear, which complicates the problem of per-thread-variable
access, particularly before the advent of user-level RCU
(see Section~\ref{sec:defer:Read-Copy Update (RCU)}).
In contrast, in the Linux kernel, when a CPU goes offline,
that CPU's per-CPU variables remain mapped and accessible.
Similarly, when a new user-level thread is created, its
per-thread variables suddenly come into existence.
In contrast, in the Linux kernel, all per-CPU variables are
mapped and initialized at boot time, regardless of whether
the corresponding CPU exists yet, or indeed, whether the
corresponding CPU will ever exist.
A key limitation that the Linux kernel imposes is a compile-time
maximum bound on the number of CPUs, namely, \co{CONFIG_NR_CPUS},
along with a typically tighter boot-time bound of \co{nr_cpu_ids}.
In contrast, in user space, there is no hard-coded upper limit
on the number of threads.
Of course, both environments must handle dynamically loaded
code (dynamic libraries in user space, kernel modules in the
Linux kernel), which increases the complexity of per-thread
variables.
These complications make it significantly harder for user-space
environments to provide access to other threads' per-thread
variables.
Nevertheless, such access is highly useful, and it is hoped
that it will someday appear.
\QuickQ{}
Doesn't the check for \co{NULL} on line~19 of
Figure~\ref{fig:count:Per-Thread Statistical Counters}
add extra branch mispredictions?
Why not have a variable set permanently to zero, and point
unused counter-pointers to that variable rather than setting
them to \co{NULL}?
\QuickA{}
This is a reasonable strategy.
Checking for the performance difference is left as an exercise
for the reader.
However, please keep in mind that the fastpath is not
\co{read_count()}, but rather \co{inc_count()}.
\QuickQ{}
Why on earth do we need something as heavyweight as a \emph{lock}
guarding the summation in the function \co{read_count()} in
Figure~\ref{fig:count:Per-Thread Statistical Counters}?
\QuickA{}
Remember, when a thread exits, its per-thread variables disappear.
Therefore, if we attempt to access a given thread's per-thread
variables after that thread exits, we will get a segmentation
fault.
The lock coordinates summation and thread exit, preventing this
scenario.
Of course, we could instead read-acquire a reader-writer lock,
but Chapter~\ref{chp:Deferred Processing} will introduce even
lighter-weight mechanisms for implementing the required coordination.
Another approach would be to use an array instead of a per-thread
variable, which, as Alexey Roytman notes, would eliminate
the tests against \co{NULL}.
However, array accesses are often slower than accesses to
per-thread variables, and use of an array would imply a
fixed upper bound on the number of threads.
Also, note that neither tests nor locks are needed on the
\co{inc_count()} fastpath.
\QuickQ{}
Why on earth do we need to acquire the lock in
\co{count_register_thread()} in
Figure~\ref{fig:count:Per-Thread Statistical Counters}?
It is a single properly aligned machine-word store to a location
that no other thread is modifying, so it should be atomic anyway,
right?
\QuickA{}
This lock could in fact be omitted, but better safe than
sorry, especially given that this function is executed only at
thread startup, and is therefore not on any critical path.
Now, if we were testing on machines with thousands of CPUs,
we might need to omit the lock, but on machines with ``only''
a hundred or so CPUs, there is no need to get fancy.
\QuickQ{}
Fine, but the Linux kernel doesn't have to acquire a lock
when reading out the aggregate value of per-CPU counters.
So why should user-space code need to do this???
\QuickA{}
Remember, the Linux kernel's per-CPU variables are always
accessible, even if the corresponding CPU is offline --- even
if the corresponding CPU never existed and never will exist.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
1 long __thread counter = 0;
2 long *counterp[NR_THREADS] = { NULL };
3 int finalthreadcount = 0;
4 DEFINE_SPINLOCK(final_mutex);
5
6 void inc_count(void)
7 {
8 counter++;
9 }
10
11 long read_count(void)
12 {
13 int t;
14 long sum = 0;
15
16 for_each_thread(t)
17 if (counterp[t] != NULL)
18 sum += *counterp[t];
19 return sum;
20 }
21
22 void count_init(void)
23 {
24 }
25
26 void count_register_thread(void)
27 {
28 counterp[smp_thread_id()] = &counter;
29 }
30
31 void count_unregister_thread(int nthreadsexpected)
32 {
33 spin_lock(&final_mutex);
34 finalthreadcount++;
35 spin_unlock(&final_mutex);
36 while (finalthreadcount < nthreadsexpected)
37 poll(NULL, 0, 1);
38 }
\end{verbatim}
}
\caption{Per-Thread Statistical Counters With Lockless Summation}
\label{fig:count:Per-Thread Statistical Counters With Lockless Summation}
\end{figure}
One workaround is to ensure that each thread continues to exist
until all threads are finished, as shown in
Figure~\ref{fig:count:Per-Thread Statistical Counters With Lockless Summation}
(\co{count_tstat.c}).
Analysis of this code is left as an exercise to the reader,
however, please note that it does not fit well into the
\url{counttorture.h} counter-evaluation scheme.
(Why not?)
Chapter~\ref{chp:Deferred Processing} will introduce
synchronization mechanisms that handle this situation in a much
more graceful manner.
\QuickQ{}
What fundamental difference is there between counting packets
and counting the total number of bytes in the packets, given
that the packets vary in size?
\QuickA{}
When counting packets, the counter is only incremented by the
value one.
On the other hand, when counting bytes, the counter might
be incremented by largish numbers.
Why does this matter?
Because in the increment-by-one case, the value returned will
be exact in the sense that the counter must necessarily have
taken on that value at some point in time, even if it is impossible
to say precisely when that point occurred.
In contrast, when counting bytes, two different threads might
return values that are inconsistent with any global ordering
of operations.
To see this, suppose that thread~0 adds the value three to its
counter, thread~1 adds the value five to its counter, and
threads~2 and 3 sum the counters.
If the system is ``weakly ordered'' or if the compiler
uses aggressive optimizations, thread~2 might find the
sum to be three and thread~3 might find the sum to be five.
The only possible global orders of the sequence of values
of the counter are 0,3,8 and 0,5,8, and neither order is
consistent with the results obtained.
If you missed this one, you are not alone.
Michael Scott used this question to stump Paul E.~McKenney
during Paul's Ph.D. defense.
\QuickQ{}
Given that the reader must sum all the threads' counters,
this could take a long time given large numbers of threads.
Is there any way that the increment operation can remain
fast and scalable while allowing readers to also enjoy
reasonable performance and scalability?
\QuickA{}
One approach would be to maintain a global approximation
to the value.
Readers would increment their per-thread variable, but when it
reached some predefined limit, atomically add it to a global
variable, then zero their per-thread variable.
This would permit a tradeoff between average increment overhead
and accuracy of the value read out.
The reader is encouraged to think up and try out other approaches,
for example, using a combining tree.
\QuickQ{}
Why does
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}
provide \co{add_count()} and \co{sub_count()} instead of the
\co{inc_count()} and \co{dec_count()} interfaces show in
Section~\ref{sec:count:Statistical Counters}?
\QuickA{}
Because structures come in different sizes.
Of course, a limit counter corresponding to a specific size
of structure might still be able to use
\co{inc_count()} and \co{dec_count()}.
\QuickQ{}
What is with the strange form of the condition on line~3 of
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
Why not the following more intuitive form of the fastpath?
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
3 if (counter + delta <= countermax){
4 counter += delta;
5 return 1;
6 }
\end{verbatim}
\end{minipage}
\vspace{5pt}
\QuickA{}
Two words.
``Integer overflow.''
Try the above formulation with \co{counter} equal to 10 and
\co{delta} equal to \co{ULONG_MAX}.
Then try it again with the code shown in
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}.
A good understanding of integer overflow will be required for
the rest of this example, so if you have never dealt with
integer overflow before, please try several examples to get
the hang of it.
Integer overflow can sometimes be more difficult to get right
than parallel algorithms!
\QuickQ{}
Why does \co{globalize_count()} zero the per-thread variables,
only to later call \co{balance_count()} to refill them in
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
Why not just leave the per-thread variables non-zero?
\QuickA{}
That is in fact what an earlier version of this code did.
But addition and subtraction are extremely cheap, and handling
all of the special cases that arise is quite complex.
Again, feel free to try it yourself, but beware of integer
overflow!
\QuickQ{}
Given that \co{globalreserve} counted against us in \co{add_count()},
why doesn't it count for us in \co{sub_count()} in
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
\QuickA{}
The \co{globalreserve} variable tracks the sum of all threads'
\co{countermax} variables.
The sum of these threads' \co{counter} variables might be anywhere
from zero to \co{globalreserve}.
We must therefore take a conservative approach, assuming that
all threads' \co{counter} variables are full in \co{add_count()}
and that they are all empty in \co{sub_count()}.
But remember this question, as we will come back to it later.
\QuickQ{}
Suppose that one thread invokes \co{add_count()} shown in
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read},
and then another thread invokes \co{sub_count()}.
Won't \co{sub_count()} return failure even though the value of
the counter is non-zero?
\QuickA{}
Indeed it will!
In many cases, this will be a problem, as discussed in
Section~\ref{sec:count:Simple Limit Counter Discussion}, and
in those cases the algorithms from
Section~\ref{sec:count:Exact Limit Counters}
will likely be preferable.
\QuickQ{}
Why have both \co{add_count()} and \co{sub_count()} in
Figure~\ref{fig:count:Simple Limit Counter Add, Subtract, and Read}?
Why not simply pass a negative number to \co{add_count()}?
\QuickA{}
Given that \co{add_count()} takes an \co{unsigned} \co{long}
as its argument, it is going to be a bit tough to pass it a
negative number.
And unless you have some anti-matter memory, there is little
point in allowing negative numbers when counting the number
of structures in use!
\QuickQ{}
Why set \co{counter} to \co{countermax / 2} in line~15 of
Figure~\ref{fig:count:Simple Limit Counter Utility Functions}?
Wouldn't it be simpler to just take \co{countermax} counts?
\QuickA{}
First, it really is reserving \co{countermax} counts
(see line~14), however,
it adjusts so that only half of these are actually in use
by the thread at the moment.
This allows the thread to carry out at least \co{countermax / 2}
increments or decrements before having to refer back to
\co{globalcount} again.
Note that the accounting in \co{globalcount} remains accurate,
thanks to the adjustment in line~18.
\QuickQ{}
In Figure~\ref{fig:count:Schematic of Globalization and Balancing},
even though a quarter of the remaining count up to the limit is
assigned to thread~0, only an eighth of the remaining count is
consumed, as indicated by the uppermost dotted line connecting
the center and the rightmost configurations.
Why is that?
\QuickA{}
The reason this happened is that thread~0's \co{counter} was
set to half of its \co{countermax}.
Thus, of the quarter assigned to thread~0, half of that quarter
(one eighth) came from \co{globalcount}, leaving the other half
(again, one eighth) to come from the remaining count.
There are two purposes for taking this approach:
(1)~To allow thread~0 to use the fastpath for decrements as
well as increments, and
(2)~To reduce the inaccuracies if all threads are monotonically
incrementing up towards the limit.
To see this last point, step through the algorithm and watch
what it does.
\QuickQ{}
Why is it necessary to atomically manipulate the thread's
\co{counter} and \co{countermax} variables as a unit?
Wouldn't it be good enough to atomically manipulate them
individually?
\QuickA{}
This might well be possible, but great care is required.
Note that removing \co{counter} without first zeroing
\co{countermax} could result in the corresponding thread
increasing \co{counter} immediately after it was zeroed,
completely negating the effect of zeroing the counter.
The opposite ordering, namely zeroing \co{countermax} and then
removing \co{counter}, can also result in a non-zero
\co{counter}.
To see this, consider the following sequence of events:
\begin{enumerate}
\item Thread~A fetches its \co{countermax}, and finds that
it is non-zero.
\item Thread~B zeroes Thread~A's \co{countermax}.
\item Thread~B removes Thread~A's \co{counter}.
\item Thread~A, having found that its \co{countermax}
is non-zero, proceeds to add to its \co{counter},
resulting in a non-zero value for \co{counter}.
\end{enumerate}
Again, it might well be possible to atomically manipulate
\co{countermax} and \co{counter} as separate variables,
but it is clear that great care is required.
It is also quite likely that doing so will slow down the
fastpath.
Exploring these possibilities are left as exercises for
the reader.
\QuickQ{}
In what way does line~7 of
Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}
violate the C standard?
\QuickA{}
It assumes eight bits per byte.
This assumption does hold for all current commodity microprocessors
that can be easily assembled into shared-memory multiprocessors,
but certainly does not hold for all computer systems that have
ever run C code.
(What could you do instead in order to comply with the C
standard? What drawbacks would it have?)
\QuickQ{}
Given that there is only one \co{ctrandmax} variable,
why bother passing in a pointer to it on line~18 of
Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}?
\QuickA{}
There is only one \co{ctrandmax} variable \emph{per thread}.
Later, we will see code that needs to pass other threads'
\co{ctrandmax} variables to \co{split_ctrandmax()}.
\QuickQ{}
Why does \co{merge_ctrandmax()} in
Figure~\ref{fig:count:Atomic Limit Counter Variables and Access Functions}
return an \co{int} rather than storing directly into an
\co{atomic_t}?
\QuickA{}
Later, we will see that we need the \co{int} return to pass
to the \co{atomic_cmpxchg()} primitive.
\QuickQ{}
Yecch!
Why the ugly \co{goto} on line~11 of
Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}?
Haven't you heard of the \co{break} statement???
\QuickA{}
Replacing the \co{goto} with a \co{break} would require keeping
a flag to determine whether or not line~15 should return, which
is not the sort of thing you want on a fastpath.
If you really hate the \co{goto} that much, your best bet would
be to pull the fastpath into a separate function that returned
success or failure, with ``failure'' indicating a need for the
slowpath.
This is left as an exercise for goto-hating readers.
\QuickQ{}
Why would the \co{atomic_cmpxchg()} primitive at lines~13-14 of
Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}
ever fail?
After all, we picked up its old value on line~9 and have not
changed it!
\QuickA{}
Later, we will see how the \co{flush_local_count()} function in
Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 1}
might update this thread's \co{ctrandmax} variable concurrently
with the execution of the fastpath on lines~8-14 of
Figure~\ref{fig:count:Atomic Limit Counter Add and Subtract}.
\QuickQ{}
What stops a thread from simply refilling its
\co{ctrandmax} variable immediately after
\co{flush_local_count()} on line 14 of
Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 1}
empties it?
\QuickA{}
This other thread cannot refill its \co{ctrandmax}
until the caller of \co{flush_local_count()} releases the
\co{gblcnt_mutex}.
By that time, the caller of \co{flush_local_count()} will have
finished making use of the counts, so there will be no problem
with this other thread refilling --- assuming that the value
of \co{globalcount} is large enough to permit a refill.
\QuickQ{}
What prevents concurrent execution of the fastpath of either
\co{atomic_add()} or \co{atomic_sub()} from interfering with
the \co{ctrandmax} variable while
\co{flush_local_count()} is accessing it on line 27 of
Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 1}
empties it?
\QuickA{}
Nothing.
Consider the following three cases:
\begin{enumerate}
\item If \co{flush_local_count()}'s \co{atomic_xchg()} executes
before the \co{split_ctrandmax()} of either fastpath,
then the fastpath will see a zero \co{counter} and
\co{countermax}, and will thus transfer to the slowpath
(unless of course \co{delta} is zero).
\item If \co{flush_local_count()}'s \co{atomic_xchg()} executes
after the \co{split_ctrandmax()} of either fastpath,
but before that fastpath's \co{atomic_cmpxchg()},
then the \co{atomic_cmpxchg()} will fail, causing the
fastpath to restart, which reduces to case~1 above.
\item If \co{flush_local_count()}'s \co{atomic_xchg()} executes
after the \co{atomic_cmpxchg()} of either fastpath,
then the fastpath will (most likely) complete successfully
before \co{flush_local_count()} zeroes the thread's
\co{ctrandmax} variable.
\end{enumerate}
Either way, the race is resolved correctly.
\QuickQ{}
Given that the \co{atomic_set()} primitive does a simple
store to the specified \co{atomic_t}, how can line~21 of
\co{balance_count()} in
Figure~\ref{fig:count:Atomic Limit Counter Utility Functions 2}
work correctly in face of concurrent \co{flush_local_count()}
updates to this variable?
\QuickA{}
The caller of both \co{balance_count()} and
\co{flush_local_count()} hold \co{gblcnt_mutex}, so
only one may be executing at a given time.
\QuickQ{}
But signal handlers can be migrated to some other
CPU while running.
Doesn't this possibility require that atomic instructions
and memory barriers are required to reliably communicate
between a thread and a signal handler that interrupts that
thread?
\QuickA{}
No.
If the signal handler is migrated to another CPU, then the
interrupted thread is also migrated along with it.
\QuickQ{}
In Figure~\ref{fig:count:Signal-Theft State Machine}, why is
the REQ \co{theft} state colored red?
\QuickA{}
To indicate that only the fastpath is permitted to change the
\co{theft} state, and that if the thread remains in this
state for too long, the thread running the slowpath will
resend the POSIX signal.
\QuickQ{}
In Figure~\ref{fig:count:Signal-Theft State Machine}, what is
the point of having separate REQ and ACK \co{theft} states?
Why not simplify the state machine by collapsing
them into a single REQACK state?
Then whichever of the signal handler or the fastpath gets there
first could set the state to READY.
\QuickA{}
Reasons why collapsing the REQ and ACK states would be a very
bad idea include:
\begin{enumerate}
\item The slowpath uses the REQ and ACK states to determine
whether the signal should be retransmitted.
If the states were collapsed, the slowpath would have
no choice but to send redundant signals, which would
have the unhelpful effect of needlessly slowing down
the fastpath.
\item The following race would result:
\begin{enumerate}
\item The slowpath sets a given thread's state to REQACK.
\item That thread has just finished its fastpath, and
notes the REQACK state.
\item The thread receives the signal, which also notes
the REQACK state, and, because there is no fastpath
in effect, sets the state to READY.
\item The slowpath notes the READY state, steals the
count, and sets the state to IDLE, and completes.
\item The fastpath sets the state to READY, disabling
further fastpath execution for this thread.
\end{enumerate}
The basic problem here is that the combined REQACK state
can be referenced by both the signal handler and the
fastpath.
The clear separation maintained by the four-state
setup ensures orderly state transitions.
\end{enumerate}
That said, you might well be able to make a three-state setup
work correctly.
If you do succeed, compare carefully to the four-state setup.
Is the three-state solution really preferable, and why or why not?
\QuickQ{}
In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}
function \co{flush_local_count_sig()}, why are there
\co{ACCESS_ONCE()} wrappers around the uses of the
\co{theft} per-thread variable?
\QuickA{}
The first one (on line~11) can be argued to be unnecessary.
The last two (lines~14 and 16) are important.
If these are removed, the compiler would be within its rights
to rewrite lines~14-17 as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
14 theft = THEFT_READY;
15 if (counting) {
16 theft = THEFT_ACK;
17 }
\end{verbatim}
\end{minipage}
\vspace{5pt}
This would be fatal, as the slowpath might see the transient
value of \co{THEFT_READY}, and start stealing before the
corresponding thread was ready.
\QuickQ{}
In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions},
why is it safe for line~28 to directly access the other thread's
\co{countermax} variable?
\QuickA{}
Because the other thread is not permitted to change the value
of its \co{countermax} variable unless it holds the
\co{gblcnt_mutex} lock.
But the caller has acquired this lock, so it is not possible
for the other thread to hold it, and therefore the other thread
is not permitted to change its \co{countermax} variable.
We can therefore safely access it --- but not change it.
\QuickQ{}
In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions},
why doesn't line~33 check for the current thread sending itself
a signal?
\QuickA{}
There is no need for an additional check.
The caller of \co{flush_local_count()} has already invoked
\co{globalize_count()}, so the check on line~28 will have
succeeded, skipping the later \co{pthread_kill()}.
\QuickQ{}
The code in
Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions},
works with gcc and POSIX.
What would be required to make it also conform to the ISO C standard?
\QuickA{}
The \co{theft} variable must be of type \co{sig_atomic_t}
to guarantee that it can be safely shared between the signal
handler and the code interrupted by the signal.
\QuickQ{}
In Figure~\ref{fig:count:Signal-Theft Limit Counter Value-Migration Functions}, why does line~41 resend the signal?
\QuickA{}
Because many operating systems over several decades have
had the property of losing the occasional signal.
Whether this is a feature or a bug is debatable, but
irrelevant.
The obvious symptom from the user's viewpoint will not be
a kernel bug, but rather a user application hanging.
\emph{Your} user application hanging!
\QuickQ{}
Not only are POSIX signals slow, sending one to each thread
simply does not scale.
What would you do if you had (say) 10,000 threads and needed
the read side to be fast?
\QuickA{}
One approach is to use the techniques shown in
Section~\ref{sec:count:Eventually Consistent Implementation},
summarizing an approximation to the overall counter value in
a single variable.
Another approach would be to use multiple threads to carry
out the reads, with each such thread interacting with a
specific subset of the updating threads.
\QuickQ{}
What if you want an exact limit counter to be exact only for
its lower limit, but to allow the upper limit to be inexact?
\QuickA{}
One simple solution is to overstate the upper limit by the
desired amount.
The limiting case of such overstatement results in the
upper limit being set to the largest value that the counter is
capable of representing.
\QuickQ{}
What else had you better have done when using a biased counter?
\QuickA{}
You had better have set the upper limit to be large enough
accommodate the bias, the expected maximum number of accesses,
and enough ``slop'' to allow the counter to work efficiently
even when the number of accesses is at its maximum.
\QuickQ{}
This is ridiculous!
We are \emph{read}-acquiring a reader-writer lock to
\emph{update} the counter?
What are you playing at???
\QuickA{}
Strange, perhaps, but true!
Almost enough to make you think that the name
``reader-writer lock'' was poorly chosen, isn't it?
\QuickQ{}
What other issues would need to be accounted for in a real system?
\QuickA{}
A huge number!
Here are a few to start with:
\begin{enumerate}
\item There could be any number of devices, so that the
global variables are inappropriate, as are the
lack of arguments to functions like \co{do_io()}.
\item Polling loops can be problematic in real systems.
In many cases, it is far better to have the last
completing I/O wake up the device-removal thread.
\item The I/O might fail, and so \co{do_io()} will likely
need a return value.
\item If the device fails, the last I/O might never complete.
In such cases, there might need to be some sort of
timeout to allow error recovery.
\item Both \co{add_count()} and \co{sub_count()} can
fail, but their return values are not checked.
\item Reader-writer locks do not scale well.
One way of avoiding the high read-acquisition costs
of reader-writer locks is presented in
Chapters~\ref{chp:Locking}
and~\ref{chp:Deferred Processing}.
\item The polling loops result in poor energy efficiency.
An event-driven design is preferable.
\end{enumerate}
\QuickQ{}
On the \url{count_stat.c} row of
Table~\ref{tab:count:Statistical Counter Performance on Power-6},
we see that the update side scales linearly with the number of
threads.
How is that possible given that the more threads there are,
the more per-thread counters must be summed up?
\QuickA{}
The read-side code must scan the entire fixed-size array, regardless
of the number of threads, so there is no difference in performance.
In contrast, in the last two algorithms, readers must do more
work when there are more threads.
In addition, the last two algorithms interpose an additional
level of indirection because they map from integer thread ID
to the corresponding \co{__thread} variable.
\QuickQ{}
Even on the last row of
Table~\ref{tab:count:Statistical Counter Performance on Power-6},
the read-side performance of these statistical counter
implementations is pretty horrible.
So why bother with them?
\QuickA{}
``Use the right tool for the job.''
As can be seen from
Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem},
single-variable atomic increment need not apply for any job
involving heavy use of parallel updates.
In contrast, the algorithms shown in
Table~\ref{tab:count:Statistical Counter Performance on Power-6}
do an excellent job of handling update-heavy situations.
Of course, if you have a read-mostly situation, you should
use something else, for example, an eventually consistent design
featuring a single atomically incremented
variable that can be read out using a single load,
similar to the approach used in
Section~\ref{sec:count:Eventually Consistent Implementation}.
\QuickQ{}
Given the performance data shown in
Table~\ref{tab:count:Limit Counter Performance on Power-6},
we should always prefer update-side signals over read-side
atomic operations, right?
\QuickA{}
That depends on the workload.
Note that you need almost one hundred thousand readers (with roughly
a 60-nanosecond performance gain) to make up for even one
writer (with almost a 5-\emph{millisecond} performance loss).
Although there are no shortage of workloads with far greater
read intensity, you will need to consider your particular
workload.
In addition, although memory barriers have historically been
expensive compared to ordinary instructions, you should
check this on the specific hardware you will be running.
The properties of computer hardware do change over time,
and algorithms must change accordingly.
\QuickQ{}
Can advanced techniques be applied to address the lock
contention for readers seen in
Table~\ref{tab:count:Limit Counter Performance on Power-6}?
\QuickA{}
One approach is to give up some update-side performance, as is
done with scalable non-zero indicators
(SNZI)~\cite{FaithEllen:2007:SNZI}.
There are a number of other ways one might go about this, and these
are left as exercises for the reader.
Any number of approaches that apply hierarchy, which replace
frequent global-lock acquisitions with local lock acquisitions
corresponding to lower levels of the hierarchy, should work quite well.
\QuickQ{}
The \co{++} operator works just fine for 1,000-digit numbers!
Haven't you heard of operator overloading???
\QuickA{}
In the C++ language, you might well be able to use \co{++}
on a 1,000-digit number, assuming that you had access to a
class implementing such numbers.
But as of 2010, the C language does not permit operator overloading.
\QuickQ{}
But if we are going to have to partition everything, why bother
with shared-memory multithreading?
Why not just partition the problem completely and run as
multiple processes, each in its own address space?
\QuickA{}
Indeed, multiple processes with separate address spaces can be
an excellent way to exploit parallelism, as the proponents of
the fork-join methodology and the Erlang language would be very
quick to tell you.
However, there are also some advantages to shared-memory parallelism:
\begin{enumerate}
\item Only the most performance-critical portions of the
application must be partitioned, and such portions
are usually a small fraction of the application.
\item Although cache misses are quite slow compared to
individual register-to-register instructions,
they are typically considerably faster than
inter-process-communication primitives, which in
turn are considerably faster than things like
TCP/IP networking.
\item Shared-memory multiprocessors are readily available
and quite inexpensive, so, in stark contrast to the
1990s, there is little cost penalty for use of
shared-memory parallelism.
\end{enumerate}
As always, use the right tool for the job!
\QuickQAC{cha:Partitioning and Synchronization Design}{Partitioning and Synchronization Design}
\QuickQ{}
Is there a better solution to the Dining
Philosophers Problem?
\QuickA{}
\begin{figure}[tb]
\begin{center}
\includegraphics[scale=.7]{SMPdesign/DiningPhilosopher5PEM}
\end{center}
\caption{Dining Philosophers Problem, Fully Partitioned}
\ContributedBy{Figure}{fig:SMPdesign:Dining Philosophers Problem, Fully Partitioned}{Kornilios Kourtis}
\end{figure}
One such improved solution is shown in
Figure~\ref{fig:SMPdesign:Dining Philosophers Problem, Fully Partitioned},
where the philosophers are simply provided with an additional
five forks.
All five philosophers may now eat simultaneously, and there
is never any need for philosophers to wait on one another.
In addition, this approach offers greatly improved disease control.
This solution might seem like cheating to some, but such
``cheating'' is key to finding good solutions to many
concurrency problems.
\QuickQ{}
And in just what sense can this ``horizontal parallelism'' be
said to be ``horizontal''?
\QuickA{}
Inman was working with protocol stacks, which are normally
depicted vertically, with the application on top and the
hardware interconnect on the bottom.
Data flows up and down this stack.
``Horizontal parallelism'' processes packets from different network
connections in parallel, while ``vertical parallelism''
handles different protocol-processing steps for a given
packet in parallel.
``Vertical parallelism'' is also called ``pipelining''.
\QuickQ{}
In this compound double-ended queue implementation, what should
be done if the queue has become non-empty while releasing
and reacquiring the lock?
\QuickA{}
In this case, simply dequeue an item from the non-empty
queue, release both locks, and return.
\QuickQ{}
Is the hashed double-ended queue a good solution?
Why or why not?
\QuickA{}
The best way to answer this is to run \url{lockhdeq.c} on
a number of different multiprocessor systems, and you are
encouraged to do so in the strongest possible terms.
One reason for concern is that each operation on this
implementation must acquire not one but two locks.
% Getting about 500 nanoseconds per element when used as
% a queue on a 4.2GHz Power system. This is roughly the same as
% the version covered by a single lock. Sequential (unlocked
% variant is more than an order of magnitude faster!
The first well-designed performance study will be cited.\footnote{
The studies by Dalessandro
et al.~\cite{LukeDalessandro:2011:ASPLOS:HybridNOrecSTM:deque}
and Dice et al.~\cite{DavidDice:2010:SCA:HTM:deque} are
good starting points.}
Do not forget to compare to a sequential implementation!
\QuickQ{}
Move \emph{all} the elements to the queue that became empty?
In what possible universe is this brain-dead solution in any
way optimal???
\QuickA{}
It is optimal in the case where data flow switches direction only
rarely.
It would of course be an extremely poor choice if the double-ended
queue was being emptied from both ends concurrently.
This of course raises another question, namely, in what possible
universe emptying from both ends concurrently would be a reasonable
thing to do.
Work-stealing queues are one possible answer to this question.
\QuickQ{}
Why can't the compound parallel double-ended queue
implementation be symmetric?
\QuickA{}
The need to avoid deadlock by imposing a lock hierarchy
forces the asymmetry, just as it does in the fork-numbering
solution to the Dining Philosophers Problem
(see Section~\ref{sec:SMPdesign:Dining Philosophers Problem}).
\QuickQ{}
Why is it necessary to retry the right-dequeue operation
on line~28 of
Figure~\ref{fig:SMPdesign:Compound Parallel Double-Ended Queue Implementation}?
\QuickA{}
This retry is necessary because some other thread might have
enqueued an element between the time that this thread dropped
\co{d->rlock} on line~25 and the time that it reacquired this
same lock on line~27.
\QuickQ{}
Surely the left-hand lock must \emph{sometimes} be available!!!
So why is it necessary that line~25 of
Figure~\ref{fig:SMPdesign:Compound Parallel Double-Ended Queue Implementation}
unconditionally release the right-hand lock?
\QuickA{}
It would be possible to use \co{spin_trylock()} to attempt
to acquire the left-hand lock when it was available.
However, the failure case would still need to drop the
right-hand lock and then re-acquire the two locks in order.
Making this transformation (and determining whether or not
it is worthwhile) is left as an exercise for the reader.
\QuickQ{}
The tandem double-ended queue runs about twice as fast as
the hashed double-ended queue, even when I increase the
size of the hash table to an insanely large number.
Why is that?
\QuickA{}
The hashed double-ended queue's locking design only permits
one thread at a time at each end, and further requires
two lock acquisitions for each operation.
The tandem double-ended queue also permits one thread at a time
at each end, and in the common case requires only one lock
acquisition per operation.
Therefore, the tandem double-ended queue should be expected to
outperform the hashed double-ended queue.
Can you created a double-ended queue that allows multiple
concurrent operations at each end?
If so, how? If not, why not?
\QuickQ{}
Is there a significantly better way of handling concurrency
for double-ended queues?
\QuickA{}
One approach is to transform the problem to be solved
so that multiple double-ended queues can be used in parallel,
allowing the simpler single-lock double-ended queue to be used,
and perhaps also replace each double-ended queue with a pair of
conventional single-ended queues.
Without such ``horizontal scaling'', the speedup is limited
to 2.0.
In contrast, horizontal-scaling designs can achieve very large
speedups, and are especially attractive if there are multiple threads
working either end of the queue, because in this
multiple-thread case the dequeue
simply cannot provide strong ordering guarantees.
After all, the fact that a given thread removed an item first
in no way implies that it will process that item
first~\cite{AndreasHaas2012FIFOisnt}.
And if there are no guarantees, we may as well obtain the
performance benefits that come with refusing to provide these
guarantees.
% about twice as fast as hashed version on 4.2GHz Power.
Regardless of whether or not the problem can be transformed
to use multiple queues, it is worth asking whether work can
be batched so that each enqueue and dequeue operation corresponds
to larger units of work.
This batching approach decreases contention on the queue data
structures, which increases both performance and scalability,
as will be seen in
Section~\ref{sec:SMPdesign:Synchronization Granularity}.
After all, if you must incur high synchronization overheads,
be sure you are getting your money's worth.
Other researchers are working on other ways to take advantage
of limited ordering guarantees in
queues~\cite{ChristophMKirsch2012FIFOisntTR}.
\QuickQ{}
Don't all these problems with critical sections mean that
we should just always use
non-blocking synchronization~\cite{MauriceHerlihy90a},
which don't have critical sections?
\QuickA{}
Although non-blocking synchronization can be very useful
in some situations, it is no panacea.
Also, non-blocking synchronization really does have
critical sections, as noted by Josh Triplett.
For example, in a non-blocking algorithm based on
compare-and-swap operations, the code starting at the
initial load and continuing to the compare-and-swap
is in many ways analogous to a lock-based critical section.
\QuickQ{}
What are some ways of preventing a structure from being freed while
its lock is being acquired?
\QuickA{}
Here are a few possible solutions to this \emph{existence guarantee}
problem:
\begin{enumerate}
\item Provide a statically allocated lock that is held while
the per-structure lock is being acquired, which is an
example of hierarchical locking (see
Section~\ref{sec:SMPdesign:Hierarchical Locking}).
Of course, using a single global lock for this purpose
can result in unacceptably high levels of lock contention,
dramatically reducing performance and scalability.
\item Provide an array of statically allocated locks, hashing
the structure's address to select the lock to be acquired,
as described in Chapter~\ref{chp:Locking}.
Given a hash function of sufficiently high quality, this
avoids the scalability limitations of the single global
lock, but in read-mostly situations, the lock-acquisition
overhead can result in unacceptably degraded performance.
\item Use a garbage collector, in software environments providing
them, so that a structure cannot be deallocated while being
referenced.
This works very well, removing the existence-guarantee
burden (and much else besides) from the developer's
shoulders, but imposes the overhead of garbage collection
on the program.
Although garbage-collection technology has advanced
considerably in the past few decades, its overhead
may be unacceptably high for some applications.
In addition, some applications require that the developer
exercise more control over the layout and placement of
data structures than is permitted by most garbage collected
environments.
\item As a special case of a garbage collector, use a global
reference counter, or a global array of reference counters.
\item Use \emph{hazard pointers}~\cite{MagedMichael04a}, which
can be thought of as an inside-out reference count.
Hazard-pointer-based algorithms maintain a per-thread list of
pointers, so that the appearance of a given pointer on
any of these lists acts as a reference to the corresponding
structure.
Hazard pointers are an interesting research direction, but
have not yet seen much use in production (written in 2008).
\item Use transactional memory
(TM)~\cite{Herlihy93a,DBLomet1977SIGSOFT,Shavit95},
so that each reference and
modification to the data structure in question is
performed atomically.
Although TM has engendered much excitement in recent years,
and seems likely to be of some use in production software,
developers should exercise some
caution~\cite{Blundell2005DebunkTM,Blundell2006TMdeadlock,McKenney2007PLOSTM},
particularly in performance-critical code.
In particular, existence guarantees require that the
transaction cover the full path from a global reference
to the data elements being updated.
\item Use RCU, which can be thought of as an extremely lightweight
approximation to a garbage collector.
Updaters are not permitted to free RCU-protected
data structures that RCU readers might still be referencing.
RCU is most heavily used for read-mostly data structures,
and is discussed at length in
Chapter~\ref{chp:Deferred Processing}.
\end{enumerate}
For more on providing existence guarantees, see
Chapters~\ref{chp:Locking} and \ref{chp:Deferred Processing}.
\QuickQ{}
How can a single-threaded 64-by-64 matrix multiple possibly
have an efficiency of less than 1.0?
Shouldn't all of the traces in
Figure~\ref{fig:SMPdesign:Matrix Multiply Efficiency}
have efficiency of exactly 1.0 when running on only one thread?
\QuickA{}
The \texttt{matmul.c} program creates the specified number of
worker threads, so even the single-worker-thread case incurs
thread-creation overhead.
Making the changes required to optimize away thread-creation
overhead in the single-worker-thread case is left as an
exercise to the reader.
\QuickQ{}
How are data-parallel techniques going to help with matrix
multiply?
It is \emph{already} data parallel!!!
\QuickA{}
I am glad that you are paying attention!
This example serves to show that although data parallelism can
be a very good thing, it is not some magic wand that automatically
wards off any and all sources of inefficiency.
Linear scaling at full performance, even to ``only'' 64 threads,
requires care at all phases of design and implementation.
In particular, you need to pay careful attention to the
size of the partitions.
For example, if you split a 64-by-64 matrix multiply across
64 threads, each thread gets only 64 floating-point multiplies.
The cost of a floating-point multiply is miniscule compared to
the overhead of thread creation.
Moral: If you have a parallel program with variable input,
always include a check for the input size being too small to
be worth parallelizing.
And when it is not helpful to parallelize, it is not helpful
to incur the overhead required to spawn a thread, now is it?
\QuickQ{}
In what situation would hierarchical locking work well?
\QuickA{}
If the comparison on line~31 of
Figure~\ref{fig:SMPdesign:Hierarchical-Locking Hash Table Search}
were replaced by a much heavier-weight operation,
then releasing {\tt bp->bucket\_lock} \emph{might} reduce lock
contention enough to outweigh the overhead of the extra
acquisition and release of {\tt cur->node\_lock}.
\QuickQ{}
In Figure~\ref{fig:SMPdesign:Allocator Cache Performance},
there is a pattern of performance rising with increasing run
length in groups of three samples, for example, for run lengths
10, 11, and 12.
Why?
\QuickA{}
This is due to the per-CPU target value being three.
A run length of 12 must acquire the global-pool lock twice,
while a run length of 13 must acquire the global-pool lock
three times.
\QuickQ{}
Allocation failures were observed in the two-thread
tests at run lengths of 19 and greater.
Given the global-pool size of 40 and the per-thread target
pool size $s$ of three, number of thread $n$ equal to two,
and assuming that the per-thread pools are initially
empty with none of the memory in use, what is the smallest allocation
run length $m$ at which failures can occur?
(Recall that each thread repeatedly allocates $m$ block of memory,
and then frees the $m$ blocks of memory.)
Alternatively, given $n$ threads each with pool size $s$, and
where each thread repeatedly first allocates $m$ blocks of memory
and then frees those $m$ blocks, how large must the global pool
size be?
\QuickA{}
The exact solution to this problem is left as an exercise to
the reader.
The first solution received will be credited to its submitter.
As a rough rule of thumb, the global pool size should be at least
$m+2sn$, where
``m'' is the maximum number of elements allocated at a given time,
``s'' is the per-CPU pool size,
and ``n'' is the number of CPUs.
\QuickQAC{chp:Locking}{Locking}
\QuickQ{}
Just how can serving as a whipping boy be considered to be
in any way honorable???
\QuickA{}
The reason locking serves as a research-paper whipping boy is
because it is heavily used in practice.
In contrast, if no one used or cared about locking, most research
papers would not bother even mentioning it.
\QuickQ{}
But the definition of deadlock only said that each thread
was holding at least one lock and waiting on another lock
that was held by some thread.
How do you know that there is a cycle?
\QuickA{}
Suppose that there is no cycle in the graph.
We would then have a directed acyclic graph (DAG), which would
have at least one leaf node.
If this leaf node was a lock, then we would have a thread
that was waiting on a lock that wasn't held by any thread,
which violates the definition.
(And in this case the thread would immediately acquire the
lock.)
On the other hand, if this leaf node was a thread, then
we would have a thread that was not waiting on any lock,
again violating the definition.
(And in this case, the thread would either be running or
be blocked on something that is not a lock.)
Therefore, given this definition of deadlock, there must
be a cycle in the corresponding graph.
\QuickQ{}
Are there any exceptions to this rule, so that there really could be
a deadlock cycle containing locks from both the library and
the caller, even given that the library code never invokes
any of the caller's functions?
\QuickA{}
Indeed there are!
Here are a few of them:
\begin{enumerate}
\item If one of the library function's arguments is a pointer
to a lock that this library function acquires, and if
the library function holds one if its locks while
acquiring the caller's lock, then we could have a
deadlock cycle involving both caller and library locks.
\item If one of the library functions returns a pointer to
a lock that is acquired by the caller, and if the
caller acquires one if its locks while holding the
library's lock, we could again have a deadlock
cycle involving both caller and library locks.
\item If one of the library functions acquires a lock and
then returns while still holding it, and if the caller
acquires one of its locks, we have yet another way
to create a deadlock cycle involving both caller
and library locks.
\item If the caller has a signal handler that acquires
locks, then the deadlock cycle can involve both
caller and library locks.
In this case, however, the library's locks are
innocent bystanders in the deadlock cycle.
That said, please note that acquiring a lock from
within a signal handler is a no-no in most
environments---it is not just a bad idea, it
is unsupported.
\end{enumerate}
\QuickQ{}
But if \co{qsort()} releases all its locks before invoking
the comparison function, how can it protect against races
with other \co{qsort()} threads?
\QuickA{}
By privatizing the data elements being compared
(as discussed in Chapter~\ref{chp:Data Ownership})
or through use of deferral mechanisms such as
reference counting (as discussed in
Chapter~\ref{chp:Deferred Processing}).
\QuickQ{}
Name one common exception where it is perfectly reasonable
to pass a pointer to a lock into a function.
\QuickA{}
Locking primitives, of course!
\QuickQ{}
Doesn't the fact that \co{pthread_cond_wait()} first releases the
mutex and then re-acquires it eliminate the possibility of deadlock?
\QuickA{}
Absolutely not!
Consider the a program that acquires \co{mutex_a}, and then
\co{mutex_b}, in that order, and then passes \co{mutex_a}
to \co{pthread_cond_wait}.
Now, \co{pthread_cond_wait} will release \co{mutex_a}, but
will re-acquire it before returning.
If some other thread acquires \co{mutex_a} in the meantime
and then blocks on \co{mutex_b}, the program will deadlock.
\QuickQ{}
Can the transformation from
Figure~\ref{fig:locking:Protocol Layering and Deadlock} to
Figure~\ref{fig:locking:Avoiding Deadlock Via Conditional Locking}
be applied universally?
\QuickA{}
Absolutely not!
This transformation assumes that the
\co{layer_2_processing()} function is idempotent, given that
it might be executed multiple times on the same packet when
the \co{layer_1()} routing decision changes.
Therefore, in real life, this transformation can become
arbitrarily complex.
\QuickQ{}
But the complexity in
Figure~\ref{fig:locking:Avoiding Deadlock Via Conditional Locking}
is well worthwhile given that it avoids deadlock, right?
\QuickA{}
Maybe.
If the routing decision in \co{layer_1()} changes often enough,
the code will always retry, never making forward progress.
This is termed ``livelock'' if no thread makes any forward progress or
``starvation''
if some threads make forward progress but other do not
(see Section~\ref{sec:locking:Livelock and Starvation}).
\QuickQ{}
When using the ``acquire needed locks first'' approach described in
Section~\ref{sec:locking:Acquire Needed Locks First},
how can livelock be avoided?
\QuickA{}
Provide an additional global lock.
If a given thread has repeatedly tried and failed to acquire the needed
locks, then have that thread unconditionally acquire the new
global lock, and then unconditionally acquire any needed locks.
(Suggested by Doug Lea.)
\QuickQ{}
Why is it illegal to acquire a Lock~A that is acquired outside
of a signal handler without blocking signals while holding
a Lock~B that is acquired within a signal handler?
\QuickA{}
Because this would lead to deadlock.
Given that Lock~A is held outside of a signal
handler without blocking signals, a signal might be handled while
holding this lock.
The corresponding signal handler might then acquire
Lock~B, so that Lock~B is acquired while holding Lock~A.
Therefore, if we also acquire Lock~A while holding Lock~B
as called out in the question, we will have a deadlock cycle.
Therefore, it is illegal to acquire a lock that is acquired outside
of a signal handler without blocking signals while holding
a another lock that is acquired within a signal handler.
\QuickQ{}
How can you legally block signals within a signal handler?
\QuickA{}
One of the simplest and fastest ways to do so is to use
the \co{sa_mask} field of the \co{struct sigaction} that
you pass to \co{sigaction()} when setting up the signal.
\QuickQ{}
If acquiring locks in signal handlers is such a bad idea, why
even discuss ways of making it safe?
\QuickA{}
Because these same rules apply to the interrupt handlers used in
operating-system kernels and in some embedded applications.
In many application environments, acquiring locks in signal
handlers is frowned upon~\cite{OpenGroup1997pthreads}.
However, that does not stop clever developers from (usually
unwisely) fashioning home-brew locks out of atomic operations.
And atomic operations are in many cases perfectly legal in
signal handlers.
\QuickQ{}
Given an object-oriented application that passes control freely
among a group of objects such that there is no straightforward
locking hierarchy,\footnote{
Also known as ``object-oriented spaghetti code.''}
layered or otherwise, how can this
application be parallelized?
\QuickA{}
There are a number of approaches:
\begin{enumerate}
\item In the case of parametric search via simulation,
where a large number of simulations will be run
in order to converge on (for example) a good design
for a mechanical or electrical device, leave the
simulation single-threaded, but run many instances
of the simulation in parallel.
This retains the object-oriented design, and
gains parallelism at a higher level, and likely
also avoids synchronization overhead.
\item Partition the objects into groups such that there
is no need to operate on objects in
more than one group at a given time.
Then associate a lock with each group.
This is an example of a single-lock-at-a-time
design, which discussed in
Section~\ref{sec:locking:Single-Lock-at-a-Time Designs}.
\item Partition the objects into groups such that threads
can all operate on objects in the groups in some
groupwise ordering.
Then associate a lock with each group, and impose a
locking hierarchy over the groups.
\item Impose an arbitrarily selected hierarchy on the locks,
and then use conditional locking if it is necessary
to acquire a lock out of order, as was discussed in
Section~\ref{sec:locking:Conditional Locking}.
\item Before carrying out a given group of operations, predict
which locks will be acquired, and attempt to acquire them
before actually carrying out any updates.
If the prediction turns out to be incorrect, drop
all the locks and retry with an updated prediction
that includes the benefit of experience.
This approach was discussed in
Section~\ref{sec:locking:Acquire Needed Locks First}.
\item Use transactional memory.
This approach has a number of advantages and disadvantages
which will be discussed in
Section~\ref{sec:future:Transactional Memory}.
\item Refactor the application to be more concurrency-friendly.
This would likely also have the side effect of making
the application run faster even when single-threaded, but might
also make it more difficult to modify the application.
\item Use techniques from later chapters in addition to locking.
\end{enumerate}
\QuickQ{}
How can the livelock shown in
Figure~\ref{fig:locking:Abusing Conditional Locking}
be avoided?
\QuickA{}
Figure~\ref{fig:locking:Avoiding Deadlock Via Conditional Locking}
provides some good hints.
In many cases, livelocks are a hint that you should revisit your
locking design.
Or visit it in the first place if your locking design
``just grew''.
That said, one good-and-sufficient approach due to Doug Lea
is to use conditional locking as described in
Section~\ref{sec:locking:Conditional Locking}, but combine this
with acquiring all needed locks first, before modifying shared
data, as described in
Section~\ref{sec:locking:Acquire Needed Locks First}.
If a given critical section retries too many times,
unconditionally acquire
a global lock, then unconditionally acquire all the needed locks.
This avoids both deadlock and livelock, and scales reasonably
assuming that the global lock need not be acquired too often.
\QuickQ{}
What problems can you spot in the code in
Figure~\ref{fig:locking:Conditional Locking and Exponential Backoff}?
\QuickA{}
Here are a couple:
\begin{enumerate}
\item A one-second wait is way too long for most uses.
Wait intervals should begin with roughly the time
required to execute the critical section, which will
normally be in the microsecond or millisecond range.
\item The code does not check for overflow.
On the other hand, this bug is nullified
by the previous bug: 32 bits worth of seconds is
more than 50 years.
\end{enumerate}
\QuickQ{}
Wouldn't it be better just to use a good parallel design
so that lock contention was low enough to avoid unfairness?
\QuickA{}
It would be better in some sense, but there are situations
where it can be appropriate to use
designs that sometimes result in high lock contentions.
For example, imagine a system that is subject to a rare error
condition.
It might well be best to have a simple error-handling design
that has poor performance and scalability for the duration of
the rare error condition, as opposed to a complex and
difficult-to-debug design that is helpful only when one of
those rare error conditions is in effect.
That said, it is usually worth putting some effort into
attempting to produce a design that both simple as well as
efficient during error conditions, for example by partitioning
the problem.
\QuickQ{}
How might the lock holder be interfered with?
\QuickA{}
If the data protected by the lock is in the same cache line
as the lock itself, then attempts by other CPUs to acquire
the lock will result in expensive cache misses on the part
of the CPU holding the lock.
This is a special case of false sharing, which can also occur
if a pair of variables protected by different locks happen
to share a cache line.
In contrast, if the lock is in a different cache line than
the data that it protects, the CPU holding the lock will
usually suffer a cache miss only on first access to a given
variable.
Of course, the downside of placing the lock and data into separate
cache lines is that the code will incur two cache misses rather
than only one in the uncontended case.
\QuickQ{}
Does it ever make sense to have an exclusive lock acquisition
immediately followed by a release of that same lock, that is,
an empty critical section?
\QuickA{}
This usage is rare, but is occasionally used.
The point is that the semantics of exclusive locks have two
components: (1)~the familiar data-protection semantic and
(2)~a messaging semantic, where releasing a given lock notifies
a waiting acquisition of that same lock.
One historical use of empty critical sections appeared in the
networking stack of the 2.4 Linux kernel.
This usage pattern can be thought of as a way of approximating
the effects of read-copy update (RCU), which is discussed in
Section~\ref{sec:defer:Read-Copy Update (RCU)}.
The empty-lock-critical-section idiom can also be used to
reduce lock contention in some situations.
For example, consider a multithreaded user-space application where
each thread processes unit of work maintained in a per-thread
list, where thread are prohibited from touching each others'
lists.
There could also be updates that require that all previously
scheduled units of work have completed before the update can
progress.
One way to handle this is to schedule a unit of work on each
thread, so that when all of these units of work complete, the
update may proceed.
In some applications, threads can come and go.
For example, each thread might correspond to one user of the
application, and thus be removed when that user logs out or
otherwise disconnects.
In many applications, threads cannot depart atomically: They must
instead explicitly unravel themselves from various portions of
the application using a specific sequence of actions.
One specific action will be refusing to accept further requests
from other threads, and another specific action will be disposing
of any remaining units of work on its list, for example, by
placing these units of work in a global work-item-disposal list
to be taken by one of the remaining threads.
(Why not just drain the thread's work-item list by executing
each item?
Because a given work item might generate more work items, so
that the list could not be drained in a timely fashion.)
If the application is to perform and scale well, a good locking
design is required.
One common solution is to have a global lock (call it \co{G})
protecting the entire
process of departing (and perhaps other things as well),
with finer-grained locks protecting the
individual unraveling operations.
Now, a departing thread must clearly refuse to accept further
requests before disposing of the work on its list, because
otherwise additional work might arrive after the disposal action,
which would render that disposal action ineffective.
So simplified pseudocode for a departing thread might be as follows:
\begin{enumerate}
\item Acquire lock \co{G}.
\item Acquire the lock guarding communications.
\item Refuse further communications from other threads.
\item Release the lock guarding communications.
\item Acquire the lock guarding the global work-item-disposal list.
\item Move all pending work items to the global
work-item-disposal list.
\item Release the lock guarding the global work-item-disposal list.
\item Release lock \co{G}.
\end{enumerate}
Of course, a thread that needs to wait for all pre-existing work
items will need to take departing threads into account.
To see this, suppose that this thread starts waiting for all
pre-existing work items just after a departing thread has refused
further communications from other threads.
How can this thread wait for the departing thread's work items
to complete, keeping in mind that threads are not allowed to
access each others' lists of work items?
One straightforward approach is for this thread to acquire \co{G}
and then the lock guarding the global work-item-disposal list, then
move the work items to its own list.
The thread then release both locks,
places a work item on the end of it own list,
and then wait for all of the work items that it placed on each thread's
list (including its own) to complete.
This approach does work well in many cases, but if special
processing is required for each work item as it is pulled in
from the global work-item-disposal list, the result could be
excessive contention on \co{G}.
One way to avoid that contention is to acquire \co{G} and then
immediately release it.
Then the process of waiting for all prior work items look
something like the following:
\begin{enumerate}
\item Set a global counter to one and initialize a condition
variable to zero.
\item Send a message to all threads to cause them to atomically
increment the global counter, and then to enqueue a
work item.
The work item will atomically decrement the global
counter, and if the result is zero, it will set a
condition variable to one.
\item Acquire \co{G}, which will wait on any currently departing
thread to finish departing.
Because only one thread may depart at a time, all the
remaining threads will have already received the message
sent in the preceding step.
\item Release \co{G}.
\item Acquire the lock guarding the global work-item-disposal list.
\item Move all work items from the global work-item-disposal list
to this thread's list, processing them as needed along the way.
\item Release the lock guarding the global work-item-disposal list.
\item Enqueue an additional work item onto this thread's list.
(As before, this work item will atomically decrement
the global counter, and if the result is zero, it will
set a condition variable to one.)
\item Wait for the condition variable to take on the value one.
\end{enumerate}
Once this procedure completes, all pre-existing work items are
guaranteed to have completed.
The empty critical sections are using locking for messaging as
well as for protection of data.
\QuickQ{}
Is there any other way for the VAX/VMS DLM to emulate
a reader-writer lock?
\QuickA{}
There are in fact several.
One way would be to use the null, protected-read, and exclusive
modes.
Another way would be to use the null, protected-read, and
concurrent-write modes.
A third way would be to use the null, concurrent-read, and
exclusive modes.
\QuickQ{}
The code in
Figure~\ref{fig:locking:Conditional Locking to Reduce Contention}
is ridiculously complicated!
Why not conditionally acquire a single global lock?
\QuickA{}
Conditionally acquiring a single global lock does work very well,
but only for relatively small numbers of CPUs.
To see why it is problematic in systems with many hundreds of
CPUs, look at
Figure~\ref{fig:count:Atomic Increment Scalability on Nehalem}
and extrapolate the delay from eight to 1,000 CPUs.
\QuickQ{}
Wait a minute!
If we ``win'' the tournament on line~16 of
Figure~\ref{fig:locking:Conditional Locking to Reduce Contention},
we get to do all the work of \co{do_force_quiescent_state()}.
Exactly how is that a win, really?
\QuickA{}
How indeed?
This just shows that in concurrency, just as in life, one
should take care to learn exactly what winning entails before
playing the game.
\QuickQ{}
Why not rely on the C language's default initialization of
zero instead of using the explicit initializer shown on
line~2 of
Figure~\ref{fig:locking:Sample Lock Based on Atomic Exchange}?
\QuickA{}
Because this default initialization does not apply to locks
allocated as auto variables within the scope of a function.
\QuickQ{}
Why bother with the inner loop on lines~7-8 of
Figure~\ref{fig:locking:Sample Lock Based on Atomic Exchange}?
Why not simply repeatedly do the atomic exchange operation
on line~6?
\QuickA{}
Suppose that the lock is held and that several threads
are attempting to acquire the lock.
In this situation, if these threads all loop on the atomic
exchange operation, they will ping-pong the cache line
containing the lock among themselves, imposing load
on the interconnect.
In contrast, if these threads are spinning in the inner
loop on lines~7-8, they will each spin within their own
caches, putting negligible load on the interconnect.
\QuickQ{}
Why not simply store zero into the lock word on line 14 of
Figure~\ref{fig:locking:Sample Lock Based on Atomic Exchange}?
\QuickA{}
This can be a legitimate implementation, but only if
this store is preceded by a memory barrier and makes use
of \co{ACCESS_ONCE()}.
The memory barrier is not required when the \co{xchg()}
operation is used because this operation implies a
full memory barrier due to the fact that it returns
a value.
\QuickQ{}
How can you tell if one counter is greater than another,
while accounting for counter wrap?
\QuickA{}
In the C language, the following macro correctly handles this:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
#define ULONG_CMP_LT(a, b) \
(ULONG_MAX / 2 < (a) - (b))
\end{verbatim}
\end{minipage}
\vspace{5pt}
Although it is tempting to simply subtract two signed integers,
this should be avoided because signed overflow is undefined
in the C language.
For example, if the compiler knows that one of the values is
positive and the other negative, it is within its rights to
simply assume that the positive number is greater than the
negative number, even though subtracting the negative number
from the positive number might well result in overflow and
thus a negative number.
How could the compiler know the signs of the two numbers?
It might be able to deduce it based on prior assignments
and comparisons.
In this case, if the per-CPU counters were signed, the compiler
could deduce that they were always increasing in value, and
then might assume that they would never go negative.
This assumption could well lead the compiler to generate
unfortunate code~\cite{PaulEMcKenney2012SignedOverflow,JohnRegehr2010UndefinedBehavior}.
\QuickQ{}
Which is better, the counter approach or the flag approach?
\QuickA{}
The flag approach will normally suffer fewer cache misses,
but a better answer is to try both and see which works best
for your particular workload.
\QuickQ{}
How can relying on implicit existence guarantees result in
a bug?
\QuickA{}
Here are some bugs resulting from improper use of implicit
existence guarantees:
\begin{enumerate}
\item A program writes the address of a global variable to
a file, then a later instance of that same program
reads that address and attempts to dereference it.
This can fail due to address-space randomization,
to say nothing of recompilation of the program.
\item A module can record the address of one of its variables
in a pointer located in some other module, then attempt
to dereference that pointer after the module has
been unloaded.
\item A function can record the address of one of its on-stack
variables into a global pointer, which some other
function might attempt to dereference after that function
has returned.
\end{enumerate}
I am sure that you can come up with additional possibilities.
\QuickQ{}
What if the element we need to delete is not the first element
of the list on line~8 of
Figure~\ref{fig:locking:Per-Element Locking Without Existence Guarantees}?
\QuickA{}
This is a very simple hash table with no chaining, so the only
element in a given bucket is the first element.
The reader is invited to adapt this example to a hash table with
full chaining.
\QuickQ{}
What race condition can occur in
Figure~\ref{fig:locking:Per-Element Locking Without Existence Guarantees}?
\QuickA{}
Consider the following sequence of events:
\begin{enumerate}
\item Thread~0 invokes \co{delete(0)}, and reaches line~10 of
the figure, acquiring the lock.
\item Thread~1 concurrently invokes \co{delete(0)}, reaching
line~10, but spins on the lock because Thread~0 holds it.
\item Thread~0 executes lines~11-14, removing the element from
the hashtable, releasing the lock, and then freeing the
element.
\item Thread~0 continues execution, and allocates memory, getting
the exact block of memory that it just freed.
\item Thread~0 then initializes this block of memory as some
other type of structure.
\item Thread~1's \co{spin_lock()} operation fails due to the
fact that what it believes to be \co{p->lock} is no longer
a spinlock.
\end{enumerate}
Because there is no existence guarantee, the identity of the
data element can change while a thread is attempting to acquire
that element's lock on line~10!
\QuickQAC{chp:Data Ownership}{Data Ownership}
\QuickQ{}
What form of data ownership is extremely difficult
to avoid when creating shared-memory parallel programs
(for example, using pthreads) in C or C++?
\QuickA{}
Use of auto variables in functions.
By default, these are private to the thread executing the
current function.
\QuickQ{}
What synchronization remains in the example shown in
Section~\ref{sec:owned:Multiple Processes}?
\QuickA{}
The creation of the threads via the \co{sh} \co{&} operator
and the joining of thread via the \co{sh} \co{wait}
command.
Of course, if the processes explicitly share memory, for
example, using the \co{shmget()} or \co{mmap()} system
calls, explicit synchronization might well be needed when
acccessing or updating the shared memory.
The processes might also synchronize using any of the following
interprocess communications mechanisms:
\begin{enumerate}
\item System V semaphores.
\item System V message queues.
\item UNIX-domain sockets.
\item Networking protocols, including TCP/IP, UDP, and a whole
host of others.
\item File locking.
\item Use of the \co{open()} system call with the
\co{O_CREAT} and \co{O_EXCL} flags.
\item Use of the \co{rename()} system call.
\end{enumerate}
A complete list of possible synchronization mechanisms is left
as an exercise to the reader, who is warned that it will be
an extremely long list.
A surprising number of unassuming system calls can be pressed
into service as synchronization mechanisms.
\QuickQ{}
Is there any shared data in the example shown in
Section~\ref{sec:owned:Multiple Processes}?
\QuickA{}
That is a philosophical question.
Those wishing the answer ``no'' might argue that processes by
definition do not share memory.
Those wishing to answer ``yes'' might list a large number of
synchronization mechanisms that do not require shared memory,
note that the kernel will have some shared state, and perhaps
even argue that the assignment of process IDs (PIDs) constitute
shared data.
Such arguments are excellent intellectual exercise, and are
also a wonderful way of feeling intelligent, scoring points
against hapless classmates or colleagues,
and (especially!) avoiding getting anything useful done.
\QuickQ{}
Does it ever make sense to have partial data ownership where
each thread reads only its own instance of a per-thread variable,
but writes to other threads' instances?
\QuickA{}
Amazingly enough, yes.
One example is a simple message-passing system where threads post
messages to other threads' mailboxes, and where each thread
is responsible for removing any message it sent once that message
has been acted on.
Implementation of such an algorithm is left as an exercise for the
reader, as is the task of identifying other algorithms with
similar ownership patterns.
\QuickQ{}
What mechanisms other than POSIX signals may be used to ship
functions?
\QuickA{}
There is a very large number of such mechanisms, including:
\begin{enumerate}
\item System V message queues.
\item Shared-memory dequeue (see
Section~\ref{sec:SMPdesign:Double-Ended Queue}).
\item Shared-memory mailboxes.
\item UNIX-domain sockets.
\item TCP/IP or UDP, possibly augmented by any number of
higher-level protocols, including RPC, HTTP,
XML, SOAP, and so on.
\end{enumerate}
Compilation of a complete list is left as an exercise to
sufficiently single-minded readers, who are warned that the
list will be extremely long.
\QuickQ{}
But none of the data in the \co{eventual()} function shown on
lines~15-32 of
Figure~\ref{fig:count:Array-Based Per-Thread Eventually Consistent Counters}
is actually owned by the \co{eventual()} thread!
In just what way is this data ownership???
\QuickA{}
The key phrase is ``owns the rights to the data''.
In this case, the rights in question are the rights to access
the per-thread \co{counter} variable defined on line~1
of the figure.
This situation is similar to that described in
Section~\ref{sec:owned:Partial Data Ownership and pthreads}.
However, there really is data that is owned by the \co{eventual()}
thread, namely the \co{t} and \co{sum} variables defined on
lines~17 and~18 of the figure.
For other examples of designated threads, look at the kernel
threads in the Linux kernel, for example, those created by
\co{kthread_create()} and \co{kthread_run()}.
\QuickQ{}
Is it possible to obtain greater accuracy while still
maintaining full privacy of the per-thread data?
\QuickA{}
Yes.
One approach is for \co{read_count()} to add the value
of its own per-thread variable.
This maintains full ownership and performance, but only
a slight improvement in accuracy, particulary on systems
with very large numbers of threads.
Another approach is for \co{read_count()} to use function
shipping, for example, in the form of per-thread signals.
This greatly improves accuracy, but at a significant performance
cost for \co{read_count()}.
However, both of these methods have the advantage of eliminating
cache-line bouncing for the common case of updating counters.
\QuickQAC{chp:Deferred Processing}{Deferred Processing}
\QuickQ{}
Why not implement reference-acquisition using
a simple compare-and-swap operation that only
acquires a reference if the reference counter is
non-zero?
\QuickA{}
Although this can resolve the race between the release of
the last reference and acquisition of a new reference,
it does absolutely nothing to prevent the data structure
from being freed and reallocated, possibly as some completely
different type of structure.
It is quite likely that the ``simple compare-and-swap
operation'' would give undefined results if applied to the
differently typed structure.
In short, use of atomic operations such as compare-and-swap
absolutely requires either type-safety or existence guarantees.
\QuickQ{}
Why isn't it necessary to guard against cases where one CPU
acquires a reference just after another CPU releases the last
reference?
\QuickA{}
Because a CPU must already hold a reference in order
to legally acquire another reference.
Therefore, if one CPU releases the last reference,
there cannot possibly be any CPU that is permitted
to acquire a new reference.
This same fact allows the non-atomic check in line~22
of Figure~\ref{fig:defer:Linux Kernel kref API}.
\QuickQ{}
Suppose that just after the \co{atomic_dec_and_test()}
on line~22 of
Figure~\ref{fig:defer:Linux Kernel kref API} is invoked,
that some other CPU invokes \co{kref_get()}.
Doesn't this result in that other CPU now having an illegal
reference to a released object?
\QuickA{}
This cannot happen if these functions are used correctly.
It is illegal to invoke \co{kref_get()} unless you already
hold a reference, in which case the \co{kref_sub()} could
not possibly have decremented the counter to zero.
\QuickQ{}
Suppose that \co{kref_sub()} returns zero, indicating that
the \co{release()} function was not invoked.
Under what conditions can the caller rely on the continued
existence of the enclosing object?
\QuickA{}
The caller cannot rely on the continued existence of the
object unless it knows that at least one reference will
continue to exist.
Normally, the caller will have no way of knowing this, and
must therefore carefullly avoid referencing the object after
the call to \co{kref_sub()}.
\QuickQ{}
Why can't the check for a zero reference count be
made in a simple ``if'' statement with an atomic
increment in its ``then'' clause?
\QuickA{}
Suppose that the ``if'' condition completed, finding
the reference counter value equal to one.
Suppose that a release operation executes, decrementing
the reference counter to zero and therefore starting
cleanup operations.
But now the ``then'' clause can increment the counter
back to a value of one, allowing the object to be
used after it has been cleaned up.
\QuickQ{}
Why does \co{hp_store()} in
Figure~\ref{fig:defer:Hazard-Pointer Storage and Erasure}
take a double indirection to the data element?
Why not \co{void *} instead of \co{void **}?
\QuickA{}
Because \co{hp_record()} must check for concurrent modifications.
To do that job, it needs a pointer to a pointer to the element,
so that it can check for a modification to the pointer to the
element.
\QuickQ{}
Why does \co{hp_store()}'s caller need to restart its
traversal from the beginning in case of failure?
Isn't that inefficient for large data structures?
\QuickA{}
It might be inefficient in some sense, but the fact is that
such restarting is absolutely required for correctness.
To see this, consider a hazard-pointer-protected linked list
containing elements~A, B, and~C that is subjecte to the
following sequence of events:
\begin{enumerate}
\item Thread~0 stores a hazard pointer to element~B
(having presumably traversed to element~B from element~A).
\item Thread~1 removes element~B from the list, which sets
the pointer from element~B to element~C to a special
\co{HAZPTR_POISON} value in order to mark the deletion.
Because Thread~0 has a hazard pointer to element~B,
it cannot yet be freed.
\item Thread~1 removes element~C from the list.
Because there are no hazard pointers referencing element~C,
it is immediately freed.
\item Thread~0 attempts to acquire a hazard pointer to
now-removed element~B's successor, but sees the
\co{HAZPTR_POISON} value, and thus returns zero,
forcing the caller to restart its traversal from the
beginning of the list.
\end{enumerate}
Which is a very good thing, because otherwise Thread~0 would
have attempted to access the now-freed element~C,
which might have resulted in arbitrarily horrible
memory corruption, especially if the memory for
element~C had since been re-allocated for some other
purpose.
\QuickQ{}
Given that papers on hazard pointers use the bottom bits
of each pointer to mark deleted elements, what is up with
\co{HAZPTR_POISON}?
\QuickA{}
The published implementations of hazard pointers used
non-blocking synchornization techniques for insertion
and deletion.
These techniques require that readers traversing the
data structure ``help'' updaters complete their updates,
which in turn means that readers need to look at the successor
of a deleted element.
In contrast, we will be using locking to synchronize updates,
which does away with the need for readers to help updaters
complete their updates, which in turn allows us to leave
pointers' bottom bits alone.
This approach allows read-side code to be simpler and faster.
\QuickQ{}
But don't these restrictions on hazard pointers also apply
to other forms of reference counting?
\QuickA{}
These restrictions apply only to reference-counting mechanisms whose
reference acquisition can fail.
\QuickQ{}
But hazard pointers don't write to the data structure!
\QuickA{}
Indeed, they do not.
However, they do write to the hazard pointers themselves,
and, more important, require that possible failures be
handled for all \co{hp_store()} calls, each of which
might fail.
Therefore, although hazard pointers are extremely useful,
it is still worth looking for improved mechanisms.
\QuickQ{}
Why isn't this sequence-lock discussion in Chapter~\ref{chp:Locking},
you know, the one on \emph{locking}?
\QuickA{}
The sequence-lock mechanism is really a combination of two
separate synchronization mechanisms, sequence counts and
locking.
In fact, the sequence-count mechanism is available separately
in the Linux kernel via the
\co{write_seqcount_begin()} and \co{write_seqcount_end()}
primitives.
However, the combined \co{write_seqlock()} and
\co{write_sequnlock()} primitives are used much more heavily
in the Linux kernel.
More importantly, many more people will understand what you
mean if you say ``sequence lock'' than if you say
``sequence count''.
So this section is entitled ``Sequence Locks'' so that people
will understand what it is about just from the title, and
it appears in the ``Deferred Processing'' because (1) of the
emphasis on the ``sequence count'' aspect of ``sequence locks''
and (2) because a ``sequence lock'' is much more than merely
a lock.
\QuickQ{}
Can you use sequence locks as the only synchronization
mechanism protecting a linked list supporting concurrent
addition, deletion, and search?
\QuickA{}
One trivial way of accomplishing this is to surround all
accesses, including the read-only accesses, with
\co{write_seqlock()} and \co{write_sequnlock()}.
Of course, this solution also prohibits all read-side
parallelism, and furthermore could just as easily be implemented
using simple locking.
If you do come up with a solution that uses \co{read_seqbegin()}
and \co{read_seqretry()} to protect read-side accesses, make
sure that you correctly handle the following sequence of events:
\begin{enumerate}
\item CPU~0 is traversing the linked list, and picks up a pointer
to list element~A.
\item CPU~1 removes element~A from the list and frees it.
\item CPU~2 allocates an unrelated data structure, and gets
the memory formerly occupied by element~A.
In this unrelated data structure, the memory previously
used for element~A's \co{->next} pointer is now occupied
by a floating-point number.
\item CPU~0 picks up what used to be element~A's \co{->next}
pointer, gets random bits, and therefore gets a
segmentation fault.
\end{enumerate}
One way to protect against this sort of problem requires use
of ``type-safe memory'', which will be discussed in
Section~\ref{sec:defer:RCU is a Way of Providing Type-Safe Memory}.
But in that case, you would be using some other synchronization
mechanism in addition to sequence locks!
\QuickQ{}
Why bother with the check on line~19 of
\co{read_seqbegin()} in
Figure~\ref{fig:defer:Sequence-Locking Implementation}?
Given that a new writer could begin at any time, why not
simply incorporate the check into line~31 of
\co{read_seqretry()}?
\QuickA{}
That would be a legitimate implementation.
However, it would not save anything to move the check down
to \co{read_seqretry()}: There would be roughly the same number
of instructions.
Furthermore, the reader's accesses from its doomed read-side
critical section could inflict overhead on the writer in
the form of cache misses.
We can avoid these cache misses by placing the check in
\co{read_seqbegin()} as shown on line~19 of
Figure~\ref{fig:defer:Sequence-Locking Implementation}.
\QuickQ{}
Why is the \co{smp_mb()} on line~29 of
Figure~\ref{fig:defer:Sequence-Locking Implementation}
needed?
\QuickA{}
If it was omitted, both the compiler and the CPU would be
within their rights to move the critical section preceding
the call to \co{read_seqretry()} down below this function.
This would prevent the sequence lock from protecting the
critical section.
The \co{smp_mb()} primitive prevents such reordering.
\QuickQ{}
What prevents sequence-locking updaters from starving readers?
\QuickA{}
Nothing.
This is one of the weaknesses of sequence locking, and as a
result, you should use sequence locking only in read-mostly
situations.
Unless of course read-side starvation is acceptable in your
situation, in which case, go wild with the sequence-locking updates!
\QuickQ{}
What if something else serializes writers, so that the lock
is not needed?
\QuickA{}
In this case, the \co{->lock} field could be omitted, as it
is in \co{seqcount_t} in the Linux kernel.
\QuickQ{}
Why isn't \co{seq} on line 2 of
Figure~\ref{fig:defer:Sequence-Locking Implementation}
\co{unsigned} rather than \co{unsigned long}?
After all, if \co{unsigned} is good enough for the Linux
kernel, shouldn't it be good enough for everyone?
\QuickA{}
Not at all.
The Linux kernel has a number of special attributes that allow
it to ignore the following sequence of events:
\begin{enumerate}
\item Thread 0 executes \co{read_seqbegin()}, picking up
\co{->seq} in line~17, noting that the value is even,
and thus returning to the caller.
\item Thread 0 starts executing its read-side critical section,
but is then preempted for a long time.
\item Other threads repeatedly invoke \co{write_seqlock()} and
\co{write_sequnlock()}, until the value of \co{->seq}
overflows back to the value that Thread~0 fetched.
\item Thread 0 resumes execution, completing its read-side
critical section with inconsistent data.
\item Thread 0 invokes \co{read_seqretry()}, which incorrectly
concludes that Thread~0 has seen a consistent view of
the data protected by the sequence lock.
\end{enumerate}
The Linux kernel uses sequence locking for things that are
updated rarely, with time-of-day information being a case
in point.
This information is updated at most once per millisecond,
so that seven weeks would be required to overflow the counter.
If a kernel thread was preempted for seven weeks, the Linux
kernel's soft-lockup code would be emitting warnings every two
minutes for that entire time.
In contrast, with a 64-bit counter, more than five centuries
would be required to overflow, even given an update every
\emph{nano}second.
Therefore, this implementation uses a type for \co{->seq}
that is 64 bits on 64-bit systems.
\QuickQ{}
But doesn't Section~\ref{sec:defer:Sequence Locks}'s seqlock
also permit readers and updaters to get work done concurrently?
\QuickA{}
Yes and no.
Although seqlock readers can run concurrently with
seqlock writers, whenever this happens, the {\tt read\_seqretry()}
primitive will force the reader to retry.
This means that any work done by a seqlock reader running concurrently
with a seqlock updater will be discarded and redone.
So seqlock readers can \emph{run} concurrently with updaters,
but they cannot actually get any work done in this case.
In contrast, RCU readers can perform useful work even in presence
of concurrent RCU updaters.
\QuickQ{}
What prevents the {\tt list\_for\_each\_entry\_rcu()} from
getting a segfault if it happens to execute at exactly the same
time as the {\tt list\_add\_rcu()}?
\QuickA{}
On all systems running Linux, loads from and stores
to pointers are atomic, that is, if a store to a pointer occurs at
the same time as a load from that same pointer, the load will return
either the initial value or the value stored, never some bitwise
mashup of the two.
In addition, the {\tt list\_for\_each\_entry\_rcu()} always proceeds
forward through the list, never looking back.
Therefore, the {\tt list\_for\_each\_entry\_rcu()} will either see
the element being added by {\tt list\_add\_rcu()} or it will not,
but either way, it will see a valid well-formed list.
\QuickQ{}
Why do we need to pass two pointers into
{\tt hlist\_for\_each\_entry\_rcu()}
when only one is needed for {\tt list\_for\_each\_entry\_rcu()}?
\QuickA{}
Because in an hlist it is necessary to check for
NULL rather than for encountering the head.
(Try coding up a single-pointer {\tt hlist\_for\_each\_entry\_rcu()}
If you come up with a nice solution, it would be a very good thing!)
\QuickQ{}
How would you modify the deletion example to permit more than two
versions of the list to be active?
\QuickA{}
One way of accomplishing this is as shown in
Figure~\ref{fig:defer:Concurrent RCU Deletion}.
\begin{figure}[htbp]
{ \centering
\begin{verbatim}
1 spin_lock(&mylock);
2 p = search(head, key);
3 if (p == NULL)
4 spin_unlock(&mylock);
5 else {
6 list_del_rcu(&p->list);
7 spin_unlock(&mylock);
8 synchronize_rcu();
9 kfree(p);
10 }
\end{verbatim}
}
\caption{Concurrent RCU Deletion}
\label{fig:defer:Concurrent RCU Deletion}
\end{figure}
Note that this means that multiple concurrent deletions might be
waiting in {\tt synchronize\_rcu()}.
\QuickQ{}
How many RCU versions of a given list can be
active at any given time?
\QuickA{}
That depends on the synchronization design.
If a semaphore protecting the update is held across the grace period,
then there can be at most two versions, the old and the new.
However, if only the search, the update, and the
{\tt list\_replace\_rcu()} were protected by a lock, then
there could be an arbitrary number of versions active, limited only
by memory and by how many updates could be completed within a
grace period.
But please note that data structures that are updated so frequently
probably are not good candidates for RCU.
That said, RCU can handle high update rates when necessary.
\QuickQ{}
How can RCU updaters possibly delay RCU readers, given that the
{\tt rcu\_read\_lock()} and {\tt rcu\_read\_unlock()}
primitives neither spin nor block?
\QuickA{}
The modifications undertaken by a given RCU updater will cause the
corresponding CPU to invalidate cache lines containing the data,
forcing the CPUs running concurrent RCU readers to incur expensive
cache misses.
(Can you design an algorithm that changes a data structure
\emph{without}
inflicting expensive cache misses on concurrent readers?
On subsequent readers?)
\QuickQ{}
WTF?
How the heck do you expect me to believe that RCU has a
100-femtosecond overhead when the clock period at 3GHz is more than
300 \emph{picoseconds}?
\QuickA{}
First, consider that the inner loop used to
take this measurement is as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 for (i = 0; i < CSCOUNT_SCALE; i++) {
2 rcu_read_lock();
3 rcu_read_unlock();
4 }
\end{verbatim}
\end{minipage}
\vspace{5pt}
Next, consider the effective definitions of \co{rcu_read_lock()}
and \co{rcu_read_unlock()}:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 #define rcu_read_lock() do { } while (0)
2 #define rcu_read_unlock() do { } while (0)
\end{verbatim}
\end{minipage}
\vspace{5pt}
Consider also that the compiler does simple optimizations,
allowing it to replace the loop with:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
i = CSCOUNT_SCALE;
\end{verbatim}
\end{minipage}
\vspace{5pt}
So the "measurement" of 100 femtoseconds is simply the fixed
overhead of the timing measurements divided by the number of
passes through the inner loop containing the calls
to \co{rcu_read_lock()} and \co{rcu_read_unlock()}.
And therefore, this measurement really is in error, in fact,
in error by an arbitrary number of orders of magnitude.
As you can see by the definition of \co{rcu_read_lock()}
and \co{rcu_read_unlock()} above, the actual overhead
is precisely zero.
It certainly is not every day that a timing measurement of
100 femtoseconds turns out to be an overestimate!
\QuickQ{}
Why does both the variability and overhead of rwlock decrease as the
critical-section overhead increases?
\QuickA{}
Because the contention on the underlying
\co{rwlock_t} decreases as the critical-section overhead
increases.
However, the rwlock overhead will not quite drop to that on a single
CPU because of cache-thrashing overhead.
\QuickQ{}
Is there an exception to this deadlock immunity, and if so,
what sequence of events could lead to deadlock?
\QuickA{}
One way to cause a deadlock cycle involving
RCU read-side primitives is via the following (illegal) sequence
of statements:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
idx = srcu_read_lock(&srcucb);
synchronize_srcu(&srcucb);
srcu_read_unlock(&srcucb, idx);
\end{verbatim}
\end{minipage}
\vspace{5pt}
The \co{synchronize_srcu()} cannot return until all
pre-existing SRCU read-side critical sections complete, but
is enclosed in an SRCU read-side critical section that cannot
complete until the \co{synchronize_srcu()} returns.
The result is a classic self-deadlock--you get the same
effect when attempting to write-acquire a reader-writer lock
while read-holding it.
Note that this self-deadlock scenario does not apply to
RCU Classic, because the context switch performed by the
\co{synchronize_rcu()} would act as a quiescent state
for this CPU, allowing a grace period to complete.
However, this is if anything even worse, because data used
by the RCU read-side critical section might be freed as a
result of the grace period completing.
In short, do not invoke synchronous RCU update-side primitives
from within an RCU read-side critical section.
\QuickQ{}
But wait!
This is exactly the same code that might be used when thinking
of RCU as a replacement for reader-writer locking!
What gives?
\QuickA{}
This is an effect of the Law of Toy Examples:
beyond a certain point, the code fragments look the same.
The only difference is in how we think about the code.
However, this difference can be extremely important.
For but one example of the importance, consider that if we think
of RCU as a restricted reference counting scheme, we would never
be fooled into thinking that the updates would exclude the RCU
read-side critical sections.
It nevertheless is often useful to think of RCU as a replacement
for reader-writer locking, for example, when you are replacing
reader-writer locking with RCU.
\QuickQ{}
Why the dip in refcnt overhead near 6 CPUs?
\QuickA{}
Most likely NUMA effects.
However, there is substantial variance in the values measured for the
refcnt line, as can be seen by the error bars.
In fact, standard deviations range in excess of 10\% of measured
values in some cases.
The dip in overhead therefore might well be a statistical aberration.
\QuickQ{}
What if the element we need to delete is not the first element
of the list on line~9 of
Figure~\ref{fig:defer:Existence Guarantees Enable Per-Element Locking}?
\QuickA{}
As with
Figure~\ref{fig:locking:Per-Element Locking Without Existence Guarantees},
this is a very simple hash table with no chaining, so the only
element in a given bucket is the first element.
The reader is again invited to adapt this example to a hash table with
full chaining.
\QuickQ{}
Why is it OK to exit the RCU read-side critical section on
line~15 of
Figure~\ref{fig:defer:Existence Guarantees Enable Per-Element Locking}
before releasing the lock on line~17?
\QuickA{}
First, please note that the second check on line~14 is
necessary because some other
CPU might have removed this element while we were waiting
to acquire the lock.
However, the fact that we were in an RCU read-side critical section
while acquiring the lock guarantees that this element could not
possibly have been re-allocated and re-inserted into this
hash table.
Furthermore, once we acquire the lock, the lock itself guarantees
the element's existence, so we no longer need to be in an
RCU read-side critical section.
The question as to whether it is necessary to re-check the
element's key is left as an exercise to the reader.
% A re-check is necessary if the key can mutate or if it is
% necessary to reject deleted entries (in cases where deletion
% is recorded by mutating the key.
\QuickQ{}
Why not exit the RCU read-side critical section on
line~23 of
Figure~\ref{fig:defer:Existence Guarantees Enable Per-Element Locking}
before releasing the lock on line~22?
\QuickA{}
Suppose we reverse the order of these two lines.
Then this code is vulnerable to the following sequence of
events:
\begin{enumerate}
\item CPU~0 invokes \co{delete()}, and finds the element
to be deleted, executing through line~15.
It has not yet actually deleted the element, but
is about to do so.
\item CPU~1 concurrently invokes \co{delete()}, attempting
to delete this same element.
However, CPU~0 still holds the lock, so CPU~1 waits
for it at line~13.
\item CPU~0 executes lines~16 and 17, and blocks at
line~18 waiting for CPU~1 to exit its RCU read-side
critical section.
\item CPU~1 now acquires the lock, but the test on line~14
fails because CPU~0 has already removed the element.
CPU~1 now executes line~22 (which we switched with line~23
for the purposes of this Quick Quiz)
and exits its RCU read-side critical section.
\item CPU~0 can now return from \co{synchronize_rcu()},
and thus executes line~19, sending the element to
the freelist.
\item CPU~1 now attempts to release a lock for an element
that has been freed, and, worse yet, possibly
reallocated as some other type of data structure.
This is a fatal memory-corruption error.
\end{enumerate}
\QuickQ{}
But what if there is an arbitrarily long series of RCU
read-side critical sections in multiple threads, so that at
any point in time there is at least one thread in the system
executing in an RCU read-side critical section?
Wouldn't that prevent any data from a \co{SLAB_DESTROY_BY_RCU}
slab ever being returned to the system, possibly resulting
in OOM events?
\QuickA{}
There could certainly be an arbitrarily long period of time
during which at least one thread is always in an RCU read-side
critical section.
However, the key words in the description in
Section~\ref{sec:defer:RCU is a Way of Providing Type-Safe Memory}
are ``in-use'' and ``pre-existing''.
Keep in mind that a given RCU read-side critical section is
conceptually only permitted to gain references to data elements
that were in use at the beginning of that critical section.
Furthermore, remember that a slab cannot be returned to the
system until all of its data elements have been freed, in fact,
the RCU grace period cannot start until after they have all been
freed.
Therefore, the slab cache need only wait for those RCU read-side
critical sections that started before the freeing of the last element
of the slab.
This in turn means that any RCU grace period that begins after
the freeing of the last element will do---the slab may be returned
to the system after that grace period ends.
\QuickQ{}
Suppose that the \co{nmi_profile()} function was preemptible.
What would need to change to make this example work correctly?
\QuickA{}
One approach would be to use
\co{rcu_read_lock()} and \co{rcu_read_unlock()}
in \co{nmi_profile()}, and to replace the
\co{synchronize_sched()} with \co{synchronize_rcu()},
perhaps as shown in
Figure~\ref{fig:defer:Using RCU to Wait for Mythical Preemptible NMIs to Finish}.
\begin{figure}[tbp]
{ \tt \scriptsize
\begin{verbatim}
1 struct profile_buffer {
2 long size;
3 atomic_t entry[0];
4 };
5 static struct profile_buffer *buf = NULL;
6
7 void nmi_profile(unsigned long pcvalue)
8 {
9 struct profile_buffer *p;
10
11 rcu_read_lock();
12 p = rcu_dereference(buf);
13 if (p == NULL) {
14 rcu_read_unlock();
15 return;
16 }
17 if (pcvalue >= p->size) {
18 rcu_read_unlock();
19 return;
20 }
21 atomic_inc(&p->entry[pcvalue]);
22 rcu_read_unlock();
23 }
24
25 void nmi_stop(void)
26 {
27 struct profile_buffer *p = buf;
28
29 if (p == NULL)
30 return;
31 rcu_assign_pointer(buf, NULL);
32 synchronize_rcu();
33 kfree(p);
34 }
\end{verbatim}
}
\caption{Using RCU to Wait for Mythical Preemptible NMIs to Finish}
\label{fig:defer:Using RCU to Wait for Mythical Preemptible NMIs to Finish}
\end{figure}
\QuickQ{}
Why do some of the cells in
Table~\ref{tab:defer:RCU Wait-to-Finish APIs}
have exclamation marks (``!'')?
\QuickA{}
The API members with exclamation marks (\co{rcu_read_lock()},
\co{rcu_read_unlock()}, and \co{call_rcu()}) were the
only members of the Linux RCU API that Paul E. McKenney was aware
of back in the mid-90s.
During this timeframe, he was under the mistaken impression that
he knew all that there is to know about RCU.
\QuickQ{}
How do you prevent a huge number of RCU read-side critical
sections from indefinitely blocking a \co{synchronize_rcu()}
invocation?
\QuickA{}
There is no need to do anything to prevent RCU read-side
critical sections from indefinitely blocking a
\co{synchronize_rcu()} invocation, because the
\co{synchronize_rcu()} invocation need wait only for
\emph{pre-existing} RCU read-side critical sections.
So as long as each RCU read-side critical section is
of finite duration, there should be no problem.
\QuickQ{}
The \co{synchronize_rcu()} API waits for all pre-existing
interrupt handlers to complete, right?
\QuickA{}
Absolutely not!
And especially not when using preemptible RCU!
You instead want \co{synchronize_irq()}.
Alternatively, you can place calls to \co{rcu_read_lock()}
and \co{rcu_read_unlock()} in the specific interrupt handlers that
you want \co{synchronize_rcu()} to wait for.
\QuickQ{}
What happens if you mix and match?
For example, suppose you use \co{rcu_read_lock()} and
\co{rcu_read_unlock()} to delimit RCU read-side critical
sections, but then use \co{call_rcu_bh()} to post an
RCU callback?
\QuickA{}
If there happened to be no RCU read-side critical
sections delimited by \co{rcu_read_lock_bh()} and
\co{rcu_read_unlock_bh()} at the time \co{call_rcu_bh()}
was invoked, RCU would be within its rights to invoke the callback
immediately, possibly freeing a data structure still being used by
the RCU read-side critical section!
This is not merely a theoretical possibility: a long-running RCU
read-side critical section delimited by \co{rcu_read_lock()}
and \co{rcu_read_unlock()} is vulnerable to this failure mode.
This vulnerability disappears in -rt kernels, where
RCU Classic and RCU BH both map onto a common implementation.
\QuickQ{}
Hardware interrupt handlers can be thought of as being
under the protection of an implicit \co{rcu_read_lock_bh()},
right?
\QuickA{}
Absolutely not!
And especially not when using preemptible RCU!
If you need to access ``rcu\_bh''-protected data structures
in an interrupt handler, you need to provide explicit calls to
\co{rcu_read_lock_bh()} and \co{rcu_read_unlock_bh()}.
\QuickQ{}
What happens if you mix and match RCU Classic and RCU Sched?
\QuickA{}
In a non-\co{PREEMPT} or a \co{PREEMPT} kernel, mixing these
two works "by accident" because in those kernel builds, RCU Classic
and RCU Sched map to the same implementation.
However, this mixture is fatal in \co{PREEMPT_RT} builds using the -rt
patchset, due to the fact that Realtime RCU's read-side critical
sections can be preempted, which would permit
\co{synchronize_sched()} to return before the
RCU read-side critical section reached its \co{rcu_read_unlock()}
call.
This could in turn result in a data structure being freed before the
read-side critical section was finished with it,
which could in turn greatly increase the actuarial risk experienced
by your kernel.
In fact, the split between RCU Classic and RCU Sched was inspired
by the need for preemptible RCU read-side critical sections.
\QuickQ{}
In general, you cannot rely on \co{synchronize_sched()} to
wait for all pre-existing interrupt handlers,
right?
\QuickA{}
That is correct!
Because -rt Linux uses threaded interrupt handlers, there can
be context switches in the middle of an interrupt handler.
Because \co{synchronize_sched()} waits only until each
CPU has passed through a context switch, it can return
before a given interrupt handler completes.
If you need to wait for a given interrupt handler to complete,
you should instead use \co{synchronize_irq()} or place
explicit RCU read-side critical sections in the interrupt
handlers that you wish to wait on.
\QuickQ{}
Why do both SRCU and QRCU lack asynchronous \co{call_srcu()}
or \co{call_qrcu()} interfaces?
\QuickA{}
Given an asynchronous interface, a single task
could register an arbitrarily large number of SRCU or QRCU callbacks,
thereby consuming an arbitrarily large quantity of memory.
In contrast, given the current synchronous
\co{synchronize_srcu()} and \co{synchronize_qrcu()}
interfaces, a given task must finish waiting for a given grace period
before it can start waiting for the next one.
\QuickQ{}
Under what conditions can \co{synchronize_srcu()} be safely
used within an SRCU read-side critical section?
\QuickA{}
In principle, you can use
\co{synchronize_srcu()} with a given \co{srcu_struct}
within an SRCU read-side critical section that uses some other
\co{srcu_struct}.
In practice, however, doing this is almost certainly a bad idea.
In particular, the code shown in
Figure~\ref{fig:defer:Multistage SRCU Deadlocks}
could still result in deadlock.
\begin{figure}[htbp]
{ \centering
\begin{verbatim}
1 idx = srcu_read_lock(&ssa);
2 synchronize_srcu(&ssb);
3 srcu_read_unlock(&ssa, idx);
4
5 /* . . . */
6
7 idx = srcu_read_lock(&ssb);
8 synchronize_srcu(&ssa);
9 srcu_read_unlock(&ssb, idx);
\end{verbatim}
}
\caption{Multistage SRCU Deadlocks}
\label{fig:defer:Multistage SRCU Deadlocks}
\end{figure}
\QuickQ{}
Why doesn't \co{list_del_rcu()} poison both the \co{next}
and \co{prev} pointers?
\QuickA{}
Poisoning the \co{next} pointer would interfere
with concurrent RCU readers, who must use this pointer.
However, RCU readers are forbidden from using the \co{prev}
pointer, so it may safely be poisoned.
\QuickQ{}
Normally, any pointer subject to \co{rcu_dereference()} \emph{must}
always be updated using \co{rcu_assign_pointer()}.
What is an exception to this rule?
\QuickA{}
One such exception is when a multi-element linked
data structure is initialized as a unit while inaccessible to other
CPUs, and then a single \co{rcu_assign_pointer()} is used
to plant a global pointer to this data structure.
The initialization-time pointer assignments need not use
\co{rcu_assign_pointer()}, though any such assignments that
happen after the structure is globally visible \emph{must} use
\co{rcu_assign_pointer()}.
However, unless this initialization code is on an impressively hot
code-path, it is probably wise to use \co{rcu_assign_pointer()}
anyway, even though it is in theory unnecessary.
It is all too easy for a "minor" change to invalidate your cherished
assumptions about the initialization happening privately.
\QuickQ{}
Are there any downsides to the fact that these traversal and update
primitives can be used with any of the RCU API family members?
\QuickA{}
It can sometimes be difficult for automated
code checkers such as ``sparse'' (or indeed for human beings) to
work out which type of RCU read-side critical section a given
RCU traversal primitive corresponds to.
For example, consider the code shown in
Figure~\ref{fig:defer:Diverse RCU Read-Side Nesting}.
\begin{figure}[htbp]
{ \centering
\begin{verbatim}
1 rcu_read_lock();
2 preempt_disable();
3 p = rcu_dereference(global_pointer);
4
5 /* . . . */
6
7 preempt_enable();
8 rcu_read_unlock();
\end{verbatim}
}
\caption{Diverse RCU Read-Side Nesting}
\label{fig:defer:Diverse RCU Read-Side Nesting}
\end{figure}
Is the \co{rcu_dereference()} primitive in an RCU Classic
or an RCU Sched critical section?
What would you have to do to figure this out?
\QuickQ{}
Why wouldn't any deadlock in the RCU implementation in
Figure~\ref{fig:defer:Lock-Based RCU Implementation}
also be a deadlock in any other RCU implementation?
\QuickA{}
\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
1 void foo(void)
2 {
3 spin_lock(&my_lock);
4 rcu_read_lock();
5 do_something();
6 rcu_read_unlock();
7 do_something_else();
8 spin_unlock(&my_lock);
9 }
10
11 void bar(void)
12 {
13 rcu_read_lock();
14 spin_lock(&my_lock);
15 do_some_other_thing();
16 spin_unlock(&my_lock);
17 do_whatever();
18 rcu_read_unlock();
19 }
\end{verbatim}
}
\caption{Deadlock in Lock-Based RCU Implementation}
\label{fig:defer:Deadlock in Lock-Based RCU Implementation}
\end{figure}
Suppose the functions \co{foo()} and \co{bar()} in
Figure~\ref{fig:defer:Deadlock in Lock-Based RCU Implementation}
are invoked concurrently from different CPUs.
Then \co{foo()} will acquire \co{my_lock()} on line~3,
while \co{bar()} will acquire \co{rcu_gp_lock} on
line~13.
When \co{foo()} advances to line~4, it will attempt to
acquire \co{rcu_gp_lock}, which is held by \co{bar()}.
Then when \co{bar()} advances to line~14, it will attempt
to acquire \co{my_lock}, which is held by \co{foo()}.
Each function is then waiting for a lock that the other
holds, a classic deadlock.
Other RCU implementations neither spin nor block in
\co{rcu_read_lock()}, hence avoiding deadlocks.
\QuickQ{}
Why not simply use reader-writer locks in the RCU implementation
in
Figure~\ref{fig:defer:Lock-Based RCU Implementation}
in order to allow RCU readers to proceed in parallel?
\QuickA{}
One could in fact use reader-writer locks in this manner.
However, textbook reader-writer locks suffer from memory
contention, so that the RCU read-side critical sections would
need to be quite long to actually permit parallel
execution~\cite{McKenney03a}.
On the other hand, use of a reader-writer lock that is
read-acquired in \co{rcu_read_lock()} would avoid the
deadlock condition noted above.
\QuickQ{}
Wouldn't it be cleaner to acquire all the locks, and then
release them all in the loop from lines~15-18 of
Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation}?
After all, with this change, there would be a point in time
when there were no readers, simplifying things greatly.
\QuickA{}
Making this change would re-introduce the deadlock, so
no, it would not be cleaner.
\QuickQ{}
Is the implementation shown in
Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation}
free from deadlocks?
Why or why not?
\QuickA{}
One deadlock is where a lock is
held across \co{synchronize_rcu()}, and that same lock is
acquired within an RCU read-side critical section.
However, this situation will deadlock any correctly designed
RCU implementation.
After all, the \co{synchronize_rcu()} primitive must wait for all
pre-existing RCU read-side critical sections to complete,
but if one of those critical sections is spinning on a lock
held by the thread executing the \co{synchronize_rcu()},
we have a deadlock inherent in the definition of RCU.
Another deadlock happens when attempting to nest RCU read-side
critical sections.
This deadlock is peculiar to this implementation, and might
be avoided by using recursive locks, or by using reader-writer
locks that are read-acquired by \co{rcu_read_lock()} and
write-acquired by \co{synchronize_rcu()}.
However, if we exclude the above two cases,
this implementation of RCU does not introduce any deadlock
situations.
This is because only time some other thread's lock is acquired is when
executing \co{synchronize_rcu()}, and in that case, the lock
is immediately released, prohibiting a deadlock cycle that
does not involve a lock held across the \co{synchronize_rcu()}
which is the first case above.
\QuickQ{}
Isn't one advantage of the RCU algorithm shown in
Figure~\ref{fig:defer:Per-Thread Lock-Based RCU Implementation}
that it uses only primitives that are widely available,
for example, in POSIX pthreads?
\QuickA{}
This is indeed an advantage, but do not forget that
\co{rcu_dereference()} and \co{rcu_assign_pointer()}
are still required, which means \co{volatile} manipulation
for \co{rcu_dereference()} and memory barriers for
\co{rcu_assign_pointer()}.
Of course, many Alpha CPUs require memory barriers for both
primitives.
\QuickQ{}
But what if you hold a lock across a call to
\co{synchronize_rcu()}, and then acquire that same lock within
an RCU read-side critical section?
\QuickA{}
Indeed, this would deadlock any legal RCU implementation.
But is \co{rcu_read_lock()} \emph{really} participating in
the deadlock cycle?
If you believe that it is, then please
ask yourself this same question when looking at the
RCU implementation in
Section~\ref{defer:RCU Based on Quiescent States}.
\QuickQ{}
How can the grace period possibly elapse in 40 nanoseconds when
\co{synchronize_rcu()} contains a 10-millisecond delay?
\QuickA{}
The update-side test was run in absence of readers, so the
\co{poll()} system call was never invoked.
In addition, the actual code has this \co{poll()}
system call commented out, the better to evaluate the
true overhead of the update-side code.
Any production uses of this code would be better served by
using the \co{poll()} system call, but then again,
production uses would be even better served by other implementations
shown later in this section.
\QuickQ{}
Why not simply make \co{rcu_read_lock()} wait when a concurrent
\co{synchronize_rcu()} has been waiting too long in
the RCU implementation in
Figure~\ref{fig:defer:RCU Implementation Using Single Global Reference Counter}?
Wouldn't that prevent \co{synchronize_rcu()} from starving?
\QuickA{}
Although this would in fact eliminate the starvation, it would
also mean that \co{rcu_read_lock()} would spin or block waiting
for the writer, which is in turn waiting on readers.
If one of these readers is attempting to acquire a lock that
the spinning/blocking \co{rcu_read_lock()} holds, we again
have deadlock.
In short, the cure is worse than the disease.
See Section~\ref{defer:Starvation-Free Counter-Based RCU}
for a proper cure.
\QuickQ{}
Why the memory barrier on line~5 of \co{synchronize_rcu()} in
Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}
given that there is a spin-lock acquisition immediately after?
\QuickA{}
The spin-lock acquisition only guarantees that the spin-lock's
critical section will not ``bleed out'' to precede the
acquisition.
It in no way guarantees that code preceding the spin-lock
acquisition won't be reordered into the critical section.
Such reordering could cause a removal from an RCU-protected
list to be reordered to follow the complementing of
\co{rcu_idx}, which could allow a newly starting RCU
read-side critical section to see the recently removed
data element.
Exercise for the reader: use a tool such as Promela/spin
to determine which (if any) of the memory barriers in
Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}
are really needed.
See Section~\ref{chp:formal:Formal Verification}
for information on using these tools.
The first correct and complete response will be credited.
\QuickQ{}
Why is the counter flipped twice in
Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}?
Shouldn't a single flip-and-wait cycle be sufficient?
\QuickA{}
Both flips are absolutely required.
To see this, consider the following sequence of events:
\begin{enumerate}
\item Line~8 of \co{rcu_read_lock()} in
Figure~\ref{fig:defer:RCU Read-Side Using Global Reference-Count Pair}
picks up \co{rcu_idx}, finding its value to be zero.
\item Line~8 of \co{synchronize_rcu()} in
Figure~\ref{fig:defer:RCU Update Using Global Reference-Count Pair}
complements the value of \co{rcu_idx}, setting its
value to one.
\item Lines~10-13 of \co{synchronize_rcu()} find that the
value of \co{rcu_refcnt[0]} is zero, and thus
returns.
(Recall that the question is asking what happens if
lines~14-20 are omitted.)
\item Lines~9 and 10 of \co{rcu_read_lock()} store the
value zero to this thread's instance of \co{rcu_read_idx}
and increments \co{rcu_refcnt[0]}, respectively.
Execution then proceeds into the RCU read-side critical
section.
\label{defer:rcu_rcgp:RCU Read Side Start}
\item Another instance of \co{synchronize_rcu()} again complements
\co{rcu_idx}, this time setting its value to zero.
Because \co{rcu_refcnt[1]} is zero, \co{synchronize_rcu()}
returns immediately.
(Recall that \co{rcu_read_lock()} incremented
\co{rcu_refcnt[0]}, not \co{rcu_refcnt[1]}!)
\label{defer:rcu_rcgp:RCU Grace Period Start}
\item The grace period that started in
step~\ref{defer:rcu_rcgp:RCU Grace Period Start}
has been allowed to end, despite
the fact that the RCU read-side critical section
that started beforehand in
step~\ref{defer:rcu_rcgp:RCU Read Side Start}
has not completed.
This violates RCU semantics, and could allow the update
to free a data element that the RCU read-side critical
section was still referencing.
\end{enumerate}
Exercise for the reader: What happens if \co{rcu_read_lock()}
is preempted for a very long time (hours!) just after
line~8?
Does this implementation operate correctly in that case?
Why or why not?
The first correct and complete response will be credited.
\QuickQ{}
Given that atomic increment and decrement are so expensive,
why not just use non-atomic increment on line~10 and a
non-atomic decrement on line~25 of
Figure~\ref{fig:defer:RCU Read-Side Using Global Reference-Count Pair}?
\QuickA{}
Using non-atomic operations would cause increments and decrements
to be lost, in turn causing the implementation to fail.
See Section~\ref{defer:Scalable Counter-Based RCU}
for a safe way to use non-atomic operations in
\co{rcu_read_lock()} and \co{rcu_read_unlock()}.
\QuickQ{}
Come off it!
We can see the \co{atomic_read()} primitive in
\co{rcu_read_lock()}!!!
So why are you trying to pretend that \co{rcu_read_lock()}
contains no atomic operations???
\QuickA{}
The \co{atomic_read()} primitives does not actually execute
atomic machine instructions, but rather does a normal load
from an \co{atomic_t}.
Its sole purpose is to keep the compiler's type-checking happy.
If the Linux kernel ran on 8-bit CPUs, it would also need to
prevent ``store tearing'', which could happen due to the need
to store a 16-bit pointer with two eight-bit accesses on some
8-bit systems.
But thankfully, it seems that no one runs Linux on 8-bit systems.
\QuickQ{}
Great, if we have $N$ threads, we can have $2N$ ten-millisecond
waits (one set per \co{flip_counter_and_wait()} invocation,
and even that assumes that we wait only once for each thread.
Don't we need the grace period to complete \emph{much} more quickly?
\QuickA{}
Keep in mind that we only wait for a given thread if that thread
is still in a pre-existing RCU read-side critical section,
and that waiting for one hold-out thread gives all the other
threads a chance to complete any pre-existing RCU read-side
critical sections that they might still be executing.
So the only way that we would wait for $2N$ intervals
would be if the last thread still remained in a pre-existing
RCU read-side critical section despite all the waiting for
all the prior threads.
In short, this implementation will not wait unnecessarily.
However, if you are stress-testing code that uses RCU, you
might want to comment out the \co{poll()} statement in
order to better catch bugs that incorrectly retain a reference
to an RCU-protected data element outside of an RCU
read-side critical section.
\QuickQ{}
All of these toy RCU implementations have either atomic operations
in \co{rcu_read_lock()} and \co{rcu_read_unlock()},
or \co{synchronize_rcu()}
overhead that increases linearly with the number of threads.
Under what circumstances could an RCU implementation enjoy
light-weight implementations for all three of these primitives,
all having deterministic ($O\left(1\right)$) overheads and latencies?
\QuickA{}
Special-purpose uniprocessor implementations of RCU can attain
this ideal~\cite{PaulEMcKenney2009BloatwatchRCU}.
\QuickQ{}
If any even value is sufficient to tell \co{synchronize_rcu()}
to ignore a given task, why doesn't line~10 of
Figure~\ref{fig:defer:Free-Running Counter Using RCU}
simply assign zero to \co{rcu_reader_gp}?
\QuickA{}
Assigning zero (or any other even-numbered constant)
would in fact work, but assigning the value of
\co{rcu_gp_ctr} can provide a valuable debugging aid,
as it gives the developer an idea of when the corresponding
thread last exited an RCU read-side critical section.
\QuickQ{}
Why are the memory barriers on lines~17 and 29 of
Figure~\ref{fig:defer:Free-Running Counter Using RCU}
needed?
Aren't the memory barriers inherent in the locking
primitives on lines~18 and 28 sufficient?
\QuickA{}
These memory barriers are required because the locking
primitives are only guaranteed to confine the critical
section.
The locking primitives are under absolutely no obligation
to keep other code from bleeding in to the critical section.
The pair of memory barriers are therefore requires to prevent
this sort of code motion, whether performed by the compiler
or by the CPU.
\QuickQ{}
Couldn't the update-side optimization described in
Section~\ref{defer:Scalable Counter-Based RCU With Shared Grace Periods}
be applied to the implementation shown in
Figure~\ref{fig:defer:Free-Running Counter Using RCU}?
\QuickA{}
Indeed it could, with a few modifications.
This work is left as an exercise for the reader.
\QuickQ{}
Is the possibility of readers being preempted in
line~3 of Figure~\ref{fig:defer:Free-Running Counter Using RCU}
a real problem, in other words, is there a real sequence
of events that could lead to failure?
If not, why not?
If so, what is the sequence of events, and how can the
failure be addressed?
\QuickA{}
It is a real problem, there is a sequence of events leading to
failure, and there are a number of possible ways of
addressing it.
For more details, see the Quick Quizzes near the end of
Section~\ref{defer:Nestable RCU Based on Free-Running Counter}.
The reason for locating the discussion there is to (1) give you
more time to think about it, and (2) because the nesting support
added in that section greatly reduces the time required to
overflow the counter.
\QuickQ{}
Why not simply maintain a separate per-thread nesting-level
variable, as was done in previous section, rather than having
all this complicated bit manipulation?
\QuickA{}
The apparent simplicity of the separate per-thread variable
is a red herring.
This approach incurs much greater complexity in the guise
of careful ordering of operations, especially if signal
handlers are to be permitted to contain RCU read-side
critical sections.
But don't take my word for it, code it up and see what you
end up with!
\QuickQ{}
Given the algorithm shown in
Figure~\ref{fig:defer:Nestable RCU Using a Free-Running Counter},
how could you double the time required to overflow the global
\co{rcu_gp_ctr}?
\QuickA{}
One way would be to replace the magnitude comparison on
lines~33 and 34 with an inequality check of the per-thread
\co{rcu_reader_gp} variable against
\co{rcu_gp_ctr+RCU_GP_CTR_BOTTOM_BIT}.
\QuickQ{}
Again, given the algorithm shown in
Figure~\ref{fig:defer:Nestable RCU Using a Free-Running Counter},
is counter overflow fatal?
Why or why not?
If it is fatal, what can be done to fix it?
\QuickA{}
It can indeed be fatal.
To see this, consider the following sequence of events:
\begin{enumerate}
\item Thread~0 enters \co{rcu_read_lock()}, determines
that it is not nested, and therefore fetches the
value of the global \co{rcu_gp_ctr}.
Thread~0 is then preempted for an extremely long time
(before storing to its per-thread \co{rcu_reader_gp}
variable).
\item Other threads repeatedly invoke \co{synchronize_rcu()},
so that the new value of the global \co{rcu_gp_ctr}
is now \co{RCU_GP_CTR_BOTTOM_BIT}
less than it was when thread~0 fetched it.
\item Thread~0 now starts running again, and stores into
its per-thread \co{rcu_reader_gp} variable.
The value it stores is
\co{RCU_GP_CTR_BOTTOM_BIT+1}
greater than that of the global \co{rcu_gp_ctr}.
\item Thread~0 acquires a reference to RCU-protected data
element~A.
\item Thread 1 now removes the data element~A that thread~0
just acquired a reference to.
\item Thread 1 invokes \co{synchronize_rcu()}, which
increments the global \co{rcu_gp_ctr} by
\co{RCU_GP_CTR_BOTTOM_BIT}.
It then checks all of the per-thread \co{rcu_reader_gp}
variables, but thread~0's value (incorrectly) indicates
that it started after thread~1's call to
\co{synchronize_rcu()}, so thread~1 does not wait
for thread~0 to complete its RCU read-side critical
section.
\item Thread 1 then frees up data element~A, which thread~0
is still referencing.
\end{enumerate}
Note that scenario can also occur in the implementation presented in
Section~\ref{defer:RCU Based on Free-Running Counter}.
One strategy for fixing this problem is to use 64-bit
counters so that the time required to overflow them would exceed
the useful lifetime of the computer system.
Note that non-antique members of the 32-bit x86 CPU family
allow atomic manipulation of 64-bit counters via the
\co{cmpxchg64b} instruction.
Another strategy is to limit the rate at which grace periods are
permitted to occur in order to achieve a similar effect.
For example, \co{synchronize_rcu()} could record the last time
that it was invoked, and any subsequent invocation would then
check this time and block as needed to force the desired
spacing.
For example, if the low-order four bits of the counter were
reserved for nesting, and if grace periods were permitted to
occur at most ten times per second, then it would take more
than 300 days for the counter to overflow.
However, this approach is not helpful if there is any possibility
that the system will be fully loaded with CPU-bound high-priority
real-time threads for the full 300 days.
(A remote possibility, perhaps, but best to consider it ahead
of time.)
A third approach is to administratively abolish real-time threads
from the system in question.
In this case, the preempted process will age up in priority,
thus getting to run long before the counter had a chance to
overflow.
Of course, this approach is less than helpful for real-time
applications.
A final approach would be for \co{rcu_read_lock()} to recheck
the value of the global \co{rcu_gp_ctr} after storing to its
per-thread \co{rcu_reader_gp} counter, retrying if the new
value of the global \co{rcu_gp_ctr} is inappropriate.
This works, but introduces non-deterministic execution time
into \co{rcu_read_lock()}.
On the other hand, if your application is being preempted long
enough for the counter to overflow, you have no hope of
deterministic execution time in any case!
% @@@ A fourth approach is rcu_nest32.[hc].
\QuickQ{}
Doesn't the additional memory barrier shown on line~14 of
Figure~\ref{fig:defer:Quiescent-State-Based RCU Read Side},
greatly increase the overhead of \co{rcu_quiescent_state}?
\QuickA{}
Indeed it does!
An application using this implementation of RCU should therefore
invoke \co{rcu_quiescent_state} sparingly, instead using
\co{rcu_read_lock()} and \co{rcu_read_unlock()} most of the
time.
However, this memory barrier is absolutely required so that
other threads will see the store on lines~12-13 before any
subsequent RCU read-side critical sections executed by the
caller.
\QuickQ{}
Why are the two memory barriers on lines~19 and 22 of
Figure~\ref{fig:defer:Quiescent-State-Based RCU Read Side}
needed?
\QuickA{}
The memory barrier on line~19 prevents any RCU read-side
critical sections that might precede the
call to \co{rcu_thread_offline()} won't be reordered by either
the compiler or the CPU to follow the assignment on lines~20-21.
The memory barrier on line~22 is, strictly speaking, unnecessary,
as it is illegal to have any RCU read-side critical sections
following the call to \co{rcu_thread_offline()}.
\QuickQ{}
To be sure, the clock frequencies of ca-2008 Power
systems were quite high, but even a 5GHz clock
frequency is insufficient to allow
loops to be executed in 50~picoseconds!
What is going on here?
\QuickA{}
Since the measurement loop contains a pair of empty functions,
the compiler optimizes it away.
The measurement loop takes 1,000 passes between each call to
\co{rcu_quiescent_state()}, so this measurement is roughly
one thousandth of the overhead of a single call to
\co{rcu_quiescent_state()}.
\QuickQ{}
Why would the fact that the code is in a library make
any difference for how easy it is to use the RCU
implementation shown in
Figures~\ref{fig:defer:Quiescent-State-Based RCU Read Side} and
\ref{fig:defer:RCU Update Side Using Quiescent States}?
\QuickA{}
A library function has absolutely no control over the caller,
and thus cannot force the caller to invoke \co{rcu_quiescent_state()}
periodically.
On the other hand, a library function that made many references
to a given RCU-protected data structure might be able to invoke
\co{rcu_thread_online()} upon entry,
\co{rcu_quiescent_state()} periodically, and
\co{rcu_thread_offline()} upon exit.
\QuickQ{}
But what if you hold a lock across a call to
\co{synchronize_rcu()}, and then acquire that same lock within
an RCU read-side critical section?
This should be a deadlock, but how can a primitive that
generates absolutely no code possibly participate in a
deadlock cycle?
\QuickA{}
Please note that the RCU read-side critical section is in
effect extended beyond the enclosing
\co{rcu_read_lock()} and \co{rcu_read_unlock()}, out to
the previous and next call to \co{rcu_quiescent_state()}.
This \co{rcu_quiescent_state} can be thought of as a
\co{rcu_read_unlock()} immediately followed by an
\co{rcu_read_lock()}.
Even so, the actual deadlock itself will involve the lock
acquisition in the RCU read-side critical section and
the \co{synchronize_rcu()}, never the \co{rcu_quiescent_state()}.
\QuickQ{}
Given that grace periods are prohibited within RCU read-side
critical sections, how can an RCU data structure possibly be
updated while in an RCU read-side critical section?
\QuickA{}
This situation is one reason for the existence of asynchronous
grace-period primitives such as \co{call_rcu()}.
This primitive may be invoked within an RCU read-side critical
section, and the specified RCU callback will in turn be invoked
at a later time, after a grace period has elapsed.
The ability to perform an RCU update while within an RCU read-side
critical section can be extremely convenient, and is analogous
to a (mythical) unconditional read-to-write upgrade for
reader-writer locking.
\QuickQ{}
The statistical-counter implementation shown in
Figure~\ref{fig:count:Per-Thread Statistical Counters}
(\url{count_end.c})
used a global lock to guard the summation in \co{read_count()},
which resulted in poor performance and negative scalability.
How could you use RCU to provide \co{read_count()} with
excellent performance and good scalability.
(Keep in mind that \co{read_count()}'s scalability will
necessarily be limited by its need to scan all threads'
counters.)
\QuickA{}
Hint: place the global variable \co{finalcount} and the
array \co{counterp[]} into a single RCU-protected struct.
At initialization time, this structure would be allocated
and set to all zero and \co{NULL}.
The \co{inc_count()} function would be unchanged.
The \co{read_count()} function would use \co{rcu_read_lock()}
instead of acquiring \co{final_mutex}, and would need to
use \co{rcu_dereference()} to acquire a reference to the
current structure.
The \co{count_register_thread()} function would set the
array element corresponding to the newly created thread
to reference that thread's per-thread \co{counter} variable.
The \co{count_unregister_thread()} function would need to
allocate a new structure, acquire \co{final_mutex},
copy the old structure to the new one, add the outgoing
thread's \co{counter} variable to the total, \co{NULL}
the pointer to this same \co{counter} variable,
use \co{rcu_assign_pointer()} to install the new structure
in place of the old one, release \co{final_mutex},
wait for a grace period, and finally free the old structure.
Does this really work?
Why or why not?
See
Section~\ref{sec:together:RCU and Per-Thread-Variable-Based Statistical Counters}
on
page~\pageref{sec:together:RCU and Per-Thread-Variable-Based Statistical Counters}
for more details.
\QuickQ{}
Section~\ref{sec:count:Applying Specialized Parallel Counters}
showed a fanciful pair of code fragments that dealt with counting
I/O accesses to removable devices.
These code fragments suffered from high overhead on the fastpath
(starting an I/O) due to the need to acquire a reader-writer
lock.
How would you use RCU to provide excellent performance and
scalability?
(Keep in mind that the performance of the common-case first
code fragment that does I/O accesses is much more important
than that of the device-removal code fragment.)
\QuickA{}
Hint: replace the read-acquisitions of the reader-writer lock
with RCU read-side critical sections, then adjust the
device-removal code fragment to suit.
See
Section~\ref{sec:together:RCU and Counters for Removable I/O Devices}
on
Page~\pageref{sec:together:RCU and Counters for Removable I/O Devices}
for one solution to this problem.
\QuickQAC{chp:Data Structures}{Data Structures}
\QuickQ{}
But there are many types of hash tables, of which the chained
hash tables described here are but one type.
Why the focus on chained hash tables?
\QuickA{}
Chained hash tables are completely partitionable, and thus
well-suited to concurrent use.
There are other completely-partitionable hash tables, for
example, split-ordered list~\cite{OriShalev2006SplitOrderListHash},
but they are considerably more complex.
We therefore start with chained hash tables.
\QuickQ{}
But isn't the double comparison on lines~15-18 in
Figure~\ref{fig:datastruct:Hash-Table Lookup} inefficient
in the case where the key fits into an unsigned long?
\QuickA{}
Indeed it is!
However, hash tables quite frequently store information with
keys such as character strings that do not necessarily fit
into an unsigned long.
Simplifying the hash-table implementation for the case where
keys always fit into unsigned longs is left as an exercise
for the reader.
\QuickQ{}
Given the negative scalability of the Schr\"odinger's
Zoo application across sockets, why not just run multiple
copies of the application, with each copy having a subset
of the animals and confined to run on a single socket?
\QuickA{}
You can do just that!
In fact, you can extend this idea to large clustered systems,
running one copy of the application on each node of the cluster.
This practice is called ``sharding'', and is heavily used in
practice by large web-based
retailers~\cite{DeCandia:2007:DAH:1323293.1294281}.
However, if you are going to shard on a per-socket basis within
a multisocket system, why not buy separate smaller and cheaper
single-socket systems, and then run one shard of the database
on each of those systems?
\QuickQ{}
But if elements in a hash table can be deleted concurrently
with lookups, doesn't that mean that a lookup could return
a reference to a data element that was deleted immediately
after it was looked up?
\QuickA{}
Yes it can!
This is why \co{hashtab_lookup()} must be invoked within an
RCU read-side critical section, and it is why
\co{hashtab_add()} and \co{hashtab_del()} must also use
RCU-aware list-manipulation primitives.
Finally, this is why the caller of \co{hashtab_del()} must
wait for a grace period (e.g., by calling \co{synchronize_rcu()})
before freeing the deleted element.
\QuickQ{}
The dangers of extrapolating from eight CPUs to 60 CPUs was
made quite clear in
Section~\ref{sec:datastruct:Hash-Table Performance}.
But why should extrapolating up from 60 CPUs be any safer?
\QuickA{}
It isn't any safer, and a useful exercise would be to run these
programs on larger systems.
That said, other testing has shown that RCU read-side primitives
offer consistent performance and scalability up to at least 1024 CPUs.
\QuickQ{}
The code in
Figure~\ref{fig:datastruct:Resizable Hash-Table Bucket Selection}
computes the hash twice!
Why this blatant inefficiency?
\QuickA{}
The reason is that the old and new hash tables might have
completely different hash functions, so that a hash computed
for the old table might be completely irrelevant to the
new table.
\QuickQ{}
How does the code in
Figure~\ref{fig:datastruct:Resizable Hash-Table Bucket Selection}
protect against the resizing process progressing past the
selected bucket?
\QuickA{}
It does not provide any such protection.
That is instead the job of the update-side concurrency-control
functions described next.
\QuickQ{}
The code in
Figures~\ref{fig:datastruct:Resizable Hash-Table Bucket Selection}
and~\ref{fig:datastruct:Resizable Hash-Table Update-Side Concurrency Control}
compute the hash and execute the bucket-selection logic twice for
updates!
Why this blatant inefficiency?
\QuickA{}
This approach allows the \co{hashtorture.h} testing infrastructure
to be reused.
That said, a production-quality resizable hash table would likely
be optimized to avoid this double computation.
Carrying out this optimization is left as an exercise for the reader.
\QuickQ{}
In the \co{hashtab_lookup()} function in
Figure~\ref{fig:datastruct:Resizable Hash-Table Access Functions},
the code carefully finds the right bucket in the new hash table
if the element to be looked up has already been distributed
by a concurrent resize operation.
This seems wasteful for RCU-protected lookups.
Why not just stick with the old hash table in this case?
\QuickA{}
Suppose that a resize operation begins and distributes half of
the old table's buckets to the new table.
Suppose further that a thread adds a new element that goes into
one of the already-distributed buckets, and that this same thread
now looks up this newly added element.
If lookups unconditionally traversed only the old hash table,
this thread would get a lookup failure for the element that it
just added, which certainly sounds like a bug to me!
\QuickQ{}
The \co{hashtab_del()} function in
Figure~\ref{fig:datastruct:Resizable Hash-Table Access Functions}
does not always remove the element from the old hash table.
Doesn't this mean that readers might access this newly removed
element after it has been freed?
\QuickA{}
No.
The \co{hashtab_del()} function omits removing the element
from the old hash table only if the resize operation has
already progressed beyond the bucket containing the just-deleted
element.
But this means that new \co{hashtab_lookup()} operations will
use the new hash table when looking up that element.
Therefore, only old \co{hashtab_lookup()} operations that started
before the \co{hashtab_del()} might encounter the newly
removed element.
This means that \co{hashtab_del()} need only wait for an
RCU grace period to avoid inconveniencing
\co{hashtab_lookup()} operations.
\QuickQ{}
In the \co{hashtab_resize()} function in
Figure~\ref{fig:datastruct:Resizable Hash-Table Access Functions},
what guarantees that the update to \co{->ht_new} on line~29
will be seen as happening before the update to \co{->ht_resize_cur}
on line~36 from the perspective of \co{hashtab_lookup()},
\co{hashtab_add()}, and \co{hashtab_del()}?
\QuickA{}
The \co{synchronize_rcu()} on line~30 of
Figure~\ref{fig:datastruct:Resizable Hash-Table Access Functions}
ensures that all pre-existing RCU readers have completed between
the time that we install the new hash-table reference on
line~29 and the time that we update \co{->ht_resize_cur} on
line~36.
This means that any reader that sees a non-negative value
of \co{->ht_resize_cur} cannot have started before the
assignment to \co{->ht_new}, and thus must be able to see
the reference to the new hash table.
\QuickQ{}
Couldn't the \co{hashtorture.h} code be modified to accommodate
a version of \co{hashtab_lock_mod()} that subsumes the
\co{ht_get_bucket()} functionality?
\QuickA{}
It probably could, and doing so would benefit all of the
per-bucket-locked hash tables presented in this chapter.
Making this modification is left as an exercise for the
reader.
\QuickQ{}
How much do these specializations really save?
Are the really worth it?
\QuickA{}
The answer to the first question is left as an exercise to
the reader.
Try specializing the resizable hash table and see how much
performance improvement results.
The second question cannot be answered in general, but must
instead be answered with respect to a specific use case.
Some use cases are extremely sensitive to performance and
scalability, while others are less so.
\QuickQAC{chp:Validation}{Validation}
\QuickQ{}
When in computing is the willingness to follow a fragmentary
plan critically important?
\QuickA{}
There are any number of situations, but perhaps the most important
situation is when no one has ever created anything resembling
the program to be developed.
In this case, the only way to create a credible plan is to
implement the program, create the plan, and implement it a
second time.
But whoever implements the program for the first time has no
choice but to follow a fragmentary plan because any detailed
plan created in ignorance cannot survive first contact with
the real world.
And perhaps this is one reason why evolution has favored insanely
optimistic human beings who are happy to follow fragmentary plans!
\QuickQ{}
Suppose that you are writing a script that processes the
output of the \co{time} command, which looks as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\tt
\scriptsize
\begin{verbatim}
real 0m0.132s
user 0m0.040s
sys 0m0.008s
\end{verbatim}
\end{minipage}
\vspace{5pt}
The script is required to check its input for errors, and to
give appropriate diagnostics if fed erroneous \co{time} output.
What test inputs should you provide to this program to test it
for use with \co{time} output generated by single-threaded programs?
\QuickA{}
\begin{enumerate}
\item Do you have a test case in which all the time is
consumed in user mode by a CPU-bound program?
\item Do you have a test case in which all the time is
consumed in system mode by a CPU-bound program?
\item Do you have a test case in which all three times
are zero?
\item Do you have a test case in which the ``user'' and ``sys''
times sum to more than the ``real'' time?
(This would of course be completely legitimate in
a multithreaded program.)
\item Do you have a set of tests cases in which one of the
times uses more than one second?
\item Do you have a set of tests cases in which one of the
times uses more than ten second?
\item Do you have a set of test cases in which one of the
times has non-zero minutes? (For example, ``15m36.342s''.)
\item Do you have a set of test cases in which one of the
times has a seconds value of greater than 60?
\item Do you have a set of test cases in which one of the
times overflows 32 bits of milliseconds? 64 bits of
milliseconds?
\item Do you have a set of test cases in which one of the
times is negative?
\item Do you have a set of test cases in which one of the
times has a positive minutes value but a negative
seconds value?
\item Do you have a set of test cases in which one of the
times omits the ``m'' or the ``s''?
\item Do you have a set of test cases in which one of the
times is non-numeric? (For example, ``Go Fish''.)
\item Do you have a set of test cases in which one of the
lines is omitted? (For example, where there is a
``real'' value and a ``sys'' value, but no ``user''
value.)
\item Do you have a set of test cases where one of the
lines is duplicated? Or duplicated, but with a
different time value for the duplicate?
\item Do you have a set of test cases where a given line
has more than one time value? (For example,
``real 0m0.132s 0m0.008s''.)
\item Do you have a set of test cases containing random
characters?
\item In all test cases involving invalid input, did you
generate all permutations?
\item For each test case, do you have an expected outcome
for that test?
\end{enumerate}
If you did not generate test data for a substantial number of
the above cases, you will need to cultivate a more destructive
attitude in order to have a chance of generating high-quality
tests.
Of course, one way to economize on destructiveness is to
generate the tests with the to-be-tested source code at hand,
which is called white-box testing (as opposed to black-box testing).
However, this is no panacea: You will find that it is all too
easy to find your thinking limited by what the program can handle,
thus failing to generate truly destructive inputs.
\QuickQ{}
You are asking me to do all this validation BS before
I even start coding???
That sounds like a great way to never get started!!!
\QuickA{}
If it is your project, for example, a hobby, do what you like.
Any time you waste will be your own, and you have no one else
to answer to for it.
And there is a good chance that the time will not be completely
wasted.
For example, if you are embarking on a first-of-a-kind project,
the requirements are in some sense unknowable anyway.
In this case, the best approach might be to quickly prototype
a number of rough solutions, try them out, and see what works
best.
\QuickQ{}
How can you implement \co{WARN_ON_ONCE()}?
\QuickA{}
If you don't mind having a \co{WARN_ON_ONCE()} that
will sometimes warn twice or three times, simply maintain
a static variable that is initialized to zero.
If the condition triggers, check the static variable, and
if it is non-zero, return.
Otherwise, set it to one, print the message, and return.
If you really need the message to never appear more than once,
perhaps because it is huge, you can use an atomic exchange
operation in place of ``set it to one'' above.
Print the message only if the atomic exchange operation returns
zero.
\QuickQ{}
Why would anyone bother copying existing code in pen on paper???
Doesn't that just increase the probability of transcription errors?
\QuickA{}
If you are worried about transcription errors, please allow me
to be the first to introduce you to a really cool tool named
\co{diff}.
In addition, carrying out the copying can be quite valuable:
\begin{enumerate}
\item If you are copying a lot of code, you are probably failing
to take advantage of an opportunity for abstraction.
The act of copying code can provide great motivation
for abstraction.
\item Copying the code gives you an opportunity to think about
whether the code really works in its new setting.
Is there some non-obvious constraint, such as the need
to disable interrupts or to hold some lock?
\item Copying the code also gives you time to consider whether
there is some better way to get the job done.
\end{enumerate}
So, yes, copy the code!
\QuickQ{}
This procedure is ridiculously over-engineered!
How can you expect to get a reasonable amount of software
written doing it this way???
\QuickA{}
Indeed, repeatedly copying code by hand is laborious and slow.
However, when combined with heavy-duty stress testing and
proofs of correctness, this approach is also extremely effective
for complex parallel code where ultimate performance and
reliability are required and where debugging is difficult.
The Linux-kernel RCU implementation is a case in point.
On the other hand, if you are writing a simple single-threaded
shell script to manipulate some data, then you would be
best-served by a different methodology.
For example, you might enter each command one at a time
into an interactive shell with a test data set to make
sure that it did what you wanted, then copy-and-paste the
successful commands into your script.
Finally, test the script as a whole.
If you have a friend or colleague who is willing to help out,
pair programming can work very well, as can any number of
formal design- and code-review processes.
And if you are writing code as a hobby, then do whatever you like.
In short, different types of software need different development
methodologies.
\QuickQ{}
Suppose that you had a very large number of systems at your
disposal.
For example, at current cloud prices, you can purchase a
huge amount of CPU time at a reasonably low cost.
Why not use this approach to get close enough to certainty
for all practical purposes?
\QuickA{}
This approach might well be a valuable addition to your
validation arsenal.
But it does have a few limitations:
\begin{enumerate}
\item Some bugs have extremely low probabilities of occurrence,
but nevertheless need to be fixed.
For example, suppose that the Linux kernel's RCU
implementation had a bug that is triggered only once
per century of machine time on average.
A century of CPU time is hugely expensive even on
the cheapest cloud platforms, but we could expect
this bug to result in more than 2,000 failures per day
on the more than 100 million Linux instances in the
world as of 2011.
\item The bug might well have zero probability of occurrence
on your test setup, which means that you won't see it
no matter how much machine time you burn testing it.
\end{enumerate}
Of course, if your code is small enough, formal validation
may be helpful, as discussed in
Section~\ref{chp:formal:Formal Verification}.
But beware: formal validation of your code will not find
errors in your assumptions, misunderstanding of the
requirements, misunderstanding of the software or hardware
primitives you use, or errors that you did not think to construct
a proof for.
\QuickQ{}
Say what???
When I plug the earlier example of five tests each with a
10\% failure rate into the formula, I get 59,050\% and that
just doesn't make sense!!!
\QuickA{}
You are right, that makes no sense at all.
Remember that a probability is a number between zero and one,
so that you need to divide a percentage by 100 to get a
probability.
So 10\% is a probability of 0.1, which gets a probability
of 0.4095, which rounds to 41\%, which quite sensibly
matches the earlier result.
\QuickQ{}
In Equation~\ref{eq:debugging:Binomial Number of Tests Required},
are the logarithms base-10, base-2, or base-$e$?
\QuickA{}
It does not matter.
You will get the same answer no matter what base of logarithms
you use because the result is a pure ratio of logarithms.
The only constraint is that you use the same base for both
the numerator and the denominator.
\QuickQ{}
Suppose that a bug causes a test failure three times per hour
on average.
How long must the test run error-free to provide 99.9\%
confidence that the fix significantly reduced the probability
of failure?
\QuickA{}
We set $n$ to $3$ and $P$ to $99.9$ in
Equation~\ref{eq:debugging:Error-Free Test Duration}, resulting in:
\begin{equation}
T = - \frac{1}{3} \log \frac{100 - 99.9}{100} = 2.3
\end{equation}
If the test runs without failure for 2.3 hours, we can be 99.9\%
certain that the fix reduced the probability of failure.
\QuickQ{}
Doing the summation of all the factorials and exponentials
is a real pain.
Isn't there an easier way?
\QuickA{}
One approach is to use the open-source symbolic manipulation
program named ``maxima''.
Once you have installed this program, which is a part of many
Debian-based Linux distributions, you can run it and give the
\co{load(distrib);} command followed by any number of
\co{bfloat(cdf_poisson(m,l));} commands, where the \co{m}
is replaced by the desired value of $m$ and the \co{l}
is replaced by the desired value of $\lambda$.
In particular, the \co{bfloat(cdf_poisson(2,24));} command
results in \co{1.181617112359357b-8}, which matches the value
given by Equation~\ref{eq:debugging:Possion CDF}.
Alternatively, you can use the rough-and-ready method described in
Section~\ref{sec:debugging:Abusing Statistics for Discrete Testing}.
\QuickQ{}
But wait!!!
Given that there has to be \emph{some} number of failures
(including the possibility of zero failures),
shouldn't the summation shown in
Equation~\ref{eq:debugging:Possion CDF}
approach the value $1$ as $m$ goes to infinity?
\QuickA{}
Indeed it should.
And it does.
To see this, note that $e^{-\lambda}$ does not depend on $i$,
which means that it can be pulled out of the summation as follows:
\begin{equation}
e^{-\lambda} \sum_{i=0}^\infty \frac{\lambda^i}{i!}
\end{equation}
The remaining summation is exactly the Taylor series for
$e^\lambda$, yielding:
\begin{equation}
e^{-\lambda} e^\lambda
\end{equation}
The two exponentials are reciprocals, and therefore cancel,
resulting in exactly $1$, as required.
\QuickQ{}
How is this approach supposed to help if the corruption affected some
unrelated pointer, which then caused the corruption???
\QuickA{}
Indeed, that can happen.
Many CPUs have hardware-debugging facilities that can help you
locate that unrelated pointer.
Furthermore, if you have a core dump, you can search the core
dump for pointers referencing the corrupted region of memory.
You can also look at the data layout of the corruption, and
check pointers whose type matches that layout.
You can also step back and test the modules making up your
program more intensively, which will likely confine the corruption
to the module responsible for it.
If this makes the corruption vanish, consider adding additional
argument checking to the functions exported from each module.
Nevertheless, this is a hard problem, which is why I used the
words ``a bit of a dark art''.
\QuickQ{}
But I did the bisection, and ended up with a huge commit.
What do I do now?
\QuickA{}
A huge commit?
Shame on you!
This is but one reason why you are supposed to keep the commits small.
And that is your answer: Break up the commit into bite-sized
pieces and bisect the pieces.
In my experience, the act of breaking up the commit is often
sufficient to make the bug painfully obvious.
\QuickQ{}
Why don't existing conditional-locking primitives provide this
spurious-failure functionality?
\QuickA{}
There are locking algorithms that depend on conditional-locking
primitives telling them the truth.
For example, if conditional-lock failure signals that
some other thread is already working on a given job,
spurious failure might cause that job to never get done,
possibly resulting in a hang.
\QuickQ{}
That is ridiculous!!!
After all, isn't getting the correct answer later than one would like
\emph{has} better than getting an incorrect answer???
\QuickA{}
This question fails to consider the option of choosing not to
compute the answer at all, and in doing so, also fails to consider
the costs of computing the answer.
For example, consider short-term weather forecasting, for which
accurate models exist, but which require large (and expensive)
clustered supercomputers, at least if you want to actually run
the model faster than the weather.
And in this case, any performance bug that prevents the model from
running faster than the actual weather prevents any forecasting.
Given that the whole purpose of purchasing the large clustered
supercomputer was to forecast weather, if you cannot run the
model faster than the weather, you would be better off not running
the model at all.
More severe examples may be found in the area of safety-critical
real-time computing.
\QuickQ{}
But if you are going to put in all the hard work of parallelizing
an application, why not do it right?
Why settle for anything less than optimal performance and
linear scalability?
\QuickA{}
Although I do heartily salute your spirit and aspirations,
you are forgetting that there may be high costs due to delays
in the program's completion.
For an extreme example, suppose that a 40\% performance shortfall
from a single-threaded application is causing one person to die
each day.
Suppose further that in a day you could hack together a
quick and dirty
parallel program that ran 50\% faster on an eight-CPU system
than the sequential version, but that an optimal parallel
program would require four months of painstaking design, coding,
debugging, and tuning.
It is safe to say that more than 100 people would prefer the
quick and dirty version.
\QuickQ{}
But what about other sources of error, for example, due to
interactions between caches and memory layout?
\QuickA{}
Changes in memory layout can indeed result in unrealistic
decreases in execution time.
For example, suppose that a given microbenchmark almost
always overflows the L0 cache's associativity, but with just the right
memory layout, it all fits.
If this is a real concern, consider running your microbenchmark
using huge pages (or within the kernel or on bare metal) in
order to completely control the memory layout.
\QuickQ{}
Wouldn't the techniques suggested to isolate the code under
test also affect that code's performance, particularly if
it is running within a larger application?
\QuickA{}
Indeed it might, although in most microbenchmarking efforts
you would extract the code under test from the enclosing
application.
Nevertheless, if for some reason you must keep the code under
test within the application, you will very likely need to use
the techniques discussed in
Section~\ref{sec:debugging:Detecting Interference}.
\QuickQ{}
This approach is just plain weird!
Why not use means and standard deviations, like we were taught
in our statistics classes?
\QuickA{}
Because mean and standard deviation were not designed to do this job.
To see this, try applying mean and standard deviation to the
following data set, given a 1\% relative error in measurement:
\begin{quote}
49,548.4 49,549.4 49,550.2 49,550.9 49,550.9 49,551.0
49,551.5 49,552.1 49,899.0 49,899.3 49,899.7 49,899.8
49,900.1 49,900.4 52,244.9 53,333.3 53,333.3 53,706.3
53,706.3 54,084.5
\end{quote}
The problem is that mean and standard deviation do not rest on
any sort of measurement-error assumption, and they will therefore
see the difference between the values near 49,500 and those near
49,900 as being statistically significant, when in fact they are
well within the bounds of estimated measurement error.
Of course, it is possible to create a script similar to
that in
Figure~\ref{fig:count:Statistical Elimination of Interference}
that uses standard deviation rather than absolute difference
to get a similar effect,
and this is left as an exercise for the interested reader.
Be careful to avoid divide-by-zero errors arising from strings
of identical data values!
\QuickQ{}
But what if all the y-values in the trusted group of data
are exactly zero?
Won't that cause the script to reject any non-zero value?
\QuickA{}
Indeed it will!
But if your performance measurements often produce a value of
exactly zero, perhaps you need to take a closer look at your
performance-measurement code.
Note that many approaches based on mean and standard deviation
will have similar problems with this sort of dataset.
\QuickQAC{chp:formal:Formal Verification}{Formal Verification}
\QuickQ{}
Why is there an unreached statement in
locker? After all, isn't this a \emph{full} state-space
search?
\QuickA{}
The locker process is an infinite loop, so control
never reaches the end of this process.
However, since there are no monotonically increasing variables,
Promela is able to model this infinite loop with a small
number of states.
\QuickQ{}
What are some Promela code-style issues with this example?
\QuickA{}
There are several:
\begin{enumerate}
\item The declaration of {\tt sum} should be moved to within
the init block, since it is not used anywhere else.
\item The assertion code should be moved outside of the
initialization loop. The initialization loop can
then be placed in an atomic block, greatly reducing
the state space (by how much?).
\item The atomic block covering the assertion code should
be extended to include the initialization of {\tt sum}
and {\tt j}, and also to cover the assertion.
This also reduces the state space (again, by how
much?).
\end{enumerate}
\QuickQ{}
Is there a more straightforward way to code the do-od statement?
\QuickA{}
Yes.
Replace it with {\tt if-fi} and remove the two {\tt break} statements.
\QuickQ{}
Why are there atomic blocks at lines 12-21
and lines 44-56, when the operations within those atomic
blocks have no atomic implementation on any current
production microprocessor?
\QuickA{}
Because those operations are for the benefit of the
assertion only. They are not part of the algorithm itself.
There is therefore no harm in marking them atomic, and
so marking them greatly reduces the state space that must
be searched by the Promela model.
\QuickQ{}
Is the re-summing of the counters on lines 24-27
\emph{really} necessary?
\QuickA{}
Yes. To see this, delete these lines and run the model.
Alternatively, consider the following sequence of steps:
\begin{enumerate}
\item One process is within its RCU read-side critical
section, so that the value of {\tt ctr[0]} is zero and
the value of {\tt ctr[1]} is two.
\item An updater starts executing, and sees that the sum of
the counters is two so that the fastpath cannot be
executed. It therefore acquires the lock.
\item A second updater starts executing, and fetches the value
of {\tt ctr[0]}, which is zero.
\item The first updater adds one to {\tt ctr[0]}, flips
the index (which now becomes zero), then subtracts
one from {\tt ctr[1]} (which now becomes one).
\item The second updater fetches the value of {\tt ctr[1]},
which is now one.
\item The second updater now incorrectly concludes that it
is safe to proceed on the fastpath, despite the fact
that the original reader has not yet completed.
\end{enumerate}
\QuickQ{}
Yeah, that's just great!
Now, just what am I supposed to do if I don't happen to have a
machine with 40GB of main memory???
\QuickA{}
Relax, there are a number of lawful answers to
this question:
\begin{enumerate}
\item Further optimize the model, reducing its memory consumption.
\item Work out a pencil-and-paper proof, perhaps starting with the
comments in the code in the Linux kernel.
\item Devise careful torture tests, which, though they cannot prove
the code correct, can find hidden bugs.
\item There is some movement towards tools that do model
checking on clusters of smaller machines.
However, please note that we have not actually used such
tools myself, courtesy of some large machines that Paul has
occasional access to.
\item Wait for memory sizes of affordable systems to expand
to fit your problem.
\item Use one of a number of cloud-computing services to rent
a large system for a short time period.
\end{enumerate}
\QuickQ{}
Why not simply increment \co{rcu_update_flag}, and then only
increment \co{dynticks_progress_counter} if the old value
of \co{rcu_update_flag} was zero???
\QuickA{}
This fails in presence of NMIs.
To see this, suppose an NMI was received just after
\co{rcu_irq_enter()} incremented \co{rcu_update_flag},
but before it incremented \co{dynticks_progress_counter}.
The instance of \co{rcu_irq_enter()} invoked by the NMI
would see that the original value of \co{rcu_update_flag}
was non-zero, and would therefore refrain from incrementing
\co{dynticks_progress_counter}.
This would leave the RCU grace-period machinery no clue that the
NMI handler was executing on this CPU, so that any RCU read-side
critical sections in the NMI handler would lose their RCU protection.
The possibility of NMI handlers, which, by definition cannot
be masked, does complicate this code.
\QuickQ{}
But if line~7 finds that we are the outermost interrupt,
wouldn't we \emph{always} need to increment
\co{dynticks_progress_counter}?
\QuickA{}
Not if we interrupted a running task!
In that case, \co{dynticks_progress_counter} would
have already been incremented by \co{rcu_exit_nohz()},
and there would be no need to increment it again.
\QuickQ{}
Can you spot any bugs in any of the code in this section?
\QuickA{}
Read the next section to see if you were correct.
\QuickQ{}
Why isn't the memory barrier in \co{rcu_exit_nohz()}
and \co{rcu_enter_nohz()} modeled in Promela?
\QuickA{}
Promela assumes sequential consistency, so
it is not necessary to model memory barriers.
In fact, one must instead explicitly model lack of memory barriers,
for example, as shown in
Figure~\ref{fig:analysis:QRCU Unordered Summation} on
page~\pageref{fig:analysis:QRCU Unordered Summation}.
\QuickQ{}
Isn't it a bit strange to model \co{rcu_exit_nohz()}
followed by \co{rcu_enter_nohz()}?
Wouldn't it be more natural to instead model entry before exit?
\QuickA{}
It probably would be more natural, but we will need
this particular order for the liveness checks that we will add later.
\QuickQ{}
Wait a minute!
In the Linux kernel, both \co{dynticks_progress_counter} and
\co{rcu_dyntick_snapshot} are per-CPU variables.
So why are they instead being modeled as single global variables?
\QuickA{}
Because the grace-period code processes each
CPU's \co{dynticks_progress_counter} and
\co{rcu_dyntick_snapshot} variables separately,
we can collapse the state onto a single CPU.
If the grace-period code were instead to do something special
given specific values on specific CPUs, then we would indeed need
to model multiple CPUs.
But fortunately, we can safely confine ourselves to two CPUs, the
one running the grace-period processing and the one entering and
leaving dynticks-idle mode.
\QuickQ{}
Given there are a pair of back-to-back changes to
\co{gp_state} on lines~25 and 26,
how can we be sure that line~25's changes won't be lost?
\QuickA{}
Recall that Promela and spin trace out
every possible sequence of state changes.
Therefore, timing is irrelevant: Promela/spin will be quite
happy to jam the entire rest of the model between those two
statements unless some state variable specifically prohibits
doing so.
\QuickQ{}
But what would you do if you needed the statements in a single
\co{EXECUTE_MAINLINE()} group to execute non-atomically?
\QuickA{}
The easiest thing to do would be to put
each such statement in its own \co{EXECUTE_MAINLINE()}
statement.
\QuickQ{}
But what if the \co{dynticks_nohz()} process had
``if'' or ``do'' statements with conditions,
where the statement bodies of these constructs
needed to execute non-atomically?
\QuickA{}
One approach, as we will see in a later section,
is to use explicit labels and ``goto'' statements.
For example, the construct:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
if
:: i == 0 -> a = -1;
:: else -> a = -2;
fi;
\end{verbatim}
\end{minipage}
\vspace{5pt}
could be modeled as something like:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
EXECUTE_MAINLINE(stmt1,
if
:: i == 0 -> goto stmt1_then;
:: else -> goto stmt1_else;
fi)
stmt1_then: skip;
EXECUTE_MAINLINE(stmt1_then1, a = -1; goto stmt1_end)
stmt1_else: skip;
EXECUTE_MAINLINE(stmt1_then1, a = -2)
stmt1_end: skip;
\end{verbatim}
\end{minipage}
\vspace{5pt}
However, it is not clear that the macro is helping much in the case
of the ``if'' statement, so these sorts of situations will
be open-coded in the following sections.
\QuickQ{}
Why are lines~45 and 46 (the \co{in_dyntick_irq = 0;}
and the \co{i++;}) executed atomically?
\QuickA{}
These lines of code pertain to controlling the
model, not to the code being modeled, so there is no reason to
model them non-atomically.
The motivation for modeling them atomically is to reduce the size
of the state space.
\QuickQ{}
What property of interrupts is this \co{dynticks_irq()}
process unable to model?
\QuickA{}
One such property is nested interrupts,
which are handled in the following section.
\QuickQ{}
Does Paul \emph{always} write his code in this painfully incremental
manner?
\QuickA{}
Not always, but more and more frequently.
In this case, Paul started with the smallest slice of code that
included an interrupt handler, because he was not sure how best
to model interrupts in Promela.
Once he got that working, he added other features.
(But if he was doing it again, he would start with a ``toy'' handler.
For example, he might have the handler increment a variable twice and
have the mainline code verify that the value was always even.)
Why the incremental approach?
Consider the following, attributed to Brian W. Kernighan:
\begin{quote}
Debugging is twice as hard as writing the code in the first
place. Therefore, if you write the code as cleverly as possible,
you are, by definition, not smart enough to debug it.
\end{quote}
This means that any attempt to optimize the production of code should
place at least 66\% of its emphasis on optimizing the debugging process,
even at the expense of increasing the time and effort spent coding.
Incremental coding and testing is one way to optimize the debugging
process, at the expense of some increase in coding effort.
Paul uses this approach because he rarely has the luxury of
devoting full days (let alone weeks) to coding and debugging.
\QuickQ{}
But what happens if an NMI handler starts running before
an irq handler completes, and if that NMI handler continues
running until a second irq handler starts?
\QuickA{}
This cannot happen within the confines of a single CPU.
The first irq handler cannot complete until the NMI handler
returns.
Therefore, if each of the \co{dynticks} and \co{dynticks_nmi}
variables have taken on an even value during a given time
interval, the corresponding CPU really was in a quiescent
state at some time during that interval.
\QuickQ{}
This is still pretty complicated.
Why not just have a \co{cpumask_t} that has a bit set for
each CPU that is in dyntick-idle mode, clearing the bit
when entering an irq or NMI handler, and setting it upon
exit?
\QuickA{}
Although this approach would be functionally correct, it
would result in excessive irq entry/exit overhead on
large machines.
In contrast, the approach laid out in this section allows
each CPU to touch only per-CPU data on irq and NMI entry/exit,
resulting in much lower irq entry/exit overhead, especially
on large machines.
\QuickQ{}
But x86 has strong memory ordering! Why would you need to
formalize its memory model?
\QuickA{}
Actually, academics consider the x86 memory model to be weak
because it can allow prior stores to be reordered with
subsequent loads.
From an academic viewpoint, a strong memory model is one
that allows absolutely no reordering, so that all threads
agree on the order of all operations visible to them.
\QuickQ{}
Why does line~8
of Figure~\ref{fig:sec:formal:CPPMEM Litmus Test}
initialize the registers?
Why not instead initialize them on lines~4 and~5?
\QuickA{}
Either way works.
However, in general, it is better to use initialization than
explicit instructions.
The explicit instructions are used in this example to demonstrate
their use.
In addition, many of the litmus tests available on the tool's
web site (\url{http://www.cl.cam.ac.uk/~pes20/ppcmem/}) were
automatically generated, which generates explicit
initialization instructions.
\QuickQ{}
But whatever happened to line~17 of
Figure~\ref{fig:sec:formal:CPPMEM Litmus Test},
the one that is the \co{Fail:} label?
\QuickA{}
The implementation of powerpc version of \co{atomic_add_return()}
loops when the \co{stwcx} instruction fails, which it communicates
by setting non-zero status in the condition-code register,
which in turn is tested by the bne instruction. Because actually
modeling the loop would result in state-space explosion, we
instead branch to the Fail: label, terminating the model with
the initial value of 2 in thread~1's \co{r3} register, which
will not trigger the exists assertion.
There is some debate about whether this trick is universally
applicable, but I have not seen an example where it fails.
\QuickQ{}
Does the ARM Linux kernel have a similar bug?
\QuickA{}
ARM does not have this particular bug because that it places
\co{smp_mb()} before and after the \co{atomic_add_return()}
function's assembly-language implementation.
PowerPC no longer has this bug; it has long since been fixed.
Finding any other bugs that the Linux kernel might have is left
as an exercise for the reader.
\QuickQAC{chp:Putting It All Together}{Putting It All Together}
\QuickQ{}
Why on earth did we need that global lock in the first place?
\QuickA{}
A given thread's \co{__thread} variables vanish when that
thread exits.
It is therefore necessary to synchronize any operation that
accesses other threads' \co{__thread} variables with
thread exit.
Without such synchronization, accesses to \co{__thread} variable
of a just-exited thread will result in segmentation faults.
\QuickQ{}
Just what is the accuracy of \co{read_count()}, anyway?
\QuickA{}
Refer to
Figure~\ref{fig:count:Per-Thread Statistical Counters} on
Page~\pageref{fig:count:Per-Thread Statistical Counters}.
Clearly, if there are no concurrent invocations of \co{inc_count()},
\co{read_count()} will return an exact result.
However, if there \emph{are} concurrent invocations of
\co{inc_count()}, then the sum is in fact changing as
\co{read_count()} performs its summation.
That said, because thread creation and exit are excluded by
\co{final_mutex}, the pointers in \co{counterp} remain constant.
Let's imagine a mythical machine that is able to take an
instantaneous snapshot of its memory.
Suppose that this machine takes such a snapshot at the
beginning of \co{read_count()}'s execution, and another
snapshot at the end of \co{read_count()}'s execution.
Then \co{read_count()} will access each thread's counter
at some time between these two snapshots, and will therefore
obtain a result that is bounded by those of the two snapshots,
inclusive.
The overall sum will therefore be bounded by the pair of sums that
would have been obtained from each of the two snapshots (again,
inclusive).
The expected error is therefore half of the difference between
the pair of sums that would have been obtained from each of the
two snapshots, that is to say, half of the execution time of
\co{read_count()} multiplied by the number of expected calls to
\co{inc_count()} per unit time.
Or, for those who prefer equations:
\begin{equation}
\epsilon = \frac{T_r R_i}{2}
\end{equation}
where $\epsilon$ is the expected error in \co{read_count()}'s
return value,
$T_r$ is the time that \co{read_count()} takes to execute,
and $R_i$ is the rate of \co{inc_count()} calls per unit time.
(And of course, $T_r$ and $R_i$ should use the same units of
time: microseconds and calls per microsecond, seconds and calls
per second, or whatever, as long as they are the same units.)
\QuickQ{}
Hey!!!
Line~45 of
Figure~\ref{fig:together:RCU and Per-Thread Statistical Counters}
modifies a value in a pre-existing \co{countarray} structure!
Didn't you say that this structure, once made available to
\co{read_count()}, remained constant???
\QuickA{}
Indeed I did say that.
And it would be possible to make \co{count_register_thread()}
allocate a new structure, much as \co{count_unregister_thread()}
currently does.
But this is unnecessary.
Recall the derivation of the error bounds of \co{read_count()}
that was based on the snapshots of memory.
Because new threads start with initial \co{counter} values of
zero, the derivation holds even if we add a new thread partway
through \co{read_count()}'s execution.
So, interestingly enough, when adding a new thread, this
implementation gets the effect of allocating a new structure,
but without actually having to do the allocation.
\QuickQ{}
Wow!
Figure~\ref{fig:together:RCU and Per-Thread Statistical Counters}
contains 69 lines of code, compared to only 42 in
Figure~\ref{fig:count:Per-Thread Statistical Counters}.
Is this extra complexity really worth it?
\QuickA{}
This of course needs to be decided on a case-by-case basis.
If you need an implementation of \co{read_count()} that
scales linearly, then the lock-based implementation shown in
Figure~\ref{fig:count:Per-Thread Statistical Counters}
simply will not work for you.
On the other hand, if calls to \co{count_read()} are sufficiently
rare, then the lock-based version is simpler and might thus be
better, although much of the size difference is due
to the structure definition, memory allocation, and \co{NULL}
return checking.
Of course, a better question is "why doesn't the language
implement cross-thread access to \co{__thread} variables?"
After all, such an implementation would make both the locking
and the use of RCU unnecessary.
This would in turn enable an implementation that
was even simpler than the one shown in
Figure~\ref{fig:count:Per-Thread Statistical Counters}, but
with all the scalability and performance benefits of the
implementation shown in
Figure~\ref{fig:together:RCU and Per-Thread Statistical Counters}!
\QuickQ{}
But cant't the approach shown in
Figure~\ref{fig:together:Correlated Measurement Fields}
result in extra cache misses, in turn resulting in additional
read-side overhead?
\QuickA{}
Indeed it can.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbatim}
1 struct measurement {
2 double meas_1;
3 double meas_2;
4 double meas_3;
5 };
6
7 struct animal {
8 char name[40];
9 double age;
10 struct measurement *mp;
11 struct measurement meas;
12 char photo[0]; /* large bitmap. */
13 };
\end{verbatim}
}
\caption{Localized Correlated Measurement Fields}
\label{fig:together:Localized Correlated Measurement Fields}
\end{figure}
One way to avoid this cache-miss overhead is shown in
Figure~\ref{fig:together:Localized Correlated Measurement Fields}:
Simply embed an instance of a \co{measurement} structure
named \co{meas}
into the \co{animal} structure, and point the \co{->mp}
field at this \co{->meas} field.
Measurement updates can then be carried out as follows:
\begin{enumerate}
\item Allocate a new \co{measurement} structure and place
the new measurements into it.
\item Use \co{rcu_assign_pointer()} to point \co{->mp} to
this new structure.
\item Wait for a grace period to elapse, for example using
either \co{synchronize_rcu()} or \co{call_rcu()}.
\item Copy the measurements from the new \co{measurement}
structure into the embedded \co{->meas} field.
\item Use \co{rcu_assign_pointer()} to point \co{->mp}
back to the old embedded \co{->meas} field.
\item After another grace period elapses, free up the
new \co{measurement} field.
\end{enumerate}
This approach uses a heavier weight update procedure to eliminate
the extra cache miss in the common case.
The extra cache miss will be incurred only while an update is
actually in progress.
\QuickQ{}
But how does this scan work while a resizable hash table
is being resized?
In that case, neither the old nor the new hash table is
guaranteed to contain all the elements in the hash table!
\QuickA{}
True, resizable hash tables as described in
Section~\ref{sec:datastruct:Non-Partitionable Data Structures}
cannot be fully scanned while being resized.
One simple way around this is to acquire the
\co{hashtab} structure's \co{->ht_lock} while scanning,
but this prevents more than one scan from proceeding
concurrently.
Another approach is for updates to mutate the old hash
table as well as the new one while resizing is in
progress.
This would allow scans to find all elements in the old
hash table.
Implementing this is left as an exercise for the reader.
\QuickQAC{sec:advsync:Advanced Synchronization}{Advanced Synchronization}
\QuickQ{}
How on earth could the assertion on line~21 of the code in
Figure~\ref{fig:advsync:Parallel Hardware is Non-Causal} on
page~\pageref{fig:advsync:Parallel Hardware is Non-Causal}
\emph{possibly} fail?
\QuickA{}
The key point is that the intuitive analysis missed is that
there is nothing preventing the assignment to C from overtaking
the assignment to A as both race to reach {\tt thread2()}.
This is explained in the remainder of this section.
\QuickQ{}
Great... So how do I fix it?
\QuickA{}
The easiest fix is to replace the \co{barrier()} on
line~12 with an \co{smp_mb()}.
\QuickQ{}
What assumption is the code fragment
in Figure~\ref{fig:advsync:Software Logic Analyzer}
making that might not be valid on real hardware?
\QuickA{}
The code assumes that as soon as a given CPU stops
seeing its own value, it will immediately see the
final agreed-upon value.
On real hardware, some of the CPUs might well see several
intermediate results before converging on the final value.
\QuickQ{}
How could CPUs possibly have different views of the
value of a single variable \emph{at the same time?}
\QuickA{}
Many CPUs have write buffers that record the values of
recent writes, which are applied once the corresponding
cache line makes its way to the CPU.
Therefore, it is quite possible for each CPU to see a
different value for a given variable at a single point
in time --- and for main memory to hold yet another value.
One of the reasons that memory barriers were invented was
to allow software to deal gracefully with situations like
this one.
\QuickQ{}
Why do CPUs~2 and 3 come to agreement so quickly, when it
takes so long for CPUs~1 and 4 to come to the party?
\QuickA{}
CPUs~2 and 3 are a pair of hardware threads on the same
core, sharing the same cache hierarchy, and therefore have
very low communications latencies.
This is a NUMA, or, more accurately, a NUCA effect.
This leads to the question of why CPUs~2 and 3 ever disagree
at all.
One possible reason is that they each might have a small amount
of private cache in addition to a larger shared cache.
Another possible reason is instruction reordering, given the
short 10-nanosecond duration of the disagreement and the
total lack of memory barriers in the code fragment.
\QuickQ{}
But if the memory barriers do not unconditionally force
ordering, how the heck can a device driver reliably execute
sequences of loads and stores to MMIO registers?
\QuickA{}
MMIO registers are special cases: because they appear
in uncached regions of physical memory.
Memory barriers \emph{do} unconditionally force ordering
of loads and stores to uncached memory.
See Section~@@@ for more information on memory barriers
and MMIO regions.
\QuickQ{}
How do we know that modern hardware guarantees that at least
one of the loads will see the value stored by the other thread
in the ears-to-mouths scenario?
\QuickA{}
The scenario is as follows, with A and B both initially zero:
CPU~0: A=1; \co{smp_mb()}; r1=B;
CPU~1: B=1; \co{smp_mb()}; r2=A;
If neither of the loads see the corresponding store, when both
CPUs finish, both \co{r1} and \co{r2} will be equal to zero.
Let's suppose that \co{r1} is equal to zero.
Then we know that CPU~0's load from B happened before CPU~1's
store to B: After all, we would have had \co{r1} equal to one
otherwise.
But given that CPU~0's load from B happened before CPU~1's store
to B, memory-barrier pairing guarantees that CPU~0's store to A
happens before CPU~1's load from A, which in turn guarantees that
\co{r2} will be equal to one, not zero.
Therefore, at least one of \co{r1} and \co{r2} must be nonzero,
which means that at least one of the loads saw the value from
the corresponding store, as claimed.
\QuickQ{}
How can the other ``Only one store'' entries in
Table~\ref{tab:advsync:Memory-Barrier Combinations}
be used?
\QuickA{}
For combination~2, if CPU~1's load from B sees a value prior
to CPU~2's store to B, then we know that CPU~2's load from A
will return the same value as CPU~1's load from A, or some later
value.
For combination~4, if CPU~2's load from B sees the value from
CPU~1's store to B, then we know that CPU~2's load from A
will return the same value as CPU~1's load from A, or some later
value.
For combination~8, if CPU~2's load from A sees CPU~1's store
to A, then we know that CPU~1's load from B will return the same
value as CPU~2's load from A, or some later value.
\QuickQ{}
How could the assertion {\tt b==2} on
page~\pageref{codesample:advsync:What Can You Count On? 1}
possibly fail?
\QuickA{}
If the CPU is not required to see all of its loads and
stores in order, then the {\tt b=1+a} might well see an
old version of the variable ``a''.
This is why it is so very important that each CPU or thread
see all of its own loads and stores in program order.
\QuickQ{}
How could the code on
page~\pageref{codesample:advsync:What Can You Count On? 2}
possibly leak memory?
\QuickA{}
Only the first execution of the critical section should
see {\tt p==NULL}.
However, if there is no global ordering of critical sections for
{\tt mylock}, then how can you say that a particular one was
first?
If several different executions of that critical section thought
that they were first, they would all see {\tt p==NULL}, and
they would all allocate memory.
All but one of those allocations would be leaked.
This is why it is so very important that all the critical sections
for a given exclusive lock appear to execute in some well-defined
order.
\QuickQ{}
How could the code on
page~\pageref{codesample:advsync:What Can You Count On? 2}
possibly count backwards?
\QuickA{}
Suppose that the counter started out with the value zero,
and that three executions of the critical section had therefore
brought its value to three.
If the fourth execution of the critical section is not constrained
to see the most recent store to this variable, it might well see
the original value of zero, and therefore set the counter to
one, which would be going backwards.
This is why it is so very important that loads from a given variable
in a given critical
section see the last store from the last prior critical section to
store to that variable.
\QuickQ{}
What effect does the following sequence have on the
order of stores to variables ``a'' and ``b''? \\
{\tt ~~~~a = 1;} \\
{\tt ~~~~b = 1;} \\
{\tt ~~~~<write barrier>}
\QuickA{}
Absolutely none. This barrier {\em would} ensure that the
assignments to ``a'' and ``b'' happened before any subsequent
assignments, but it does nothing to enforce any order of
assignments to ``a'' and ``b'' themselves.
\QuickQ{}
What sequence of LOCK-UNLOCK operations \emph{would}
act as a full memory barrier?
\QuickA{}
A series of two back-to-back LOCK-UNLOCK operations, or, somewhat
less conventionally, an UNLOCK operation followed by a LOCK
operation.
\QuickQ{}
What (if any) CPUs have memory-barrier instructions
from which these semi-permeable locking primitives might
be constructed?
\QuickA{}
Itanium is one example.
The identification of any others is left as an
exercise for the reader.
\QuickQ{}
Given that operations grouped in curly braces are executed
concurrently, which of the rows of
Table~\ref{tab:advsync:Lock-Based Critical Sections}
are legitimate reorderings of the assignments to variables
``A'' through ``F'' and the LOCK/UNLOCK operations?
(The order in the code is A, B, LOCK, C, D, UNLOCK, E, F.)
Why or why not?
\QuickA{}
\begin{enumerate}
\item Legitimate, executed in order.
\item Legitimate, the lock acquisition was executed concurrently
with the last assignment preceding the critical section.
\item Illegitimate, the assignment to ``F'' must follow the LOCK
operation.
\item Illegitimate, the LOCK must complete before any operation in
the critical section. However, the UNLOCK may legitimately
be executed concurrently with subsequent operations.
\item Legitimate, the assignment to ``A'' precedes the UNLOCK,
as required, and all other operations are in order.
\item Illegitimate, the assignment to ``C'' must follow the LOCK.
\item Illegitimate, the assignment to ``D'' must precede the UNLOCK.
\item Legitimate, all assignments are ordered with respect to the
LOCK and UNLOCK operations.
\item Illegitimate, the assignment to ``A'' must precede the UNLOCK.
\end{enumerate}
\QuickQ{}
What are the constraints for
Table~\ref{tab:advsync:Ordering With Multiple Locks}?
\QuickA{}
All CPUs must see the following ordering constraints:
\begin{enumerate}
\item LOCK M precedes B, C, and D.
\item UNLOCK M follows A, B, and C.
\item LOCK Q precedes F, G, and H.
\item UNLOCK Q follows E, F, and G.
\end{enumerate}
\QuickQAC{chp:Ease of Use}{Ease of Use}
\QuickQ{}
Can a similar algorithm be used when deleting elements?
\QuickA{}
Yes.
However, since each thread must hold the locks of three
consecutive elements to delete the middle one, if there
are $N$ threads, there must be $2N+1$ elements (rather than
just $N+1$) in order to avoid deadlock.
\QuickQ{}
Yetch!
What ever possessed someone to come up with an algorithm
that deserves to be shaved as much as this one does???
\QuickA{}
That would be Paul.
He was considering the \emph{Dining Philosopher's Problem}, which
involves a rather unsanitary spaghetti dinner attended by
five philosophers.
Given that there are five plates and but five forks on the table, and
given that each philosopher requires two forks at a time to eat,
one is supposed to come up with a fork-allocation algorithm that
avoids deadlock.
Paul's response was ``Sheesh! Just get five more forks!''.
This in itself was OK, but Paul then applied this same solution to
circular linked lists.
This would not have been so bad either, but he had to go and tell
someone about it!
\QuickQ{}
Give an exception to this rule.
\QuickA{}
One exception would be a difficult and complex algorithm that
was the only one known to work in a given situation.
Another exception would be a difficult and complex algorithm
that was nonetheless the simplest of the set known to work in
a given situation.
However, even in these cases, it may be very worthwhile to spend
a little time trying to come up with a simpler algorithm!
After all, if you managed to invent the first algorithm
to do some task, it shouldn't be that hard to go on to
invent a simpler one.
\QuickQAC{chp:Conflicting Visions of the Future}{Conflicting Visions of the Future}
\QuickQ{}
What about non-persistent primitives represented by data
structures in \co{mmap()} regions of memory?
What happens when there is an \co{exec()} within a critical
section of such a primitive?
\QuickA{}
If the \co{exec()}ed program maps those same regions of
memory, then this program could in principle simply release
the lock.
The question as to whether this approach is sound from a
software-engineering viewpoint is left as an exercise for
the reader.
\QuickQ{}
Why would it matter that oft-written variables shared the cache
line with the lock variable?
\QuickA{}
If the lock is in the same cacheline as some of the variables
that it is protecting, then writes to those variables by one CPU
will invalidate that cache line for all the other CPUs.
These invalidations will
generate large numbers of conflicts and retries, perhaps even
degrading performance and scalability compared to locking.
\QuickQ{}
Why are relatively small updates important to HTM performance
and scalability?
\QuickA{}
The larger the updates, the greater the probability of conflict,
and thus the greater probability of retries, which degrade
performance.
\QuickQ{}
How could a red-black tree possibly efficiently enumerate all
elements of the tree regardless of choice of synchronization
mechanism???
\QuickA{}
In many cases, the enumeration need not be exact.
In these cases, hazard pointers or RCU may be used to protect
readers with low probability of conflict with any given insertion
or deletion.
\QuickQ{}
But why can't a debugger emulate single stepping by setting
breakpoints at successive lines of the transaction, relying
on the retry to retrace the steps of the earlier instances
of the transaction?
\QuickA{}
This scheme might work with reasonably high probability, but it
can fail in ways that would be quite surprising to most users.
To see this, consider the following transaction:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
1 begin_trans();
2 if (a) {
3 do_one_thing();
4 do_another_thing();
5 } else {
6 do_a_third_thing();
7 do_a_fourth_thing();
8 }
9 end_trans();
\end{verbatim}
\end{minipage}
\vspace{5pt}
Suppose that the user sets a breakpoint at line 3, which triggers,
aborting the transaction and entering the debugger.
Suppose that between the time that the breakpoint triggers
and the debugger gets around to stopping all the threads, some
other thread sets the value of \co{a} to zero.
When the poor user attempts to single-step the program, surprise!
The program is now in the else-clause instead of the then-clause.
This is \emph{not} what I call an easy-to-use debugger.
\QuickQ{}
But why would \emph{anyone} need an empty lock-based critical
section???
\QuickA{}
See the answer to the Quick Quiz in
Section~\ref{sec:locking:Exclusive Locks}.
However, it is claimed that given a strongly atomic HTM
implementation without forward-progress guarantees, any
memory-based locking design based on empty critical sections
will operate correctly in the presence of transactional
lock elision.
Although I have not seen a proof of this statement, there
is a straightforward rationale for this claim.
The main idea is that in a strongly atomic HTM implementation,
the results of a given transaction are not visible until
after the transaction completes successfully.
Therefore, if you can see that a transaction has started,
it is guaranteed to have already completed, which means
that a subsequent empty lock-based critical section will
successfully ``wait'' on it---after all, there is no waiting
required.
This line of reasoning does not apply to weakly atomic
systems (including many STM implementation), and it also
does not apply to lock-based programs that use means other
than memory to communicate.
One such means is the passage of time (for example, in
hard real-time systems) or flow of priority (for example,
in soft real-time systems).
Locking designs that rely on priority boosting are of particular
interest.
\QuickQ{}
Can't transactional lock elision trivially handle locking's
time-based messaging semantics
by simply choosing not to elide empty lock-based critical sections?
\QuickA{}
It could do so, but this would be both unnecessary and
insufficient.
It would be unnecessary in cases where the empty critical section
was due to conditional compilation.
Here, it might well be that the only purpose of the lock was to
protect data, so eliding it completely would be the right thing
to do.
In fact, leaving the empty lock-based critical section would
degrade performance and scalability.
On the other hand, it is possible for a non-empty lock-based
critical section to be relying on both the data-protection
and time-based and messaging semantics of locking.
Using transactional lock elision in such a case would be
incorrect, and would result in bugs.
\QuickQ{}
Given modern hardware~\cite{PeterOkech2009InherentRandomness},
how can anyone possibly expect parallel software relying
on timing to work?
\QuickA{}
The short answer is that on commonplace commodity hardware,
synchronization designs based on any sort of fine-grained
timing are foolhardy and cannot be expected to operate correctly
under all conditions.
That said, there are systems designed for hard real-time use
that are much more deterministic.
In the (very unlikely) event that you are using such a system,
here is a toy example showing how time-based synchronization can
work.
Again, do \emph{not} try this on commodity microprocessors,
as they have highly nondeterministic performance characteristics.
This example uses multiple worker threads along with a control
thread.
Each worker thread corresponds to an outbound data feed, and
records the current time (for example, from the
\co{clock_gettime()} system call) in a per-thread
\co{my_timestamp} variable after executing each unit
of work.
The real-time nature of this example results in the following
set of constraints:
\begin{enumerate}
\item It is a fatal error for a given worker thread to fail
to update its timestamp for a time period of more than
\co{MAX_LOOP_TIME}.
\item Locks are used sparingly to access and update global
state.
item Locks are granted in strict FIFO order within
a given thread priority.
\end{enumerate}
When worker threads complete their feed, they must disentangle
themselves from the rest of the application and place a status
value in a per-thread \co{my_status} variable that is initialized
to -1.
Threads do not exit; they instead are placed on a thread pool
to accommodate later processing requirements.
The control thread assigns (and re-assigns) worker threads as
needed, and also maintains a histogram of thread statuses.
The control thread runs at a real-time priority no higher than
that of the worker threads.
Worker threads' code is as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 int my_status = -1; /* Thread local. */
2
3 while (continue_working()) {
4 enqueue_any_new_work();
5 wp = dequeue_work();
6 do_work(wp);
7 my_timestamp = clock_gettime(...);
8 }
9
10 acquire_lock(&departing_thread_lock);
11
12 /*
13 * Disentangle from application, might
14 * acquire other locks, can take much longer
15 * than MAX_LOOP_TIME, especially if many
16 * threads exit concurrently.
17 */
18 my_status = get_return_status();
19 release_lock(&departing_thread_lock);
20
21 /* thread awaits repurposing. */
\end{verbatim}
\end{minipage}
\vspace{5pt}
The control thread's code is as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 for (;;) {
2 for_each_thread(t) {
3 ct = clock_gettime(...);
4 d = ct - per_thread(my_timestamp, t);
5 if (d >= MAX_LOOP_TIME) {
6 /* thread departing. */
7 acquire_lock(&departing_thread_lock);
8 release_lock(&departing_thread_lock);
9 i = per_thread(my_status, t);
10 status_hist[i]++; /* Bug if TLE! */
11 }
12 }
13 /* Repurpose threads as needed. */
14 }
\end{verbatim}
\end{minipage}
\vspace{5pt}
Line~5 uses the passage of time to deduce that the thread
has exited, executing lines~6-10 if so.
The empty lock-based critical section on lines~7 and~8
guarantees that any thread in the process of exiting
completes (remember that locks are granted in FIFO order!).
Once again, do not try this sort of thing on commodity
microprocessors.
After all, it is difficult enough to get right on systems
specifically designed for hard real-time use!
\QuickQ{}
But the \co{boostee()} function in
Figure~\ref{fig:future:Exploiting Priority Boosting}
alternatively acquires its locks in reverse order!
Won't this result in deadlock?
\QuickA{}
No deadlock will result.
To arrive at deadlock, two different threads must each
acquire the two locks in oppposite orders, which does not
happen in this example.
However, deadlock detectors such as
lockdep~\cite{JonathanCorbet2006lockdep}
will flag this as a false positive.
\QuickQAC{cha:app:Important Questions}{Important Questions}
\QuickQ{}
What SMP coding errors can you see in these examples?
See \url{time.c} for full code.
\QuickA{}
\begin{enumerate}
\item Missing barrier() or volatile on tight loops.
\item Missing Memory barriers on update side.
\item Lack of synchronization between producer and consumer.
\end{enumerate}
\QuickQ{}
How could there be such a large gap between successive
consumer reads?
See \url{timelocked.c} for full code.
\QuickA{}
\begin{enumerate}
\item The consumer might be preempted for long time periods.
\item A long-running interrupt might delay the consumer.
\item The producer might also be running on a faster CPU than is the
consumer (for example, one of the CPUs might have had to
decrease its
clock frequency due to heat-dissipation or power-consumption
constraints).
\end{enumerate}
\QuickQAC{app:primitives:Synchronization Primitives}{Synchronization Primitives}
\QuickQ{}
Give an example of a parallel program that could be written
without synchronization primitives.
\QuickA{}
There are many examples.
One of the simplest would be a parametric study using a
single independent variable.
If the program {\tt run\_study} took a single argument,
then we could use the following bash script to run two
instances in parallel, as might be appropriate on a
two-CPU system:
{ \scriptsize \tt run\_study 1 > 1.out\& run\_study 2 > 2.out; wait}
One could of course argue that the bash ampersand operator and
the ``wait'' primitive are in fact synchronization primitives.
If so, then consider that
this script could be run manually in two separate
command windows, so that the only synchronization would be
supplied by the user himself or herself.
\QuickQ{}
What problems could occur if the variable {\tt counter} were
incremented without the protection of {\tt mutex}?
\QuickA{}
On CPUs with load-store architectures, incrementing {\tt counter}
might compile into something like the following:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
LOAD counter,r0
INC r0
STORE r0,counter
\end{verbatim}
\end{minipage}
\vspace{5pt}
On such machines, two threads might simultaneously load the
value of {\tt counter}, each increment it, and each store the
result.
The new value of {\tt counter} will then only be one greater
than before, despite two threads each incrementing it.
\QuickQ{}
How could you work around the lack of a per-thread-variable
API on systems that do not provide it?
\QuickA{}
One approach would be to create an array indexed by
{\tt smp\_thread\_id()}, and another would be to use a hash
table to map from {\tt smp\_thread\_id()} to an array
index --- which is in fact what this
set of APIs does in pthread environments.
Another approach would be for the parent to allocate a structure
containing fields for each desired per-thread variable, then
pass this to the child during thread creation.
However, this approach can impose large software-engineering
costs in large systems.
To see this, imagine if all global variables in a large system
had to be declared in a single file, regardless of whether or
not they were C static variables!
\QuickQAC{chp:app:whymb:Why Memory Barriers?}{Why Memory Barriers?}
\QuickQ{}
Where does a writeback message originate from and where does
it go to?
\QuickA{}
The writeback message originates from a given CPU, or in some
designs from a given level of a given CPU's cache---or even
from a cache that might be shared among several CPUs.
The key point is that a given cache does not have room for
a given data item, so some other piece of data must be ejected
from the cache to make room.
If there is some other piece of data that is duplicated in some
other cache or in memory, then that piece of data may be simply
discarded, with no writeback message required.
On the other hand, if every piece of data that might be ejected
has been modified so that the only up-to-date copy is in this
cache, then one of those data items must be copied somewhere
else.
This copy operation is undertaken using a ``writeback message''.
The destination of the writeback message has to be something
that is able to store the new value.
This might be main memory, but it also might be some other cache.
If it is a cache, it is normally a higher-level cache for the
same CPU, for example, a level-1 cache might write back to a
level-2 cache.
However, some hardware designs permit cross-CPU writebacks,
so that CPU~0's cache might send a writeback message to CPU~1.
This would normally be done if CPU~1 had somehow indicated
an interest in the data, for example, by having recently
issued a read request.
In short, a writeback message is sent from some part of the
system that is short of space, and is received by some other
part of the system that can accommodate the data.
\QuickQ{}
What happens if two CPUs attempt to invalidate the
same cache line concurrently?
\QuickA{}
One of the CPUs gains access
to the shared bus first,
and that CPU ``wins''. The other CPU must invalidate its copy of the
cache line and transmit an ``invalidate acknowledge'' message
to the other CPU. \\
Of course, the losing CPU can be expected to immediately issue a
``read invalidate'' transaction, so the winning CPU's victory will
be quite ephemeral.
\QuickQ{}
When an ``invalidate'' message appears in a large multiprocessor,
every CPU must give an ``invalidate acknowledge'' response.
Wouldn't the resulting ``storm'' of ``invalidate acknowledge''
responses totally saturate the system bus?
\QuickA{}
It might, if large-scale multiprocessors were in fact implemented
that way. Larger multiprocessors, particularly NUMA machines,
tend to use so-called ``directory-based'' cache-coherence
protocols to avoid this and other problems.
\QuickQ{}
If SMP machines are really using message passing
anyway, why bother with SMP at all?
\QuickA{}
There has been quite a bit of controversy on this topic over
the past few decades. One answer is that the cache-coherence
protocols are quite simple, and therefore can be implemented
directly in hardware, gaining bandwidths and latencies
unattainable by software message passing. Another answer is that
the real truth is to be found in economics due to the relative
prices of large SMP machines and that of clusters of smaller
SMP machines. A third answer is that the SMP programming
model is easier to use than that of distributed systems, but
a rebuttal might note the appearance of HPC clusters and MPI.
And so the argument continues.
\QuickQ{}
How does the hardware handle the delayed transitions
described above?
\QuickA{}
Usually by adding additional states, though these additional
states need not be actually stored with the cache line, due to
the fact that only a few lines at a time will be transitioning.
The need to delay transitions is but one issue that results in
real-world cache coherence protocols being much more complex than
the over-simplified MESI protocol described in this appendix.
Hennessy and Patterson's classic introduction to computer
architecture~\cite{Hennessy95a} covers many of these issues.
\QuickQ{}
What sequence of operations would put the CPUs' caches
all back into the ``invalid'' state?
\QuickA{}
There is no such sequence, at least in absence of special
``flush my cache'' instructions in the CPU's instruction set.
Most CPUs do have such instructions.
\QuickQ{}
But if the main purpose of store buffers is to hide acknowledgment
latencies in multiprocessor cache-coherence protocols, why
do uniprocessors also have store buffers?
\QuickA{}
Because the purpose of store buffers is not just to hide
acknowledgement latencies in multiprocessor cache-coherence protocols,
but to hide memory latencies in general.
Because memory is much slower than is cache on uniprocessors,
store buffers on uniprocessors can help to hide write-miss
latencies.
\QuickQ{}
In step~1 above, why does CPU~0 need to issue a ``read invalidate''
rather than a simple ``invalidate''?
\QuickA{}
Because the cache line in question contains more than just the
variable \co{a}.
\QuickQ{}
In step~1 of the first scenario in
Section~\ref{sec:app:whymb:Invalidate Queues and Memory Barriers},
why is an ``invalidate'' sent instead of a ''read invalidate''
message?
Doesn't CPU~0 need the values of the other variables that share
this cache line with ``a''?
\QuickA{}
CPU~0 already has the values of these variables, given that it
has a read-only copy of the cache line containing ``a''.
Therefore, all CPU~0 need do is to cause the other CPUs to discard
their copies of this cache line.
An ``invalidate'' message therefore suffices.
\QuickQ{}
Say what???
Why do we need a memory barrier here, given that the CPU cannot
possibly execute the \co{assert()} until after the
\co{while} loop completes?
\QuickA{}
CPUs are free to speculatively execute, which can have the effect
of executing the assertion before the \co{while} loop completes.
Furthermore, compilers normally assume that only the currently
executing thread is updating the variables, and this assumption
allows the compiler to hoist the load of \co{a} to precede the
loop.
In fact, some compilers would transform the loop to a branch
around an infinite loop as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
1 void foo(void)
2 {
3 a = 1;
4 smp_mb();
5 b = 1;
6 }
7
8 void bar(void)
9 {
10 if (b == 0)
11 for (;;)
12 continue;
13 smp_mb();
14 assert(a == 1);
15 }
\end{verbatim}
\end{minipage}
\vspace{5pt}
Given this optimization, the assertion could clearly fire.
You should use volatile casts or (where available) C++
relaxed atomics to prevent the compiler from optimizing
your parallel code into oblivion.
In short, both compilers and CPUs are quite aggressive about
optimizing, so you must clearly communicate your constraints
to them, using compiler directives and memory barriers.
\QuickQ{}
Does the guarantee that each CPU sees its own memory accesses
in order also guarantee that each user-level thread will see
its own memory accesses in order?
Why or why not?
\QuickA{}
No. Consider the case where a thread migrates from one CPU to
another, and where the destination CPU perceives the source
CPU's recent memory operations out of order. To preserve
user-mode sanity, kernel hackers must use memory barriers in
the context-switch path. However, the locking already required
to safely do a context switch should automatically provide
the memory barriers needed to cause the user-level task to see
its own accesses in order. That said, if you are designing a
super-optimized scheduler, either in the kernel or at user level,
please keep this scenario in mind!
\QuickQ{}
Could this code be fixed by inserting a memory barrier
between CPU~1's ``while'' and assignment to ``c''?
Why or why not?
\QuickA{}
No. Such a memory barrier would only force ordering local to CPU~1.
It would have no effect on the relative ordering of CPU~0's and
CPU~1's accesses, so the assertion could still fail.
However, all mainstream computer systems provide one mechanism
or another to provide ``transitivity'', which provides
intuitive causal ordering: if B saw the effects of A's accesses,
and C saw the effects of B's accesses, then C must also see
the effects of A's accesses.
In short, hardware designers have taken at least a little pity
on software developers.
\QuickQ{}
Suppose that lines~3-5 for CPUs~1 and 2 in
Table~\ref{tab:app:whymb:Memory Barrier Example 3}
are in an interrupt
handler, and that the CPU~2's line~9 is run at process level.
What changes, if any, are required to enable the code to work
correctly, in other words, to prevent the assertion from firing?
\QuickA{}
The assertion will need to written to ensure that the load of
``e'' precedes that of ``a''.
In the Linux kernel, the barrier() primitive may be used to accomplish
this in much the same way that the memory barrier was used in the
assertions in the previous examples.
\QuickQ{}
If CPU~2 executed an \co{assert(e==0||c==1)} in the example in
Table~\ref{tab:app:whymb:Memory Barrier Example 3},
would this assert ever trigger?
\QuickA{}
The result depends on whether the CPU supports ``transitivity.''
In other words, CPU~0 stored to ``e'' after seeing CPU~1's
store to ``c'', with a memory barrier between CPU~0's load
from ``c'' and store to ``e''.
If some other CPU sees CPU~0's store to ``e'', is it also
guaranteed to see CPU~1's store?
All CPUs I am aware of claim to provide transitivity.
\QuickQ{}
Why is Alpha's \co{smp_read_barrier_depends()} an
\co{smp_mb()} rather than \co{smp_rmb()}?
\QuickA{}
First, Alpha has only \co{mb} and \co{wmb} instructions,
so \co{smp_rmb()} would be implemented by the Alpha \co{mb}
instruction in either case.
More importantly, \co{smp_read_barrier_depends()} must
order subsequent stores.
For example, consider the following code:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
1 p = global_pointer;
2 smp_read_barrier_depends();
3 if (do_something_with(p->a, p->b) == 0)
4 p->hey_look = 1;
\end{verbatim}
\end{minipage}
\vspace{5pt}
Here the store to \co{p->hey_look} must be ordered,
not just the loads from \co{p->a} and \co{p->b}.
\QuickQAC{app:rcuimpl:Read-Copy Update Implementations}{Read-Copy Update Implementations}
\QuickQ{}
Why is sleeping prohibited within Classic RCU read-side
critical sections?
\QuickA{}
Because sleeping implies a context switch, which in Classic RCU is
a quiescent state, and RCU's grace-period detection requires that
quiescent states never appear in RCU read-side critical sections.
\QuickQ{}
Why not permit sleeping in Classic RCU read-side critical sections
by eliminating context switch as a quiescent state, leaving user-mode
execution and idle loop as the remaining quiescent states?
\QuickA{}
This would mean that a system undergoing heavy kernel-mode
execution load (e.g., due to kernel threads) might never
complete a grace period, which
would cause it to exhaust memory sooner or later.
\QuickQ{}
Why is it OK to assume that updates separated by
{\tt synchronize\_sched()} will be performed in order?
\QuickA{}
Because this property is required for the {\tt synchronize\_sched()}
aspect of RCU to work at all.
For example, consider a code sequence that removes an object
from a list, invokes {\tt synchronize\_sched()}, then frees
the object.
If this property did not hold, then that object might appear
to be freed before it was
removed from the list, which is precisely the situation that
{\tt synchronize\_sched()} is supposed to prevent!
\QuickQ{}
Why must line~17 in {\tt synchronize\_srcu()}
(Figure~\ref{fig:app:rcuimpl:Update-Side Implementation})
precede the release of the mutex on line~18?
What would have to change to permit these two lines to be
interchanged?
Would such a change be worthwhile?
Why or why not?
\QuickA{}
Suppose that the order was reversed, and that CPU~0
has just reached line~13 of
{\tt synchronize\_srcu()}, while both CPU~1 and CPU~2 start executing
another {\tt synchronize\_srcu()} each, and CPU~3 starts executing a
{\tt srcu\_read\_lock()}.
Suppose that CPU~1 reaches line~6 of {\tt synchronize\_srcu()}
just before CPU~0 increments the counter on line~13.
Most importantly, suppose that
CPU~3 executes {\tt srcu\_read\_lock()}
out of order with the following SRCU read-side critical section,
so that it acquires a reference to some SRCU-protected data
structure \emph{before} CPU~0 increments {\tt sp->completed}, but
executes the {\tt srcu\_read\_lock()} \emph{after} CPU~0 does
this increment.
Then CPU~0 will \emph{not} wait for CPU~3 to complete its
SRCU read-side critical section before exiting the ``while''
loop on lines~15-16 and releasing the mutex (remember, the
CPU could be reordering the code).
Now suppose that CPU~2 acquires the mutex next,
and again increments {\tt sp->completed}.
This CPU will then have to wait for CPU~3 to exit its SRCU
read-side critical section before exiting the loop on
lines~15-16 and releasing the mutex.
But suppose that CPU~3 again executes out of order,
completing the {\tt srcu\_read\_unlock()} prior to
executing a final reference to the pointer it obtained
when entering the SRCU read-side critical section.
CPU~1 will then acquire the mutex, but see that the
{\tt sp->completed} counter has incremented twice, and
therefore take the early exit.
The caller might well free up the element that CPU~3 is
still referencing (due to CPU~3's out-of-order execution).
To prevent this perhaps improbable, but entirely possible,
scenario, the final {\tt synchronize\_sched()} must precede
the mutex release in {\tt synchronize\_srcu()}.
Another approach would be to change to comparison on
line~7 of {\tt synchronize\_srcu()} to check for at
least three increments of the counter.
However, such a change would increase the latency of a
``bulk update'' scenario, where a hash table is being updated
or unloaded using multiple threads.
In the current code, the latency of the resulting concurrent
{\tt synchronize\_srcu()} calls would take at most two SRCU
grace periods, while with this change, three would be required.
More experience will be required to determine which approach
is really better.
For one thing, there must first be some use of SRCU with
multiple concurrent updaters.
\QuickQ{}
Wait a minute!
With all those new locks, how do you avoid deadlock?
\QuickA{}
Deadlock is avoided by never holding more than one of the
\co{rcu_node} structures' locks at a given time.
This algorithm uses two more locks, one to prevent CPU hotplug
operations from running concurrently with grace-period advancement
(\co{onofflock}) and another
to permit only one CPU at a time from forcing a quiescent state
to end quickly (\co{fqslock}).
These are subject to a locking hierarchy, so that
\co{fqslock} must be acquired before
\co{onofflock}, which in turn must be acquired before
any of the \co{rcu_node} structures' locks.
Also, as a practical matter, refusing to ever hold more than
one of the \co{rcu_node} locks means that it is unnecessary
to track which ones are held.
Such tracking would be painful as well as unnecessary.
\QuickQ{}
Why stop at a 64-times reduction?
Why not go for a few orders of magnitude instead?
\QuickA{}
RCU works with no problems on
systems with a few hundred CPUs, so allowing 64 CPUs to contend on
a single lock leaves plenty of headroom.
Keep in mind that these locks are acquired quite rarely, as each
CPU will check in about one time per grace period, and grace periods
extend for milliseconds.
\QuickQ{}
But I don't care about McKenney's lame excuses in the answer to
Quick Quiz 2!!!
I want to get the number of CPUs contending on a single lock down
to something reasonable, like sixteen or so!!!
\QuickA{}
OK, have it your way, then!
Set \co{CONFIG_RCU_FANOUT=16} and (for \co{NR_CPUS=4096})
you will get a
three-level hierarchy with with 256 \co{rcu_node} structures
at the lowest level, 16 \co{rcu_node} structures as intermediate
nodes, and a single root-level \co{rcu_node}.
The penalty you will pay is that more \co{rcu_node} structures
will need to be scanned when checking to see which CPUs need help
completing their quiescent states (256 instead of only 64).
\QuickQ{}
OK, so what is the story with the colors?
\QuickA{}
Data structures analogous to \co{rcu_state} (including
\co{rcu_ctrlblk}) are yellow,
those containing the bitmaps used to determine when CPUs have checked
in are pink,
and the per-CPU \co{rcu_data} structures are blue.
The data structures used to conserve energy
(such as \co{rcu_dynticks}) will be colored green.
\QuickQ{}
Given such an egregious bug, why does Linux run at all?
\QuickA{}
Because the Linux kernel contains device drivers that are (relatively)
well behaved.
Few if any of them spin in RCU read-side critical sections for the
many milliseconds that would be required to provoke this bug.
The bug nevertheless does need to be fixed, and this variant of
RCU does fix it.
\QuickQ{}
But doesn't this state diagram indicate that dyntick-idle CPUs will
get hit with reschedule IPIs? Won't that wake them up?
\QuickA{}
No.
Keep in mind that RCU is handling groups of CPUs.
One particular group might contain both dyntick-idle CPUs and
CPUs in normal mode that have somehow managed to avoid passing through
a quiescent state.
Only the latter group will be sent a reschedule IPI; the dyntick-idle
CPUs will merely be marked as being in an extended quiescent state.
\QuickQ{}
But what happens if a CPU tries to report going through a quiescent
state (by clearing its bit) before the bit-setting CPU has finished?
\QuickA{}
There are three cases to consider here:
\begin{enumerate}
\item A CPU corresponding to a non-yet-initialized leaf
\co{rcu_node} structure tries to report a quiescent state.
This CPU will see its bit already cleared, so will give up on
reporting its quiescent state.
Some later quiescent state will serve for the new grace period.
\item A CPU corresponding to a leaf \co{rcu_node} structure that
is currently being initialized tries to report a quiescent
state.
This CPU will see that the \co{rcu_node} structure's
\co{->lock} is held, so will spin until it is
released.
But once the lock is released, the \co{rcu_node}
structure will have been initialized, reducing to the
following case.
\item A CPU corresponding to a leaf \co{rcu_node} that has
already been initialized tries to report a quiescent state.
This CPU will find its bit set, and will therefore clear it.
If it is the last CPU for that leaf node, it will
move up to the next level of the hierarchy.
However, this CPU cannot possibly be the last CPU in the
system to report a quiescent state, given that the CPU
doing the initialization cannot yet have checked in.
\end{enumerate}
So, in all three cases, the potential race is resolved correctly.
\QuickQ{}
And what happens if \emph{all} CPUs try to report going
through a quiescent
state before the bit-setting CPU has finished, thus ending the new
grace period before it starts?
\QuickA{}
The bit-setting CPU cannot pass through a
quiescent state during initialization, as it has irqs disabled.
Its bits therefore remain non-zero, preventing the grace period from
ending until the data structure has been fully initialized.
\QuickQ{}
And what happens if one CPU comes out of dyntick-idle mode and then
passed through a quiescent state just as another CPU notices that the
first CPU was in dyntick-idle mode?
Couldn't they both attempt to report a quiescent state at the same
time, resulting in confusion?
\QuickA{}
They will both attempt to acquire the lock on the same leaf
\co{rcu_node} structure.
The first one to acquire the lock will report the quiescent state
and clear the appropriate bit, and the second one to acquire the
lock will see that this bit has already been cleared.
\QuickQ{}
But what if \emph{all} the CPUs end up in dyntick-idle mode?
Wouldn't that prevent the current RCU grace period from ever ending?
\QuickA{}
Indeed it will!
However, CPUs that have RCU callbacks are not permitted to enter
dyntick-idle mode, so the only way that \emph{all} the CPUs could
possibly end up in dyntick-idle mode would be if there were
absolutely no RCU callbacks in the system.
And if there are no RCU callbacks in the system, then there is no
need for the RCU grace period to end.
In fact, there is no need for the RCU grace period to even
\emph{start}.
RCU will restart if some irq handler does a \co{call_rcu()},
which will cause an RCU callback to appear on the corresponding CPU,
which will force that CPU out of dyntick-idle mode, which will in turn
permit the current RCU grace period to come to an end.
\QuickQ{}
Given that \co{force_quiescent_state()} is a three-phase state
machine, don't we have triple the scheduling latency due to scanning
all the CPUs?
\QuickA{}
Ah, but the three phases will not execute back-to-back on the same CPU,
and, furthermore, the first (initialization) phase doesn't do any
scanning.
Therefore, the scheduling-latency hit of the three-phase algorithm
is no different than that of a single-phase algorithm.
If the scheduling latency becomes a problem, one approach would be to
recode the state machine to scan the CPUs incrementally, most likely
by keeping state on a per-leaf-\co{rcu_node} basis.
But first show me a problem in the real world, \emph{then}
I will consider fixing it!
\QuickQ{}
But the other reason to hold \co{->onofflock} is to prevent
multiple concurrent online/offline operations, right?
\QuickA{}
Actually, no!
The CPU-hotplug code's synchronization design prevents multiple
concurrent CPU online/offline operations, so only one CPU
online/offline operation can be executing at any given time.
Therefore, the only purpose of \co{->onofflock} is to prevent a CPU
online or offline operation from running concurrently with grace-period
initialization.
\QuickQ{}
Given all these acquisitions of the global \co{->onofflock},
won't there
be horrible lock contention when running with thousands of CPUs?
\QuickA{}
Actually, there can be only three acquisitions of this lock per grace
period, and each grace period lasts many milliseconds.
One of the acquisitions is by the CPU initializing for the current
grace period, and the other two onlining and offlining some CPU.
These latter two cannot run concurrently due to the CPU-hotplug
locking, so at most two CPUs can be contending for this lock at any
given time.
Lock contention on \co{->onofflock} should therefore
be no problem, even on systems with thousands of CPUs.
\QuickQ{}
Why not simplify the code by merging the detection of dyntick-idle
CPUs with that of offline CPUs?
\QuickA{}
It might well be that such merging may eventually be the right
thing to do.
In the meantime, however, there are some challenges:
\begin{enumerate}
\item CPUs are not allowed to go into dyntick-idle mode while they
have RCU callbacks pending, but CPUs \emph{are} allowed to go
offline with callbacks pending.
This means that CPUs going offline need to have their callbacks
migrated to some other CPU, thus, we cannot allow CPUs to
simply go quietly offline.
\item Present-day Linux systems run with \co{NR_CPUS}
much larger than the actual number of CPUs.
A unified approach could thus end up uselessly waiting on
CPUs that are not just offline, but which never existed in
the first place.
\item RCU is already operational when CPUs get onlined one
at a time during boot, and therefore must handle the online
process.
This onlining must exclude grace-period initialization, so
the \co{->onofflock} must still be used.
\item CPUs often switch into and out of dyntick-idle mode
extremely frequently, so it is not reasonable to use the
heavyweight online/offline code path for entering and exiting
dyntick-idle mode.
\end{enumerate}
\QuickQ{}
Why not simply disable bottom halves (softirq) when acquiring
the \co{rcu_data} structure's \co{lock}?
Wouldn't this be faster?
\QuickA{}
Because this lock can be acquired from functions
called by \co{call_rcu()}, which in turn can be
invoked from irq handlers.
Therefore, irqs \emph{must} be disabled when
holding this lock.
\QuickQ{}
How about the \co{qsmask} and \co{qsmaskinit}
fields for the leaf \co{rcu_node} structures?
Doesn't there have to be some way to work out
which of the bits in these fields corresponds
to each CPU covered by the \co{rcu_node} structure
in question?
\QuickA{}
Indeed there does!
The \co{grpmask} field in each CPU's \co{rcu_data}
structure does this job.
\QuickQ{}
But why bother setting \co{qs_pending} to one when a CPU
is coming online, given that being offline is an extended
quiescent state that should cover any ongoing grace period?
\QuickA{}
Because this helps to resolve a race between a CPU coming online
just as a new grace period is starting.
\QuickQ{}
Why record the last completed grace period number in
\co{passed_quiesc_completed}?
Doesn't that cause this RCU implementation to be vulnerable
to quiescent states seen while no grace period was in progress
being incorrectly applied to the next grace period that starts?
\QuickA{}
We record the last completed grace period number in order
to avoid races where a quiescent state noted near the end of
one grace period is incorrectly applied to the next grace
period, especially for dyntick and CPU-offline grace periods.
Therefore, \co{force_quiescent_state()} and friends all
check the last completed grace period number to avoid such races.
Now these dyntick and CPU-offline grace periods are only checked
for when a grace period is actually active.
The only quiescent states that can be recorded when no grace
period is in progress are self-detected quiescent states,
which are recorded in the \co{passed_quiesc_completed},
\co{passed_quiesc}, and \co{qs_pending}.
These variables are initialized every time the corresponding
CPU notices that a new grace period has started, preventing
any obsolete quiescent states from being applied to the
new grace period.
All that said, optimizing grace-period latency may require that
\co{gpnum} be tracked in addition to \co{completed}.
\QuickQ{}
What is the point of running a system with \co{NR_CPUS}
way bigger than the actual number of CPUs?
\QuickA{}
Because this allows producing a single binary of the Linux kernel
that runs on a wide variety of systems, greatly easing administration
and validation.
\QuickQ{}
Why not simply have multiple lists rather than this funny
multi-tailed list?
\QuickA{}
Because this multi-tailed approach, due to Lai Jiangshan,
simplifies callback processing.
\QuickQ{}
So some poor CPU has to note quiescent states on behalf of
each and every offline CPU?
Yecch!
Won't that result in excessive overheads in the not-uncommon
case of a system with a small number of CPUs but a large value
for \co{NR_CPUS}?
\QuickA{}
Actually, no it will not!
Offline CPUs are excluded from both the \co{qsmask} and
\co{qsmaskinit} bit masks, so RCU normally ignores them.
However, there are races with online/offline operations that
can result in an offline CPU having its \co{qsmask} bit set.
These races must of course be handled correctly, and the way
they are handled is to permit other CPUs to note that RCU
is waiting on a quiescent state from an offline CPU.
\QuickQ{}
So what guards the earlier fields in this structure?
\QuickA{}
Nothing does, as they are constants set at compile time
or boot time.
Of course, the fields internal to each \co{rcu_node}
in the \co{->node} array may change, but they are
guarded separately.
\QuickQ{}
I thought that RCU read-side processing was supposed to
be \emph{fast}!
The functions shown in
Figure~\ref{fig:app:rcuimpl:rcutreewt:RCU Read-Side Critical Sections}
have so much junk in them that they just \emph{have} to be slow!
What gives here?
\QuickA{}
Appearances can be deceiving.
The \co{preempt_disable()}, \co{preempt_enable()},
\co{local_bh_disable()}, and \co{local_bh_enable()} each
do a single non-atomic manipulation of local data.
Even that assumes \co{CONFIG_PREEMPT}, otherwise,
the \co{preempt_disable()} and \co{preempt_enable()}
functions emit no code, not even compiler directives.
The \co{__acquire()} and \co{__release()} functions
emit no code (not even compiler directives), but are instead
used by the \co{sparse} semantic-parsing bug-finding program.
Finally, \co{rcu_read_acquire()} and \co{rcu_read_release()}
emit no code (not even compiler directives) unless the
``lockdep'' lock-order debugging facility is enabled, in
which case they can indeed be somewhat expensive.
In short, unless you are a kernel hacker who has enabled
debugging options, these functions are extremely cheap,
and in some cases, absolutely free of overhead.
And, in the words of a Portland-area furniture retailer,
``free is a \emph{very} good price''.
\QuickQ{}
Why not simply use \co{__get_cpu_var()} to pick up a
reference to the
current CPU's \co{rcu_data} structure on line~13 in
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcutree call-rcu}?
\QuickA{}
Because we might be called either from \co{call_rcu()}
(in which case we would need \co{__get_cpu_var(rcu_data)})
or from \co{call_rcu_bh()} (in which case we would need
\co{__get_cpu_var(rcu_bh_data)}).
Using the \co{->rda[]} array of whichever
\co{rcu_state} structure we were passed works correctly
regardless of which API \co{__call_rcu()} was invoked from
(suggested by Lai Jiangshan~\cite{LaiJiangshan2008NewClassicAlgorithm}).
\QuickQ{}
Given that \co{rcu_pending()} is always called twice
on lines~29-32 of
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcutree rcu-check-callbacks},
shouldn't there be some way to combine the checks of the
two structures?
\QuickA{}
Sorry, but this was a trick question.
The C language's short-circuit boolean expression evaluation
means that \co{__rcu_pending()} is invoked on
\co{rcu_bh_state} only if the prior invocation on
\co{rcu_state} returns zero.
The reason the two calls are in this order is that
``rcu'' is used more heavily than is ``rcu\_bh'', so
the first call is more likely to return non-zero than
is the second.
\QuickQ{}
Shouldn't line~42 of
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcutree rcu-check-callbacks}
also check for \co{in_hardirq()}?
\QuickA{}
No.
The \co{rcu_read_lock_bh()} primitive disables
softirq, not hardirq.
Because \co{call_rcu_bh()} need only wait for pre-existing
``rcu\_bh'' read-side critical sections to complete,
we need only check \co{in_softirq()}.
\QuickQ{}
But don't we also need to check that a grace period is
actually in progress in \co{__rcu_process_callbacks} in
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcutree rcu-process-callbacks}?
\QuickA{}
Indeed we do!
And the first thing that \co{force_quiescent_state()} does
is to perform exactly that check.
\QuickQ{}
What happens if two CPUs attempt to start a new grace
period concurrently in
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcutree rcu-process-callbacks}?
\QuickA{}
One of the CPUs will be the first to acquire the root
\co{rcu_node} structure's lock, and that CPU will start
the grace period.
The other CPU will then acquire the lock and invoke
\co{rcu_start_gp()}, which, seeing that a grace period
is already in progress, will immediately release the
lock and return.
\QuickQ{}
How does the code traverse a given path through
the \co{rcu_node} hierarchy from root to leaves?
\QuickA{}
It turns out that the code never needs to do such a traversal,
so there is nothing special in place to handle this.
\QuickQ{}
C-preprocessor macros are \emph{so} 1990s!
Why not get with the times and convert \co{RCU_DATA_PTR_INIT()}
in Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-init}
to be a function?
\QuickA{}
Because, although it is possible to pass a reference to
a particular CPU's instance of a per-CPU variable to a function,
there does not appear to be a good way pass a reference to
the full set of instances of a given per-CPU variable to
a function.
One could of course build an array of pointers, then pass a
reference to the array in, but that is part of what
the \co{RCU_DATA_PTR_INIT()} macro is doing in the first place.
\QuickQ{}
What happens if a CPU comes online between the time
that the last online CPU is notified on lines~25-26 of
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-init}
and the time that \co{register_cpu_notifier()} is invoked
on line~27?
\QuickA{}
Only one CPU is online at this point, so the only way another
CPU can come online is if this CPU puts it online, which it
is not doing.
\QuickQ{}
Why call \co{cpu_quiet()} on line~41 of
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-init-percpu-data},
given that we are excluding grace periods with various
locks, and given that any earlier grace periods would not have
been waiting on this previously-offlined CPU?
\QuickA{}
A new grace period might have started just after the
\co{->onofflock} was released on line~40.
The \co{cpu_quiet()} will help expedite such a grace period.
\QuickQ{}
But what if the \co{rcu_node} hierarchy has only a single
structure, as it would on a small system?
What prevents concurrent grace-period initialization in that
case, given the code in
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-offline-cpu}?
\QuickA{}
The later acquisition of the sole \co{rcu_node} structure's
\co{->lock} on line~16 excludes grace-period initialization,
which must acquire this same lock in order to initialize this
sole \co{rcu_node} structure for the new grace period.
The \co{->onofflock} is needed only for multi-node hierarchies,
and is used in that case as an alternative to acquiring and
holding \emph{all} of the \co{rcu_node} structures'
\co{->lock} fields, which would be incredibly painful on
large systems.
\QuickQ{}
But does line~25 of
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-offline-cpu}
ever really exit the loop?
Why or why not?
\QuickA{}
The only way that line~25 could exit the loop is if \emph{all}
CPUs were to be put offline.
This cannot happen in the Linux kernel as of 2.6.28, though
other environments have been designed to offline all CPUs
during the normal shutdown procedure.
\QuickQ{}
Suppose that line~26 got executed seriously out of order in
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-offline-cpu},
so that \co{lastcomp} is set to some prior grace period, but
so that the current grace period is still waiting on the
now-offline CPU?
In this case, won't the call to \co{cpu_quiet()} fail to
report the quiescent state, thus causing the grace period
to wait forever for this now-offline CPU?
\QuickA{}
First, the lock acquisitions on lines~16 and 12 would prevent
the execution of line~26 from being pushed that far out of
order.
Nevertheless, even if line~26 managed to be misordered that
dramatically, what would happen is that \co{force_quiescent_state()}
would eventually be invoked, and would notice that the current
grace period was waiting for a quiescent state from an offline
CPU.
Then \co{force_quiescent_state()} would report the extended
quiescent state on behalf of the offlined CPU.
\QuickQ{}
Given that an offline CPU is in an extended quiescent state,
why does line~28 of
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-offline-cpu}
need to care which grace period it is
dealing with?
\QuickA{}
It really does not need to care in this case.
However, because it \emph{does} need to care in many other
cases, the \co{cpu_quiet()} function does take the
grace-period number as an argument, so some value must be
supplied.
\QuickQ{}
But this list movement in
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-offline-cpu}
makes all of the going-offline CPU's callbacks go through
another grace period, even if they were ready to invoke.
Isn't that inefficient?
Furthermore, couldn't an unfortunate pattern of CPUs going
offline then coming back online prevent a given callback from
ever being invoked?
\QuickA{}
It is inefficient, but it is simple.
Given that this is not a commonly executed code path, this
is the right tradeoff.
The starvation case would be a concern, except that the
online and offline process involves multiple grace periods.
\QuickQ{}
Why not just expand \co{note_new_gpnum()} inline into
\co{check_for_new_grace_period()} in
Figure~\ref{fig:app:rcuimpl:rcutreewt:Noting New Grace Periods}?
\QuickA{}
Because \co{note_new_gpnum()} must be called for each new
grace period, including both those started by this CPU and
those started by other CPUs.
In contrast, \co{check_for_new_grace_period()} is called only
for the case where some other CPU started the grace period.
\QuickQ{}
But there has been no initialization yet at line~15 of
Figure~\ref{fig:app:rcuimpl:rcutreewt:Starting a Grace Period}!
What happens if a CPU notices the new grace period and
immediately attempts to report a quiescent state?
Won't it get confused?
\QuickA{}
There are two cases of interest.
In the first case, there is only a single \co{rcu_node}
structure in the hierarchy.
Since the CPU executing in \co{rcu_start_gp()} is currently
holding that \co{rcu_node} structure's lock, the CPU
attempting to report the quiescent state will not be able
to acquire this lock until initialization is complete,
at which point the quiescent state will be reported
normally.
In the second case, there are multiple \co{rcu_node} structures,
and the leaf \co{rcu_node} structure corresponding to the
CPU that is attempting to report the quiescent state already
has that CPU's \co{->qsmask} bit cleared.
Therefore, the CPU attempting to report the quiescent state
will give up, and some later quiescent state for that CPU
will be applied to the new grace period.
\QuickQ{}
Hey!
Shouldn't we hold the non-leaf \co{rcu_node} structures'
locks when munging their state in line~37 of
Figure~\ref{fig:app:rcuimpl:rcutreewt:Starting a Grace Period}???
\QuickA{}
There is no need to hold their locks.
The reasoning is as follows:
\begin{enumerate}
\item The new grace period cannot end, because the running CPU
(which is initializing it) won't pass through a
quiescent state.
Therefore, there is no race with another invocation
of \co{rcu_start_gp()}.
\item The running CPU holds \co{->onofflock}, so there
is no race with CPU-hotplug operations.
\item The leaf \co{rcu_node} structures are not yet initialized,
so they have all of their \co{->qsmask} bits cleared.
This means that any other CPU attempting to report
a quiescent state will stop at the leaf level,
and thus cannot race with the current CPU for non-leaf
\co{rcu_node} structures.
\item The RCU tracing functions access, but do not modify,
the \co{rcu_node} structures' fields.
Races with these functions is therefore harmless.
\end{enumerate}
\QuickQ{}
Why can't we merge the loop spanning lines~36-37 with
the loop spanning lines~40-44 in
Figure~\ref{fig:app:rcuimpl:rcutreewt:Starting a Grace Period}?
\QuickA{}
If we were to do so, we would either be needlessly acquiring locks
for the non-leaf \co{rcu_node} structures or would need
ugly checks for a given node being a leaf node on each pass
through the loop.
(Recall that we must acquire the locks for the leaf
\co{rcu_node} structures due to races with CPUs attempting
to report quiescent states.)
Nevertheless, it is quite possible that experience on very large
systems will show that such merging is in fact the right thing
to do.
\QuickQ{}
What prevents lines~11-12 of
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-check-quiescent-state}
from reporting a quiescent state from a prior
grace period against the current grace period?
\QuickA{}
If this could occur, it would be a serious bug, since the
CPU in question might be in an RCU read-side critical section
that started before the beginning of the current grace period.
There are several cases to consider for the CPU in question:
\begin{enumerate}
\item It remained online and active throughout.
\item It was in dynticks-idle mode for at least part of the current
grace period.
\item It was offline for at least part of the current grace period.
\end{enumerate}
In the first case, the prior grace period could not have
ended without this CPU explicitly reporting a quiescent
state, which would leave \co{->qs_pending} zero.
This in turn would mean that lines~7-8 would return, so
that control would not reach \co{cpu_quiet()} unless
\co{check_for_new_grace_period()} had noted the new grace
period.
However, if the current grace period had been noted, it would
also have set \co{->passed_quiesc} to zero, in which case
lines~9-10 would have returned, again meaning that \co{cpu_quiet()}
would not be invoked.
Finally, the only way that \co{->passed_quiesc} could be invoked
would be if \co{rcu_check_callbacks()} was invoked by
a scheduling-clock interrupt that occurred somewhere between
lines~5 and 9 of \co{rcu_check_quiescent_state()} in
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-check-quiescent-state}.
However, this would be a case of a quiescent state occurring
in the \emph{current} grace period, which would be totally
legitimate to report against the current grace period.
So this case is correctly covered.
In the second case, where the CPU in question spent part of
the new quiescent state in dynticks-idle mode, note that
dynticks-idle mode is an extended quiescent state, hence
it is again permissible to report this quiescent state against
the current grace period.
In the third case, where the CPU in question spent part of the
new quiescent state offline, note that offline CPUs are in
an extended quiescent state, which is again permissible to
report against the current grace period.
So quiescent states from prior grace periods are never reported
against the current grace period.
\QuickQ{}
How do lines~22-23 of
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for cpu-quiet}
know that it is safe to promote the running CPU's RCU
callbacks?
\QuickA{}
Because the specified CPU has not yet passed through a quiescent
state, and because we hold the corresponding leaf node's lock,
we know that the current grace period cannot possibly have
ended yet.
Therefore, there is no danger that any of the callbacks currently
queued were registered after the next grace period started, given
that they have already been queued and the next grace period
has not yet started.
\QuickQ{}
Given that argument \co{mask} on line 2 of
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for cpu-quiet-msk}
is an unsigned long, how can it possibly deal with systems
with more than 64 CPUs?
\QuickA{}
Because \co{mask} is specific to the specified leaf \co{rcu_node}
structure, it need only be large enough to represent the
CPUs corresponding to that particular \co{rcu_node} structure.
Since at most 64 CPUs may be associated with a given
\co{rcu_node} structure (32 CPUs on 32-bit systems),
the unsigned long \co{mask} argument suffices.
\QuickQ{}
How do RCU callbacks on dynticks-idle or offline CPUs
get invoked?
\QuickA{}
They don't.
CPUs with RCU callbacks are not permitted to enter dynticks-idle
mode, so dynticks-idle CPUs never have RCU callbacks.
When CPUs go offline, their RCU callbacks are migrated to
an online CPU, so offline CPUs never have RCU callbacks, either.
Thus, there is no need to invoke callbacks on dynticks-idle
or offline CPUs.
\QuickQ{}
Why would lines~14-17 in
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for rcu-do-batch}
need to adjust the tail pointers?
\QuickA{}
If any of the tail pointers reference the last callback
in the sublist that was ready to invoke, they must be
changed to instead reference the \co{->nxtlist} pointer.
This situation occurs when the sublists
immediately following the ready-to-invoke sublist are empty.
\QuickQ{}
But how does the code in
Figure~\ref{fig:app:rcuimpl:rcutreewt:NMIs from Dyntick-Idle Mode}
handle nested NMIs?
\QuickA{}
It does not have to handle nested NMIs, because NMIs do not nest.
\QuickQ{}
Why isn't there a memory barrier between lines~8 and 9 of
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for dyntick-save-progress-counter}?
Couldn't this cause the code to fetch even-numbered values
from both the \co{->dynticks} and \co{->dynticks_nmi} fields,
even though these two fields never were zero at the same time?
\QuickA{}
First, review the code in
Figures~\ref{fig:app:rcuimpl:rcutreewt:Entering and Exiting Dyntick-Idle Mode},
\ref{fig:app:rcuimpl:rcutreewt:NMIs from Dyntick-Idle Mode}, and
\ref{fig:app:rcuimpl:rcutreewt:Interrupts from Dyntick-Idle Mode},
and note that \co{dynticks} and \co{dynticks_nmi} will never
have odd values simultaneously (see especially lines~6 and 17 of
Figure~\ref{fig:app:rcuimpl:rcutreewt:NMIs from Dyntick-Idle Mode},
and recall that interrupts cannot happen from NMIs).
Of course, given the placement of the memory barriers in these
functions, it might \emph{appear} to another CPU that both
counters were odd at the same time, but logically this cannot
happen, and would indicate that the CPU had in fact passed
through dynticks-idle mode.
Now, let's suppose that at the time line~8 fetches \co{->dynticks},
the value of \co{->dynticks_nmi} was at odd number, and that at the
time line~9 fetches \co{->dynticks_nmi}, the value of
\co{->dynticks} was an odd number.
Given that both counters cannot be odd simultaneously, there must
have been a time between these two fetches when both counters
were even, and thus a time when the CPU was in dynticks-idle
mode, which is a quiescent state, as required.
So, why can't the \co{&&} on line~13 of
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for dyntick-save-progress-counter}
be replaced with an \co{==}?
Well, it could be, but this would likely be more confusing
than helpful.
\QuickQ{}
Why wait the extra couple jiffies on lines~12-13 in
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for check-cpu-stall}?
\QuickA{}
This added delay gives the offending CPU a better chance of
reporting on itself, thus getting a decent stack trace of
the stalled code.
Of course, if the offending CPU is spinning with interrupts
disabled, it will never report on itself, so other CPUs
do so after a short delay.
\QuickQ{}
What prevents the grace period from ending before the
stall warning is printed in
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for print-cpu-stall}?
\QuickA{}
The caller checked that this CPU still had not reported a
quiescent state, and because preemption is disabled, there is
no way that a quiescent state could have been reported in
the meantime.
\QuickQ{}
Why does \co{print_other_cpu_stall()} in
Figure~\ref{fig:app:rcuimpl:rcutreewt:Code for print-other-cpu-stall}
need to check for the grace period ending when
\co{print_cpu_stall()} did not?
\QuickA{}
The other CPUs might pass through a quiescent state at any time,
so the grace period might well have ended in the meantime.
\QuickQ{}
Why is it important that blocking primitives
called from within a preemptible-RCU read-side critical section be
subject to priority inheritance?
\QuickA{}
Because blocked readers stall RCU grace periods,
which can result in OOM.
For example, if a reader did a \co{wait_event()} within
an RCU read-side critical section, and that event never occurred,
then RCU grace periods would stall indefinitely, guaranteeing that
the system would OOM sooner or later.
There must therefore be some way to cause these readers to progress
through their read-side critical sections in order to avoid such OOMs.
Priority boosting is one way to force such progress, but only if
readers are restricted to blocking such that they can be awakened via
priority boosting.
Of course, there are other methods besides priority inheritance
that handle the priority inversion problem, including priority ceiling,
preemption disabling, and so on.
However, there are good reasons why priority inheritance is the approach
used in the Linux kernel, so this is what is used for RCU.
\QuickQ{}
Could the prohibition against using primitives
that would block in a non-\co{CONFIG_PREEMPT} kernel be lifted,
and if so, under what conditions?
\QuickA{}
If testing and benchmarking demonstrated that the
preemptible RCU worked well enough that classic RCU could be dispensed
with entirely, and if priority inheritance was implemented for blocking
synchronization primitives
such as \co{semaphore}s, then those primitives could be
used in RCU read-side critical sections.
\QuickQ{}
How is it possible for lines~38-43 of
\co{__rcu_advance_callbacks()} to be executed when
lines~7-37 have not?
Won't they both be executed just after a counter flip, and
never at any other time?
\QuickA{}
Consider the following sequence of events:
\begin{enumerate}
\item CPU 0 executes lines~5-12 of
\co{rcu_try_flip_idle()}.
\item CPU 1 executes \co{__rcu_advance_callbacks()}.
Because \co{rcu_ctrlblk.completed} has been
incremented, lines~7-37 execute.
However, none of the \co{rcu_flip_flag} variables
have been set, so lines~38-43 do \emph{not} execute.
\item CPU 0 executes lines~13-15 of
\co{rcu_try_flip_idle()}.
\item Later, CPU 1 again executes \co{__rcu_advance_callbacks()}.
The counter has not been incremented since the earlier
execution, but the \co{rcu_flip_flag} variables have
all been set, so only lines~38-43 are executed.
\end{enumerate}
\QuickQ{}
What problems could arise if the lines containing
\co{ACCESS_ONCE()} in \co{rcu_read_unlock()}
were reordered by the compiler?
\QuickA{}
\begin{enumerate}
\item If the \co{ACCESS_ONCE()} were omitted from the
fetch of \co{rcu_flipctr_idx} (line~14), then the compiler
would be within its rights to eliminate \co{idx}.
It would also be free to compile the \co{rcu_flipctr}
decrement as a fetch-increment-store sequence, separately
fetching \co{rcu_flipctr_idx} for both the fetch and
the store.
If an NMI were to occur between the fetch and the store, and
if the NMI handler contained an \co{rcu_read_lock()},
then the value of \co{rcu_flipctr_idx} would change
in the meantime, resulting in corruption of the
\co{rcu_flipctr} values, destroying the ability
to correctly identify grace periods.
\item Another failure that could result from omitting the
\co{ACCESS_ONCE()} from line~14 is due to
the compiler reordering this statement to follow the
decrement of \co{rcu_read_lock_nesting}
(line~16).
In this case, if an NMI were to occur between these two
statements, then any \co{rcu_read_lock()} in the
NMI handler could corrupt \co{rcu_flipctr_idx},
causing the wrong \co{rcu_flipctr} to be
decremented.
As with the analogous situation in \co{rcu_read_lock()},
this could result in premature grace-period termination,
an indefinite grace period, or even both.
\item If \co{ACCESS_ONCE()} macros were omitted such that
the update of \co{rcu_read_lock_nesting} could be
interchanged by the compiler with the decrement of
\co{rcu_flipctr}, and if an NMI occurred in between,
any \co{rcu_read_lock()} in the NMI handler would
incorrectly conclude that it was protected by an enclosing
\co{rcu_read_lock()}, and fail to increment the
\co{rcu_flipctr} variables.
\end{enumerate}
It is not clear that the \co{ACCESS_ONCE()} on the
fetch of \co{rcu_read_lock_nesting} (line~7) is required.
\QuickQ{}
What problems could arise if the lines containing
\co{ACCESS_ONCE()} in \co{rcu_read_unlock()}
were reordered by the CPU?
\QuickA{}
Absolutely none! The code in \co{rcu_read_unlock()}
interacts with the scheduling-clock interrupt handler
running on the same CPU, and is thus insensitive to reorderings
because CPUs always see their own accesses as if they occurred
in program order.
Other CPUs do access the \co{rcu_flipctr}, but because these
other CPUs don't access any of the other variables, ordering is
irrelevant.
\QuickQ{}
What problems could arise in
\co{rcu_read_unlock()} if irqs were not disabled?
\QuickA{}
\begin{enumerate}
\item Disabling irqs has the side effect of disabling preemption.
Suppose that this code were to be preempted in the midst
of line~17 between selecting the current CPU's copy
of the \co{rcu_flipctr} array and the decrement of
the element indicated by \co{rcu_flipctr_idx}.
Execution might well resume on some other CPU.
If this resumption happened concurrently with an
\co{rcu_read_lock()} or \co{rcu_read_unlock()}
running on the original CPU,
an increment or decrement might be lost, resulting in either
premature termination of a grace period, indefinite extension
of a grace period, or even both.
\item Failing to disable preemption can also defeat RCU priority
boosting, which relies on \co{rcu_read_lock_nesting}
to determine which tasks to boost.
If preemption occurred between the update of
\co{rcu_read_lock_nesting} (line~16) and of
\co{rcu_flipctr} (line~17), then a grace
period might be stalled until this task resumed.
But because the RCU priority booster has no way of knowing
that this particular task is stalling grace periods, needed
boosting will never occur.
Therefore, if there are CPU-bound realtime tasks running,
the preempted task might never resume, stalling grace periods
indefinitely, and eventually resulting in OOM.
\end{enumerate}
Of course, both of these situations could be handled by disabling
preemption rather than disabling irqs.
(The CPUs I have access to do not show much difference between these
two alternatives, but others might.)
\QuickQ{}
Suppose that the irq disabling in
\co{rcu_read_lock()} was replaced by preemption disabling.
What effect would that have on \co{GP_STAGES}?
\QuickA{}
No finite value of \co{GP_STAGES} suffices.
The following scenario, courtesy of Oleg Nesterov, demonstrates this:
Suppose that low-priority Task~A has executed
\co{rcu_read_lock()} on CPU 0,
and thus has incremented \co{per_cpu(rcu_flipctr, 0)[0]},
which thus has a value of one.
Suppose further that Task~A is now preempted indefinitely.
Given this situation, consider the following sequence of events:
\begin{enumerate}
\item Task~B starts executing \co{rcu_read_lock()}, also on
CPU 0, picking up the low-order bit of
\co{rcu_ctrlblk.completed}, which is still equal to zero.
\item Task~B is interrupted by a sufficient number of scheduling-clock
interrupts to allow the current grace-period stage to complete,
and also be sufficient long-running interrupts to allow the
RCU grace-period state machine to advance the
\co{rcu_ctrlblk.complete} counter so that its bottom bit
is now equal to one and all CPUs have acknowledged this
increment operation.
\item CPU 1 starts summing the index==0 counters, starting with
\co{per_cpu(rcu_flipctr, 0)[0]}, which is equal to one
due to Task~A's increment.
CPU 1's local variable \co{sum} is therefore equal to one.
\item Task~B returns from interrupt, resuming its execution of
\co{rcu_read_lock()}, incrementing
\co{per_cpu(rcu_flipctr, 0)[0]}, which now has a value
of two.
\item Task~B is migrated to CPU 2.
\item Task~B completes its RCU read-side critical section, and
executes \co{rcu_read_unlock()}, which decrements
\co{per_cpu(rcu_flipctr, 2)[0]}, which is now -1.
\item CPU 1 now adds \co{per_cpu(rcu_flipctr, 1)[0]} and
\co{per_cpu(rcu_flipctr, 2)[0]} to its
local variable \co{sum}, obtaining the value zero.
\item CPU 1 then incorrectly concludes that all prior RCU read-side
critical sections have completed, and advances to the next
RCU grace-period stage.
This means that some other task might well free up data
structures that Task~A is still using!
\end{enumerate}
This sequence of events could repeat indefinitely, so that no finite
value of \co{GP_STAGES} could prevent disrupting Task~A.
This sequence of events demonstrates the importance of the promise
made by CPUs that acknowledge an increment of
\co{rcu_ctrlblk.completed}, as the problem illustrated by the
above sequence of events is caused by Task~B's repeated failure
to honor this promise.
Therefore, more-pervasive changes to the grace-period state will be
required in order for \co{rcu_read_lock()} to be able to safely
dispense with irq disabling.
\QuickQ{}
Why can't the \co{rcu_dereference()}
precede the memory barrier?
\QuickA{}
Because the memory barrier is being executed in
an interrupt handler, and interrupts are exact in the sense that
a single value of the PC is saved upon interrupt, so that the
interrupt occurs at a definite place in the code.
Therefore, if the
\co{rcu_dereference()} were to precede the memory barrier,
the interrupt would have had to have occurred after the
\co{rcu_dereference()}, and therefore
the interrupt would also have had to have occurred after the
\co{rcu_read_lock()} that begins the RCU read-side critical
section.
This would have forced the \co{rcu_read_lock()} to use
the earlier value of the grace-period counter, which would in turn
have meant that the corresponding \co{rcu_read_unlock()}
would have had to precede the first "Old counters zero [0]" rather
than the second one.
This in turn would have meant that the read-side critical section
would have been much shorter --- which would have been
counter-productive,
given that the point of this exercise was to identify the longest
possible RCU read-side critical section.
\QuickQ{}
What is a more precise way to say "CPU~0
might see CPU~1's increment as early as CPU~1's last previous
memory barrier"?
\QuickA{}
First, it is important to note that the problem with
the less-precise statement is that it gives the impression that there
might be a single global timeline, which there is not, at least not for
popular microprocessors.
Second, it is important to note that memory barriers are all about
perceived ordering, not about time.
Finally, a more precise way of stating above statement would be as
follows: "If CPU~0 loads the value resulting from CPU~1's
increment, then any subsequent load by CPU~0 will see the
values from any relevant stores by CPU~1 if these stores
preceded CPU~1's last prior memory barrier."
Even this more-precise version leaves some wiggle room.
The word "subsequent" must be understood to mean "ordered after",
either by an explicit memory barrier or by the CPU's underlying
memory ordering.
In addition, the memory barriers must be strong enough to order
the relevant operations.
For example, CPU~1's last prior memory barrier must order stores
(for example, \co{smp_wmb()} or \co{smp_mb()}).
Similarly, if CPU~0 needs an explicit memory barrier to
ensure that its later load follows the one that saw the increment,
then this memory barrier needs to be an \co{smp_rmb()}
or \co{smp_mb()}.
In general, much care is required when proving parallel algorithms.