debugging/debugging.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % debugging/debugging.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \QuickQuizChapter{chp:Validation}{Validation}{qqzdebugging}
 %
 \Epigraph{If it is not tested, it doesn't work.}{Unknown}

 I have had a few parallel programs work the first time, but that is only
 because I have written an extremely large number parallel programs over
 the past few decades.
 And I have had far more parallel programs that fooled me into thinking
 that they were working correctly the first time than actually were working
 the first time.

 I thus need to validate my parallel programs.
 The basic trick behind validation, is to realize that the computer knows
 what is wrong.
 It is therefore your job to force it to tell you.
 This chapter can therefore be thought of as a short course in
 machine interrogation.
 But you can leave the good-cop/bad-cop routine at home.
 This chapter covers much more sophisticated and effective methods,
 especially given that most computers couldn't tell a good cop from a
 bad cop, at least as far as we know.

 A longer course may be found in many recent books on validation, as
 well as at least one older but valuable
 one~\cite{GlenfordJMyers1979}.
 Validation is an extremely important topic that cuts across all forms
 of software, and is worth intensive study in its own right.
 However, this book is primarily about concurrency, so this chapter will do
 little more than scratch the surface of this critically important topic.

 \Cref{sec:debugging:Introduction}
 introduces the philosophy of debugging.
 \Cref{sec:debugging:Tracing}
 discusses tracing,
 \cref{sec:debugging:Assertions}
 discusses assertions, and
 \cref{sec:debugging:Static Analysis}
 discusses static analysis.
 \Cref{sec:debugging:Code Review}
 describes some unconventional approaches to code review that can
 be helpful when the fabled 10,000 eyes happen not to be looking at your code.
 \Cref{sec:debugging:Probability and Heisenbugs}
 overviews the use of probability for validating parallel software.
 Because performance and scalability are first-class requirements
 for parallel programming,
 \cref{sec:debugging:Performance Estimation} covers these
 topics.
 Finally,
 \cref{sec:debugging:Summary}
 gives a fanciful summary and a short list of statistical traps to avoid.

 But never forget that the three best debugging tools are a thorough
 understanding of the requirements, a solid design, and a good night's
 sleep!

 \section{Introduction}
 \label{sec:debugging:Introduction}
 %
 % \epigraph{If debugging is the process of removing software bugs, then
 % 	  programming must be the process of putting them in.}
 % 	 {Edsger W.~Dijkstra}
 \epigraph{Debugging is like being the detective in a crime movie where
 	  you are also the murderer.}
 	 {Filipe Fortes}
 % https://twitter.com/fortes/status/399339918213652480?lang=en Nov 9, 2013

 \Cref{sec:debugging:Where Do Bugs Come From?}
 discusses the sources of bugs, and
 \cref{sec:debugging:Required Mindset}
 overviews the mindset required when validating software.
 \Cref{sec:debugging:When Should Validation Start?}
 discusses when you should start validation, and
 \cref{sec:debugging:The Open Source Way} describes the
 surprisingly effective open-source regimen of code review and
 community testing.

 \subsection{Where Do Bugs Come From?}
 \label{sec:debugging:Where Do Bugs Come From?}

 Bugs come from developers.
 The basic problem is that the human brain did not evolve with computer
 software in mind.
 Instead, the human brain evolved in concert with other human brains and
 with animal brains.
 Because of this history, the following three characteristics of computers
 often come as a shock to human intuition:

 \begin{enumerate}
 \item	Computers lack common sense, despite huge sacrifices at the
 	altar of artificial intelligence.
 \item	Computers fail to understand user intent, or more formally,
 	computers generally lack a theory of mind.
 \item	Computers cannot do anything useful with a fragmentary plan,
 	instead requiring that every detail of all possible scenarios
 	be spelled out in full.
 \end{enumerate}

 The first two points should be uncontroversial, as they are illustrated
 by any number of failed products, perhaps most famously Clippy and
 Microsoft Bob.
 By attempting to relate to users as people, these two products raised
 common-sense and theory-of-mind expectations that they proved incapable
 of meeting.
 Perhaps the set of software assistants are now available on smartphones
 will fare better, but as of 2021 reviews are mixed.
 That said, the developers working on them by all accounts still develop
 the old way:
 The assistants might well benefit end users, but not so much their own
 developers.

 This human love of fragmentary plans deserves more explanation,
 especially given that it is a classic two-edged sword.
 This love of fragmentary plans is apparently due to the assumption that
 the person carrying out the plan will have (1)~common sense and (2)~a good
 understanding of the intent and requirements driving the plan.
 This latter assumption is especially likely to hold in the common case
 where the person doing the planning and the person carrying out the plan
 are one and the same:
 In this case, the plan will be revised almost
 subconsciously as obstacles arise, especially when that person has the
 a good understanding of the problem at hand.
 In fact, the love of fragmentary plans has served human beings well,
 in part because it is better to take random actions that have a some
 chance of locating food than to starve to death while attempting to plan
 the unplannable.
 However, the usefulness of fragmentary plans in the
 everyday life of which we are all experts is no guarantee of their future
 usefulness in stored-program computers.

 Furthermore, the need to follow fragmentary plans has had important effects
 on the human psyche, due to the fact
 that throughout much of human history, life was often difficult and dangerous.
 It should come as no surprise that executing a fragmentary plan that has
 a high probability of a violent encounter with sharp teeth and claws
 requires almost insane levels of optimism---a level of optimism that actually
 is present in most human beings.
 These insane levels of optimism extend to self-assessments of programming
 ability, as evidenced by the effectiveness of (and the controversy over)
 code-interviewing techniques~\cite{RegBraithwaite2007FizzBuzz}.
 In fact, the clinical term for a human being with less-than-insane
 levels of optimism is ``clinically depressed''.
 Such people usually have extreme difficulty functioning in their daily
 lives, underscoring the perhaps counter-intuitive importance of insane
 levels of optimism to a normal, healthy life.
 Furtheremore, if you are not insanely optimistic, you are less likely
 to start a difficult but worthwhile project.\footnote{
 	There are some famous exceptions to this rule of thumb.
 	Some people take on difficult or risky projects in order to at
 	least a temporarily escape from their depression.
 	Others have nothing to lose:
 	The project is literally a matter of life or death.}

 \QuickQuiz{
 	When in computing is it necessary to follow a
 	fragmentary plan?
 }\QuickQuizAnswer{
 	There are any number of situations, but perhaps the most important
 	situation is when no one has ever created anything resembling
 	the program to be developed.
 	In this case, the only way to create a credible plan is to
 	implement the program, create the plan, and implement it a
 	second time.
 	But whoever implements the program for the first time has no
 	choice but to follow a fragmentary plan because any detailed
 	plan created in ignorance cannot survive first contact with
 	the real world.

 	And perhaps this is one reason why evolution has favored insanely
 	optimistic human beings who are happy to follow fragmentary plans!
 }\QuickQuizEnd

 An important special case is the project that, while valuable, is not
 valuable enough to justify the time required to implement it.
 This special case is quite common, and one early symptom is the
 unwillingness of the decision-makers to invest enough to actually
 implement the project.
 A natural reaction is for the developers to produce an unrealistically
 optimistic estimate in order to be permitted to start the project.
 If the organization is strong enough and its decision-makers ineffective
 enough, the project might succeed despite the resulting schedule slips
 and budget overruns.
 However, if the organization is not strong enough and if the decision-makers
 fail to cancel the project as soon as it becomes clear that the estimates
 are garbage, then the project might well kill the organization.
 This might result in another organization picking up the project and
 either completing it, canceling it, or being killed by it.
 A given project might well succeed only after killing several
 organizations.
 One can only hope that the organization that eventually makes a success
 of a serial-organization-killer project maintains a suitable
 level of humility, lest it be killed by its next such project.

 \QuickQuiz{
 	Who cares about the organization?
 	After all, it is the project that is important!
 }\QuickQuizAnswer{
 	Yes, projects are important, but if you like being paid for your
 	work, you need organizations as well as projects.
 }\QuickQuizEnd

 Important though insane levels of optimism might be, they are a key source
 of bugs (and perhaps failure of organizations).
 The question is therefore ``How to maintain the optimism required to start
 a large project while at the same time injecting enough reality to keep
 the bugs down to a dull roar?''
 The next section examines this conundrum.

 \subsection{Required Mindset}
 \label{sec:debugging:Required Mindset}

 When carrying out any validation effort, keep the following
 definitions firmly in mind:

 \begin{enumerate}
 \item	The only bug-free programs are trivial programs.
 \item	A reliable program has no known bugs.
 \end{enumerate}

 From these definitions, it logically follows that any reliable
 non-trivial program contains at least one bug that you do not
 know about.
 Therefore, any validation effort undertaken on a non-trivial program
 that fails to find any bugs is itself a failure.
 Validation is therefore an exercise in destruction.
 This means that if you are the type of person who enjoys breaking things,
 validation is just job for you.

 \begin{SaveVerbatim}{VerbDebuggingQQZ}
         real    0m0.132s
         user    0m0.040s
         sys     0m0.008s
 \end{SaveVerbatim}

 \QuickQuiz{
 	Suppose that you are writing a script that processes the
 	output of the \co{time} command, which looks as follows:

 	\begin{center}
 	\fbox{\BUseVerbatim[boxwidth=2in,baseline=c]{VerbDebuggingQQZ}}
 	\end{center}

 	The script is required to check its input for errors, and to
 	give appropriate diagnostics if fed erroneous \co{time} output.
 	What test inputs should you provide to this program to test it
 	for use with \co{time} output generated by single-threaded programs?
 }\QuickQuizAnswer{
 	Can you say ``Yes'' to all the following questions?

 	\begin{enumerate}
 	\item	Do you have a test case in which all the time is
 		consumed in user mode by a CPU-bound program?
 	\item	Do you have a test case in which all the time is
 		consumed in system mode by a CPU-bound program?
 	\item	Do you have a test case in which all three times
 		are zero?
 	\item	Do you have a test case in which the \qco{user} and \qco{sys}
 		times sum to more than the \qco{real} time?
 		(This would of course be completely legitimate in
 		a multithreaded program.)
 	\item	Do you have a set of tests cases in which one of the
 		times uses more than one second?
 	\item	Do you have a set of tests cases in which one of the
 		times uses more than ten seconds?
 	\item	Do you have a set of test cases in which one of the
 		times has non-zero minutes?
 		(For example, \qco{15m36.342s}.)
 	\item	Do you have a set of test cases in which one of the
 		times has a seconds value of greater than 60?
 	\item	Do you have a set of test cases in which one of the
 		times overflows 32 bits of milliseconds?
 		64 bits of milliseconds?
 	\item	Do you have a set of test cases in which one of the
 		times is negative?
 	\item	Do you have a set of test cases in which one of the
 		times has a positive minutes value but a negative
 		seconds value?
 	\item	Do you have a set of test cases in which one of the
 		times omits the \qco{m} or the \qco{s}?
 	\item	Do you have a set of test cases in which one of the
 		times is non-numeric?
 		(For example, \qco{Go Fish}.)
 	\item	Do you have a set of test cases in which one of the
 		lines is omitted?
 		(For example, where there is a \qco{real} value and
 		a \qco{sys} value, but no \qco{user} value.)
 	\item	Do you have a set of test cases where one of the
 		lines is duplicated?
 		Or duplicated, but with a different time value for
 		the duplicate?
 	\item	Do you have a set of test cases where a given line
 		has more than one time value?
 		(For example, \qco{real 0m0.132s 0m0.008s}.)
 	\item	Do you have a set of test cases containing random
 		characters?
 	\item	In all test cases involving invalid input, did you
 		generate all permutations?
 	\item	For each test case, do you have an expected outcome
 		for that test?
 	\end{enumerate}

 	If you did not generate test data for a substantial number of
 	the above cases, you will need to cultivate a more destructive
 	attitude in order to have a chance of generating high-quality
 	tests.

 	Of course, one way to economize on destructiveness is to
 	generate the tests with the to-be-tested source code at hand,
 	which is called white-box testing (as opposed to black-box testing).
 	However, this is no panacea:
 	You will find that it is all too easy to find your thinking
 	limited by what the program can handle, thus failing to generate
 	truly destructive inputs.
 }\QuickQuizEnd

 But perhaps you are a super-programmer whose code is always perfect
 the first time every time.
 If so, congratulations!
 Feel free to skip this chapter, but
 I do hope that you will forgive my skepticism.
 You see, I have too many people who claimed to be able to write perfect
 code the first time, which is not too surprising given the previous
 discussion of optimism and over-confidence.
 And even if you really are a super-programmer, you just might
 find yourself debugging lesser mortals' work.

 One approach for the rest of us is to alternate between our normal
 state of insane optimism
 (Sure, I can program that!\@) and severe pessimism
 (It seems to work, but I just know that there have to be more bugs hiding
 in there somewhere!).
 It helps if you enjoy breaking things.
 If you don't, or if your joy in breaking things is limited to breaking
 \emph{other} people's things, find someone who does love breaking your
 code and have them help you break it.

 Another helpful frame of mind is to hate it when other people find bugs in
 your code.
 This hatred can help motivate you to torture your code beyond all reason
 in order to increase the probability that you will be the one to find
 the bugs.
 Just make sure to suspend this hatred long enough to sincerely thank
 anyone who does find a bug in your code!
 After all, by so doing, they saved you the trouble of tracking it down,
 and possibly at great personal expense dredging through your code.

 Yet another helpful frame of mind is studied skepticism.
 You see, believing that you understand the code means you can learn
 absolutely nothing about it.
 Ah, but you know that you completely understand the code because you
 wrote or reviewed it?
 Sorry, but the presence of bugs suggests that your understanding is at
 least partially fallacious.
 One cure is to write down what you know to be true and double-check this
 knowledge, as discussed in
 \crefrange{sec:debugging:Tracing}{sec:debugging:Code Review}.
 Objective reality \emph{always} overrides whatever you
 might think you know.

 One final frame of mind is to consider the possibility that someone's
 life depends on your code being correct.
 One way of looking at this is that consistently making good things happen
 requires a lot of focus on a lot of bad things that might happen, with
 an eye towards preventing or otherwise handling those bad things.\footnote{
 	For more on this philosophy, see the chapter entitled
 	``The Power of Negative Thinking''
 	from Chris Hadfield's excellent book entitled
 	``An Astronaut's Guide to Life on Earth.''}
 The prospect of these bad things might also motivate you to torture your
 code into revealing the whereabouts of its bugs.

 This wide variety of frames of mind opens the door to
 the possibility of multiple people with different frames of
 mind contributing to the project, with varying levels of optimism.
 This can work well, if properly organized.

 \begin{figure}
 \centering
 \resizebox{2in}{!}{\includegraphics{cartoons/TortureTux}}
 \caption{Validation and the Geneva Convention}
 \ContributedBy{fig:debugging:Validation and the Geneva Convention}{Melissa Broussard}
 \end{figure}

 \begin{figure}
 \centering
 \resizebox{2in}{!}{\includegraphics{cartoons/TortureLaptop}}
 \caption{Rationalizing Validation}
 \ContributedBy{fig:debugging:Rationalizing Validation}{Melissa Broussard}
 \end{figure}

 Some people might see vigorous validation as a form of torture, as
 depicted in
 \cref{fig:debugging:Validation and the Geneva Convention}.\footnote{
 	The cynics among us might question whether these people are
 	afraid that validation will find bugs that they will then be
 	required to fix.}
 Such people might do well to remind themselves that, Tux cartoons aside,
 they are really torturing an inanimate object, as shown in
 \cref{fig:debugging:Rationalizing Validation}.
 Rest assured that those who fail to torture their code are doomed to be
 tortured by it!

 However, this leaves open the question of exactly when during the project
 lifetime validation should start, a topic taken up by the next section.

 \subsection{When Should Validation Start?}
 \label{sec:debugging:When Should Validation Start?}

 Validation should start exactly when the project starts.

 To see this, consider that tracking down a bug is much harder in a large
 program than in a small one.
 Therefore, to minimize the time and effort required to track down bugs,
 you should test small units of code.
 Although you won't find all the bugs this way, you will find a substantial
 fraction, and it will be much easier to find and fix the ones you do find.
 Testing at this level can also alert you to larger flaws in your overall
 design, minimizing the time you waste writing code that is broken
 by design.

 But why wait until you have code before validating your design?\footnote{
 	The old saying ``First we must code, then we have incentive to
 	think'' notwithstanding.}
 Hopefully reading \cref{chp:Hardware and its Habits,%
 chp:Tools of the Trade} provided you with the information
 required to avoid some regrettably common design flaws,
 but discussing your design with a colleague or even simply writing it
 down can help flush out additional flaws.

 However, it is all too often the case that waiting to start validation
 until you have a design is waiting too long.
 Mightn't your natural level of optimism caused you to start the design
 before you fully understood the requirements?
 The answer to this question will almost always be ``yes''.
 One good way to avoid flawed requirements is to get to know your users.
 To really serve them well, you will have to live among them.

 \QuickQuiz{
 	You are asking me to do all this validation BS before
 	I even start coding???
 	That sounds like a great way to never get started!!!
 }\QuickQuizAnswer{
 	If it is your project, for example, a hobby, do what you like.
 	Any time you waste will be your own, and you have no one else
 	to answer to for it.
 	And there is a good chance that the time will not be completely
 	wasted.
 	For example, if you are embarking on a first-of-a-kind project,
 	the requirements are in some sense unknowable anyway.
 	In this case, the best approach might be to quickly prototype
 	a number of rough solutions, try them out, and see what works
 	best.

 	On the other hand, if you are being paid to produce a system that
 	is broadly similar to existing systems, you owe it to your users,
 	your employer, and your future self to validate early and often.
 }\QuickQuizEnd

 First-of-a-kind projects often use different methodologies such as
 rapid prototyping or agile.
 Here, the main goal of early prototypes are not to create correct
 implementations, but rather to learn the project's requirements.
 But this does not mean that you omit validation; it instead means that
 you approach it differently.

 One such approach takes a Darwinian view, with the validation suite
 eliminating code that is not fit to solve the problem at hand.
 From this viewpoint, a vigorous validation suite is essential to the
 fitness of your software.
 However, taking this approach to its logical conclusion is quite humbling,
 as it requires us developers to admit that our carefully crafted changes
 to the codebase are, from a Darwinian standpoint, random mutations.
 On the other hand, this conclusion is supported by long experience
 indicating that seven percent of fixes introduce at least one
 bug~\cite{RexBlack2012SQA}.

 How vigorous should your validation suite be?
 If the bugs it finds aren't threatening the very foundations of your
 software design, then it is not yet vigorous enough.
 After all, your design is just as prone to bugs as is your code, and
 the earlier you find and fix the bugs in your design, the less time
 you will waste coding those design bugs.

 \QuickQuiz{
 	Are you actually suggesting that it is possible to test
 	correctness into software???
 	Everyone knows that is impossible!!!
 }\QuickQuizAnswer{
 	Please note that the text used the word ``validation'' rather
 	than the word ``testing''.
 	The word ``validation'' includes formal methods as well as
 	testing, for more on which please see
 	\cref{chp:Formal Verification}.

 	But as long as we are bringing up things that everyone should
 	know, let's remind ourselves that Darwinian evolution is
 	not about correctness, but rather about survival.
 	As is software.
 	My goal as a developer is not that my software be attractive
 	from a theoretical viewpoint, but rather that it survive
 	whatever its users throw at it.

 	Although the notion of correctness does have its uses, its
 	fundamental limitation is that the specification against which
 	correctness is judged will also have bugs.
 	This means nothing more nor less than that traditional correctness
 	proofs prove that the code in question contains the intended
 	set of bugs!

 	Alternative definitions of correctness instead focus on the
 	lack of problematic properties, for example, proving that the
 	software has no use-after-free bugs, no \co{NULL} pointer
 	dereferences, no array-out-of-bounds references, and so on.
 	Make no mistake, finding and eliminating such classes of bugs
 	can be highly useful.
 	But the fact remains that the lack of certain classes of bugs
 	does nothing to demonstrate fitness for any specific purpose.

 	Therefore, usage-driven validation remains critically important.

 	Besides, it is also impossible to verify correctness into your
 	software, especially given the problematic need to verify both
 	the verifier and the specification.
 }\QuickQuizEnd

 It is worth reiterating that this advice applies to first-of-a-kind
 projects.
 If you are instead doing a project in a well-explored area, you would
 be quite foolish to refuse to learn from previous experience.
 But you should still start validating right at the beginning of the
 project, but hopefully guided by others' hard-won knowledge of both
 requirements and pitfalls.

 An equally important question is ``When should validation stop?''
 The best answer is ``Some time after the last change.''
 Every change has the potential to create a bug, and thus every
 change must be validated.
 Furthermore, validation development should continue through the
 full lifetime of the project.
 After all, the Darwinian perspective above implies that bugs are
 adapting to your validation suite.
 Therefore, unless you continually improve your validation suite, your
 project will naturally accumulate hordes of validation-suite-immune bugs.

 But life is a tradeoff, and every bit of time invested in validation
 suites as a bit of time that cannot be invested in directly improving
 the project itself.
 These sorts of choices are never easy, and it can be just as damaging
 to overinvest in validation as it can be to underinvest.
 But this is just one more indication that life is not easy.

 Now that we have established that you should start validation when
 you start the project (if not earlier!), and that both validation and
 validation development should continue throughout the lifetime of that
 project, the following sections cover a number of validation techniques
 and methods that have proven their worth.

 \subsection{The Open Source Way}
 \label{sec:debugging:The Open Source Way}

 The open-source programming methodology has proven quite effective, and
 includes a regimen of intense code review and testing.

 I can personally attest to the effectiveness of the open-source community's
 intense code review.
 One of my first patches to the Linux kernel involved a distributed
 filesystem where one node might write to a given file that another node
 has mapped into memory.
 In this case, it is necessary to invalidate the affected pages from
 the mapping in order to allow the filesystem to maintain coherence
 during the write operation.
 I coded up a first attempt at a patch, and, in keeping with the open-source
 maxim ``post early, post often'', I posted the patch.
 I then considered how I was going to test it.

 But before I could even decide on an overall test strategy, I got a
 reply to my posting pointing out a few bugs.
 I fixed the bugs and reposted the patch, and returned to thinking
 out my test strategy.
 However, before I had a chance to write any test code, I received
 a reply to my reposted patch, pointing out more bugs.
 This process repeated itself many times, and I am not sure that I
 ever got a chance to actually test the patch.

 This experience brought home the truth of the open-source saying:
 Given enough eyeballs, all bugs are shallow~\cite{EricSRaymond99b}.

 However, when you post some code or a given patch, it is worth
 asking a few questions:

 \begin{enumerate}
 \item	How many of those eyeballs are actually going to look at your code?
 \item	How many will be experienced and clever enough to actually find
 	your bugs?
 \item	Exactly when are they going to look?
 \end{enumerate}

 I was lucky:
 There was someone out there who wanted the functionality
 provided by my patch, who had long experience with distributed filesystems,
 and who looked at my patch almost immediately.
 If no one had looked at my patch, there would have been no review, and
 therefore none of those bugs would have been located.
 If the people looking at my patch had lacked experience with distributed
 filesystems, it is unlikely that they would have found all the bugs.
 Had they waited months or even years to look, I likely would have forgotten
 how the patch was supposed to work, making it much more difficult to
 fix them.

 However, we must not forget the second tenet of the open-source development,
 namely intensive testing.
 For example, a great many people test the Linux kernel.
 Some test patches as they are submitted, perhaps even yours.
 Others test the -next tree, which is helpful, but there is likely to be
 several weeks or even months delay between the time that you write the
 patch and the time that it appears in the -next tree, by which time the
 patch will not be quite as fresh in your mind.
 Still others test maintainer trees, which often have a similar time delay.

 Quite a few people don't test code until it is committed to mainline,
 or the master source tree (Linus's tree in the case of the Linux kernel).
 If your maintainer won't accept your patch until it has been tested,
 this presents you with a deadlock situation:
 Your patch won't be accepted until it is tested, but it won't be tested
 until it is accepted.
 Nevertheless, people who test mainline code are still relatively
 aggressive, given that many people and organizations do not test code
 until it has been pulled into a Linux distro.

 And even if someone does test your patch, there is no guarantee that they
 will be running the hardware and software configuration and workload
 required to locate your bugs.

 Therefore, even when writing code for an open-source project, you need to
 be prepared to develop and run your own test suite.
 Test development is an underappreciated and very valuable skill, so be
 sure to take full advantage of any existing test suites available to
 you.
 Important as test development is, we must leave further discussion of it
 to books dedicated to that topic.
 The following sections therefore discuss locating bugs in your code given that
 you already have a good test suite.

 \section{Tracing}
 \label{sec:debugging:Tracing}
 %
 \epigraph{The machine knows what is wrong.
 	  Make it tell you.}{Unknown}

 When all else fails, add a \co{printk()}!
 Or a \co{printf()}, if you are working with user-mode C-language applications.

 The rationale is simple:
 If you cannot figure out how execution reached a given point in the code,
 sprinkle print statements earlier in the code to work out what happened.
 You can get a similar effect, and with more convenience and flexibility,
 by using a debugger such as gdb (for user applications) or kgdb
 (for debugging Linux kernels).
 Much more sophisticated tools exist, with some of the more recent
 offering the ability to rewind backwards in time from the point
 of failure.

 These brute-force testing tools are all valuable, especially now
 that typical systems have more than 64K of memory and CPUs running
 faster than 4\,MHz.
 Much has been
 written about these tools, so this chapter will add only a little more.

 However, these tools all have a serious shortcoming when you need a
 fastpath to tell you what is going wrong, namely, these tools often have
 excessive overheads.
 There are special tracing technologies for this purpose, which typically
 leverage data ownership techniques
 (see \cref{chp:Data Ownership})
 to minimize the overhead of runtime data collection.
 One example within the Linux kernel is
 ``trace events''~\cite{StevenRostedt2010perfTraceEventP1,StevenRostedt2010perfTraceEventP2,StevenRostedt2010perfTraceEventP3,StevenRostedt2010perfHP+DeathlyMacros},
 which uses per-CPU buffers to allow data to be collected with
 extremely low overhead.
 Even so, enabling tracing can sometimes change timing enough to
 hide bugs, resulting in \emph{heisenbugs}, which are discussed in
 \cref{sec:debugging:Probability and Heisenbugs}
 and especially \cref{sec:debugging:Hunting Heisenbugs}.
 In the kernel, BPF can do data reduction in the kernel, reducing
 the overhead of transmitting the needed information from the kernel
 to userspace~\cite{BrendanGregg2019BPFperftools}.
 In userspace code, there is a huge number of tools that can help you.
 One good starting point is Brendan Gregg's blog.\footnote{
 	\url{http://www.brendangregg.com/blog/}}

 Even if you avoid heisenbugs, other pitfalls await you.
 For example, although the machine really does know all,
 what it knows is almost always way more than your head can hold.
 For this reason, high-quality test suites normally come with sophisticated
 scripts to analyze the voluminous output.
 But beware---scripts will only notice what you tell them to.
 My \co{rcutorture} scripts are a case in point:
 Early versions of those scripts were quite satisfied with a test run
 in which RCU \IXpl{grace period} stalled indefinitely.
 This of course resulted in the scripts being modified to detect RCU
 grace-period stalls, but this does not change the fact that the scripts
 will only detect problems that I make them detect.
 But note well that unless you have a solid design, you won't know what
 your script should check for!

 Another problem with tracing and especially with \co{printk()} calls
 is that their overhead can rule out production use.
 In such cases, assertions can be helpful.

 \section{Assertions}
 \label{sec:debugging:Assertions}
 %
 \epigraph{No man really becomes a fool until he stops asking questions.}
 	 {Charles P. Steinmetz}

 Assertions are usually implemented in the following manner:

 \begin{VerbatimN}
 if (something_bad_is_happening())
 	complain();
 \end{VerbatimN}

 This pattern is often encapsulated into C-preprocessor macros or
 language intrinsics, for example, in the Linux kernel, this might
 be represented as \co{WARN_ON(something_bad_is_happening())}.
 Of course, if \co{something_bad_is_happening()} quite frequently,
 the resulting output might obscure reports of other problems,
 in which case
 \co{WARN_ON_ONCE(something_bad_is_happening())} might be more appropriate.

 \QuickQuiz{
 	How can you implement \co{WARN_ON_ONCE()}?
 }\QuickQuizAnswer{
 	If you don't mind \co{WARN_ON_ONCE()} sometimes warning more
 	than once, simply maintain a static variable that is initialized
 	to zero.
 	If the condition triggers, check the variable, and
 	if it is non-zero, return.
 	Otherwise, set it to one, print the message, and return.

 	If you really need the message to never appear more than once,
 	you can use an atomic exchange operation in place of ``set it
 	to one'' above.
 	Print the message only if the atomic exchange operation returns
 	zero.
 }\QuickQuizEnd

 In parallel code, one bad something that might happen is that
 a function expecting to be called under a particular lock might be called
 without that lock being held.
 Such functions sometimes have header comments stating something like
 ``The caller must hold \co{foo_lock} when calling this function'', but
 such a comment does no good unless someone actually reads it.
 An executable statement carries far more weight.
 The Linux kernel's lockdep
 facility~\cite{JonathanCorbet2006lockdep,StevenRostedt2011locdepCryptic}
 therefore provides a \co{lockdep_assert_held()} function that checks
 whether the specified lock is held.
 Of course, lockdep incurs significant overhead, and thus might not be
 helpful in production.

 An especially bad parallel-code something is unexpected concurrent
 access to data.
 The \IXBacrfst{kcsan}~\cite{JonathanCorbet2019KCSAN}
 uses existing markings such as \co{READ_ONCE()} and \co{WRITE_ONCE()}
 to determine which concurrent accesses deserve warning messages.
 KCSAN has a significant false-positive rate, especially from the
 viewpoint of developers thinking in terms of C as assembly language
 with additional syntax.
 KCSAN therefore provides a \co{data_race()} construct to forgive
 known-benign \IXpl{data race}, and also the \co{ASSERT_EXCLUSIVE_ACCESS()}
 and \co{ASSERT_EXCLUSIVE_WRITER()} assertions to explicitly check for data
 races~\cite{MarcoElver2020FearDataRaceDetector1,MarcoElver2020FearDataRaceDetector2}.

 So what can be done in cases where checking is necessary, but where the
 overhead of runtime checking cannot be tolerated?
 One approach is static analysis, which is discussed in the next section.

 \section{Static Analysis}
 \label{sec:debugging:Static Analysis}
 %
 \epigraph{A lot of automation isn't a replacement of
 	  humans but of mind-numbing behavior.}
 	 {Summarized from Stewart Butterfield}

 Static analysis is a validation technique where one program takes a second
 program as input, reporting errors and vulnerabilities located in this
 second program.
 Interestingly enough, almost all programs are statically analyzed
 by their compilers or interpreters.
 These tools are far from perfect, but their ability to locate
 errors has improved immensely over the past few decades, in part because
 they now have much more than 64K bytes of memory in which to carry out their
 analyses.

 The original UNIX \co{lint} tool~\cite{StephenJohnson1977lint} was
 quite useful, though much of its functionality has since been incorporated
 into C compilers.
 There are nevertheless lint-like tools in use to this day.
 The sparse static analyzer~\cite{JonathanCorbet2004sparse}
 finds higher-level issues in the Linux kernel, including:

 \begin{enumerate}
 \item	Misuse of pointers to user-space structures.
 \item	Assignments from too-long constants.
 \item	Empty \co{switch} statements.
 \item	Mismatched lock acquisition and release primitives.
 \item	Misuse of per-CPU primitives.
 \item	Use of RCU primitives on non-RCU pointers and vice versa.
 \end{enumerate}

 There is a coccinelle tool that can be thought of as a C-syntax-aware
 search-and-replace facility.
 There is also a large body of scripts that use this tool to locate
 and fix some classes of Linux-kernel
 bugs~\cite{NicolasPalix2011CoccinelleTenYears}.

 Although it is likely that compilers will continue to increase their
 static-analysis capabilities, coccinelle and the sparse static analyzer
 demonstrate the benefits of static analysis outside of the compiler,
 particularly for finding application-specific bugs.
 \Crefrange{sec:formal:SAT Solvers}{sec:formal:Stateless Model Checkers}
 describe more sophisticated forms of static analysis.

 \section{Code Review}
 \label{sec:debugging:Code Review}
 %
 \epigraph{If a man speaks of my virtues, he steals from me;
 	  if he speaks of my vices, then he is my teacher.}
 	 {Chinese proverb}

 Code review is a special case of static analysis with human beings doing
 the analysis.
 Human beings are of course subject to inattention, fatigue, and errors,
 which underscores the importance of static analysis and testing.
 Done properly, these two activities can automate at least some aspects
 of code review.
 However, it does not appear that testing and static analysis will be able
 to completely replace manual review any time soon.
 This section therefore covers inspection, walkthroughs, and self-inspection.

 \subsection{Inspection}
 \label{sec:debugging:Inspection}

 Traditionally, formal code inspections take place in face-to-face meetings
 with formally defined roles:
 Moderator, developer, and one or two other participants.
 The developer reads through the code, explaining what it is doing and
 why it works.
 The one or two other participants ask questions and raise issues,
 hopefully exposing the author's invalid assumptions, while the moderator's
 job is to resolve any resulting conflicts and take notes.
 This process can be extremely effective at locating bugs, particularly
 if all of the participants are familiar with the code at hand.

 However, this face-to-face formal procedure does not necessarily
 work well in the global Linux kernel community.
 Instead, individuals review code separately and provide comments via
 email or IRC\@.
 The note-taking is provided by email archives or IRC logs, and moderators
 volunteer their services as required by the occasional flamewar.
 This process also works reasonably well, particularly if all of the
 participants are familiar with the code at hand.
 In fact, one advantage of the Linux kernel community approach over
 traditional formal inspections is the greater probability of contributions
 from people \emph{not} familiar with the code, who might not be blinded
 by the author's invalid assumptions, and who might also test the code.

 \QuickQuiz{
 	Just what invalid assumptions are you accusing Linux kernel
 	hackers of harboring???
 }\QuickQuizAnswer{
 	Those wishing a complete answer to this question are encouraged
 	to search the Linux kernel \co{git} repository for commits
 	containing the string \qco{Fixes:}.
 	There were many thousands of them just in the year 2020, including
 	fixes for the following invalid assumptions:

 	\begin{enumerate}
 	\item	Testing for a non-zero denominator will prevent
 		divide-by-zero errors.
 		(Hint:
 		Suppose that the test uses 64-bit arithmetic
 		but that the division uses 32-bit arithmetic.)
 	\item	Userspace can be trusted to zero out versioned data
 		structures used to communicate with the kernel.
 		(Hint:
 		Sometimes userspace has no idea how large the
 		data structure is.)
 	\item	Outdated TCP duplicate selective acknowledgement (D-SACK)
 		packets can be completely ignored.
 		(Hint:
 		These packets might also contain other information.)
 	\item	All CPUs are little-endian.
 	\item	Once a data structure is no longer needed, all of its
 		memory may be immediately freed.
 	\item	All devices can be initialized while in standby mode.
 	\item	Developers can be trusted to consistently do correct
 		hexidecimal arithmetic.
 	\end{enumerate}

 	Those who look at these commits in greater detail will conclude
 	that invalid assumptions are the rule, not the exception.
 }\QuickQuizEnd

 It is quite likely that the Linux kernel community's review process
 is ripe for improvement:

 \begin{enumerate}
 \item	There is sometimes a shortage of people with the time and
 	expertise required to carry out an effective review.
 \item	Even though all review discussions are archived, they are
 	often ``lost'' in the sense that insights are forgotten and
 	people fail to look up the discussions.
 	This can result in re-insertion of the same old bugs.
 \item	It is sometimes difficult to resolve flamewars when they do
 	break out, especially when the combatants have disjoint
 	goals, experience, and vocabulary.
 \end{enumerate}

 Perhaps some of the needed improvements will be provided by
 continuous-integration-style testing, but there are many bugs more
 easily found by review than by testing.
 When reviewing, therefore, it is worthwhile to look at relevant documentation
 in commit logs, bug reports, and LWN articles.
 This documentation can help you quickly build up the required expertise.

 \subsection{Walkthroughs}
 \label{sec:debugging:Walkthroughs}

 A traditional code walkthrough is similar to a formal inspection,
 except that the group
 ``plays computer'' with the code, driven by specific test cases.
 A typical walkthrough team has a moderator, a secretary (who records
 bugs found), a testing expert (who generates the test cases) and
 perhaps one to two others.
 These can be extremely effective, albeit also extremely time-consuming.

 It has been some decades since I have participated in a formal
 walkthrough, and I suspect that a present-day walkthrough would
 use single-stepping debuggers.
 One could imagine a particularly sadistic procedure as follows:

 \begin{enumerate}
 \item	The tester presents the test case.
 \item	The moderator starts the code under a debugger, using the
 	specified test case as input.
 \item	Before each statement is executed, the developer is required
 	to predict the outcome of the statement and explain why
 	this outcome is correct.
 \item	If the outcome differs from that predicted by the developer,
 	this is taken as a potential bug.
 \item	In parallel code, a ``concurrency shark'' asks what code
 	might execute concurrently with this code, and why such
 	concurrency is harmless.
 \end{enumerate}

 Sadistic, certainly.
 Effective?
 Maybe.
 If the participants have a good understanding of the requirements,
 software tools, data structures, and algorithms, then walkthroughs
 can be extremely effective.
 If not, walkthroughs are often a waste of time.

 \subsection{Self-Inspection}
 \label{sec:debugging:Self-Inspection}

 Although developers are usually not all that effective at inspecting
 their own code, there are a number of situations where there is no
 reasonable alternative.
 For example, the developer might be the only person authorized to look
 at the code, other qualified developers might all be too busy, or
 the code in question might be sufficiently bizarre that the developer
 is unable to convince anyone else to take it seriously until after
 demonstrating a prototype.
 In these cases, the following procedure can be quite helpful,
 especially for complex parallel code:

 \begin{enumerate}
 \item	Write design document with requirements, diagrams for data structures,
 	and rationale for design choices.
 \item	Consult with experts, updating the design document as needed.
 \item	Write the code in pen on paper, correcting errors as you go.
 	Resist the temptation to refer to pre-existing nearly identical code
 	sequences, instead, copy them.
 \item	At each step, articulate and question your assumptions,
 	inserting assertions or constructing tests to check them.
 \item	If there were errors, copy the code in pen on fresh paper, correcting
 	errors as you go.
 	Repeat until the last two copies are identical.
 \item	Produce proofs of correctness for any non-obvious code.
 \item	Use a source-code control system.
 	Commit early; commit often.
 \item	Test the code fragments from the bottom up.
 \item	When all the code is integrated (but preferably before),
 	do full-up functional and stress testing.
 \item	Once the code passes all tests, write code-level documentation,
 	perhaps as an extension to the design document discussed above.
 	Fix both the code and the test code as needed.
 \end{enumerate}

 When I follow this procedure for new RCU code, there are normally only
 a few bugs left at the end.
 With a few prominent (and embarrassing)
 exceptions~\cite{PaulEMcKenney2011RCU3.0trainwreck},
 I usually manage to locate these bugs before others do.
 That said, this is getting more difficult over time as the number and
 variety of Linux-kernel users increases.

 \QuickQuizSeries{%
 \QuickQuizB{
 	Why would anyone bother copying existing code in pen on paper???
 	Doesn't that just increase the probability of transcription errors?
 }\QuickQuizAnswerB{
 	If you are worried about transcription errors, please allow me
 	to be the first to introduce you to a really cool tool named
 	\co{diff}.
 	In addition, carrying out the copying can be quite valuable:
 	\begin{enumerate}
 	\item	If you are copying a lot of code, you are probably failing
 		to take advantage of an opportunity for abstraction.
 		The act of copying code can provide great motivation
 		for abstraction.
 	\item	Copying the code gives you an opportunity to think about
 		whether the code really works in its new setting.
 		Is there some non-obvious constraint, such as the need
 		to disable interrupts or to hold some lock?
 	\item	Copying the code also gives you time to consider whether
 		there is some better way to get the job done.
 	\end{enumerate}
 	So, yes, copy the code!
 }\QuickQuizEndB
 %
 \QuickQuizM{
 	This procedure is ridiculously over-engineered!
 	How can you expect to get a reasonable amount of software
 	written doing it this way???
 }\QuickQuizAnswerM{
 	Indeed, repeatedly copying code by hand is laborious and slow.
 	However, when combined with heavy-duty stress testing and
 	proofs of correctness, this approach is also extremely effective
 	for complex parallel code where ultimate performance and
 	reliability are required and where debugging is difficult.
 	The Linux-kernel RCU implementation is a case in point.

 	On the other hand, if you are writing a simple single-threaded
 	shell script, then you would be best-served by a different
 	methodology.
 	For example, enter each command one at a time into an interactive
 	shell with a test data set to make sure that it does what you
 	want, then copy-and-paste the successful commands into your
 	script.
 	Finally, test the script as a whole.

 	If you have a friend or colleague who is willing to help out,
 	pair programming can work very well, as can any number of
 	formal design- and code-review processes.

 	And if you are writing code as a hobby, then do whatever you like.

 	In short, different types of software need different development
 	methodologies.
 }\QuickQuizEndM
 %
 \QuickQuizE{
 	What do you do if, after all the pen-on-paper copying, you find
 	a bug while typing in the resulting code?
 }\QuickQuizAnswerE{
 	The answer, as is often the case, is ``it depends''.
 	If the bug is a simple typo, fix that typo and continue typing.
 	However, if the bug indicates a design flaw, go back to pen
 	and paper.
 }\QuickQuizEndE
 }

 The above procedure works well for new code, but what if you need to
 inspect code that you have already written?
 You can of course apply the above procedure for old code in the special
 case where you wrote one to throw away~\cite{Brooks79},
 but the following approach can also be helpful in less desperate
 circumstances:

 \begin{enumerate}
 \item	Using your favorite documentation tool (\LaTeX{}, HTML,
 	OpenOffice, or straight ASCII), describe the high-level
 	design of the code in question.
 	Use lots of diagrams to illustrate the data structures
 	and how these structures are updated.
 \item	Make a copy of the code, stripping away all comments.
 \item	Document what the code does statement by statement.
 \item	Fix bugs as you find them.
 \end{enumerate}

 This works because describing the code in detail is an excellent way to spot
 bugs~\cite{GlenfordJMyers1979}.
 This second procedure is also a good way to get your head around
 someone else's code, although the first step often suffices.

 Although review and inspection by others is probably more efficient and
 effective, the above procedures can be quite helpful in cases where
 for whatever reason it is not feasible to involve others.

 At this point, you might be wondering how to write parallel code without
 having to do all this boring paperwork.
 Here are some time-tested ways of accomplishing this:

 \begin{enumerate}
 \item	Write a sequential program that scales through use of
 	available parallel library functions.
 \item	Write sequential plug-ins for a parallel framework,
 	such as map-reduce, BOINC, or a web-application server.
 \item	Fully partition your problems, then implement sequential
 	program(s) that run in parallel without communication.
 \item	Stick to one of the application areas (such as linear algebra)
 	where tools can automatically decompose and parallelize
 	the problem.
 \item	Make extremely disciplined use of parallel-programming
 	primitives, so that the resulting code is easily seen to be correct.
 	But beware:
 	It is always tempting to break the rules ``just a little bit''
 	to gain better performance or scalability.
 	Breaking the rules often results in general breakage.
 	That is, unless you carefully do the paperwork described in this
 	section.
 \end{enumerate}

 But the sad fact is that even if you do the paperwork or use one of
 the above ways to more-or-less safely avoid paperwork,
 there will be bugs.
 If nothing else, more users and a greater variety of users will expose
 more bugs more quickly, especially if those users are doing things
 that the original developers did not consider.
 The next section describes how to handle the probabilistic bugs that
 occur all too commonly when validating parallel software.

 \QuickQuiz{
 	Wait!
 	Why on earth would an abstract piece of software fail only
 	sometimes???
 }\QuickQuizAnswer{
 	Because complexity and concurrency can produce results that
 	are indistinguishable from
 	randomness~\cite{PeterOkech2009InherentRandomness}.
 	For example, a bug in Linux-kernel RCU required the following
 	to hold before that bug would manifest:
 	\begin{enumerate}
 	\item	The kernel was built for HPC or real-time use, so that
 		a given CPU's RCU work could be offloaded to some other
 		CPU\@.
 	\item	An offloaded CPU went offline just after generating a
 		large quantity of RCU work.
 	\item	A special \co{rcu_barrier()} API was invoked just at
 		this time.
 	\item	The RCU work from the newly offlined CPU was still being
 		processed after \co{rcu_barrier()} returned.
 	\item	One of these remaining RCU work items was related to
 		the code invoking the \co{rcu_barrier()}.
 	\end{enumerate}
 	Making this bug manifest therefore required considerable luck
 	or great testing skill.
 	But the testing skill could be effective only if the bug was
 	known, which of course it was not.
 	Therefore, the manifesting of this bug was very well modeled
 	as a probabilistic process.
 }\QuickQuizEnd

 \section{Probability and Heisenbugs}
 \label{sec:debugging:Probability and Heisenbugs}
 %
 \epigraph{With both heisenbugs and impressionist art, the closer you
 	  get, the less you see.}
 	 {Unknown}

 So your parallel program fails sometimes.
 But you used techniques from the earlier sections to locate
 the problem and now have a fix in place!
 Congratulations!!!

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/r-2014-Passed-the-stress-test}}
 \caption{Passed on Merits?
 			   Or Dumb Luck?}
 \ContributedBy{fig:cpu:Passed-the-stress-test}{Melissa Broussard}
 \end{figure}

 Now the question is just how much testing is required in order to be
 certain that
 you actually fixed the bug, as opposed to just reducing the probability
 of it occurring on the one hand, having fixed only one of several
 related bugs on the other hand, or made some ineffectual unrelated
 change on yet a third hand.
 In short, what is the answer to the eternal question posed by
 \cref{fig:cpu:Passed-the-stress-test}?

 Unfortunately, the honest answer is that an infinite amount of testing
 is required to attain absolute certainty.

 \QuickQuiz{
 	Suppose that you had a very large number of systems at your
 	disposal.
 	For example, at current cloud prices, you can purchase a
 	huge amount of CPU time at low cost.
 	Why not use this approach to get close enough to certainty
 	for all practical purposes?
 }\QuickQuizAnswer{
 	This approach might well be a valuable addition to your
 	validation arsenal.
 	But it does have limitations that rule out ``for all practical
 	purposes'':
 	\begin{enumerate}
 	\item	Some bugs have extremely low probabilities of occurrence,
 		but nevertheless need to be fixed.
 		For example, suppose that the Linux kernel's RCU
 		implementation had a bug that is triggered only once
 		per million years of machine time on average.
 		A million years of CPU time is hugely expensive even on
 		the cheapest cloud platforms, but we could expect
 		this bug to result in more than 50 failures per day
 		on the more than 20~billion Linux instances in the
 		world as of 2017.
 	\item	The bug might well have zero probability of occurrence
 		on your particular cloud-computing test setup, which
 		means that you won't see it no matter how much machine
 		time you burn testing it.
 		For but one example, there have been (and likely
 		still are) Linux-kernel RCU bugs that appear only
 		in preemptible kernels, others that appear only if
 		callbacks are offloaded, still others that appear only
 		in the presence of closely spaced CPU-hotplug operations,
 		and even more that can be provoked only on systems with
 		weak memory ordering.
 	\end{enumerate}
 	Of course, if your code is small enough, formal validation
 	may be helpful, as discussed in
 	\cref{chp:Formal Verification}.
 	But beware:
 	Formal validation of your code will not find errors in your
 	assumptions, misunderstanding of the requirements,
 	misunderstanding of the software or hardware primitives you use,
 	or errors that you did not think to construct a proof for.
 }\QuickQuizEnd

 But suppose that we are willing to give up absolute certainty in favor
 of high probability.
 Then we can bring powerful statistical tools to bear on this problem.
 However, this section will focus on simple statistical tools.
 These tools are extremely helpful, but please note that reading this
 section is not a substitute for statistics classes.\footnote{
 	Which I most highly recommend.
 	The few statistics courses I have taken have provided value
 	far beyond that of the time I spent on them.}

 For our start with simple statistical tools, we need to decide whether
 we are doing discrete or continuous testing.
 Discrete testing features well-defined individual test runs.
 For example, a boot-up test of a Linux kernel patch is an example
 of a discrete test:
 The kernel either comes up or it does not.
 Although you might spend an hour boot-testing your kernel, the number of
 times you attempted to boot the kernel and the number of times the
 boot-up succeeded would often be of more interest than the length
 of time you spent testing.
 Functional tests tend to be discrete.

 On the other hand, if my patch involved RCU, I would probably run
 \co{rcutorture}, which is a kernel module that, strangely enough, tests RCU\@.
 Unlike booting the kernel, where the appearance of a login prompt
 signals the successful end of a discrete test, \co{rcutorture} will happily
 continue torturing RCU until either the kernel crashes or until you
 tell it to stop.
 The duration of the \co{rcutorture} test is usually of more
 interest than the number of times you started and stopped it.
 Therefore, \co{rcutorture} is an example of a continuous test, a category
 that includes many stress tests.

 Statistics for discrete tests are simpler and more familiar than those
 for continuous tests, and furthermore the statistics for discrete tests
 can often be pressed into service for continuous tests, though with some
 loss of accuracy.
 We therefore start with discrete tests.

 \subsection{Statistics for Discrete Testing}
 \label{sec:debugging:Statistics for Discrete Testing}

 Suppose a bug has a 10\pct\ chance of occurring in a given run and that
 we do five runs.
 How do we compute the probability of at least one run failing?
 Here is one way:

 \begin{enumerate}
 \item	Compute the probability of a given run succeeding, which is 90\pct.
 \item	Compute the probability of all five runs succeeding, which
 	is 0.9 raised to the fifth power, or about 59\pct.
 \item	Because either all five runs succeed, or at least one fails,
 	subtract the 59\pct\ expected success rate from 100\pct, yielding
 	a 41\pct\ expected failure rate.
 \end{enumerate}

 For those preferring formulas, call the probability of a single failure $f$.
 The probability of a single success is then $1-f$ and the probability
 that all of $n$ tests will succeed is $S_n$:

 \begin{equation}
 	S_n = \left(1-f\right)^n
 \end{equation}

 The probability of failure is $1-S_n$, or:

 \begin{equation}
 	F_n = 1-\left(1-f\right)^n
 \label{eq:debugging:Binomial Failure Rate}
 \end{equation}

 \QuickQuiz{
 	Say what???
 	When I plug the earlier five-test 10\pct-failure-rate example into
 	the formula, I get 59,050\pct\ and that just doesn't make sense!!!
 }\QuickQuizAnswer{
 	You are right, that makes no sense at all.

 	Remember that a probability is a number between zero and one,
 	so that you need to divide a percentage by 100 to get a
 	probability.
 	So 10\pct\ is a probability of 0.1, which gets a probability
 	of 0.4095, which rounds to 41\pct, which quite sensibly
 	matches the earlier result.
 }\QuickQuizEnd

 So suppose that a given test has been failing 10\pct\ of the time.
 How many times do you have to run the test to be 99\pct\ sure that
 your alleged fix actually helped?

 Another way to ask this question is ``How many times would we need
 to run the test to cause the probability of failure to rise above 99\pct?''
 After all, if we were to run the test enough times that the probability
 of seeing at least one failure becomes 99\pct, and none of these test
 runs fail,
 there is only 1\pct\ probability of this ``success'' being due to dumb luck.
 And if we plug $f=0.1$ into
 \cref{eq:debugging:Binomial Failure Rate} and vary $n$,
 we find that 43 runs gives us a 98.92\pct\ chance of at least one test failing
 given the original 10\pct\ per-test failure rate,
 while 44 runs gives us a 99.03\pct\ chance of at least one test failing.
 So if we run the test on our fix 44 times and see no failures, there
 is a 99\pct\ probability that our fix really did help.

 But repeatedly plugging numbers into
 \cref{eq:debugging:Binomial Failure Rate}
 can get tedious, so let's solve for $n$:

 \begin{align}
 	F_n & = 1-\left(1-f\right)^n \\
 	1 - F_n & = \left(1-f\right)^n \\
 	\log \left(1 - F_n\right) & = n \; \log \left(1 - f\right)
 \end{align}

 Finally the number of tests required is given by:

 \begin{equation}
 	n = \frac{\log\left(1 - F_n\right)}{\log\left(1 - f\right)}
 \label{eq:debugging:Binomial Number of Tests Required}
 \end{equation}

 Plugging $f=0.1$ and $F_n=0.99$ into
 \cref{eq:debugging:Binomial Number of Tests Required}
 gives 43.7, meaning that we need 44 consecutive successful test
 runs to be 99\pct\ certain that our fix was a real improvement.
 This matches the number obtained by the previous method, which
 is reassuring.

 \QuickQuiz{
 	In \cref{eq:debugging:Binomial Number of Tests Required},
 	are the logarithms base-10, base-2, or base-$\euler$?
 }\QuickQuizAnswer{
 	It does not matter.
 	You will get the same answer no matter what base of logarithms
 	you use because the result is a pure ratio of logarithms.
 	The only constraint is that you use the same base for both
 	the numerator and the denominator.
 }\QuickQuizEnd

 \begin{figure}
 \centering
 \resizebox{2.5in}{!}{\includegraphics{CodeSamples/debugging/BinomialNRuns}}
 \caption{Number of Tests Required for 99 Percent Confidence Given Failure Rate}
 \label{fig:debugging:Number of Tests Required for 99 Percent Confidence Given Failure Rate}
 \end{figure}

 \Cref{fig:debugging:Number of Tests Required for 99 Percent Confidence Given Failure Rate}
 shows a plot of this function.
 Not surprisingly, the less frequently each test run fails, the more
 test runs are required to be 99\pct\ confident that the bug has been
 at least partially fixed.
 If the bug caused the test to fail only 1\pct\ of the time, then a
 mind-boggling 458 test runs are required.
 As the failure probability decreases, the number of test runs required
 increases, going to infinity as the failure probability goes to zero.

 The moral of this story is that when you have found a rarely occurring
 bug, your testing job will be much easier if you can come up with
 a carefully targeted test (or ``reproducer'') with a much higher failure rate.
 For example, if your reproducer raised the failure rate from 1\pct\
 to 30\pct, then the number of runs required for 99\pct\ confidence
 would drop from 458 to a more tractable 13.

 But these thirteen test runs would only give you 99\pct\ confidence that
 your fix had produced ``some improvement''.
 Suppose you instead want to have 99\pct\ confidence that your fix reduced
 the failure rate by an order of magnitude.
 How many failure-free test runs are required?

 An order of magnitude improvement from a 30\pct\ failure rate would be
 a 3\pct\ failure rate.
 Plugging these numbers into
 \cref{eq:debugging:Binomial Number of Tests Required} yields:

 \begin{equation}
 	n = \frac{\log\left(1 - 0.99\right)}{\log\left(1 - 0.03\right)} = 151.2
 \end{equation}

 So our order of magnitude improvement requires roughly an order of
 magnitude more testing.
 Certainty is impossible, and high probabilities are quite expensive.
 This is why creating high-quality reproducers is so important:
 Making tests run more quickly and making failures more
 probable are essential skills in the development of highly reliable
 software.
 These skills will be covered in
 \cref{sec:debugging:Hunting Heisenbugs}.

 \subsection{Statistics Abuse for Discrete Testing}
 \label{sec:debugging:Statistics Abuse for Discrete Testing}

 But suppose that you have a continuous test that fails about three
 times every ten hours, and that you fix the bug that you believe was
 causing the failure.
 How long do you have to run this test without failure to be 99\pct\ certain
 that you reduced the probability of failure?

 Without doing excessive violence to statistics, we could simply
 redefine a one-hour run to be a discrete test that has a 30\pct\
 probability of failure.
 Then the results of in the previous section tell us that if the test
 runs for 13 hours without failure, there is a 99\pct\ probability that
 our fix actually improved the program's reliability.

 A dogmatic statistician might not approve of this approach, but the sad
 fact is that the errors introduced by this sort of statistical abuse are
 usually quite small compared to the errors in your failure-rate estimates.
 Nevertheless, the next section takes a more rigorous approach.

 \subsection{Statistics for Continuous Testing}
 \label{sec:debuggingStatistics for Continuous Testing}

 The fundamental formula for failure probabilities is the Poisson
 distribution:

 \begin{equation}
 	F_m = \frac{\lambda^m}{m!} \euler^{-\lambda}
 \label{eq:debugging:Poisson Probability}
 \end{equation}

 Here $F_m$ is the probability of $m$ failures in the test and
 $\lambda$ is the expected failure rate per unit time.
 A rigorous derivation may be found in any advanced probability
 textbook, for example, Feller's classic ``An Introduction to Probability
 Theory and Its Applications''~\cite{Feller58}, while a more
 intuitive derivation may be found in the first edition of
 this book~\cite[Equations 11.8--11.26]{McKenney2014ParallelProgramming-e1}.

 Let's rework the example from
 \cref{sec:debugging:Statistics Abuse for Discrete Testing}
 using the Poisson distribution.
 Recall that this example involved an alleged fix for a bug with a 30\pct\
 failure rate per hour.
 If we need to be 99\pct\ certain that the fix actually reduced the failure
 rate, how long an error-free test run is required?
 In this case, $m$ is zero, so that
 \cref{eq:debugging:Poisson Probability} reduces to:

 \begin{equation}
 	F_0 =  \euler^{-\lambda}
 \end{equation}

 Solving this requires setting $F_0$
 to 0.01 and solving for $\lambda$, resulting in:

 \begin{equation}
 	\lambda = - \ln 0.01 = 4.6
 \end{equation}

 Because we get $0.3$ failures per hour, the number of hours required
 is $4.6/0.3 = 14.3$, which is within 10\pct\ of the 13 hours
 calculated using the method in
 \cref{sec:debugging:Statistics Abuse for Discrete Testing}.
 Given that you normally won't know your failure rate to anywhere near
 10\pct, the simpler method described in
 \cref{sec:debugging:Statistics Abuse for Discrete Testing}
 is almost always good and sufficient.

 However, those wanting to learn more about statistics for continuous
 testing are encouraged to read on.

 More generally, if we have $n$ failures per unit time, and we want to
 be $P$\pct\ certain that a fix reduced the failure rate, we can use the
 following formula:

 \begin{equation}
 	T = - \frac{1}{n} \ln \frac{100 - P}{100}
 \label{eq:debugging:Error-Free Test Duration}
 \end{equation}

 \QuickQuiz{
 	Suppose that a bug causes a test failure three times per hour
 	on average.
 	How long must the test run error-free to provide 99.9\pct\
 	confidence that the fix significantly reduced the probability
 	of failure by at least a little bit?
 }\QuickQuizAnswer{
 	We set $n$ to $3$ and $P$ to $99.9$ in
 	\cref{eq:debugging:Error-Free Test Duration}, resulting in:

 	\begin{equation}
 		T = - \frac{1}{3} \ln \frac{100 - 99.9}{100} = 2.3
 	\end{equation}

 	If the test runs without failure for 2.3 hours, we can be 99.9\pct\
 	certain that the fix reduced (by at least some small amount)
 	the probability of failure.
 }\QuickQuizEnd

 As before, the less frequently the bug occurs and the greater the
 required level of confidence, the longer the required error-free test run.

 Suppose that a given test fails about once every hour, but after a bug
 fix, a 24-hour test run fails only twice.
 Assuming that the failure leading to the bug is a random occurrence,
 what is the probability of a false negative?
 In other words, how confident are we that the alleged fix actually had
 some effect on the bug?
 This probability may be calculated by summing
 \cref{eq:debugging:Poisson Probability} as follows:

 \begin{equation}
 	F_0 + F_1 + \dots + F_{m - 1} + F_m =
 		\sum_{i=0}^m \frac{\lambda^i}{i!} \euler^{-\lambda}
 \end{equation}

 This is the Poisson cumulative distribution function, which can be
 written more compactly as:

 \begin{equation}
 	F_{i \le m} = \sum_{i=0}^m \frac{\lambda^i}{i!} \euler^{-\lambda}
 \label{eq:debugging:Possion CDF}
 \end{equation}

 Here $m$ is the actual number of errors in the long test run with the
 alleged fix applied (in this case, two errors) and $\lambda$ is expected
 number of errors in the long test run prior to applying the fix (in this
 case, 24 errors).
 Plugging $m=2$ and $\lambda=24$ into this expression gives the probability
 of a false negative (that is, of two or fewer failures), as about
 $1.2 \times 10^{-8}$, in other words, we have a high level of confidence
 that the fix actually had some relationship to the bug.\footnote{
 	Of course, this result in no way excuses you from finding and
 	fixing the bug(s) causing the remaining two failures!}

 It is often more convenient to work with statistical confidences.
 When the probability of a false negative is low, the confidence is high,
 giving this result:

 \begin{equation}
 	C_{m,\lambda} = 1 - \sum_{i=0}^m \frac{\lambda^i}{i!} \euler^{-\lambda}
 \label{eq:debugging:Possion Confidence}
 \end{equation}

 Continuing our example with $m=2$ and $\lambda=24$, this gives a confidence
 of about 0.999999988, or equivalently, 99.9999988\pct.
 This level of confidence should satisfy all but the purest of purists.

 But what if we are interested not in ``some relationship'' to the bug,
 but instead in at least an order of magnitude \emph{reduction} in its
 expected frequency of occurrence?
 Then we divide $\lambda$ by ten, and plug $m=2$ and $\lambda=2.4$ into
 \cref{eq:debugging:Possion Confidence},
 which gives but a 90.9\pct\ confidence level.
 This illustrates the sad fact that increasing either statistical
 confidence or degree of improvement, let alone both, can be quite
 expensive.

 \QuickQuizSeries{%
 \QuickQuizB{
 	Doing the summation of all the factorials and exponentials
 	is a real pain.
 	Isn't there an easier way?
 }\QuickQuizAnswerB{
 	One approach is to use the open-source symbolic manipulation
 	program named ``maxima''.
 	Once you have installed this program, which is a part of many
 	Linux distributions, you can run it and give the
 	\co{load(distrib);} command followed by any number of
 	\co{bfloat(cdf_poisson(m,lambda));} commands, where the \co{m} is
 	replaced by the value of $m$ (the actual number of failures in
 	actual test of the alleged fix, usually zero) and the \co{lambda}
 	is replaced by the value of $\lambda$ (the expected number of
 	failures in that same test prior to applying the alleged fix).
 	You can adjust for the desired improvement, for example, if the
 	run prior to the fix had 24 failures and you are looking for an
 	improvement of at least an order of magnitude, then set $\lambda$
 	to 2.4 instead of 24.

 	In particular, the \co{bfloat(cdf_poisson(2,24));} command
 	results in \co{1.181617112359357b-8}, which matches the value
 	given by \cref{eq:debugging:Possion CDF}.
 	Similarly, referring to
 	\cref{eq:debugging:Possion Confidence},
 	the \co{bfloat(1-cdf_poisson(m,lambda));} command results in
 	\co{9.092820467105875b-1}, which is again a match.

 \begin{table}
 \renewcommand*{\arraystretch}{1.25}
 \rowcolors{3}{}{lightgray}
 \small
 \centering
 \begin{tabular}{rrrr}
 	\toprule
 		& \multicolumn{3}{c}{Improvement} \\
 		\cmidrule(l){2-4}
 	Certainty (\%)
 		& Any
 			& 10x
 				& 100x \\
 	\cmidrule{1-1} \cmidrule(l){2-4}
 	90.0	& 2.3	& 23.0	& 230.0  \\
 	95.0	& 3.0	& 30.0	& 300.0  \\
 	99.0	& 4.6	& 46.1	& 460.5  \\
 	99.9	& 6.9	& 69.1	& 690.7  \\
 	\bottomrule
 \end{tabular}
 \caption{Human-Friendly Poisson-Function Display}
 \label{tab:debugging:Human-Friendly Poisson-Function Display}
 \end{table}

 	Another approach is to recognize that in this real world,
 	it is not all that useful to compute (say) the duration of a test
 	having two or fewer errors that would give a 76.8\pct\ confidence
 	of a 349.2x improvement in reliability.
 	Instead, human beings tend to focus on specific values, for
 	example, a 95\pct\ confidence of a 10x improvement.
 	People also greatly prefer error-free test runs, and so should
 	you because doing so reduces your required test durations.
 	Therefore, it is quite possible that the values in
 	\cref{tab:debugging:Human-Friendly Poisson-Function Display}
 	will suffice.
 	Simply look up the desired confidence and degree of improvement,
 	and the resulting number will give you the required
 	error-free test duration in terms of the expected time for
 	a single error to appear.
 	So if your pre-fix testing suffered one failure per hour, and the
 	powers that be require a 95\pct\ confidence of a 10x improvement,
 	you need a 30-hour error-free run.

 	Alternatively, you can use the rough-and-ready method described in
 	\cref{sec:debugging:Statistics Abuse for Discrete Testing}.
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	But wait!!!
 	Given that there has to be \emph{some} number of failures
 	(including the possibility of zero failures), shouldn't
 	\cref{eq:debugging:Possion CDF}
 	approach the value $1$ as $m$ goes to infinity?
 }\QuickQuizAnswerE{
 	Indeed it should.
 	And it does.

 	To see this, note that $\euler^{-\lambda}$ does not depend on $i$,
 	which means that it can be pulled out of the summation as follows:

 	\begin{equation}
 		\euler^{-\lambda} \sum_{i=0}^\infty \frac{\lambda^i}{i!}
 	\end{equation}

 	The remaining summation is exactly the Taylor series for
 	$\euler^\lambda$, yielding:

 	\begin{equation}
 		\euler^{-\lambda} \euler^\lambda
 	\end{equation}

 	The two exponentials are reciprocals, and therefore cancel,
 	resulting in exactly $1$, as required.
 }\QuickQuizEndE
 }

 The Poisson distribution is a powerful tool for analyzing test results,
 but the fact is that in this last example there were still two remaining
 test failures in a 24-hour test run.
 Such a low failure rate results in very long test runs.
 The next section discusses counter-intuitive ways of improving this situation.

 \subsection{Hunting Heisenbugs}
 \label{sec:debugging:Hunting Heisenbugs}

 This line of thought also helps explain heisenbugs:
 Adding tracing and assertions can easily reduce the probability
 of a bug appearing, which
 is why extremely lightweight tracing and assertion mechanism are
 so critically important.

 The term ``\IXB{heisenbug}'' was inspired by the \pplsur{Weiner}{Heisenberg}
 \IX{Uncertainty Principle} from quantum physics, which states that
 it is impossible to
 exactly quantify a given particle's position and velocity at any given
 point in time~\cite{WeinerHeisenberg1927Uncertain}.
 Any attempt to more accurately measure that particle's position will
 result in increased uncertainty of its velocity and vice versa.
 Similarly, attempts to track down
 the heisenbug causes its symptoms to radically change or even disappear
 completely.\footnote{
 	The term ``heisenbug'' is a misnomer, as most heisenbugs are
 	fully explained by the \emph{observer effect} from classical
 	physics.
 	Nevertheless, the name has stuck.}
 Of course, adding debugging overhead can and sometimes does make bugs
 more probable.
 But developers are more likely to remember the frustration of a
 disappearing heisenbug than the joy inspired by the bug becoming
 more easily reproduced!

 If the field of physics inspired the name of this problem, it is only
 fair that the field of physics should inspire the solution.
 Fortunately, particle physics is up to the task:
 Why not create an \IXBhy{anti-}{heisenbug} to annihilate the heisenbug?
 Or, perhaps more accurately, to annihilate the heisen-ness of
 the heisenbug?
 Although producing an anti-heisenbug for a given heisenbug is more an
 art than a science, the following sections describe a number of ways to
 do just that:

 \begin{enumerate}
 \item	Add delay to race-prone regions (\cref{sec:debugging:Add Delay}).
 \item	Increase workload intensity
 	(\cref{sec:debugging:Increase Workload Intensity}).
 \item	Isolate suspicious subsystems
 	(\cref{sec:debugging:Isolate Suspicious Subsystems}).
 \item	Make rare events less rare
 	(\cref{sec:debugging:Make Rare Events Less Rare}).
 \item	Count near misses (\cref{sec:debugging:Count Near Misses}).
 \item	Proactive hunting techniques
 	(\cref{sec:debugging:Proactive Hunting Techniques}).
 \end{enumerate}

 These are followed by discussion in
 \cref{sec:debugging:Heisenbug Discussion}.

 \subsubsection{Add Delay}
 \label{sec:debugging:Add Delay}

 Consider the count-lossy code in
 \cref{sec:count:Why Isn't Concurrent Counting Trivial?}.
 Adding \co{printf()} statements will likely greatly reduce or even
 eliminate the lost counts.
 However, converting the load-add-store sequence to a load-add-delay-store
 sequence will greatly increase the incidence of lost counts (try it!).
 Once you spot a bug involving a \IX{race condition}, it is frequently possible
 to create an anti-heisenbug by adding delay in this manner.

 Of course, this begs the question of how to find the race condition in
 the first place.
 Although very lucky developers might accidentally create delay-based
 anti-heisenbugs when adding debug code, this is in general a dark art.
 Nevertheless, there are a number of things you can do to find your race
 conditions.

 One approach is to recognize that race conditions often end up corrupting
 some of the data involved in the race.
 It is therefore good practice to double-check the synchronization of
 any corrupted data.
 Even if you cannot immediately recognize the race condition, adding
 delay before and after accesses to the corrupted data might change
 the failure rate.
 By adding and removing the delays in an organized fashion (e.g., binary
 search), you might learn more about the workings of the race condition.

 \QuickQuiz{
 	How is this approach supposed to help if the corruption affected some
 	unrelated pointer, which then caused the corruption???
 }\QuickQuizAnswer{
 	Indeed, that can happen.
 	Many CPUs have hardware-debugging facilities that can help you
 	locate that unrelated pointer.
 	Furthermore, if you have a core dump, you can search the core
 	dump for pointers referencing the corrupted region of memory.
 	You can also look at the data layout of the corruption, and
 	check pointers whose type matches that layout.

 	You can also step back and test the modules making up your
 	program more intensively, which will likely confine the corruption
 	to the module responsible for it.
 	If this makes the corruption vanish, consider adding additional
 	argument checking to the functions exported from each module.

 	Nevertheless, this is a hard problem, which is why I used the
 	words ``a bit of a dark art''.
 }\QuickQuizEnd

 Another important approach is to
 vary the software and hardware configuration and look for statistically
 significant differences in failure rate.
 For example, back in the 1990s, it was common practice to test on systems
 having CPUs running at different clock rates, which tended to make some
 types of race conditions more probable.
 One way of getting a similar effect today is to test on multi-socket
 systems, thus incurring the large delays described in
 \cref{sec:cpu:Overheads}.

 However you choose to add delays, you can then look more intensively at
 the code implicated by those delays that make the greatest difference
 in failure rate.
 It might be helpful to test that code in isolation, for example.

 One important aspect of software configuration is the history of
 changes, which is why \co{git bisect} is so useful.
 Bisection of the change history can provide very valuable clues as
 to the nature of the heisenbug, in this case presumably by locating
 a commit that shows a change in the software's response to the addition
 or removal of a given delay.

 \QuickQuiz{
 	But I did the bisection, and ended up with a huge commit.
 	What do I do now?
 }\QuickQuizAnswer{
 	A huge commit?
 	Shame on you!
 	This is but one reason why you are supposed to keep the commits small.

 	And that is your answer:
 	Break up the commit into bite-sized pieces and bisect the pieces.
 	In my experience, the act of breaking up the commit is often
 	sufficient to make the bug painfully obvious.
 }\QuickQuizEnd

 Once you locate the suspicious section of code, you can then introduce
 delays to attempt to increase the probability of failure.
 As we have seen, increasing the probability of failure makes it much
 easier to gain high confidence in the corresponding fix.

 However, it is sometimes quite difficult to track down the problem using
 normal debugging techniques.
 The following sections present some other alternatives.

 \subsubsection{Increase Workload Intensity}
 \label{sec:debugging:Increase Workload Intensity}

 It is often the case that a given test suite places relatively
 low stress on a given subsystem, so that a small change in timing
 can cause a heisenbug to disappear.
 One way to create an anti-heisenbug for this case is to increase
 the workload intensity, which has a good chance of increasing the
 bug's probability.
 If the probability is increased sufficiently, it may be possible to
 add lightweight diagnostics such as tracing without causing the
 bug to vanish.

 How can you increase the workload intensity?
 This depends on the program, but here are some things to try:

 \begin{enumerate}
 \item	Add more CPUs.
 \item	If the program uses networking, add more network adapters
 	and more or faster remote systems.
 \item	If the program is doing heavy I/O when the problem occurs,
 	either (1) add more storage devices, (2) use faster storage
 	devices, for example, substitute SSDs for disks,
 	or (3) use a RAM-based filesystem to substitute main
 	memory for mass storage.
 \item	Change the size of the problem, for example, if doing a parallel
 	matrix multiply, change the size of the matrix.
 	Larger problems may introduce more complexity, but smaller
 	problems often increase the level of contention.
 	If you aren't sure whether you should go large or go small,
 	just try both.
 \end{enumerate}

 However, it is often the case that the bug is in a specific subsystem,
 and the structure of the program limits the amount of stress that can
 be applied to that subsystem.
 The next section addresses this situation.

 \subsubsection{Isolate Suspicious Subsystems}
 \label{sec:debugging:Isolate Suspicious Subsystems}

 If the program is structured such that it is difficult or impossible
 to apply much stress to a subsystem that is under suspicion,
 a useful anti-heisenbug is a stress test that tests that subsystem in
 isolation.
 The Linux kernel's \co{rcutorture} module takes exactly this approach with RCU\@:
 Applying more stress to RCU than is feasible in a production environment
 increases the probability that RCU bugs will be found during testing
 rather than in production.\footnote{
 	Though sadly not increased to probability one.}

 In fact, when creating a parallel program, it is wise to stress-test
 the components separately.
 Creating such component-level stress tests can seem like a waste of time,
 but a little bit of component-level testing can save a huge amount
 of system-level debugging.

 \subsubsection{Make Rare Events Less Rare}
 \label{sec:debugging:Make Rare Events Less Rare}

 Heisenbugs are sometimes due to rare events, such as
 memory-allocation failure, conditional-lock-acquisition failure,
 CPU-hotplug operations, timeouts, packet losses, large-scale changes
 in state, and so on.
 The corresponding anti-heisenbug is thus simply to make these rare events
 happen much more frequently.
 For example, the TREE03 rcutorture scenario waits only 200~milliseconds
 between CPU-hotplug operations.
 For another example, most of the rcutorture scenarios emulate RCU
 callback flooding every minute.
 For a final example, a memory-management stress test for x86 CPUs might
 do well to frequently transition an aligned 2\,MB block of memory back
 and forth between 2\,MB and 4\,KB pages.

 Another way to construct an anti-heisenbug for this class of heisenbug
 is to introduce spurious failures.
 For example, instead of invoking \co{malloc()} directly, invoke
 a wrapper function that uses a random number to decide whether
 to return \co{NULL} unconditionally on the one hand, or to actually
 invoke \co{malloc()} and return the resulting pointer on the other.
 Inducing spurious failures is an excellent way to bake robustness into
 sequential programs as well as parallel programs.

 \QuickQuiz{
 	Why don't conditional-locking primitives provide this
 	spurious-failure functionality?
 }\QuickQuizAnswer{
 	There are locking algorithms that depend on conditional-locking
 	primitives telling them the truth.
 	For example, if conditional-lock failure signals that
 	some other thread is already working on a given job,
 	spurious failure might cause that job to never get done,
 	possibly resulting in a hang.

 	However, this is a design choice.
 	You could create a conditional-locking primitive that was subject
 	to spurious failure, and some have.
 	Therefore, when moving to a new software environment, check
 	your assumptions!
 }\QuickQuizEnd

 \subsubsection{Count Near Misses}
 \label{sec:debugging:Count Near Misses}

 Bugs are often all-or-nothing things, so that a bug either happens
 or not, with nothing in between.
 However, it is sometimes possible to define a \emph{near miss} where
 the bug does not result in a failure, but has likely manifested.
 For example, suppose your code is making a robot walk.
 The robot's falling down constitutes a bug in your program, but
 stumbling and recovering might constitute a near miss.
 If the robot falls over only once per hour, but stumbles every few
 minutes, you might be able to speed up your debugging progress by
 counting the number of stumbles in addition to the number of falls.

 In concurrent programs, timestamping can sometimes be used to detect
 near misses.
 For example, locking primitives incur significant delays, so if there is
 a too-short delay between a pair of operations that are supposed
 to be protected by different acquisitions of the same lock, this too-short
 delay might be counted as a near miss.\footnote{
 	Of course, in this case, you might be better off using
 	whatever \co{lock_held()} primitive is available
 	in your environment.
 	If there isn't a \co{lock_held()} primitive, create one!}

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{debugging/RCUnearMiss}}
 \caption{RCU Errors and Near Misses}
 \label{fig:debugging:RCU Errors and Near Misses}
 \end{figure}

 For example, a low-probability bug in RCU priority boosting occurred
 roughly once every hundred hours of focused \co{rcutorture} testing.
 Because it would take almost 500 hours of failure-free testing to be
 99\pct\ certain that the bug's probability had been significantly reduced,
 the \co{git bisect} process
 to find the failure would be painfully slow---or would require an extremely
 large test farm.
 Fortunately, the RCU operation being tested included not only a wait for
 an RCU grace period, but also a previous wait for the grace period to start
 and a subsequent wait for an RCU callback to be
 invoked after completion of the RCU grace period.
 This distinction between an \co{rcutorture} error and near miss is
 shown in
 \cref{fig:debugging:RCU Errors and Near Misses}.
 To qualify as a full-fledged error, an RCU read-side critical section
 must extend from the \co{call_rcu()} that initiated a grace period,
 through the remainder of the previous grace period, through the
 entirety of the grace period initiated by the \co{call_rcu()}
 (denoted by the region between the jagged lines), and
 through the delay from the end of that grace period to the callback
 invocation, as indicated by the ``Error'' arrow.
 However, the formal definition of RCU prohibits RCU read-side critical
 sections from extending across a single grace period, as indicated by
 the ``Near Miss'' arrow.
 This suggests using near misses as the error condition, however, this
 can be problematic because different CPUs can have different opinions
 as to exactly where a given
 grace period starts and ends, as indicated by the jagged lines.\footnote{
 	In real life, these lines can be much more jagged because idle
 	CPUs can be completely unaware of a great many recent grace
 	periods.}
 Using the near misses as the error condition could therefore result
 in false positives, which need to be avoided in the automated
 \co{rcutorture} testing.

 By sheer dumb luck, \co{rcutorture} happens to include some statistics that
 are sensitive to the near-miss version of the grace period.
 As noted above, these statistics are subject to false positives due to
 their unsynchronized access to RCU's state variables,
 but these false positives turn out to be extremely rare on strongly
 ordered systems such as the IBM mainframe and x86, occurring less than
 once per thousand hours of testing.

 These near misses occurred roughly once per hour, about two orders of
 magnitude more frequently than the actual errors.
 Use of these near misses allowed the bug's root cause to be identified
 in less than a week and a high degree of confidence in the fix to be
 built in less than a day.
 In contrast, excluding the near misses in favor of the real errors would
 have required months of debug and validation time.

 To sum up near-miss counting, the general approach is to replace counting
 of infrequent failures with more-frequent near misses that are believed
 to be correlated with those failures.
 These near-misses can be considered an anti-heisenbug to the real failure's
 heisenbug because the near-misses, being more frequent, are likely to
 be more robust in the face of changes to your code, for example, the
 changes you make to add debugging code.

 \subsubsection{Proactive Hunting Techniques}
 \label{sec:debugging:Proactive Hunting Techniques}

 Most of the anti-heisenbug techniques discussed in the precending sections
 are backwards looking.
 After all, prior experience is the best guide to knowing which regions
 of code are prone to race conditions, what aspects of the workload
 can most profitably be increased in intensity, which subsystems are
 deserving of suspicion, which rare events are important, and what near
 misses are good proxies for actual failures.

 What can you do to get ahead of the game?

 Getting ahead of the anti-heisenbug game is even more of an art than
 constructing an anti-heisenbug for a specific situation, but here
 are some techniques that can be helpful:

 \begin{enumerate}
 \item	Add delay to sections of concurrent code that required the
 	most analysis, that needed formal verification, or that
 	deviated the most from common concurrency practice.
 \item	Analyze trends in workload intensity, and use the results to
 	guide increasing the intensity of your testing.
 \item	Be most suspicious of new code, especially if it is your new
 	code.
 \item	Instrument your workload, looking for complex operations that
 	occur frequently enough to be an uptime problem but rarely
 	enough to avoid much exposure in your current testing.
 \item	Look for near misses in failure-recovery code and on slowpaths.
 \end{enumerate}

 Finally, and most importantly, pay special attention to code that people
 are the most proud of.
 After all, people are most likely to be proud of code that is unusual,
 which means that its bugs (and the bugs in the code that it uses) are
 likely to escape your usual testing efforts.

 \subsubsection{Heisenbug Discussion}
 \label{sec:debugging:Heisenbug Discussion}

 The alert reader might have noticed that this section was fuzzy and
 qualitative, in stark contrast to the precise mathematics of
 \cref{sec:debugging:Statistics for Discrete Testing,%
 sec:debugging:Statistics Abuse for Discrete Testing,%
 sec:debuggingStatistics for Continuous Testing}.
 If you love precision and mathematics, you may be disappointed to
 learn that the situations to which this section applies are far
 more common than those to which the preceding sections apply.

 In fact, the common case is that although you might have reason to believe
 that your code has bugs, you have no idea what those bugs are, what
 causes them, how likely they are to appear, or what conditions affect
 their probability of appearance.
 In this all-too-common case, statistics cannot help you.\footnote{
 	Although if you know what your program is supposed to do and
 	if your program is small enough (both less likely that you
 	might think), then the formal-verification tools described in
 	\cref{chp:Formal Verification}
 	can be helpful.}
 That is to say, statistics cannot help you \emph{directly}.
 But statistics can be of great indirect help---\emph{if} you have the
 humility required to admit that you make mistakes, that you can reduce the
 probability of these mistakes (for example, by getting enough sleep), and
 that the number and type of mistakes you made in the past is indicative of
 the number and type of mistakes that you are likely to make in the future.
 For example, I have a deplorable tendency to forget to write a small
 but critical portion of the initialization code, and frequently get most
 or even all of a parallel program correct---except for a stupid
 omission in initialization.
 Once I was willing to admit to myself that I am prone to this type of
 mistake, it was easier (but not easy!\@) to force myself to double-check
 my initialization code.
 Doing this allowed me to find numerous bugs ahead of time.

 When your quick bug hunt morphs into a long-term quest, it is important
 to log everything you have tried and what happened.
 In the common case where the software is changing during the course of
 your quest, make sure to record the exact version of the software to
 which each log entry applies.
 From time to time, reread the entire log in order to make connections
 between clues encountered at different times.
 Such rereading is especially important upon encountering a surprising
 test result, for example, I reread my log upon realizing that what I
 thought was a failure of the hypervisor to schedule a vCPU was instead
 an interrupt storm preventing that vCPU from making forward progress
 on the interrupted code.
 If the code you are debugging is new to you, this log is also an
 excellent place to document the relationships between code and data
 structures.
 Keeping a log when you are furiously chasing a difficult bug might seem
 like needless paperwork, but it has on many occasions saved me from
 debugging around and around in circles, which can waste far more time
 than keeping a log ever could.

 Using Taleb's nomenclature~\cite{NassimTaleb2007BlackSwan},
 a white swan is a bug that we can reproduce.
 We can run a large number of tests, use ordinary statistics to
 estimate the bug's probability, and use ordinary statistics again
 to estimate our confidence in a proposed fix.
 An unsuspected bug is a black swan.
 We know nothing about it, we have no tests that have yet caused it
 to happen, and statistics is of no help.
 Studying our own behavior, especially the number and types of mistakes
 we make, can turn black swans into grey swans.
 We might not know exactly what the bugs are, but we have some idea of
 their number and maybe also of their type.
 Ordinary statistics is still of no help (at least not until we are
 able to reproduce one of the bugs), but robust\footnote{
 	That is to say brutal.}
 testing methods can be of great help.
 The goal, therefore, is to use experience and good validation practices
 to turn the black swans grey, focused testing and analysis to turn the
 grey swans white, and ordinary methods to fix the white swans.

 That said, thus far, we have focused solely on bugs in the parallel program's
 functionality.
 However, performance is a first-class requirement for a parallel program.
 Otherwise, why not write a sequential program?
 To repurpose Kipling, our goal when writing parallel code is to fill
 the unforgiving second with sixty minutes worth of distance run.
 The next section therefore discusses a number of performance bugs that
 would be happy to thwart this Kiplingesque goal.

 \section{Performance Estimation}
 \label{sec:debugging:Performance Estimation}
 %
 \epigraph{There are lies, damn lies, statistics, and benchmarks.}
 	 {Unknown}

 Parallel programs usually have performance and scalability requirements,
 after all, if performance is not an issue, why not use a sequential
 program?
 Ultimate performance and linear scalability might not be necessary, but
 there is little use for a parallel program that runs slower than its
 optimal sequential counterpart.
 And there really are cases where every microsecond matters and every
 nanosecond is needed.
 Therefore, for parallel programs, insufficient performance is just as
 much a bug as is incorrectness.

 \QuickQuizSeries{%
 \QuickQuizB{
 	That is ridiculous!!!
 	After all, isn't getting a late but correct answer always
 	better than getting an incorrect answer???
 }\QuickQuizAnswerB{
 	This question fails to consider the option of choosing not to
 	compute the answer at all, and in doing so, also fails to consider
 	the costs of computing the answer.
 	For example, consider short-term weather forecasting, for which
 	accurate models exist, but which require large (and expensive)
 	clustered supercomputers, at least if you want to actually run
 	the model faster than the weather.

 	And in this case, any performance bug that prevents the model from
 	running faster than the actual weather prevents any forecasting.
 	Given that the whole purpose of purchasing the large clustered
 	supercomputers was to forecast weather, if you cannot run the
 	model faster than the weather, you would be better off not running
 	the model at all.

 	More severe examples may be found in the area of safety-critical
 	real-time computing.
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	But if you are going to put in all the hard work of parallelizing
 	an application, why not do it right?
 	Why settle for anything less than optimal performance and
 	linear scalability?
 }\QuickQuizAnswerE{
 	Although I do heartily salute your spirit and aspirations,
 	you are forgetting that there may be high costs due to delays
 	in the program's completion.
 	For an extreme example, suppose that a 30\pct\ performance shortfall
 	from a single-threaded application is causing one person to die
 	each day.
 	Suppose further that in a day you could hack together a
 	quick and dirty
 	parallel program that ran 50\pct\ faster on an eight-CPU system.
 	This is of course horrible scalability, given that the seven
 	additional CPUs provide only half a CPU's worth of additional
 	performance.

 	But it might well be that constructing an optimal parallel
 	program would require four months of painstaking design, coding,
 	debugging, and tuning.
 	I would guess that more than 100 people would prefer the
 	quick and dirty version.

 	Once that is done, \emph{maybe} it is worth creating an optimal
 	version.
 	But that depends on what else could be accomplished with this
 	same four months of effort.
 	For example, us carbon-based lifeforms just might feel that
 	saving more lives is more important than saving 6.5 CPUs.

 	Fortunately, most developers of low-level concurrent code play
 	for much lower stakes, allowing much greater freedom to choose.
 	In addition, small improvements in low-level code can benefit
 	huge numbers of users, in many cases justifying considerable
 	investments in performance, scalability, and real-time response.
 }\QuickQuizEndE
 }

 Validating a parallel program must therfore include validating its
 performance.
 But validating performance means having a workload to run and performance
 criteria with which to evaluate the program at hand.
 These needs are often met by \emph{performance benchmarks}, which
 are discussed in the next section.

 \subsection{Benchmarking}
 \label{sec:debugging:Benchmarking}

 Frequent abuse aside, benchmarks are both useful and heavily used,
 so it is not helpful to be too dismissive of them.
 Benchmarks span the range from ad hoc test jigs to international
 standards, but regardless of their level of formality, benchmarks
 serve four major purposes:

 \begin{enumerate}
 \item	Providing a fair framework for comparing competing implementations.
 \item	Focusing competitive energy on improving implementations in ways
 	that matter to users.
 \item	Serving as example uses of the implementations being benchmarked.
 \item	Serving as a marketing tool to highlight your software
 	against your competitors' offerings.
 \end{enumerate}

 Of course, the only completely fair framework is the intended
 application itself.
 So why would anyone who cared about fairness in benchmarking
 bother creating imperfect benchmarks rather than simply
 using the application itself as the benchmark?

 Running the actual application is in fact the best approach where it is practical.
 Unfortunately, it is often impractical for the following reasons:

 \begin{enumerate}
 \item	The application might be proprietary, and you
 	might not have the right to run the intended application.
 \item	The application might require more hardware
 	than you have access to.
 \item	The application might use data that you cannot
 	access, for example, due to privacy regulations.
 \item	The application might take longer than is convenient to
 	reproduce a performance or scalability problem.\footnote{
 		Microbenchmarks can help, but
 		please see \cref{sec:debugging:Microbenchmarking}.}
 \end{enumerate}

 Creating a benchmark that approximates
 the application can help overcome these obstacles.
 A carefully constructed benchmark can help promote performance,
 scalability, \IXh{energy}{efficiency}, and much else besides.
 However, be careful to avoid investing too much into the benchmarking
 effort.
 It is after all important to invest at least a little into the
 application itself~\cite{Gray91}.

 \subsection{Profiling}
 \label{sec:debugging:Profiling}

 In many cases, a fairly small portion of your software is responsible
 for the majority of the performance and scalability shortfall.
 However, developers are notoriously unable to identify the actual
 bottlenecks by inspection.
 For example, in the case of a kernel buffer allocator, all attention focused
 on a search of a dense array which turned out to represent only
 a few percent of the allocator's execution time.
 An execution profile collected via a logic analyzer focused attention
 on the cache misses that were actually responsible for the majority
 of the problem~\cite{McKenney93}.

 An old-school but quite effective method of tracking down performance
 and scalability bugs is to run your program under a debugger,
 then periodically interrupt it, recording the stacks of all threads
 at each interruption.
 The theory here is that if something is slowing down your program,
 it has to be visible in your threads' executions.

 That said, there are a number of tools
 that will usually do a much better job of helping you to focus your
 attention where it will do the most good.
 Two popular choices are \co{gprof} and \co{perf}.
 To use \co{perf} on a single-process program, prefix your command
 with \co{perf record}, then after the command completes, type
 \co{perf report}.
 There is a lot of work on tools for performance debugging of multi-threaded
 programs, which should make this important job easier.
 Again, one good starting point is Brendan Gregg's blog.\footnote{
 	\url{http://www.brendangregg.com/blog/}}

 \subsection{Differential Profiling}
 \label{sec:debugging:Differential Profiling}

 Scalability problems will not necessarily be apparent unless you are running
 on very large systems.
 However, it is sometimes possible to detect impending scalability problems
 even when running on much smaller systems.
 One technique for doing this is called \emph{differential profiling}.

 The idea is to run your workload under two different sets of conditions.
 For example, you might run it on two CPUs, then run it again on four
 CPUs.
 You might instead vary the load placed on the system, the number of
 network adapters, the number of mass-storage devices, and so on.
 You then collect profiles of the two runs, and mathematically combine
 corresponding profile measurements.
 For example, if your main concern is scalability, you might take the
 ratio of corresponding measurements, and then sort the ratios into
 descending numerical order.
 The prime scalability suspects will then be sorted to the top of the
 list~\cite{McKenney95a,McKenney99b}.

 Some tools such as \co{perf} have built-in differential-profiling
 support.

 \subsection{Microbenchmarking}
 \label{sec:debugging:Microbenchmarking}

 Microbenchmarking can be useful when deciding which algorithms or
 data structures are worth incorporating into a larger body of software
 for deeper evaluation.

 One common approach to microbenchmarking is to measure the time,
 run some number of iterations of the code
 under test, then measure the time again.
 The difference between the two times divided by the number of iterations
 gives the measured time required to execute the code under test.

 Unfortunately, this approach to measurement allows any number of errors
 to creep in, including:

 \begin{enumerate}
 \item	The measurement will include some of the overhead of
 	the time measurement.
 	This source of error can be reduced to an arbitrarily small
 	value by increasing the number of iterations.
 \item	The first few iterations of the test might incur cache misses
 	or (worse yet) page faults that might inflate the measured
 	value.
 	This source of error can also be reduced by increasing the
 	number of iterations, or it can often be eliminated entirely
 	by running a few warm-up iterations before starting the
 	measurement period.
 	Most systems have ways of detecting whether a given process
 	incurred a page fault, and you should make use of this to
 	reject runs whose performance has been thus impeded.
 \item	Some types of interference, for example, random memory errors,
 	are so rare that they can be dealt with by running a number
 	of sets of iterations of the test.
 	If the level of interference was statistically significant,
 	any performance outliers could be rejected statistically.
 \item	Any iteration of the test might be interfered with by other
 	activity on the system.
 	Sources of interference include other applications, system
 	utilities and daemons, device interrupts, firmware interrupts
 	(including system management interrupts, or SMIs),
 	virtualization, memory errors, and much else besides.
 	Assuming that these sources of interference occur randomly,
 	their effect can be minimized by reducing the number of
 	iterations.
 \item	\IX{Thermal throttling} can understate scalability because increasing
 	CPU activity increases heat generation, and on systems without
 	adequate cooling (most of them!), this can result in the CPU
 	frequency decreasing as the number of CPUs increases.\footnote{
 		Systems with adequate cooling tend to look like gaming systems.}
 	Of course, if you are testing an application to evaluate its
 	expected behavior when run in production, such thermal throttling
 	is simply a fact of life.
 	Otherwise, if you are interested in theoretical scalability,
 	use a system with adequate cooling or reduce the CPU clock rate
 	to a level that the cooling system can handle.
 \end{enumerate}

 The first and fourth sources of interference provide conflicting advice,
 which is one sign that we are living in the real world.

 \QuickQuiz{
 	But what about other sources of error, for example, due to
 	interactions between caches and memory layout?
 }\QuickQuizAnswer{
 	Changes in memory layout can indeed result in unrealistic
 	decreases in execution time.
 	For example, suppose that a given microbenchmark almost
 	always overflows the L0 cache's associativity, but with just the right
 	memory layout, it all fits.
 	If this is a real concern, consider running your microbenchmark
 	using huge pages (or within the kernel or on bare metal) in
 	order to completely control the memory layout.

 	But note that there are many different possible memory-layout
 	bottlenecks.
 	Benchmarks sensitive to memory bandwidth (such as those involving
 	matrix arithmetic) should spread the running threads across the
 	available cores and sockets to maximize memory parallelism.
 	They should also spread the data across \IXplr{NUMA node}, memory
 	controllers, and DRAM chips to the extent possible.
 	In contrast, benchmarks sensitive to memory latency (including
 	most poorly scaling applications) should instead maximize
 	locality, filling each core and socket in turn before adding
 	another one.
 }\QuickQuizEnd

 The following sections discuss ways of dealing with these measurement
 errors, with
 \cref{sec:debugging:Isolation}
 covering isolation techniques that may be used to prevent some forms of
 interference,
 and with
 \cref{sec:debugging:Detecting Interference}
 covering methods for detecting interference so as to reject measurement
 data that might have been corrupted by that interference.

 \subsection{Isolation}
 \label{sec:debugging:Isolation}

 The Linux kernel provides a number of ways to isolate a group of
 CPUs from outside interference.

 First, let's look at interference by other processes, threads, and tasks.
 The POSIX \co{sched_setaffinity()} system call may be used to move
 most tasks off of a given set of CPUs and to confine your tests to
 that same group.
 The Linux-specific user-level \co{taskset} command may be used for
 the same purpose, though both \co{sched_setaffinity()} and
 \co{taskset} require elevated permissions.
 Linux-specific control groups (cgroups) may be used for this same purpose.
 This approach can be quite effective at reducing interference, and
 is sufficient in many cases.
 However, it does have limitations, for example, it cannot do anything
 about the per-CPU kernel threads that are often used for housekeeping
 tasks.

 One way to avoid interference from per-CPU kernel threads is to run
 your test at a high real-time priority, for example, by using
 the POSIX \co{sched_setscheduler()} system call.
 However, note that if you do this, you are implicitly taking on
 responsibility for avoiding infinite loops, because otherwise
 your test can prevent part of the kernel from functioning.
 This is an example of the Spiderman Principle:
 ``With great power comes great responsibility.''
 And although the default real-time throttling settings often address
 such problems, they might do so by causing your real-time threads
 to miss their deadlines.

 These approaches can greatly reduce, and perhaps even eliminate,
 interference from processes, threads, and tasks.
 However, it does nothing to prevent interference from device
 interrupts, at least in the absence of threaded interrupts.
 Linux allows some control of threaded interrupts via the
 \path{/proc/irq} directory, which contains numerical directories, one
 per interrupt vector.
 Each numerical directory contains \co{smp_affinity} and
 \co{smp_affinity_list}.
 Given sufficient permissions, you can write a value to these files
 to restrict interrupts to the specified set of CPUs.
 For example, either
 ``\co{echo 3 > /proc/irq/23/smp_affinity}''
 or
 ``\co{echo 0-1 > /proc/irq/23/smp_affinity_list}''
 would confine interrupts on vector~23 to CPUs~0 and~1,
 at least given sufficient privileges.
 You can use ``\co{cat /proc/interrupts}'' to obtain a list of the interrupt
 vectors on your system, how many are handled by each CPU, and what
 devices use each interrupt vector.

 Running a similar command for all interrupt vectors on your system
 would confine interrupts to CPUs~0 and~1, leaving the remaining CPUs
 free of interference.
 Or mostly free of interference, anyway.
 It turns out that the scheduling-clock interrupt fires on each CPU
 that is running in user mode.\footnote{
 	Frederic Weisbecker leads up a \co{NO_HZ_FULL}
 	adaptive-ticks project
 	that allows scheduling-clock interrupts to be disabled
 	on CPUs that have only one runnable task.
 	As of 2021, this is largely complete.}
 In addition you must take care to ensure that the set of CPUs that you
 confine the interrupts to is capable of handling the load.

 But this only handles processes and interrupts running in the same
 operating-system instance as the test.
 Suppose that you are running the test in a guest OS that is itself
 running on a hypervisor, for example, Linux running KVM\@?
 Although you can in theory apply the same techniques at the hypervisor
 level that you can at the guest-OS level, it is quite common for
 hypervisor-level operations to be restricted to authorized personnel.
 In addition, none of these techniques work against firmware-level
 interference.

 \QuickQuiz{
 	Wouldn't the techniques suggested to isolate the code under
 	test also affect that code's performance, particularly if
 	it is running within a larger application?
 }\QuickQuizAnswer{
 	Indeed it might, although in most microbenchmarking efforts
 	you would extract the code under test from the enclosing
 	application.
 	Nevertheless, if for some reason you must keep the code under
 	test within the application, you will very likely need to use
 	the techniques discussed in
 	\cref{sec:debugging:Detecting Interference}.
 }\QuickQuizEnd

 Of course, if it is in fact the interference that is producing the
 behavior of interest, you will instead need to promote interference,
 in which case being unable to prevent it is not a problem.
 But if you really do need interference-free measurements, then instead
 of preventing the interference, you might need to detect the interference
 as described in the next section.

 \subsection{Detecting Interference}
 \label{sec:debugging:Detecting Interference}

 If you cannot prevent interference, perhaps you can detect it
 and reject results from any affected test runs.
 \Cref{sec:debugging:Detecting Interference Via Measurement}
 describes methods of rejection involving additional measurements,
 while \cref{sec:debugging:Detecting Interference Via Statistics}
 describes statistics-based rejection.

 \subsubsection{Detecting Interference Via Measurement}
 \label{sec:debugging:Detecting Interference Via Measurement}

 %	Sources of interference include other applications, system
 %	utilities and daemons, device interrupts, firmware interrupts
 %	(including system management interrupts, or SMIs),
 %	virtualization, memory errors, and much else besides.

 Many systems, including Linux, provide means for determining after the
 fact whether some forms of interference have occurred.
 For example, process-based interference results in context switches,
 which, on Linux-based systems, are visible in
 \path{/proc/<PID>/sched} via the \co{nr_switches} field.
 Similarly, interrupt-based interference can be detected via the
 \path{/proc/interrupts} file.

 \begin{listing}
 \begin{fcvlabel}[ln:debugging:Using getrusage() to Detect Context Switches]
 \begin{VerbatimL}
 #include <sys/time.h>
 #include <sys/resource.h>

 /* Return 0 if test results should be rejected. */
 int runtest(void)
 {
 	struct rusage ru1;
 	struct rusage ru2;

 	if (getrusage(RUSAGE_SELF, &ru1) != 0) {
 		perror("getrusage");
 		abort();
 	}
 	/* run test here. */
 	if (getrusage(RUSAGE_SELF, &ru2 != 0) {
 		perror("getrusage");
 		abort();
 	}
 	return (ru1.ru_nvcsw == ru2.ru_nvcsw &&
 	        ru1.runivcsw == ru2.runivcsw);
 }
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Using \tco{getrusage()} to Detect Context Switches}
 \label{lst:debugging:Using getrusage() to Detect Context Switches}
 \end{listing}

 Opening and reading files is not the way to low overhead, and it is
 possible to get the count of context switches for a given thread
 by using the \co{getrusage()} system call, as shown in
 \cref{lst:debugging:Using getrusage() to Detect Context Switches}.
 This same system call can be used to detect minor page faults (\co{ru_minflt})
 and major page faults (\co{ru_majflt}).

 Unfortunately, detecting memory errors and firmware interference is quite
 system-specific, as is the detection of interference due to virtualization.
 Although avoidance is better than detection, and detection is better than
 statistics, there are times when one must avail oneself of statistics,
 a topic addressed in the next section.

 \subsubsection{Detecting Interference Via Statistics}
 \label{sec:debugging:Detecting Interference Via Statistics}

 Any statistical analysis will be based on assumptions about the data,
 and performance microbenchmarks often support the following assumptions:

 \begin{enumerate}
 \item	Smaller measurements are more likely to be accurate than
 	larger measurements.
 \item	The measurement uncertainty of good data is known.
 \item	A reasonable fraction of the test runs will result in good data.
 \end{enumerate}

 The fact that smaller measurements are more likely to be accurate than
 larger measurements suggests that sorting the measurements in increasing
 order is likely to be productive.\footnote{
 	To paraphrase the old saying, ``Sort first and ask questions later.''}
 The fact that the measurement uncertainty is known allows us to accept
 measurements within this uncertainty of each other:
 If the effects of interference are large compared to this uncertainty,
 this will ease rejection of bad data.
 Finally, the fact that some fraction (for example, one third) can be
 assumed to be good allows us to blindly accept the first portion of the
 sorted list, and this data can then be used to gain an estimate of the
 natural variation of the measured data, over and above the assumed
 measurement error.

 The approach is to take the specified number of leading elements from the
 beginning of the sorted list, and use these to estimate a typical
 inter-element delta, which in turn may be multiplied by the number of
 elements in the list to obtain an upper bound on permissible values.
 The algorithm then repeatedly considers the next element of the list.
 If it falls below the upper bound, and if the distance between
 the next element and the previous element is not too much greater than
 the average inter-element distance for the portion of the list accepted
 thus far, then the next element is accepted and the process repeats.
 Otherwise, the remainder of the list is rejected.

 \begin{listing}
 \input{CodeSamples/debugging/datablows@whole.fcv}
 \caption{Statistical Elimination of Interference}
 \label{lst:debugging:Statistical Elimination of Interference}
 \end{listing}

 \Cref{lst:debugging:Statistical Elimination of Interference}
 shows a simple \co{sh}/\co{awk} script implementing this notion.
 Input consists of an x-value followed by an arbitrarily long list of y-values,
 and output consists of one line for each input line, with fields as follows:

 \begin{enumerate}
 \item	The x-value.
 \item	The average of the selected data.
 \item	The minimum of the selected data.
 \item	The maximum of the selected data.
 \item	The number of selected data items.
 \item	The number of input data items.
 \end{enumerate}

 This script takes three optional arguments as follows:

 \begin{description}
 \item	[\lopt{divisor}\nf{:}] Number of segments to divide the list
 	into, for example, a divisor of four means that the first quarter of
 	the data elements will be assumed to be good.
 	This defaults to three.
 \item	[\lopt{relerr}\nf{:}] Relative measurement error.
 	The script assumes that values that differ by less than this
 	error are for all intents and purposes equal.
 	This defaults to 0.01, which is equivalent to 1\pct.
 \item	[\lopt{trendbreak}\nf{:}] Ratio of inter-element spacing
 	constituting a break in the trend of the data.
 	For example, if the average spacing in the data accepted so far
 	is 1.5, then if the trend-break ratio is 2.0, then if the next
 	data value differs from the last one by more than 3.0, this
 	constitutes a break in the trend.
 	(Unless of course, the relative error is greater than 3.0, in
 	which case the ``break'' will be ignored.)
 \end{description}

 \begin{fcvref}[ln:debugging:datablows:whole]
 \Clnrefrange{param:b}{param:e} of
 \cref{lst:debugging:Statistical Elimination of Interference}
 set the default values for the parameters, and
 \clnrefrange{parse:b}{parse:e} parse
 any command-line overriding of these parameters.
 \end{fcvref}
 \begin{fcvref}[ln:debugging:datablows:whole:awk]
 The \co{awk} invocation on \clnref{invoke} sets the values of the
 \co{divisor}, \co{relerr}, and \co{trendbreak} variables to their
 \co{sh} counterparts.
 In the usual \co{awk} manner,
 \clnrefrange{copy:b}{end} are executed on each input
 line.
 The loop spanning \clnref{copy:b,copy:e} copies
 the input y-values to the
 \co{d} array, which \clnref{asort} sorts into increasing order.
 \Clnref{comp_i} computes the number of trustworthy y-values
 by applying \co{divisor} and rounding up.

 \Clnrefrange{delta}{comp_max:e} compute the \co{maxdelta}
 lower bound on the upper bound of y-values.
 To this end, \clnref{maxdelta} multiplies the difference in values over
 the trusted region of data by the \co{divisor}, which projects the
 difference in values across the trusted region across the entire
 set of y-values.
 However, this value might well be much smaller than the relative error,
 so \clnref{maxdelta1} computes the absolute error (\co{d[i] * relerr})
 and adds
 that to the difference \co{delta} across the trusted portion of the data.
 \Clnref{comp_max:b,comp_max:e} then compute the maximum of
 these two values.

 Each pass through the loop spanning \clnrefrange{add:b}{add:e}
 attempts to add another
 data value to the set of good data.
 \Clnrefrange{chk_engh}{break} compute the trend-break delta,
 with \clnref{chk_engh} disabling this
 limit if we don't yet have enough values to compute a trend,
 and with \clnref{mul_avr} multiplying \co{trendbreak} by the average
 difference between pairs of data values in the good set.
 If \clnref{chk_max} determines that the candidate data value would exceed the
 lower bound on the upper bound (\co{maxdelta}) \emph{and}
 that the difference between the candidate data value
 and its predecessor exceeds the trend-break difference (\co{maxdiff}),
 then \clnref{break} exits the loop:
 We have the full good set of data.

 \Clnrefrange{comp_stat:b}{comp_stat:e} then compute and print
 statistics.
 \end{fcvref}

 \QuickQuizSeries{%
 \QuickQuizB{
 	This approach is just plain weird!
 	Why not use means and standard deviations, like we were taught
 	in our statistics classes?
 }\QuickQuizAnswerB{
 	Because mean and standard deviation were not designed to do this job.
 	To see this, try applying mean and standard deviation to the
 	following data set, given a 1\pct\ relative error in measurement:

 	\begin{quote}
 		49,548.4 49,549.4 49,550.2 49,550.9 49,550.9 49,551.0
 		49,551.5 49,552.1 49,899.0 49,899.3 49,899.7 49,899.8
 		49,900.1 49,900.4 52,244.9 53,333.3 53,333.3 53,706.3
 		53,706.3 54,084.5
 	\end{quote}

 	The problem is that mean and standard deviation do not rest on
 	any sort of measurement-error assumption, and they will therefore
 	see the difference between the values near 49,500 and those near
 	49,900 as being statistically significant, when in fact they are
 	well within the bounds of estimated measurement error.

 	Of course, it is possible to create a script similar to
 	that in
 	\cref{lst:debugging:Statistical Elimination of Interference}
 	that uses standard deviation rather than absolute difference
 	to get a similar effect,
 	and this is left as an exercise for the interested reader.
 	Be careful to avoid divide-by-zero errors arising from strings
 	of identical data values!
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	But what if all the y-values in the trusted group of data
 	are exactly zero?
 	Won't that cause the script to reject any non-zero value?
 }\QuickQuizAnswerE{
 	Indeed it will!
 	But if your performance measurements often produce a value of
 	exactly zero, perhaps you need to take a closer look at your
 	performance-measurement code.

 	Note that many approaches based on mean and standard deviation
 	will have similar problems with this sort of dataset.
 }\QuickQuizEndE
 }

 Although statistical interference detection can be quite useful, it should
 be used only as a last resort.
 Where feasible, it is far better to avoid interference in the first place
 (\cref{sec:debugging:Isolation}), or, failing that,
 detecting interference via measurement
 (\cref{sec:debugging:Detecting Interference Via Measurement}).

 Never forget that for any statistical method that you create, there is
 a dataset out there that will defeat it, thus making fools of both you
 and your method.
 Therefore, the more creative you are, the more careful you must be!

 \section{Summary}
 \label{sec:debugging:Summary}
 %
 % \epigraph{To err is human---but it feels devine.}{\emph{Mae West}}
 \epigraph{To err is human!
 	  Stop being human!!!}{Ed Nofziger}

 Although validation never will be an exact science, much can be gained
 by taking an organized approach to it, as an organized approach will
 help you choose the right validation tools for your job, avoiding
 situations like the one fancifully depicted in
 \cref{fig:debugging:Choose Validation Methods Wisely}.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/UseTheRightCannon}}
 \caption{Choose Validation Methods Wisely}
 \ContributedBy{fig:debugging:Choose Validation Methods Wisely}{Melissa Broussard}
 \end{figure}

 A key choice is that of statistics.
 Although the methods described in this chapter work very well most of
 the time, they do have their limitations, courtesy of the Halting
 Problem~\cite{AlanMTuring1937HaltingProblem,GeoffreyKPullum2000HaltingProblem}.
 Fortunately for us, there is a huge number of special cases in which
 we can not only work out whether a program will halt, but also
 estimate how long it will run before halting, as discussed in
 \cref{sec:debugging:Performance Estimation}.
 Furthermore, in cases where a given program might or might not work
 correctly, we can often establish estimates for what fraction of the
 time it will work correctly, as discussed in
 \cref{sec:debugging:Probability and Heisenbugs}.

 Nevertheless, unthinking reliance on these estimates is brave to the
 point of foolhardiness.
 After all, we are summarizing a huge mass of complexity in code and
 data structures down to a single solitary number.
 Even though we can get away with such bravery a surprisingly large
 fraction of the time, abstracting all that code and data away will
 occasionally cause severe problems.

 One possible problem is variability, where repeated runs give wildly
 different results.
 This problem is often addressed using standard deviation, however, using
 two numbers to summarize the behavior of a large and complex program is
 about as brave as using only one number.
 In computer programming, the surprising thing is that use of the
 mean or the mean and standard deviation are often sufficient.
 Nevertheless, there are no guarantees.

 One cause of variation is confounding factors.
 For example, the CPU time consumed by a linked-list search will depend
 on the length of the list.
 Averaging together runs with wildly different list lengths will
 probably not be useful, and adding a standard deviation to the mean
 will not be much better.
 The right thing to do would be control for list length, either by
 holding the length constant or to measure CPU time as a function of
 list length.

 Of course, this advice assumes that you are aware of the confounding
 factors, and Murphy says that you will not be.
 I have been involved in projects that had confounding factors as diverse
 as air conditioners (which drew considerable power at startup, thus
 causing the voltage supplied to the computer to momentarily drop too
 low, sometimes resulting in failure), cache state (resulting in odd
 variations in performance), I/O errors (including disk errors, packet
 loss, and duplicate Ethernet MAC addresses), and even porpoises (which
 could not resist playing with an array of transponders, which could be
 otherwise used for high-precision acoustic positioning and navigation).
 And this is but one reason why a good night's sleep is such an effective
 debugging tool.

 In short, validation always will require some measure of the behavior of
 the system.
 To be at all useful, this measure must be a severe summarization of the
 system, which in turn means that it can be misleading.
 So as the saying goes, ``Be careful.
 It is a real world out there.''

 But what if you are working on the Linux kernel, which as of 2017 was
 estimated to have more than 20 billion instances running throughout
 the world?
 In that case, a bug that occurs once every million years on a single system
 will be encountered more than 50 times per day across the installed base.
 A test with a 50\pct\ chance of encountering this bug in a one-hour run
 would need to increase that bug's probability of occurrence by more than
 ten orders of magnitude, which poses a severe challenge to
 today's testing methodologies.
 One important tool that can sometimes be applied with good effect to
 such situations is formal verification, the subject of the next chapter,
 and, more speculatively, \cref{sec:future:Formal Regression Testing?}.

 The topic of choosing a validation plan, be it testing, formal
 verification, or both, is taken up by
 \cref{sec:formal:Choosing a Validation Plan}.

 \QuickQuizAnswersChp{qqzdebugging}