| % debugging/debugging.tex |
| % mainfile: ../perfbook.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \QuickQuizChapter{chp:Validation}{Validation}{qqzdebugging} |
| % |
| \Epigraph{If it is not tested, it doesn't work.}{Unknown} |
| |
| I have had a few parallel programs work the first time, but that is only |
| because I have written an extremely large number parallel programs over |
| the past few decades. |
| And I have had far more parallel programs that fooled me into thinking |
| that they were working correctly the first time than actually were working |
| the first time. |
| |
| I thus need to validate my parallel programs. |
| The basic trick behind validation, is to realize that the computer knows |
| what is wrong. |
| It is therefore your job to force it to tell you. |
| This chapter can therefore be thought of as a short course in |
| machine interrogation. |
| But you can leave the good-cop/bad-cop routine at home. |
| This chapter covers much more sophisticated and effective methods, |
| especially given that most computers couldn't tell a good cop from a |
| bad cop, at least as far as we know. |
| |
| A longer course may be found in many recent books on validation, as |
| well as at least one older but valuable |
| one~\cite{GlenfordJMyers1979}. |
| Validation is an extremely important topic that cuts across all forms |
| of software, and is worth intensive study in its own right. |
| However, this book is primarily about concurrency, so this chapter will do |
| little more than scratch the surface of this critically important topic. |
| |
| \Cref{sec:debugging:Introduction} |
| introduces the philosophy of debugging. |
| \Cref{sec:debugging:Tracing} |
| discusses tracing, |
| \cref{sec:debugging:Assertions} |
| discusses assertions, and |
| \cref{sec:debugging:Static Analysis} |
| discusses static analysis. |
| \Cref{sec:debugging:Code Review} |
| describes some unconventional approaches to code review that can |
| be helpful when the fabled 10,000 eyes happen not to be looking at your code. |
| \Cref{sec:debugging:Probability and Heisenbugs} |
| overviews the use of probability for validating parallel software. |
| Because performance and scalability are first-class requirements |
| for parallel programming, |
| \cref{sec:debugging:Performance Estimation} covers these |
| topics. |
| Finally, |
| \cref{sec:debugging:Summary} |
| gives a fanciful summary and a short list of statistical traps to avoid. |
| |
| But never forget that the three best debugging tools are a thorough |
| understanding of the requirements, a solid design, and a good night's |
| sleep! |
| |
| \section{Introduction} |
| \label{sec:debugging:Introduction} |
| % |
| % \epigraph{If debugging is the process of removing software bugs, then |
| % programming must be the process of putting them in.} |
| % {Edsger W.~Dijkstra} |
| \epigraph{Debugging is like being the detective in a crime movie where |
| you are also the murderer.} |
| {Filipe Fortes} |
| % https://twitter.com/fortes/status/399339918213652480?lang=en Nov 9, 2013 |
| |
| \Cref{sec:debugging:Where Do Bugs Come From?} |
| discusses the sources of bugs, and |
| \cref{sec:debugging:Required Mindset} |
| overviews the mindset required when validating software. |
| \Cref{sec:debugging:When Should Validation Start?} |
| discusses when you should start validation, and |
| \cref{sec:debugging:The Open Source Way} describes the |
| surprisingly effective open-source regimen of code review and |
| community testing. |
| |
| \subsection{Where Do Bugs Come From?} |
| \label{sec:debugging:Where Do Bugs Come From?} |
| |
| Bugs come from developers. |
| The basic problem is that the human brain did not evolve with computer |
| software in mind. |
| Instead, the human brain evolved in concert with other human brains and |
| with animal brains. |
| Because of this history, the following three characteristics of computers |
| often come as a shock to human intuition: |
| |
| \begin{enumerate} |
| \item Computers lack common sense, despite huge sacrifices at the |
| altar of artificial intelligence. |
| \item Computers fail to understand user intent, or more formally, |
| computers generally lack a theory of mind. |
| \item Computers cannot do anything useful with a fragmentary plan, |
| instead requiring that every detail of all possible scenarios |
| be spelled out in full. |
| \end{enumerate} |
| |
| The first two points should be uncontroversial, as they are illustrated |
| by any number of failed products, perhaps most famously Clippy and |
| Microsoft Bob. |
| By attempting to relate to users as people, these two products raised |
| common-sense and theory-of-mind expectations that they proved incapable |
| of meeting. |
| Perhaps the set of software assistants are now available on smartphones |
| will fare better, but as of 2021 reviews are mixed. |
| That said, the developers working on them by all accounts still develop |
| the old way: |
| The assistants might well benefit end users, but not so much their own |
| developers. |
| |
| This human love of fragmentary plans deserves more explanation, |
| especially given that it is a classic two-edged sword. |
| This love of fragmentary plans is apparently due to the assumption that |
| the person carrying out the plan will have (1)~common sense and (2)~a good |
| understanding of the intent and requirements driving the plan. |
| This latter assumption is especially likely to hold in the common case |
| where the person doing the planning and the person carrying out the plan |
| are one and the same: |
| In this case, the plan will be revised almost |
| subconsciously as obstacles arise, especially when that person has the |
| a good understanding of the problem at hand. |
| In fact, the love of fragmentary plans has served human beings well, |
| in part because it is better to take random actions that have a some |
| chance of locating food than to starve to death while attempting to plan |
| the unplannable. |
| However, the usefulness of fragmentary plans in the |
| everyday life of which we are all experts is no guarantee of their future |
| usefulness in stored-program computers. |
| |
| Furthermore, the need to follow fragmentary plans has had important effects |
| on the human psyche, due to the fact |
| that throughout much of human history, life was often difficult and dangerous. |
| It should come as no surprise that executing a fragmentary plan that has |
| a high probability of a violent encounter with sharp teeth and claws |
| requires almost insane levels of optimism---a level of optimism that actually |
| is present in most human beings. |
| These insane levels of optimism extend to self-assessments of programming |
| ability, as evidenced by the effectiveness of (and the controversy over) |
| code-interviewing techniques~\cite{RegBraithwaite2007FizzBuzz}. |
| In fact, the clinical term for a human being with less-than-insane |
| levels of optimism is ``clinically depressed''. |
| Such people usually have extreme difficulty functioning in their daily |
| lives, underscoring the perhaps counter-intuitive importance of insane |
| levels of optimism to a normal, healthy life. |
| Furtheremore, if you are not insanely optimistic, you are less likely |
| to start a difficult but worthwhile project.\footnote{ |
| There are some famous exceptions to this rule of thumb. |
| Some people take on difficult or risky projects in order to at |
| least a temporarily escape from their depression. |
| Others have nothing to lose: |
| The project is literally a matter of life or death.} |
| |
| \QuickQuiz{ |
| When in computing is it necessary to follow a |
| fragmentary plan? |
| }\QuickQuizAnswer{ |
| There are any number of situations, but perhaps the most important |
| situation is when no one has ever created anything resembling |
| the program to be developed. |
| In this case, the only way to create a credible plan is to |
| implement the program, create the plan, and implement it a |
| second time. |
| But whoever implements the program for the first time has no |
| choice but to follow a fragmentary plan because any detailed |
| plan created in ignorance cannot survive first contact with |
| the real world. |
| |
| And perhaps this is one reason why evolution has favored insanely |
| optimistic human beings who are happy to follow fragmentary plans! |
| }\QuickQuizEnd |
| |
| An important special case is the project that, while valuable, is not |
| valuable enough to justify the time required to implement it. |
| This special case is quite common, and one early symptom is the |
| unwillingness of the decision-makers to invest enough to actually |
| implement the project. |
| A natural reaction is for the developers to produce an unrealistically |
| optimistic estimate in order to be permitted to start the project. |
| If the organization is strong enough and its decision-makers ineffective |
| enough, the project might succeed despite the resulting schedule slips |
| and budget overruns. |
| However, if the organization is not strong enough and if the decision-makers |
| fail to cancel the project as soon as it becomes clear that the estimates |
| are garbage, then the project might well kill the organization. |
| This might result in another organization picking up the project and |
| either completing it, canceling it, or being killed by it. |
| A given project might well succeed only after killing several |
| organizations. |
| One can only hope that the organization that eventually makes a success |
| of a serial-organization-killer project maintains a suitable |
| level of humility, lest it be killed by its next such project. |
| |
| \QuickQuiz{ |
| Who cares about the organization? |
| After all, it is the project that is important! |
| }\QuickQuizAnswer{ |
| Yes, projects are important, but if you like being paid for your |
| work, you need organizations as well as projects. |
| }\QuickQuizEnd |
| |
| Important though insane levels of optimism might be, they are a key source |
| of bugs (and perhaps failure of organizations). |
| The question is therefore ``How to maintain the optimism required to start |
| a large project while at the same time injecting enough reality to keep |
| the bugs down to a dull roar?'' |
| The next section examines this conundrum. |
| |
| \subsection{Required Mindset} |
| \label{sec:debugging:Required Mindset} |
| |
| When carrying out any validation effort, keep the following |
| definitions firmly in mind: |
| |
| \begin{enumerate} |
| \item The only bug-free programs are trivial programs. |
| \item A reliable program has no known bugs. |
| \end{enumerate} |
| |
| From these definitions, it logically follows that any reliable |
| non-trivial program contains at least one bug that you do not |
| know about. |
| Therefore, any validation effort undertaken on a non-trivial program |
| that fails to find any bugs is itself a failure. |
| Validation is therefore an exercise in destruction. |
| This means that if you are the type of person who enjoys breaking things, |
| validation is just job for you. |
| |
| \begin{SaveVerbatim}{VerbDebuggingQQZ} |
| real 0m0.132s |
| user 0m0.040s |
| sys 0m0.008s |
| \end{SaveVerbatim} |
| |
| \QuickQuiz{ |
| Suppose that you are writing a script that processes the |
| output of the \co{time} command, which looks as follows: |
| |
| \begin{center} |
| \fbox{\BUseVerbatim[boxwidth=2in,baseline=c]{VerbDebuggingQQZ}} |
| \end{center} |
| |
| The script is required to check its input for errors, and to |
| give appropriate diagnostics if fed erroneous \co{time} output. |
| What test inputs should you provide to this program to test it |
| for use with \co{time} output generated by single-threaded programs? |
| }\QuickQuizAnswer{ |
| Can you say ``Yes'' to all the following questions? |
| |
| \begin{enumerate} |
| \item Do you have a test case in which all the time is |
| consumed in user mode by a CPU-bound program? |
| \item Do you have a test case in which all the time is |
| consumed in system mode by a CPU-bound program? |
| \item Do you have a test case in which all three times |
| are zero? |
| \item Do you have a test case in which the \qco{user} and \qco{sys} |
| times sum to more than the \qco{real} time? |
| (This would of course be completely legitimate in |
| a multithreaded program.) |
| \item Do you have a set of tests cases in which one of the |
| times uses more than one second? |
| \item Do you have a set of tests cases in which one of the |
| times uses more than ten seconds? |
| \item Do you have a set of test cases in which one of the |
| times has non-zero minutes? |
| (For example, \qco{15m36.342s}.) |
| \item Do you have a set of test cases in which one of the |
| times has a seconds value of greater than 60? |
| \item Do you have a set of test cases in which one of the |
| times overflows 32 bits of milliseconds? |
| 64 bits of milliseconds? |
| \item Do you have a set of test cases in which one of the |
| times is negative? |
| \item Do you have a set of test cases in which one of the |
| times has a positive minutes value but a negative |
| seconds value? |
| \item Do you have a set of test cases in which one of the |
| times omits the \qco{m} or the \qco{s}? |
| \item Do you have a set of test cases in which one of the |
| times is non-numeric? |
| (For example, \qco{Go Fish}.) |
| \item Do you have a set of test cases in which one of the |
| lines is omitted? |
| (For example, where there is a \qco{real} value and |
| a \qco{sys} value, but no \qco{user} value.) |
| \item Do you have a set of test cases where one of the |
| lines is duplicated? |
| Or duplicated, but with a different time value for |
| the duplicate? |
| \item Do you have a set of test cases where a given line |
| has more than one time value? |
| (For example, \qco{real 0m0.132s 0m0.008s}.) |
| \item Do you have a set of test cases containing random |
| characters? |
| \item In all test cases involving invalid input, did you |
| generate all permutations? |
| \item For each test case, do you have an expected outcome |
| for that test? |
| \end{enumerate} |
| |
| If you did not generate test data for a substantial number of |
| the above cases, you will need to cultivate a more destructive |
| attitude in order to have a chance of generating high-quality |
| tests. |
| |
| Of course, one way to economize on destructiveness is to |
| generate the tests with the to-be-tested source code at hand, |
| which is called white-box testing (as opposed to black-box testing). |
| However, this is no panacea: |
| You will find that it is all too easy to find your thinking |
| limited by what the program can handle, thus failing to generate |
| truly destructive inputs. |
| }\QuickQuizEnd |
| |
| But perhaps you are a super-programmer whose code is always perfect |
| the first time every time. |
| If so, congratulations! |
| Feel free to skip this chapter, but |
| I do hope that you will forgive my skepticism. |
| You see, I have too many people who claimed to be able to write perfect |
| code the first time, which is not too surprising given the previous |
| discussion of optimism and over-confidence. |
| And even if you really are a super-programmer, you just might |
| find yourself debugging lesser mortals' work. |
| |
| One approach for the rest of us is to alternate between our normal |
| state of insane optimism |
| (Sure, I can program that!\@) and severe pessimism |
| (It seems to work, but I just know that there have to be more bugs hiding |
| in there somewhere!). |
| It helps if you enjoy breaking things. |
| If you don't, or if your joy in breaking things is limited to breaking |
| \emph{other} people's things, find someone who does love breaking your |
| code and have them help you break it. |
| |
| Another helpful frame of mind is to hate it when other people find bugs in |
| your code. |
| This hatred can help motivate you to torture your code beyond all reason |
| in order to increase the probability that you will be the one to find |
| the bugs. |
| Just make sure to suspend this hatred long enough to sincerely thank |
| anyone who does find a bug in your code! |
| After all, by so doing, they saved you the trouble of tracking it down, |
| and possibly at great personal expense dredging through your code. |
| |
| Yet another helpful frame of mind is studied skepticism. |
| You see, believing that you understand the code means you can learn |
| absolutely nothing about it. |
| Ah, but you know that you completely understand the code because you |
| wrote or reviewed it? |
| Sorry, but the presence of bugs suggests that your understanding is at |
| least partially fallacious. |
| One cure is to write down what you know to be true and double-check this |
| knowledge, as discussed in |
| \crefrange{sec:debugging:Tracing}{sec:debugging:Code Review}. |
| Objective reality \emph{always} overrides whatever you |
| might think you know. |
| |
| One final frame of mind is to consider the possibility that someone's |
| life depends on your code being correct. |
| One way of looking at this is that consistently making good things happen |
| requires a lot of focus on a lot of bad things that might happen, with |
| an eye towards preventing or otherwise handling those bad things.\footnote{ |
| For more on this philosophy, see the chapter entitled |
| ``The Power of Negative Thinking'' |
| from Chris Hadfield's excellent book entitled |
| ``An Astronaut's Guide to Life on Earth.''} |
| The prospect of these bad things might also motivate you to torture your |
| code into revealing the whereabouts of its bugs. |
| |
| This wide variety of frames of mind opens the door to |
| the possibility of multiple people with different frames of |
| mind contributing to the project, with varying levels of optimism. |
| This can work well, if properly organized. |
| |
| \begin{figure} |
| \centering |
| \resizebox{2in}{!}{\includegraphics{cartoons/TortureTux}} |
| \caption{Validation and the Geneva Convention} |
| \ContributedBy{fig:debugging:Validation and the Geneva Convention}{Melissa Broussard} |
| \end{figure} |
| |
| \begin{figure} |
| \centering |
| \resizebox{2in}{!}{\includegraphics{cartoons/TortureLaptop}} |
| \caption{Rationalizing Validation} |
| \ContributedBy{fig:debugging:Rationalizing Validation}{Melissa Broussard} |
| \end{figure} |
| |
| Some people might see vigorous validation as a form of torture, as |
| depicted in |
| \cref{fig:debugging:Validation and the Geneva Convention}.\footnote{ |
| The cynics among us might question whether these people are |
| afraid that validation will find bugs that they will then be |
| required to fix.} |
| Such people might do well to remind themselves that, Tux cartoons aside, |
| they are really torturing an inanimate object, as shown in |
| \cref{fig:debugging:Rationalizing Validation}. |
| Rest assured that those who fail to torture their code are doomed to be |
| tortured by it! |
| |
| However, this leaves open the question of exactly when during the project |
| lifetime validation should start, a topic taken up by the next section. |
| |
| \subsection{When Should Validation Start?} |
| \label{sec:debugging:When Should Validation Start?} |
| |
| Validation should start exactly when the project starts. |
| |
| To see this, consider that tracking down a bug is much harder in a large |
| program than in a small one. |
| Therefore, to minimize the time and effort required to track down bugs, |
| you should test small units of code. |
| Although you won't find all the bugs this way, you will find a substantial |
| fraction, and it will be much easier to find and fix the ones you do find. |
| Testing at this level can also alert you to larger flaws in your overall |
| design, minimizing the time you waste writing code that is broken |
| by design. |
| |
| But why wait until you have code before validating your design?\footnote{ |
| The old saying ``First we must code, then we have incentive to |
| think'' notwithstanding.} |
| Hopefully reading \cref{chp:Hardware and its Habits,% |
| chp:Tools of the Trade} provided you with the information |
| required to avoid some regrettably common design flaws, |
| but discussing your design with a colleague or even simply writing it |
| down can help flush out additional flaws. |
| |
| However, it is all too often the case that waiting to start validation |
| until you have a design is waiting too long. |
| Mightn't your natural level of optimism caused you to start the design |
| before you fully understood the requirements? |
| The answer to this question will almost always be ``yes''. |
| One good way to avoid flawed requirements is to get to know your users. |
| To really serve them well, you will have to live among them. |
| |
| \QuickQuiz{ |
| You are asking me to do all this validation BS before |
| I even start coding??? |
| That sounds like a great way to never get started!!! |
| }\QuickQuizAnswer{ |
| If it is your project, for example, a hobby, do what you like. |
| Any time you waste will be your own, and you have no one else |
| to answer to for it. |
| And there is a good chance that the time will not be completely |
| wasted. |
| For example, if you are embarking on a first-of-a-kind project, |
| the requirements are in some sense unknowable anyway. |
| In this case, the best approach might be to quickly prototype |
| a number of rough solutions, try them out, and see what works |
| best. |
| |
| On the other hand, if you are being paid to produce a system that |
| is broadly similar to existing systems, you owe it to your users, |
| your employer, and your future self to validate early and often. |
| }\QuickQuizEnd |
| |
| First-of-a-kind projects often use different methodologies such as |
| rapid prototyping or agile. |
| Here, the main goal of early prototypes are not to create correct |
| implementations, but rather to learn the project's requirements. |
| But this does not mean that you omit validation; it instead means that |
| you approach it differently. |
| |
| One such approach takes a Darwinian view, with the validation suite |
| eliminating code that is not fit to solve the problem at hand. |
| From this viewpoint, a vigorous validation suite is essential to the |
| fitness of your software. |
| However, taking this approach to its logical conclusion is quite humbling, |
| as it requires us developers to admit that our carefully crafted changes |
| to the codebase are, from a Darwinian standpoint, random mutations. |
| On the other hand, this conclusion is supported by long experience |
| indicating that seven percent of fixes introduce at least one |
| bug~\cite{RexBlack2012SQA}. |
| |
| How vigorous should your validation suite be? |
| If the bugs it finds aren't threatening the very foundations of your |
| software design, then it is not yet vigorous enough. |
| After all, your design is just as prone to bugs as is your code, and |
| the earlier you find and fix the bugs in your design, the less time |
| you will waste coding those design bugs. |
| |
| \QuickQuiz{ |
| Are you actually suggesting that it is possible to test |
| correctness into software??? |
| Everyone knows that is impossible!!! |
| }\QuickQuizAnswer{ |
| Please note that the text used the word ``validation'' rather |
| than the word ``testing''. |
| The word ``validation'' includes formal methods as well as |
| testing, for more on which please see |
| \cref{chp:Formal Verification}. |
| |
| But as long as we are bringing up things that everyone should |
| know, let's remind ourselves that Darwinian evolution is |
| not about correctness, but rather about survival. |
| As is software. |
| My goal as a developer is not that my software be attractive |
| from a theoretical viewpoint, but rather that it survive |
| whatever its users throw at it. |
| |
| Although the notion of correctness does have its uses, its |
| fundamental limitation is that the specification against which |
| correctness is judged will also have bugs. |
| This means nothing more nor less than that traditional correctness |
| proofs prove that the code in question contains the intended |
| set of bugs! |
| |
| Alternative definitions of correctness instead focus on the |
| lack of problematic properties, for example, proving that the |
| software has no use-after-free bugs, no \co{NULL} pointer |
| dereferences, no array-out-of-bounds references, and so on. |
| Make no mistake, finding and eliminating such classes of bugs |
| can be highly useful. |
| But the fact remains that the lack of certain classes of bugs |
| does nothing to demonstrate fitness for any specific purpose. |
| |
| Therefore, usage-driven validation remains critically important. |
| |
| Besides, it is also impossible to verify correctness into your |
| software, especially given the problematic need to verify both |
| the verifier and the specification. |
| }\QuickQuizEnd |
| |
| It is worth reiterating that this advice applies to first-of-a-kind |
| projects. |
| If you are instead doing a project in a well-explored area, you would |
| be quite foolish to refuse to learn from previous experience. |
| But you should still start validating right at the beginning of the |
| project, but hopefully guided by others' hard-won knowledge of both |
| requirements and pitfalls. |
| |
| An equally important question is ``When should validation stop?'' |
| The best answer is ``Some time after the last change.'' |
| Every change has the potential to create a bug, and thus every |
| change must be validated. |
| Furthermore, validation development should continue through the |
| full lifetime of the project. |
| After all, the Darwinian perspective above implies that bugs are |
| adapting to your validation suite. |
| Therefore, unless you continually improve your validation suite, your |
| project will naturally accumulate hordes of validation-suite-immune bugs. |
| |
| But life is a tradeoff, and every bit of time invested in validation |
| suites as a bit of time that cannot be invested in directly improving |
| the project itself. |
| These sorts of choices are never easy, and it can be just as damaging |
| to overinvest in validation as it can be to underinvest. |
| But this is just one more indication that life is not easy. |
| |
| Now that we have established that you should start validation when |
| you start the project (if not earlier!), and that both validation and |
| validation development should continue throughout the lifetime of that |
| project, the following sections cover a number of validation techniques |
| and methods that have proven their worth. |
| |
| \subsection{The Open Source Way} |
| \label{sec:debugging:The Open Source Way} |
| |
| The open-source programming methodology has proven quite effective, and |
| includes a regimen of intense code review and testing. |
| |
| I can personally attest to the effectiveness of the open-source community's |
| intense code review. |
| One of my first patches to the Linux kernel involved a distributed |
| filesystem where one node might write to a given file that another node |
| has mapped into memory. |
| In this case, it is necessary to invalidate the affected pages from |
| the mapping in order to allow the filesystem to maintain coherence |
| during the write operation. |
| I coded up a first attempt at a patch, and, in keeping with the open-source |
| maxim ``post early, post often'', I posted the patch. |
| I then considered how I was going to test it. |
| |
| But before I could even decide on an overall test strategy, I got a |
| reply to my posting pointing out a few bugs. |
| I fixed the bugs and reposted the patch, and returned to thinking |
| out my test strategy. |
| However, before I had a chance to write any test code, I received |
| a reply to my reposted patch, pointing out more bugs. |
| This process repeated itself many times, and I am not sure that I |
| ever got a chance to actually test the patch. |
| |
| This experience brought home the truth of the open-source saying: |
| Given enough eyeballs, all bugs are shallow~\cite{EricSRaymond99b}. |
| |
| However, when you post some code or a given patch, it is worth |
| asking a few questions: |
| |
| \begin{enumerate} |
| \item How many of those eyeballs are actually going to look at your code? |
| \item How many will be experienced and clever enough to actually find |
| your bugs? |
| \item Exactly when are they going to look? |
| \end{enumerate} |
| |
| I was lucky: |
| There was someone out there who wanted the functionality |
| provided by my patch, who had long experience with distributed filesystems, |
| and who looked at my patch almost immediately. |
| If no one had looked at my patch, there would have been no review, and |
| therefore none of those bugs would have been located. |
| If the people looking at my patch had lacked experience with distributed |
| filesystems, it is unlikely that they would have found all the bugs. |
| Had they waited months or even years to look, I likely would have forgotten |
| how the patch was supposed to work, making it much more difficult to |
| fix them. |
| |
| However, we must not forget the second tenet of the open-source development, |
| namely intensive testing. |
| For example, a great many people test the Linux kernel. |
| Some test patches as they are submitted, perhaps even yours. |
| Others test the -next tree, which is helpful, but there is likely to be |
| several weeks or even months delay between the time that you write the |
| patch and the time that it appears in the -next tree, by which time the |
| patch will not be quite as fresh in your mind. |
| Still others test maintainer trees, which often have a similar time delay. |
| |
| Quite a few people don't test code until it is committed to mainline, |
| or the master source tree (Linus's tree in the case of the Linux kernel). |
| If your maintainer won't accept your patch until it has been tested, |
| this presents you with a deadlock situation: |
| Your patch won't be accepted until it is tested, but it won't be tested |
| until it is accepted. |
| Nevertheless, people who test mainline code are still relatively |
| aggressive, given that many people and organizations do not test code |
| until it has been pulled into a Linux distro. |
| |
| And even if someone does test your patch, there is no guarantee that they |
| will be running the hardware and software configuration and workload |
| required to locate your bugs. |
| |
| Therefore, even when writing code for an open-source project, you need to |
| be prepared to develop and run your own test suite. |
| Test development is an underappreciated and very valuable skill, so be |
| sure to take full advantage of any existing test suites available to |
| you. |
| Important as test development is, we must leave further discussion of it |
| to books dedicated to that topic. |
| The following sections therefore discuss locating bugs in your code given that |
| you already have a good test suite. |
| |
| \section{Tracing} |
| \label{sec:debugging:Tracing} |
| % |
| \epigraph{The machine knows what is wrong. |
| Make it tell you.}{Unknown} |
| |
| When all else fails, add a \co{printk()}! |
| Or a \co{printf()}, if you are working with user-mode C-language applications. |
| |
| The rationale is simple: |
| If you cannot figure out how execution reached a given point in the code, |
| sprinkle print statements earlier in the code to work out what happened. |
| You can get a similar effect, and with more convenience and flexibility, |
| by using a debugger such as gdb (for user applications) or kgdb |
| (for debugging Linux kernels). |
| Much more sophisticated tools exist, with some of the more recent |
| offering the ability to rewind backwards in time from the point |
| of failure. |
| |
| These brute-force testing tools are all valuable, especially now |
| that typical systems have more than 64K of memory and CPUs running |
| faster than 4\,MHz. |
| Much has been |
| written about these tools, so this chapter will add only a little more. |
| |
| However, these tools all have a serious shortcoming when you need a |
| fastpath to tell you what is going wrong, namely, these tools often have |
| excessive overheads. |
| There are special tracing technologies for this purpose, which typically |
| leverage data ownership techniques |
| (see \cref{chp:Data Ownership}) |
| to minimize the overhead of runtime data collection. |
| One example within the Linux kernel is |
| ``trace events''~\cite{StevenRostedt2010perfTraceEventP1,StevenRostedt2010perfTraceEventP2,StevenRostedt2010perfTraceEventP3,StevenRostedt2010perfHP+DeathlyMacros}, |
| which uses per-CPU buffers to allow data to be collected with |
| extremely low overhead. |
| Even so, enabling tracing can sometimes change timing enough to |
| hide bugs, resulting in \emph{heisenbugs}, which are discussed in |
| \cref{sec:debugging:Probability and Heisenbugs} |
| and especially \cref{sec:debugging:Hunting Heisenbugs}. |
| In the kernel, BPF can do data reduction in the kernel, reducing |
| the overhead of transmitting the needed information from the kernel |
| to userspace~\cite{BrendanGregg2019BPFperftools}. |
| In userspace code, there is a huge number of tools that can help you. |
| One good starting point is Brendan Gregg's blog.\footnote{ |
| \url{http://www.brendangregg.com/blog/}} |
| |
| Even if you avoid heisenbugs, other pitfalls await you. |
| For example, although the machine really does know all, |
| what it knows is almost always way more than your head can hold. |
| For this reason, high-quality test suites normally come with sophisticated |
| scripts to analyze the voluminous output. |
| But beware---scripts will only notice what you tell them to. |
| My \co{rcutorture} scripts are a case in point: |
| Early versions of those scripts were quite satisfied with a test run |
| in which RCU \IXpl{grace period} stalled indefinitely. |
| This of course resulted in the scripts being modified to detect RCU |
| grace-period stalls, but this does not change the fact that the scripts |
| will only detect problems that I make them detect. |
| But note well that unless you have a solid design, you won't know what |
| your script should check for! |
| |
| Another problem with tracing and especially with \co{printk()} calls |
| is that their overhead can rule out production use. |
| In such cases, assertions can be helpful. |
| |
| \section{Assertions} |
| \label{sec:debugging:Assertions} |
| % |
| \epigraph{No man really becomes a fool until he stops asking questions.} |
| {Charles P. Steinmetz} |
| |
| Assertions are usually implemented in the following manner: |
| |
| \begin{VerbatimN} |
| if (something_bad_is_happening()) |
| complain(); |
| \end{VerbatimN} |
| |
| This pattern is often encapsulated into C-preprocessor macros or |
| language intrinsics, for example, in the Linux kernel, this might |
| be represented as \co{WARN_ON(something_bad_is_happening())}. |
| Of course, if \co{something_bad_is_happening()} quite frequently, |
| the resulting output might obscure reports of other problems, |
| in which case |
| \co{WARN_ON_ONCE(something_bad_is_happening())} might be more appropriate. |
| |
| \QuickQuiz{ |
| How can you implement \co{WARN_ON_ONCE()}? |
| }\QuickQuizAnswer{ |
| If you don't mind \co{WARN_ON_ONCE()} sometimes warning more |
| than once, simply maintain a static variable that is initialized |
| to zero. |
| If the condition triggers, check the variable, and |
| if it is non-zero, return. |
| Otherwise, set it to one, print the message, and return. |
| |
| If you really need the message to never appear more than once, |
| you can use an atomic exchange operation in place of ``set it |
| to one'' above. |
| Print the message only if the atomic exchange operation returns |
| zero. |
| }\QuickQuizEnd |
| |
| In parallel code, one bad something that might happen is that |
| a function expecting to be called under a particular lock might be called |
| without that lock being held. |
| Such functions sometimes have header comments stating something like |
| ``The caller must hold \co{foo_lock} when calling this function'', but |
| such a comment does no good unless someone actually reads it. |
| An executable statement carries far more weight. |
| The Linux kernel's lockdep |
| facility~\cite{JonathanCorbet2006lockdep,StevenRostedt2011locdepCryptic} |
| therefore provides a \co{lockdep_assert_held()} function that checks |
| whether the specified lock is held. |
| Of course, lockdep incurs significant overhead, and thus might not be |
| helpful in production. |
| |
| An especially bad parallel-code something is unexpected concurrent |
| access to data. |
| The \IXBacrfst{kcsan}~\cite{JonathanCorbet2019KCSAN} |
| uses existing markings such as \co{READ_ONCE()} and \co{WRITE_ONCE()} |
| to determine which concurrent accesses deserve warning messages. |
| KCSAN has a significant false-positive rate, especially from the |
| viewpoint of developers thinking in terms of C as assembly language |
| with additional syntax. |
| KCSAN therefore provides a \co{data_race()} construct to forgive |
| known-benign \IXpl{data race}, and also the \co{ASSERT_EXCLUSIVE_ACCESS()} |
| and \co{ASSERT_EXCLUSIVE_WRITER()} assertions to explicitly check for data |
| races~\cite{MarcoElver2020FearDataRaceDetector1,MarcoElver2020FearDataRaceDetector2}. |
| |
| So what can be done in cases where checking is necessary, but where the |
| overhead of runtime checking cannot be tolerated? |
| One approach is static analysis, which is discussed in the next section. |
| |
| \section{Static Analysis} |
| \label{sec:debugging:Static Analysis} |
| % |
| \epigraph{A lot of automation isn't a replacement of |
| humans but of mind-numbing behavior.} |
| {Summarized from Stewart Butterfield} |
| |
| Static analysis is a validation technique where one program takes a second |
| program as input, reporting errors and vulnerabilities located in this |
| second program. |
| Interestingly enough, almost all programs are statically analyzed |
| by their compilers or interpreters. |
| These tools are far from perfect, but their ability to locate |
| errors has improved immensely over the past few decades, in part because |
| they now have much more than 64K bytes of memory in which to carry out their |
| analyses. |
| |
| The original UNIX \co{lint} tool~\cite{StephenJohnson1977lint} was |
| quite useful, though much of its functionality has since been incorporated |
| into C compilers. |
| There are nevertheless lint-like tools in use to this day. |
| The sparse static analyzer~\cite{JonathanCorbet2004sparse} |
| finds higher-level issues in the Linux kernel, including: |
| |
| \begin{enumerate} |
| \item Misuse of pointers to user-space structures. |
| \item Assignments from too-long constants. |
| \item Empty \co{switch} statements. |
| \item Mismatched lock acquisition and release primitives. |
| \item Misuse of per-CPU primitives. |
| \item Use of RCU primitives on non-RCU pointers and vice versa. |
| \end{enumerate} |
| |
| There is a coccinelle tool that can be thought of as a C-syntax-aware |
| search-and-replace facility. |
| There is also a large body of scripts that use this tool to locate |
| and fix some classes of Linux-kernel |
| bugs~\cite{NicolasPalix2011CoccinelleTenYears}. |
| |
| Although it is likely that compilers will continue to increase their |
| static-analysis capabilities, coccinelle and the sparse static analyzer |
| demonstrate the benefits of static analysis outside of the compiler, |
| particularly for finding application-specific bugs. |
| \Crefrange{sec:formal:SAT Solvers}{sec:formal:Stateless Model Checkers} |
| describe more sophisticated forms of static analysis. |
| |
| \section{Code Review} |
| \label{sec:debugging:Code Review} |
| % |
| \epigraph{If a man speaks of my virtues, he steals from me; |
| if he speaks of my vices, then he is my teacher.} |
| {Chinese proverb} |
| |
| Code review is a special case of static analysis with human beings doing |
| the analysis. |
| Human beings are of course subject to inattention, fatigue, and errors, |
| which underscores the importance of static analysis and testing. |
| Done properly, these two activities can automate at least some aspects |
| of code review. |
| However, it does not appear that testing and static analysis will be able |
| to completely replace manual review any time soon. |
| This section therefore covers inspection, walkthroughs, and self-inspection. |
| |
| \subsection{Inspection} |
| \label{sec:debugging:Inspection} |
| |
| Traditionally, formal code inspections take place in face-to-face meetings |
| with formally defined roles: |
| Moderator, developer, and one or two other participants. |
| The developer reads through the code, explaining what it is doing and |
| why it works. |
| The one or two other participants ask questions and raise issues, |
| hopefully exposing the author's invalid assumptions, while the moderator's |
| job is to resolve any resulting conflicts and take notes. |
| This process can be extremely effective at locating bugs, particularly |
| if all of the participants are familiar with the code at hand. |
| |
| However, this face-to-face formal procedure does not necessarily |
| work well in the global Linux kernel community. |
| Instead, individuals review code separately and provide comments via |
| email or IRC\@. |
| The note-taking is provided by email archives or IRC logs, and moderators |
| volunteer their services as required by the occasional flamewar. |
| This process also works reasonably well, particularly if all of the |
| participants are familiar with the code at hand. |
| In fact, one advantage of the Linux kernel community approach over |
| traditional formal inspections is the greater probability of contributions |
| from people \emph{not} familiar with the code, who might not be blinded |
| by the author's invalid assumptions, and who might also test the code. |
| |
| \QuickQuiz{ |
| Just what invalid assumptions are you accusing Linux kernel |
| hackers of harboring??? |
| }\QuickQuizAnswer{ |
| Those wishing a complete answer to this question are encouraged |
| to search the Linux kernel \co{git} repository for commits |
| containing the string \qco{Fixes:}. |
| There were many thousands of them just in the year 2020, including |
| fixes for the following invalid assumptions: |
| |
| \begin{enumerate} |
| \item Testing for a non-zero denominator will prevent |
| divide-by-zero errors. |
| (Hint: |
| Suppose that the test uses 64-bit arithmetic |
| but that the division uses 32-bit arithmetic.) |
| \item Userspace can be trusted to zero out versioned data |
| structures used to communicate with the kernel. |
| (Hint: |
| Sometimes userspace has no idea how large the |
| data structure is.) |
| \item Outdated TCP duplicate selective acknowledgement (D-SACK) |
| packets can be completely ignored. |
| (Hint: |
| These packets might also contain other information.) |
| \item All CPUs are little-endian. |
| \item Once a data structure is no longer needed, all of its |
| memory may be immediately freed. |
| \item All devices can be initialized while in standby mode. |
| \item Developers can be trusted to consistently do correct |
| hexidecimal arithmetic. |
| \end{enumerate} |
| |
| Those who look at these commits in greater detail will conclude |
| that invalid assumptions are the rule, not the exception. |
| }\QuickQuizEnd |
| |
| It is quite likely that the Linux kernel community's review process |
| is ripe for improvement: |
| |
| \begin{enumerate} |
| \item There is sometimes a shortage of people with the time and |
| expertise required to carry out an effective review. |
| \item Even though all review discussions are archived, they are |
| often ``lost'' in the sense that insights are forgotten and |
| people fail to look up the discussions. |
| This can result in re-insertion of the same old bugs. |
| \item It is sometimes difficult to resolve flamewars when they do |
| break out, especially when the combatants have disjoint |
| goals, experience, and vocabulary. |
| \end{enumerate} |
| |
| Perhaps some of the needed improvements will be provided by |
| continuous-integration-style testing, but there are many bugs more |
| easily found by review than by testing. |
| When reviewing, therefore, it is worthwhile to look at relevant documentation |
| in commit logs, bug reports, and LWN articles. |
| This documentation can help you quickly build up the required expertise. |
| |
| \subsection{Walkthroughs} |
| \label{sec:debugging:Walkthroughs} |
| |
| A traditional code walkthrough is similar to a formal inspection, |
| except that the group |
| ``plays computer'' with the code, driven by specific test cases. |
| A typical walkthrough team has a moderator, a secretary (who records |
| bugs found), a testing expert (who generates the test cases) and |
| perhaps one to two others. |
| These can be extremely effective, albeit also extremely time-consuming. |
| |
| It has been some decades since I have participated in a formal |
| walkthrough, and I suspect that a present-day walkthrough would |
| use single-stepping debuggers. |
| One could imagine a particularly sadistic procedure as follows: |
| |
| \begin{enumerate} |
| \item The tester presents the test case. |
| \item The moderator starts the code under a debugger, using the |
| specified test case as input. |
| \item Before each statement is executed, the developer is required |
| to predict the outcome of the statement and explain why |
| this outcome is correct. |
| \item If the outcome differs from that predicted by the developer, |
| this is taken as a potential bug. |
| \item In parallel code, a ``concurrency shark'' asks what code |
| might execute concurrently with this code, and why such |
| concurrency is harmless. |
| \end{enumerate} |
| |
| Sadistic, certainly. |
| Effective? |
| Maybe. |
| If the participants have a good understanding of the requirements, |
| software tools, data structures, and algorithms, then walkthroughs |
| can be extremely effective. |
| If not, walkthroughs are often a waste of time. |
| |
| \subsection{Self-Inspection} |
| \label{sec:debugging:Self-Inspection} |
| |
| Although developers are usually not all that effective at inspecting |
| their own code, there are a number of situations where there is no |
| reasonable alternative. |
| For example, the developer might be the only person authorized to look |
| at the code, other qualified developers might all be too busy, or |
| the code in question might be sufficiently bizarre that the developer |
| is unable to convince anyone else to take it seriously until after |
| demonstrating a prototype. |
| In these cases, the following procedure can be quite helpful, |
| especially for complex parallel code: |
| |
| \begin{enumerate} |
| \item Write design document with requirements, diagrams for data structures, |
| and rationale for design choices. |
| \item Consult with experts, updating the design document as needed. |
| \item Write the code in pen on paper, correcting errors as you go. |
| Resist the temptation to refer to pre-existing nearly identical code |
| sequences, instead, copy them. |
| \item At each step, articulate and question your assumptions, |
| inserting assertions or constructing tests to check them. |
| \item If there were errors, copy the code in pen on fresh paper, correcting |
| errors as you go. |
| Repeat until the last two copies are identical. |
| \item Produce proofs of correctness for any non-obvious code. |
| \item Use a source-code control system. |
| Commit early; commit often. |
| \item Test the code fragments from the bottom up. |
| \item When all the code is integrated (but preferably before), |
| do full-up functional and stress testing. |
| \item Once the code passes all tests, write code-level documentation, |
| perhaps as an extension to the design document discussed above. |
| Fix both the code and the test code as needed. |
| \end{enumerate} |
| |
| When I follow this procedure for new RCU code, there are normally only |
| a few bugs left at the end. |
| With a few prominent (and embarrassing) |
| exceptions~\cite{PaulEMcKenney2011RCU3.0trainwreck}, |
| I usually manage to locate these bugs before others do. |
| That said, this is getting more difficult over time as the number and |
| variety of Linux-kernel users increases. |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| Why would anyone bother copying existing code in pen on paper??? |
| Doesn't that just increase the probability of transcription errors? |
| }\QuickQuizAnswerB{ |
| If you are worried about transcription errors, please allow me |
| to be the first to introduce you to a really cool tool named |
| \co{diff}. |
| In addition, carrying out the copying can be quite valuable: |
| \begin{enumerate} |
| \item If you are copying a lot of code, you are probably failing |
| to take advantage of an opportunity for abstraction. |
| The act of copying code can provide great motivation |
| for abstraction. |
| \item Copying the code gives you an opportunity to think about |
| whether the code really works in its new setting. |
| Is there some non-obvious constraint, such as the need |
| to disable interrupts or to hold some lock? |
| \item Copying the code also gives you time to consider whether |
| there is some better way to get the job done. |
| \end{enumerate} |
| So, yes, copy the code! |
| }\QuickQuizEndB |
| % |
| \QuickQuizM{ |
| This procedure is ridiculously over-engineered! |
| How can you expect to get a reasonable amount of software |
| written doing it this way??? |
| }\QuickQuizAnswerM{ |
| Indeed, repeatedly copying code by hand is laborious and slow. |
| However, when combined with heavy-duty stress testing and |
| proofs of correctness, this approach is also extremely effective |
| for complex parallel code where ultimate performance and |
| reliability are required and where debugging is difficult. |
| The Linux-kernel RCU implementation is a case in point. |
| |
| On the other hand, if you are writing a simple single-threaded |
| shell script, then you would be best-served by a different |
| methodology. |
| For example, enter each command one at a time into an interactive |
| shell with a test data set to make sure that it does what you |
| want, then copy-and-paste the successful commands into your |
| script. |
| Finally, test the script as a whole. |
| |
| If you have a friend or colleague who is willing to help out, |
| pair programming can work very well, as can any number of |
| formal design- and code-review processes. |
| |
| And if you are writing code as a hobby, then do whatever you like. |
| |
| In short, different types of software need different development |
| methodologies. |
| }\QuickQuizEndM |
| % |
| \QuickQuizE{ |
| What do you do if, after all the pen-on-paper copying, you find |
| a bug while typing in the resulting code? |
| }\QuickQuizAnswerE{ |
| The answer, as is often the case, is ``it depends''. |
| If the bug is a simple typo, fix that typo and continue typing. |
| However, if the bug indicates a design flaw, go back to pen |
| and paper. |
| }\QuickQuizEndE |
| } |
| |
| The above procedure works well for new code, but what if you need to |
| inspect code that you have already written? |
| You can of course apply the above procedure for old code in the special |
| case where you wrote one to throw away~\cite{Brooks79}, |
| but the following approach can also be helpful in less desperate |
| circumstances: |
| |
| \begin{enumerate} |
| \item Using your favorite documentation tool (\LaTeX{}, HTML, |
| OpenOffice, or straight ASCII), describe the high-level |
| design of the code in question. |
| Use lots of diagrams to illustrate the data structures |
| and how these structures are updated. |
| \item Make a copy of the code, stripping away all comments. |
| \item Document what the code does statement by statement. |
| \item Fix bugs as you find them. |
| \end{enumerate} |
| |
| This works because describing the code in detail is an excellent way to spot |
| bugs~\cite{GlenfordJMyers1979}. |
| This second procedure is also a good way to get your head around |
| someone else's code, although the first step often suffices. |
| |
| Although review and inspection by others is probably more efficient and |
| effective, the above procedures can be quite helpful in cases where |
| for whatever reason it is not feasible to involve others. |
| |
| At this point, you might be wondering how to write parallel code without |
| having to do all this boring paperwork. |
| Here are some time-tested ways of accomplishing this: |
| |
| \begin{enumerate} |
| \item Write a sequential program that scales through use of |
| available parallel library functions. |
| \item Write sequential plug-ins for a parallel framework, |
| such as map-reduce, BOINC, or a web-application server. |
| \item Fully partition your problems, then implement sequential |
| program(s) that run in parallel without communication. |
| \item Stick to one of the application areas (such as linear algebra) |
| where tools can automatically decompose and parallelize |
| the problem. |
| \item Make extremely disciplined use of parallel-programming |
| primitives, so that the resulting code is easily seen to be correct. |
| But beware: |
| It is always tempting to break the rules ``just a little bit'' |
| to gain better performance or scalability. |
| Breaking the rules often results in general breakage. |
| That is, unless you carefully do the paperwork described in this |
| section. |
| \end{enumerate} |
| |
| But the sad fact is that even if you do the paperwork or use one of |
| the above ways to more-or-less safely avoid paperwork, |
| there will be bugs. |
| If nothing else, more users and a greater variety of users will expose |
| more bugs more quickly, especially if those users are doing things |
| that the original developers did not consider. |
| The next section describes how to handle the probabilistic bugs that |
| occur all too commonly when validating parallel software. |
| |
| \QuickQuiz{ |
| Wait! |
| Why on earth would an abstract piece of software fail only |
| sometimes??? |
| }\QuickQuizAnswer{ |
| Because complexity and concurrency can produce results that |
| are indistinguishable from |
| randomness~\cite{PeterOkech2009InherentRandomness}. |
| For example, a bug in Linux-kernel RCU required the following |
| to hold before that bug would manifest: |
| \begin{enumerate} |
| \item The kernel was built for HPC or real-time use, so that |
| a given CPU's RCU work could be offloaded to some other |
| CPU\@. |
| \item An offloaded CPU went offline just after generating a |
| large quantity of RCU work. |
| \item A special \co{rcu_barrier()} API was invoked just at |
| this time. |
| \item The RCU work from the newly offlined CPU was still being |
| processed after \co{rcu_barrier()} returned. |
| \item One of these remaining RCU work items was related to |
| the code invoking the \co{rcu_barrier()}. |
| \end{enumerate} |
| Making this bug manifest therefore required considerable luck |
| or great testing skill. |
| But the testing skill could be effective only if the bug was |
| known, which of course it was not. |
| Therefore, the manifesting of this bug was very well modeled |
| as a probabilistic process. |
| }\QuickQuizEnd |
| |
| \section{Probability and Heisenbugs} |
| \label{sec:debugging:Probability and Heisenbugs} |
| % |
| \epigraph{With both heisenbugs and impressionist art, the closer you |
| get, the less you see.} |
| {Unknown} |
| |
| So your parallel program fails sometimes. |
| But you used techniques from the earlier sections to locate |
| the problem and now have a fix in place! |
| Congratulations!!! |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{cartoons/r-2014-Passed-the-stress-test}} |
| \caption{Passed on Merits? |
| Or Dumb Luck?} |
| \ContributedBy{fig:cpu:Passed-the-stress-test}{Melissa Broussard} |
| \end{figure} |
| |
| Now the question is just how much testing is required in order to be |
| certain that |
| you actually fixed the bug, as opposed to just reducing the probability |
| of it occurring on the one hand, having fixed only one of several |
| related bugs on the other hand, or made some ineffectual unrelated |
| change on yet a third hand. |
| In short, what is the answer to the eternal question posed by |
| \cref{fig:cpu:Passed-the-stress-test}? |
| |
| Unfortunately, the honest answer is that an infinite amount of testing |
| is required to attain absolute certainty. |
| |
| \QuickQuiz{ |
| Suppose that you had a very large number of systems at your |
| disposal. |
| For example, at current cloud prices, you can purchase a |
| huge amount of CPU time at low cost. |
| Why not use this approach to get close enough to certainty |
| for all practical purposes? |
| }\QuickQuizAnswer{ |
| This approach might well be a valuable addition to your |
| validation arsenal. |
| But it does have limitations that rule out ``for all practical |
| purposes'': |
| \begin{enumerate} |
| \item Some bugs have extremely low probabilities of occurrence, |
| but nevertheless need to be fixed. |
| For example, suppose that the Linux kernel's RCU |
| implementation had a bug that is triggered only once |
| per million years of machine time on average. |
| A million years of CPU time is hugely expensive even on |
| the cheapest cloud platforms, but we could expect |
| this bug to result in more than 50 failures per day |
| on the more than 20~billion Linux instances in the |
| world as of 2017. |
| \item The bug might well have zero probability of occurrence |
| on your particular cloud-computing test setup, which |
| means that you won't see it no matter how much machine |
| time you burn testing it. |
| For but one example, there have been (and likely |
| still are) Linux-kernel RCU bugs that appear only |
| in preemptible kernels, others that appear only if |
| callbacks are offloaded, still others that appear only |
| in the presence of closely spaced CPU-hotplug operations, |
| and even more that can be provoked only on systems with |
| weak memory ordering. |
| \end{enumerate} |
| Of course, if your code is small enough, formal validation |
| may be helpful, as discussed in |
| \cref{chp:Formal Verification}. |
| But beware: |
| Formal validation of your code will not find errors in your |
| assumptions, misunderstanding of the requirements, |
| misunderstanding of the software or hardware primitives you use, |
| or errors that you did not think to construct a proof for. |
| }\QuickQuizEnd |
| |
| But suppose that we are willing to give up absolute certainty in favor |
| of high probability. |
| Then we can bring powerful statistical tools to bear on this problem. |
| However, this section will focus on simple statistical tools. |
| These tools are extremely helpful, but please note that reading this |
| section is not a substitute for statistics classes.\footnote{ |
| Which I most highly recommend. |
| The few statistics courses I have taken have provided value |
| far beyond that of the time I spent on them.} |
| |
| For our start with simple statistical tools, we need to decide whether |
| we are doing discrete or continuous testing. |
| Discrete testing features well-defined individual test runs. |
| For example, a boot-up test of a Linux kernel patch is an example |
| of a discrete test: |
| The kernel either comes up or it does not. |
| Although you might spend an hour boot-testing your kernel, the number of |
| times you attempted to boot the kernel and the number of times the |
| boot-up succeeded would often be of more interest than the length |
| of time you spent testing. |
| Functional tests tend to be discrete. |
| |
| On the other hand, if my patch involved RCU, I would probably run |
| \co{rcutorture}, which is a kernel module that, strangely enough, tests RCU\@. |
| Unlike booting the kernel, where the appearance of a login prompt |
| signals the successful end of a discrete test, \co{rcutorture} will happily |
| continue torturing RCU until either the kernel crashes or until you |
| tell it to stop. |
| The duration of the \co{rcutorture} test is usually of more |
| interest than the number of times you started and stopped it. |
| Therefore, \co{rcutorture} is an example of a continuous test, a category |
| that includes many stress tests. |
| |
| Statistics for discrete tests are simpler and more familiar than those |
| for continuous tests, and furthermore the statistics for discrete tests |
| can often be pressed into service for continuous tests, though with some |
| loss of accuracy. |
| We therefore start with discrete tests. |
| |
| \subsection{Statistics for Discrete Testing} |
| \label{sec:debugging:Statistics for Discrete Testing} |
| |
| Suppose a bug has a 10\pct\ chance of occurring in a given run and that |
| we do five runs. |
| How do we compute the probability of at least one run failing? |
| Here is one way: |
| |
| \begin{enumerate} |
| \item Compute the probability of a given run succeeding, which is 90\pct. |
| \item Compute the probability of all five runs succeeding, which |
| is 0.9 raised to the fifth power, or about 59\pct. |
| \item Because either all five runs succeed, or at least one fails, |
| subtract the 59\pct\ expected success rate from 100\pct, yielding |
| a 41\pct\ expected failure rate. |
| \end{enumerate} |
| |
| For those preferring formulas, call the probability of a single failure $f$. |
| The probability of a single success is then $1-f$ and the probability |
| that all of $n$ tests will succeed is $S_n$: |
| |
| \begin{equation} |
| S_n = \left(1-f\right)^n |
| \end{equation} |
| |
| The probability of failure is $1-S_n$, or: |
| |
| \begin{equation} |
| F_n = 1-\left(1-f\right)^n |
| \label{eq:debugging:Binomial Failure Rate} |
| \end{equation} |
| |
| \QuickQuiz{ |
| Say what??? |
| When I plug the earlier five-test 10\pct-failure-rate example into |
| the formula, I get 59,050\pct\ and that just doesn't make sense!!! |
| }\QuickQuizAnswer{ |
| You are right, that makes no sense at all. |
| |
| Remember that a probability is a number between zero and one, |
| so that you need to divide a percentage by 100 to get a |
| probability. |
| So 10\pct\ is a probability of 0.1, which gets a probability |
| of 0.4095, which rounds to 41\pct, which quite sensibly |
| matches the earlier result. |
| }\QuickQuizEnd |
| |
| So suppose that a given test has been failing 10\pct\ of the time. |
| How many times do you have to run the test to be 99\pct\ sure that |
| your alleged fix actually helped? |
| |
| Another way to ask this question is ``How many times would we need |
| to run the test to cause the probability of failure to rise above 99\pct?'' |
| After all, if we were to run the test enough times that the probability |
| of seeing at least one failure becomes 99\pct, and none of these test |
| runs fail, |
| there is only 1\pct\ probability of this ``success'' being due to dumb luck. |
| And if we plug $f=0.1$ into |
| \cref{eq:debugging:Binomial Failure Rate} and vary $n$, |
| we find that 43 runs gives us a 98.92\pct\ chance of at least one test failing |
| given the original 10\pct\ per-test failure rate, |
| while 44 runs gives us a 99.03\pct\ chance of at least one test failing. |
| So if we run the test on our fix 44 times and see no failures, there |
| is a 99\pct\ probability that our fix really did help. |
| |
| But repeatedly plugging numbers into |
| \cref{eq:debugging:Binomial Failure Rate} |
| can get tedious, so let's solve for $n$: |
| |
| \begin{align} |
| F_n & = 1-\left(1-f\right)^n \\ |
| 1 - F_n & = \left(1-f\right)^n \\ |
| \log \left(1 - F_n\right) & = n \; \log \left(1 - f\right) |
| \end{align} |
| |
| Finally the number of tests required is given by: |
| |
| \begin{equation} |
| n = \frac{\log\left(1 - F_n\right)}{\log\left(1 - f\right)} |
| \label{eq:debugging:Binomial Number of Tests Required} |
| \end{equation} |
| |
| Plugging $f=0.1$ and $F_n=0.99$ into |
| \cref{eq:debugging:Binomial Number of Tests Required} |
| gives 43.7, meaning that we need 44 consecutive successful test |
| runs to be 99\pct\ certain that our fix was a real improvement. |
| This matches the number obtained by the previous method, which |
| is reassuring. |
| |
| \QuickQuiz{ |
| In \cref{eq:debugging:Binomial Number of Tests Required}, |
| are the logarithms base-10, base-2, or base-$\euler$? |
| }\QuickQuizAnswer{ |
| It does not matter. |
| You will get the same answer no matter what base of logarithms |
| you use because the result is a pure ratio of logarithms. |
| The only constraint is that you use the same base for both |
| the numerator and the denominator. |
| }\QuickQuizEnd |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{CodeSamples/debugging/BinomialNRuns}} |
| \caption{Number of Tests Required for 99 Percent Confidence Given Failure Rate} |
| \label{fig:debugging:Number of Tests Required for 99 Percent Confidence Given Failure Rate} |
| \end{figure} |
| |
| \Cref{fig:debugging:Number of Tests Required for 99 Percent Confidence Given Failure Rate} |
| shows a plot of this function. |
| Not surprisingly, the less frequently each test run fails, the more |
| test runs are required to be 99\pct\ confident that the bug has been |
| at least partially fixed. |
| If the bug caused the test to fail only 1\pct\ of the time, then a |
| mind-boggling 458 test runs are required. |
| As the failure probability decreases, the number of test runs required |
| increases, going to infinity as the failure probability goes to zero. |
| |
| The moral of this story is that when you have found a rarely occurring |
| bug, your testing job will be much easier if you can come up with |
| a carefully targeted test (or ``reproducer'') with a much higher failure rate. |
| For example, if your reproducer raised the failure rate from 1\pct\ |
| to 30\pct, then the number of runs required for 99\pct\ confidence |
| would drop from 458 to a more tractable 13. |
| |
| But these thirteen test runs would only give you 99\pct\ confidence that |
| your fix had produced ``some improvement''. |
| Suppose you instead want to have 99\pct\ confidence that your fix reduced |
| the failure rate by an order of magnitude. |
| How many failure-free test runs are required? |
| |
| An order of magnitude improvement from a 30\pct\ failure rate would be |
| a 3\pct\ failure rate. |
| Plugging these numbers into |
| \cref{eq:debugging:Binomial Number of Tests Required} yields: |
| |
| \begin{equation} |
| n = \frac{\log\left(1 - 0.99\right)}{\log\left(1 - 0.03\right)} = 151.2 |
| \end{equation} |
| |
| So our order of magnitude improvement requires roughly an order of |
| magnitude more testing. |
| Certainty is impossible, and high probabilities are quite expensive. |
| This is why creating high-quality reproducers is so important: |
| Making tests run more quickly and making failures more |
| probable are essential skills in the development of highly reliable |
| software. |
| These skills will be covered in |
| \cref{sec:debugging:Hunting Heisenbugs}. |
| |
| \subsection{Statistics Abuse for Discrete Testing} |
| \label{sec:debugging:Statistics Abuse for Discrete Testing} |
| |
| But suppose that you have a continuous test that fails about three |
| times every ten hours, and that you fix the bug that you believe was |
| causing the failure. |
| How long do you have to run this test without failure to be 99\pct\ certain |
| that you reduced the probability of failure? |
| |
| Without doing excessive violence to statistics, we could simply |
| redefine a one-hour run to be a discrete test that has a 30\pct\ |
| probability of failure. |
| Then the results of in the previous section tell us that if the test |
| runs for 13 hours without failure, there is a 99\pct\ probability that |
| our fix actually improved the program's reliability. |
| |
| A dogmatic statistician might not approve of this approach, but the sad |
| fact is that the errors introduced by this sort of statistical abuse are |
| usually quite small compared to the errors in your failure-rate estimates. |
| Nevertheless, the next section takes a more rigorous approach. |
| |
| \subsection{Statistics for Continuous Testing} |
| \label{sec:debuggingStatistics for Continuous Testing} |
| |
| The fundamental formula for failure probabilities is the Poisson |
| distribution: |
| |
| \begin{equation} |
| F_m = \frac{\lambda^m}{m!} \euler^{-\lambda} |
| \label{eq:debugging:Poisson Probability} |
| \end{equation} |
| |
| Here $F_m$ is the probability of $m$ failures in the test and |
| $\lambda$ is the expected failure rate per unit time. |
| A rigorous derivation may be found in any advanced probability |
| textbook, for example, Feller's classic ``An Introduction to Probability |
| Theory and Its Applications''~\cite{Feller58}, while a more |
| intuitive derivation may be found in the first edition of |
| this book~\cite[Equations 11.8--11.26]{McKenney2014ParallelProgramming-e1}. |
| |
| Let's rework the example from |
| \cref{sec:debugging:Statistics Abuse for Discrete Testing} |
| using the Poisson distribution. |
| Recall that this example involved an alleged fix for a bug with a 30\pct\ |
| failure rate per hour. |
| If we need to be 99\pct\ certain that the fix actually reduced the failure |
| rate, how long an error-free test run is required? |
| In this case, $m$ is zero, so that |
| \cref{eq:debugging:Poisson Probability} reduces to: |
| |
| \begin{equation} |
| F_0 = \euler^{-\lambda} |
| \end{equation} |
| |
| Solving this requires setting $F_0$ |
| to 0.01 and solving for $\lambda$, resulting in: |
| |
| \begin{equation} |
| \lambda = - \ln 0.01 = 4.6 |
| \end{equation} |
| |
| Because we get $0.3$ failures per hour, the number of hours required |
| is $4.6/0.3 = 14.3$, which is within 10\pct\ of the 13 hours |
| calculated using the method in |
| \cref{sec:debugging:Statistics Abuse for Discrete Testing}. |
| Given that you normally won't know your failure rate to anywhere near |
| 10\pct, the simpler method described in |
| \cref{sec:debugging:Statistics Abuse for Discrete Testing} |
| is almost always good and sufficient. |
| |
| However, those wanting to learn more about statistics for continuous |
| testing are encouraged to read on. |
| |
| More generally, if we have $n$ failures per unit time, and we want to |
| be $P$\pct\ certain that a fix reduced the failure rate, we can use the |
| following formula: |
| |
| \begin{equation} |
| T = - \frac{1}{n} \ln \frac{100 - P}{100} |
| \label{eq:debugging:Error-Free Test Duration} |
| \end{equation} |
| |
| \QuickQuiz{ |
| Suppose that a bug causes a test failure three times per hour |
| on average. |
| How long must the test run error-free to provide 99.9\pct\ |
| confidence that the fix significantly reduced the probability |
| of failure by at least a little bit? |
| }\QuickQuizAnswer{ |
| We set $n$ to $3$ and $P$ to $99.9$ in |
| \cref{eq:debugging:Error-Free Test Duration}, resulting in: |
| |
| \begin{equation} |
| T = - \frac{1}{3} \ln \frac{100 - 99.9}{100} = 2.3 |
| \end{equation} |
| |
| If the test runs without failure for 2.3 hours, we can be 99.9\pct\ |
| certain that the fix reduced (by at least some small amount) |
| the probability of failure. |
| }\QuickQuizEnd |
| |
| As before, the less frequently the bug occurs and the greater the |
| required level of confidence, the longer the required error-free test run. |
| |
| Suppose that a given test fails about once every hour, but after a bug |
| fix, a 24-hour test run fails only twice. |
| Assuming that the failure leading to the bug is a random occurrence, |
| what is the probability of a false negative? |
| In other words, how confident are we that the alleged fix actually had |
| some effect on the bug? |
| This probability may be calculated by summing |
| \cref{eq:debugging:Poisson Probability} as follows: |
| |
| \begin{equation} |
| F_0 + F_1 + \dots + F_{m - 1} + F_m = |
| \sum_{i=0}^m \frac{\lambda^i}{i!} \euler^{-\lambda} |
| \end{equation} |
| |
| This is the Poisson cumulative distribution function, which can be |
| written more compactly as: |
| |
| \begin{equation} |
| F_{i \le m} = \sum_{i=0}^m \frac{\lambda^i}{i!} \euler^{-\lambda} |
| \label{eq:debugging:Possion CDF} |
| \end{equation} |
| |
| Here $m$ is the actual number of errors in the long test run with the |
| alleged fix applied (in this case, two errors) and $\lambda$ is expected |
| number of errors in the long test run prior to applying the fix (in this |
| case, 24 errors). |
| Plugging $m=2$ and $\lambda=24$ into this expression gives the probability |
| of a false negative (that is, of two or fewer failures), as about |
| $1.2 \times 10^{-8}$, in other words, we have a high level of confidence |
| that the fix actually had some relationship to the bug.\footnote{ |
| Of course, this result in no way excuses you from finding and |
| fixing the bug(s) causing the remaining two failures!} |
| |
| It is often more convenient to work with statistical confidences. |
| When the probability of a false negative is low, the confidence is high, |
| giving this result: |
| |
| \begin{equation} |
| C_{m,\lambda} = 1 - \sum_{i=0}^m \frac{\lambda^i}{i!} \euler^{-\lambda} |
| \label{eq:debugging:Possion Confidence} |
| \end{equation} |
| |
| Continuing our example with $m=2$ and $\lambda=24$, this gives a confidence |
| of about 0.999999988, or equivalently, 99.9999988\pct. |
| This level of confidence should satisfy all but the purest of purists. |
| |
| But what if we are interested not in ``some relationship'' to the bug, |
| but instead in at least an order of magnitude \emph{reduction} in its |
| expected frequency of occurrence? |
| Then we divide $\lambda$ by ten, and plug $m=2$ and $\lambda=2.4$ into |
| \cref{eq:debugging:Possion Confidence}, |
| which gives but a 90.9\pct\ confidence level. |
| This illustrates the sad fact that increasing either statistical |
| confidence or degree of improvement, let alone both, can be quite |
| expensive. |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| Doing the summation of all the factorials and exponentials |
| is a real pain. |
| Isn't there an easier way? |
| }\QuickQuizAnswerB{ |
| One approach is to use the open-source symbolic manipulation |
| program named ``maxima''. |
| Once you have installed this program, which is a part of many |
| Linux distributions, you can run it and give the |
| \co{load(distrib);} command followed by any number of |
| \co{bfloat(cdf_poisson(m,lambda));} commands, where the \co{m} is |
| replaced by the value of $m$ (the actual number of failures in |
| actual test of the alleged fix, usually zero) and the \co{lambda} |
| is replaced by the value of $\lambda$ (the expected number of |
| failures in that same test prior to applying the alleged fix). |
| You can adjust for the desired improvement, for example, if the |
| run prior to the fix had 24 failures and you are looking for an |
| improvement of at least an order of magnitude, then set $\lambda$ |
| to 2.4 instead of 24. |
| |
| In particular, the \co{bfloat(cdf_poisson(2,24));} command |
| results in \co{1.181617112359357b-8}, which matches the value |
| given by \cref{eq:debugging:Possion CDF}. |
| Similarly, referring to |
| \cref{eq:debugging:Possion Confidence}, |
| the \co{bfloat(1-cdf_poisson(m,lambda));} command results in |
| \co{9.092820467105875b-1}, which is again a match. |
| |
| \begin{table} |
| \renewcommand*{\arraystretch}{1.25} |
| \rowcolors{3}{}{lightgray} |
| \small |
| \centering |
| \begin{tabular}{rrrr} |
| \toprule |
| & \multicolumn{3}{c}{Improvement} \\ |
| \cmidrule(l){2-4} |
| Certainty (\%) |
| & Any |
| & 10x |
| & 100x \\ |
| \cmidrule{1-1} \cmidrule(l){2-4} |
| 90.0 & 2.3 & 23.0 & 230.0 \\ |
| 95.0 & 3.0 & 30.0 & 300.0 \\ |
| 99.0 & 4.6 & 46.1 & 460.5 \\ |
| 99.9 & 6.9 & 69.1 & 690.7 \\ |
| \bottomrule |
| \end{tabular} |
| \caption{Human-Friendly Poisson-Function Display} |
| \label{tab:debugging:Human-Friendly Poisson-Function Display} |
| \end{table} |
| |
| Another approach is to recognize that in this real world, |
| it is not all that useful to compute (say) the duration of a test |
| having two or fewer errors that would give a 76.8\pct\ confidence |
| of a 349.2x improvement in reliability. |
| Instead, human beings tend to focus on specific values, for |
| example, a 95\pct\ confidence of a 10x improvement. |
| People also greatly prefer error-free test runs, and so should |
| you because doing so reduces your required test durations. |
| Therefore, it is quite possible that the values in |
| \cref{tab:debugging:Human-Friendly Poisson-Function Display} |
| will suffice. |
| Simply look up the desired confidence and degree of improvement, |
| and the resulting number will give you the required |
| error-free test duration in terms of the expected time for |
| a single error to appear. |
| So if your pre-fix testing suffered one failure per hour, and the |
| powers that be require a 95\pct\ confidence of a 10x improvement, |
| you need a 30-hour error-free run. |
| |
| Alternatively, you can use the rough-and-ready method described in |
| \cref{sec:debugging:Statistics Abuse for Discrete Testing}. |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| But wait!!! |
| Given that there has to be \emph{some} number of failures |
| (including the possibility of zero failures), shouldn't |
| \cref{eq:debugging:Possion CDF} |
| approach the value $1$ as $m$ goes to infinity? |
| }\QuickQuizAnswerE{ |
| Indeed it should. |
| And it does. |
| |
| To see this, note that $\euler^{-\lambda}$ does not depend on $i$, |
| which means that it can be pulled out of the summation as follows: |
| |
| \begin{equation} |
| \euler^{-\lambda} \sum_{i=0}^\infty \frac{\lambda^i}{i!} |
| \end{equation} |
| |
| The remaining summation is exactly the Taylor series for |
| $\euler^\lambda$, yielding: |
| |
| \begin{equation} |
| \euler^{-\lambda} \euler^\lambda |
| \end{equation} |
| |
| The two exponentials are reciprocals, and therefore cancel, |
| resulting in exactly $1$, as required. |
| }\QuickQuizEndE |
| } |
| |
| The Poisson distribution is a powerful tool for analyzing test results, |
| but the fact is that in this last example there were still two remaining |
| test failures in a 24-hour test run. |
| Such a low failure rate results in very long test runs. |
| The next section discusses counter-intuitive ways of improving this situation. |
| |
| \subsection{Hunting Heisenbugs} |
| \label{sec:debugging:Hunting Heisenbugs} |
| |
| This line of thought also helps explain heisenbugs: |
| Adding tracing and assertions can easily reduce the probability |
| of a bug appearing, which |
| is why extremely lightweight tracing and assertion mechanism are |
| so critically important. |
| |
| The term ``\IXB{heisenbug}'' was inspired by the \pplsur{Weiner}{Heisenberg} |
| \IX{Uncertainty Principle} from quantum physics, which states that |
| it is impossible to |
| exactly quantify a given particle's position and velocity at any given |
| point in time~\cite{WeinerHeisenberg1927Uncertain}. |
| Any attempt to more accurately measure that particle's position will |
| result in increased uncertainty of its velocity and vice versa. |
| Similarly, attempts to track down |
| the heisenbug causes its symptoms to radically change or even disappear |
| completely.\footnote{ |
| The term ``heisenbug'' is a misnomer, as most heisenbugs are |
| fully explained by the \emph{observer effect} from classical |
| physics. |
| Nevertheless, the name has stuck.} |
| Of course, adding debugging overhead can and sometimes does make bugs |
| more probable. |
| But developers are more likely to remember the frustration of a |
| disappearing heisenbug than the joy inspired by the bug becoming |
| more easily reproduced! |
| |
| If the field of physics inspired the name of this problem, it is only |
| fair that the field of physics should inspire the solution. |
| Fortunately, particle physics is up to the task: |
| Why not create an \IXBhy{anti-}{heisenbug} to annihilate the heisenbug? |
| Or, perhaps more accurately, to annihilate the heisen-ness of |
| the heisenbug? |
| Although producing an anti-heisenbug for a given heisenbug is more an |
| art than a science, the following sections describe a number of ways to |
| do just that: |
| |
| \begin{enumerate} |
| \item Add delay to race-prone regions (\cref{sec:debugging:Add Delay}). |
| \item Increase workload intensity |
| (\cref{sec:debugging:Increase Workload Intensity}). |
| \item Isolate suspicious subsystems |
| (\cref{sec:debugging:Isolate Suspicious Subsystems}). |
| \item Make rare events less rare |
| (\cref{sec:debugging:Make Rare Events Less Rare}). |
| \item Count near misses (\cref{sec:debugging:Count Near Misses}). |
| \item Proactive hunting techniques |
| (\cref{sec:debugging:Proactive Hunting Techniques}). |
| \end{enumerate} |
| |
| These are followed by discussion in |
| \cref{sec:debugging:Heisenbug Discussion}. |
| |
| \subsubsection{Add Delay} |
| \label{sec:debugging:Add Delay} |
| |
| Consider the count-lossy code in |
| \cref{sec:count:Why Isn't Concurrent Counting Trivial?}. |
| Adding \co{printf()} statements will likely greatly reduce or even |
| eliminate the lost counts. |
| However, converting the load-add-store sequence to a load-add-delay-store |
| sequence will greatly increase the incidence of lost counts (try it!). |
| Once you spot a bug involving a \IX{race condition}, it is frequently possible |
| to create an anti-heisenbug by adding delay in this manner. |
| |
| Of course, this begs the question of how to find the race condition in |
| the first place. |
| Although very lucky developers might accidentally create delay-based |
| anti-heisenbugs when adding debug code, this is in general a dark art. |
| Nevertheless, there are a number of things you can do to find your race |
| conditions. |
| |
| One approach is to recognize that race conditions often end up corrupting |
| some of the data involved in the race. |
| It is therefore good practice to double-check the synchronization of |
| any corrupted data. |
| Even if you cannot immediately recognize the race condition, adding |
| delay before and after accesses to the corrupted data might change |
| the failure rate. |
| By adding and removing the delays in an organized fashion (e.g., binary |
| search), you might learn more about the workings of the race condition. |
| |
| \QuickQuiz{ |
| How is this approach supposed to help if the corruption affected some |
| unrelated pointer, which then caused the corruption??? |
| }\QuickQuizAnswer{ |
| Indeed, that can happen. |
| Many CPUs have hardware-debugging facilities that can help you |
| locate that unrelated pointer. |
| Furthermore, if you have a core dump, you can search the core |
| dump for pointers referencing the corrupted region of memory. |
| You can also look at the data layout of the corruption, and |
| check pointers whose type matches that layout. |
| |
| You can also step back and test the modules making up your |
| program more intensively, which will likely confine the corruption |
| to the module responsible for it. |
| If this makes the corruption vanish, consider adding additional |
| argument checking to the functions exported from each module. |
| |
| Nevertheless, this is a hard problem, which is why I used the |
| words ``a bit of a dark art''. |
| }\QuickQuizEnd |
| |
| Another important approach is to |
| vary the software and hardware configuration and look for statistically |
| significant differences in failure rate. |
| For example, back in the 1990s, it was common practice to test on systems |
| having CPUs running at different clock rates, which tended to make some |
| types of race conditions more probable. |
| One way of getting a similar effect today is to test on multi-socket |
| systems, thus incurring the large delays described in |
| \cref{sec:cpu:Overheads}. |
| |
| However you choose to add delays, you can then look more intensively at |
| the code implicated by those delays that make the greatest difference |
| in failure rate. |
| It might be helpful to test that code in isolation, for example. |
| |
| One important aspect of software configuration is the history of |
| changes, which is why \co{git bisect} is so useful. |
| Bisection of the change history can provide very valuable clues as |
| to the nature of the heisenbug, in this case presumably by locating |
| a commit that shows a change in the software's response to the addition |
| or removal of a given delay. |
| |
| \QuickQuiz{ |
| But I did the bisection, and ended up with a huge commit. |
| What do I do now? |
| }\QuickQuizAnswer{ |
| A huge commit? |
| Shame on you! |
| This is but one reason why you are supposed to keep the commits small. |
| |
| And that is your answer: |
| Break up the commit into bite-sized pieces and bisect the pieces. |
| In my experience, the act of breaking up the commit is often |
| sufficient to make the bug painfully obvious. |
| }\QuickQuizEnd |
| |
| Once you locate the suspicious section of code, you can then introduce |
| delays to attempt to increase the probability of failure. |
| As we have seen, increasing the probability of failure makes it much |
| easier to gain high confidence in the corresponding fix. |
| |
| However, it is sometimes quite difficult to track down the problem using |
| normal debugging techniques. |
| The following sections present some other alternatives. |
| |
| \subsubsection{Increase Workload Intensity} |
| \label{sec:debugging:Increase Workload Intensity} |
| |
| It is often the case that a given test suite places relatively |
| low stress on a given subsystem, so that a small change in timing |
| can cause a heisenbug to disappear. |
| One way to create an anti-heisenbug for this case is to increase |
| the workload intensity, which has a good chance of increasing the |
| bug's probability. |
| If the probability is increased sufficiently, it may be possible to |
| add lightweight diagnostics such as tracing without causing the |
| bug to vanish. |
| |
| How can you increase the workload intensity? |
| This depends on the program, but here are some things to try: |
| |
| \begin{enumerate} |
| \item Add more CPUs. |
| \item If the program uses networking, add more network adapters |
| and more or faster remote systems. |
| \item If the program is doing heavy I/O when the problem occurs, |
| either (1) add more storage devices, (2) use faster storage |
| devices, for example, substitute SSDs for disks, |
| or (3) use a RAM-based filesystem to substitute main |
| memory for mass storage. |
| \item Change the size of the problem, for example, if doing a parallel |
| matrix multiply, change the size of the matrix. |
| Larger problems may introduce more complexity, but smaller |
| problems often increase the level of contention. |
| If you aren't sure whether you should go large or go small, |
| just try both. |
| \end{enumerate} |
| |
| However, it is often the case that the bug is in a specific subsystem, |
| and the structure of the program limits the amount of stress that can |
| be applied to that subsystem. |
| The next section addresses this situation. |
| |
| \subsubsection{Isolate Suspicious Subsystems} |
| \label{sec:debugging:Isolate Suspicious Subsystems} |
| |
| If the program is structured such that it is difficult or impossible |
| to apply much stress to a subsystem that is under suspicion, |
| a useful anti-heisenbug is a stress test that tests that subsystem in |
| isolation. |
| The Linux kernel's \co{rcutorture} module takes exactly this approach with RCU\@: |
| Applying more stress to RCU than is feasible in a production environment |
| increases the probability that RCU bugs will be found during testing |
| rather than in production.\footnote{ |
| Though sadly not increased to probability one.} |
| |
| In fact, when creating a parallel program, it is wise to stress-test |
| the components separately. |
| Creating such component-level stress tests can seem like a waste of time, |
| but a little bit of component-level testing can save a huge amount |
| of system-level debugging. |
| |
| \subsubsection{Make Rare Events Less Rare} |
| \label{sec:debugging:Make Rare Events Less Rare} |
| |
| Heisenbugs are sometimes due to rare events, such as |
| memory-allocation failure, conditional-lock-acquisition failure, |
| CPU-hotplug operations, timeouts, packet losses, large-scale changes |
| in state, and so on. |
| The corresponding anti-heisenbug is thus simply to make these rare events |
| happen much more frequently. |
| For example, the TREE03 rcutorture scenario waits only 200~milliseconds |
| between CPU-hotplug operations. |
| For another example, most of the rcutorture scenarios emulate RCU |
| callback flooding every minute. |
| For a final example, a memory-management stress test for x86 CPUs might |
| do well to frequently transition an aligned 2\,MB block of memory back |
| and forth between 2\,MB and 4\,KB pages. |
| |
| Another way to construct an anti-heisenbug for this class of heisenbug |
| is to introduce spurious failures. |
| For example, instead of invoking \co{malloc()} directly, invoke |
| a wrapper function that uses a random number to decide whether |
| to return \co{NULL} unconditionally on the one hand, or to actually |
| invoke \co{malloc()} and return the resulting pointer on the other. |
| Inducing spurious failures is an excellent way to bake robustness into |
| sequential programs as well as parallel programs. |
| |
| \QuickQuiz{ |
| Why don't conditional-locking primitives provide this |
| spurious-failure functionality? |
| }\QuickQuizAnswer{ |
| There are locking algorithms that depend on conditional-locking |
| primitives telling them the truth. |
| For example, if conditional-lock failure signals that |
| some other thread is already working on a given job, |
| spurious failure might cause that job to never get done, |
| possibly resulting in a hang. |
| |
| However, this is a design choice. |
| You could create a conditional-locking primitive that was subject |
| to spurious failure, and some have. |
| Therefore, when moving to a new software environment, check |
| your assumptions! |
| }\QuickQuizEnd |
| |
| \subsubsection{Count Near Misses} |
| \label{sec:debugging:Count Near Misses} |
| |
| Bugs are often all-or-nothing things, so that a bug either happens |
| or not, with nothing in between. |
| However, it is sometimes possible to define a \emph{near miss} where |
| the bug does not result in a failure, but has likely manifested. |
| For example, suppose your code is making a robot walk. |
| The robot's falling down constitutes a bug in your program, but |
| stumbling and recovering might constitute a near miss. |
| If the robot falls over only once per hour, but stumbles every few |
| minutes, you might be able to speed up your debugging progress by |
| counting the number of stumbles in addition to the number of falls. |
| |
| In concurrent programs, timestamping can sometimes be used to detect |
| near misses. |
| For example, locking primitives incur significant delays, so if there is |
| a too-short delay between a pair of operations that are supposed |
| to be protected by different acquisitions of the same lock, this too-short |
| delay might be counted as a near miss.\footnote{ |
| Of course, in this case, you might be better off using |
| whatever \co{lock_held()} primitive is available |
| in your environment. |
| If there isn't a \co{lock_held()} primitive, create one!} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{debugging/RCUnearMiss}} |
| \caption{RCU Errors and Near Misses} |
| \label{fig:debugging:RCU Errors and Near Misses} |
| \end{figure} |
| |
| For example, a low-probability bug in RCU priority boosting occurred |
| roughly once every hundred hours of focused \co{rcutorture} testing. |
| Because it would take almost 500 hours of failure-free testing to be |
| 99\pct\ certain that the bug's probability had been significantly reduced, |
| the \co{git bisect} process |
| to find the failure would be painfully slow---or would require an extremely |
| large test farm. |
| Fortunately, the RCU operation being tested included not only a wait for |
| an RCU grace period, but also a previous wait for the grace period to start |
| and a subsequent wait for an RCU callback to be |
| invoked after completion of the RCU grace period. |
| This distinction between an \co{rcutorture} error and near miss is |
| shown in |
| \cref{fig:debugging:RCU Errors and Near Misses}. |
| To qualify as a full-fledged error, an RCU read-side critical section |
| must extend from the \co{call_rcu()} that initiated a grace period, |
| through the remainder of the previous grace period, through the |
| entirety of the grace period initiated by the \co{call_rcu()} |
| (denoted by the region between the jagged lines), and |
| through the delay from the end of that grace period to the callback |
| invocation, as indicated by the ``Error'' arrow. |
| However, the formal definition of RCU prohibits RCU read-side critical |
| sections from extending across a single grace period, as indicated by |
| the ``Near Miss'' arrow. |
| This suggests using near misses as the error condition, however, this |
| can be problematic because different CPUs can have different opinions |
| as to exactly where a given |
| grace period starts and ends, as indicated by the jagged lines.\footnote{ |
| In real life, these lines can be much more jagged because idle |
| CPUs can be completely unaware of a great many recent grace |
| periods.} |
| Using the near misses as the error condition could therefore result |
| in false positives, which need to be avoided in the automated |
| \co{rcutorture} testing. |
| |
| By sheer dumb luck, \co{rcutorture} happens to include some statistics that |
| are sensitive to the near-miss version of the grace period. |
| As noted above, these statistics are subject to false positives due to |
| their unsynchronized access to RCU's state variables, |
| but these false positives turn out to be extremely rare on strongly |
| ordered systems such as the IBM mainframe and x86, occurring less than |
| once per thousand hours of testing. |
| |
| These near misses occurred roughly once per hour, about two orders of |
| magnitude more frequently than the actual errors. |
| Use of these near misses allowed the bug's root cause to be identified |
| in less than a week and a high degree of confidence in the fix to be |
| built in less than a day. |
| In contrast, excluding the near misses in favor of the real errors would |
| have required months of debug and validation time. |
| |
| To sum up near-miss counting, the general approach is to replace counting |
| of infrequent failures with more-frequent near misses that are believed |
| to be correlated with those failures. |
| These near-misses can be considered an anti-heisenbug to the real failure's |
| heisenbug because the near-misses, being more frequent, are likely to |
| be more robust in the face of changes to your code, for example, the |
| changes you make to add debugging code. |
| |
| \subsubsection{Proactive Hunting Techniques} |
| \label{sec:debugging:Proactive Hunting Techniques} |
| |
| Most of the anti-heisenbug techniques discussed in the precending sections |
| are backwards looking. |
| After all, prior experience is the best guide to knowing which regions |
| of code are prone to race conditions, what aspects of the workload |
| can most profitably be increased in intensity, which subsystems are |
| deserving of suspicion, which rare events are important, and what near |
| misses are good proxies for actual failures. |
| |
| What can you do to get ahead of the game? |
| |
| Getting ahead of the anti-heisenbug game is even more of an art than |
| constructing an anti-heisenbug for a specific situation, but here |
| are some techniques that can be helpful: |
| |
| \begin{enumerate} |
| \item Add delay to sections of concurrent code that required the |
| most analysis, that needed formal verification, or that |
| deviated the most from common concurrency practice. |
| \item Analyze trends in workload intensity, and use the results to |
| guide increasing the intensity of your testing. |
| \item Be most suspicious of new code, especially if it is your new |
| code. |
| \item Instrument your workload, looking for complex operations that |
| occur frequently enough to be an uptime problem but rarely |
| enough to avoid much exposure in your current testing. |
| \item Look for near misses in failure-recovery code and on slowpaths. |
| \end{enumerate} |
| |
| Finally, and most importantly, pay special attention to code that people |
| are the most proud of. |
| After all, people are most likely to be proud of code that is unusual, |
| which means that its bugs (and the bugs in the code that it uses) are |
| likely to escape your usual testing efforts. |
| |
| \subsubsection{Heisenbug Discussion} |
| \label{sec:debugging:Heisenbug Discussion} |
| |
| The alert reader might have noticed that this section was fuzzy and |
| qualitative, in stark contrast to the precise mathematics of |
| \cref{sec:debugging:Statistics for Discrete Testing,% |
| sec:debugging:Statistics Abuse for Discrete Testing,% |
| sec:debuggingStatistics for Continuous Testing}. |
| If you love precision and mathematics, you may be disappointed to |
| learn that the situations to which this section applies are far |
| more common than those to which the preceding sections apply. |
| |
| In fact, the common case is that although you might have reason to believe |
| that your code has bugs, you have no idea what those bugs are, what |
| causes them, how likely they are to appear, or what conditions affect |
| their probability of appearance. |
| In this all-too-common case, statistics cannot help you.\footnote{ |
| Although if you know what your program is supposed to do and |
| if your program is small enough (both less likely that you |
| might think), then the formal-verification tools described in |
| \cref{chp:Formal Verification} |
| can be helpful.} |
| That is to say, statistics cannot help you \emph{directly}. |
| But statistics can be of great indirect help---\emph{if} you have the |
| humility required to admit that you make mistakes, that you can reduce the |
| probability of these mistakes (for example, by getting enough sleep), and |
| that the number and type of mistakes you made in the past is indicative of |
| the number and type of mistakes that you are likely to make in the future. |
| For example, I have a deplorable tendency to forget to write a small |
| but critical portion of the initialization code, and frequently get most |
| or even all of a parallel program correct---except for a stupid |
| omission in initialization. |
| Once I was willing to admit to myself that I am prone to this type of |
| mistake, it was easier (but not easy!\@) to force myself to double-check |
| my initialization code. |
| Doing this allowed me to find numerous bugs ahead of time. |
| |
| When your quick bug hunt morphs into a long-term quest, it is important |
| to log everything you have tried and what happened. |
| In the common case where the software is changing during the course of |
| your quest, make sure to record the exact version of the software to |
| which each log entry applies. |
| From time to time, reread the entire log in order to make connections |
| between clues encountered at different times. |
| Such rereading is especially important upon encountering a surprising |
| test result, for example, I reread my log upon realizing that what I |
| thought was a failure of the hypervisor to schedule a vCPU was instead |
| an interrupt storm preventing that vCPU from making forward progress |
| on the interrupted code. |
| If the code you are debugging is new to you, this log is also an |
| excellent place to document the relationships between code and data |
| structures. |
| Keeping a log when you are furiously chasing a difficult bug might seem |
| like needless paperwork, but it has on many occasions saved me from |
| debugging around and around in circles, which can waste far more time |
| than keeping a log ever could. |
| |
| Using Taleb's nomenclature~\cite{NassimTaleb2007BlackSwan}, |
| a white swan is a bug that we can reproduce. |
| We can run a large number of tests, use ordinary statistics to |
| estimate the bug's probability, and use ordinary statistics again |
| to estimate our confidence in a proposed fix. |
| An unsuspected bug is a black swan. |
| We know nothing about it, we have no tests that have yet caused it |
| to happen, and statistics is of no help. |
| Studying our own behavior, especially the number and types of mistakes |
| we make, can turn black swans into grey swans. |
| We might not know exactly what the bugs are, but we have some idea of |
| their number and maybe also of their type. |
| Ordinary statistics is still of no help (at least not until we are |
| able to reproduce one of the bugs), but robust\footnote{ |
| That is to say brutal.} |
| testing methods can be of great help. |
| The goal, therefore, is to use experience and good validation practices |
| to turn the black swans grey, focused testing and analysis to turn the |
| grey swans white, and ordinary methods to fix the white swans. |
| |
| That said, thus far, we have focused solely on bugs in the parallel program's |
| functionality. |
| However, performance is a first-class requirement for a parallel program. |
| Otherwise, why not write a sequential program? |
| To repurpose Kipling, our goal when writing parallel code is to fill |
| the unforgiving second with sixty minutes worth of distance run. |
| The next section therefore discusses a number of performance bugs that |
| would be happy to thwart this Kiplingesque goal. |
| |
| \section{Performance Estimation} |
| \label{sec:debugging:Performance Estimation} |
| % |
| \epigraph{There are lies, damn lies, statistics, and benchmarks.} |
| {Unknown} |
| |
| Parallel programs usually have performance and scalability requirements, |
| after all, if performance is not an issue, why not use a sequential |
| program? |
| Ultimate performance and linear scalability might not be necessary, but |
| there is little use for a parallel program that runs slower than its |
| optimal sequential counterpart. |
| And there really are cases where every microsecond matters and every |
| nanosecond is needed. |
| Therefore, for parallel programs, insufficient performance is just as |
| much a bug as is incorrectness. |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| That is ridiculous!!! |
| After all, isn't getting a late but correct answer always |
| better than getting an incorrect answer??? |
| }\QuickQuizAnswerB{ |
| This question fails to consider the option of choosing not to |
| compute the answer at all, and in doing so, also fails to consider |
| the costs of computing the answer. |
| For example, consider short-term weather forecasting, for which |
| accurate models exist, but which require large (and expensive) |
| clustered supercomputers, at least if you want to actually run |
| the model faster than the weather. |
| |
| And in this case, any performance bug that prevents the model from |
| running faster than the actual weather prevents any forecasting. |
| Given that the whole purpose of purchasing the large clustered |
| supercomputers was to forecast weather, if you cannot run the |
| model faster than the weather, you would be better off not running |
| the model at all. |
| |
| More severe examples may be found in the area of safety-critical |
| real-time computing. |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| But if you are going to put in all the hard work of parallelizing |
| an application, why not do it right? |
| Why settle for anything less than optimal performance and |
| linear scalability? |
| }\QuickQuizAnswerE{ |
| Although I do heartily salute your spirit and aspirations, |
| you are forgetting that there may be high costs due to delays |
| in the program's completion. |
| For an extreme example, suppose that a 30\pct\ performance shortfall |
| from a single-threaded application is causing one person to die |
| each day. |
| Suppose further that in a day you could hack together a |
| quick and dirty |
| parallel program that ran 50\pct\ faster on an eight-CPU system. |
| This is of course horrible scalability, given that the seven |
| additional CPUs provide only half a CPU's worth of additional |
| performance. |
| |
| But it might well be that constructing an optimal parallel |
| program would require four months of painstaking design, coding, |
| debugging, and tuning. |
| I would guess that more than 100 people would prefer the |
| quick and dirty version. |
| |
| Once that is done, \emph{maybe} it is worth creating an optimal |
| version. |
| But that depends on what else could be accomplished with this |
| same four months of effort. |
| For example, us carbon-based lifeforms just might feel that |
| saving more lives is more important than saving 6.5 CPUs. |
| |
| Fortunately, most developers of low-level concurrent code play |
| for much lower stakes, allowing much greater freedom to choose. |
| In addition, small improvements in low-level code can benefit |
| huge numbers of users, in many cases justifying considerable |
| investments in performance, scalability, and real-time response. |
| }\QuickQuizEndE |
| } |
| |
| Validating a parallel program must therfore include validating its |
| performance. |
| But validating performance means having a workload to run and performance |
| criteria with which to evaluate the program at hand. |
| These needs are often met by \emph{performance benchmarks}, which |
| are discussed in the next section. |
| |
| \subsection{Benchmarking} |
| \label{sec:debugging:Benchmarking} |
| |
| Frequent abuse aside, benchmarks are both useful and heavily used, |
| so it is not helpful to be too dismissive of them. |
| Benchmarks span the range from ad hoc test jigs to international |
| standards, but regardless of their level of formality, benchmarks |
| serve four major purposes: |
| |
| \begin{enumerate} |
| \item Providing a fair framework for comparing competing implementations. |
| \item Focusing competitive energy on improving implementations in ways |
| that matter to users. |
| \item Serving as example uses of the implementations being benchmarked. |
| \item Serving as a marketing tool to highlight your software |
| against your competitors' offerings. |
| \end{enumerate} |
| |
| Of course, the only completely fair framework is the intended |
| application itself. |
| So why would anyone who cared about fairness in benchmarking |
| bother creating imperfect benchmarks rather than simply |
| using the application itself as the benchmark? |
| |
| Running the actual application is in fact the best approach where it is practical. |
| Unfortunately, it is often impractical for the following reasons: |
| |
| \begin{enumerate} |
| \item The application might be proprietary, and you |
| might not have the right to run the intended application. |
| \item The application might require more hardware |
| than you have access to. |
| \item The application might use data that you cannot |
| access, for example, due to privacy regulations. |
| \item The application might take longer than is convenient to |
| reproduce a performance or scalability problem.\footnote{ |
| Microbenchmarks can help, but |
| please see \cref{sec:debugging:Microbenchmarking}.} |
| \end{enumerate} |
| |
| Creating a benchmark that approximates |
| the application can help overcome these obstacles. |
| A carefully constructed benchmark can help promote performance, |
| scalability, \IXh{energy}{efficiency}, and much else besides. |
| However, be careful to avoid investing too much into the benchmarking |
| effort. |
| It is after all important to invest at least a little into the |
| application itself~\cite{Gray91}. |
| |
| \subsection{Profiling} |
| \label{sec:debugging:Profiling} |
| |
| In many cases, a fairly small portion of your software is responsible |
| for the majority of the performance and scalability shortfall. |
| However, developers are notoriously unable to identify the actual |
| bottlenecks by inspection. |
| For example, in the case of a kernel buffer allocator, all attention focused |
| on a search of a dense array which turned out to represent only |
| a few percent of the allocator's execution time. |
| An execution profile collected via a logic analyzer focused attention |
| on the cache misses that were actually responsible for the majority |
| of the problem~\cite{McKenney93}. |
| |
| An old-school but quite effective method of tracking down performance |
| and scalability bugs is to run your program under a debugger, |
| then periodically interrupt it, recording the stacks of all threads |
| at each interruption. |
| The theory here is that if something is slowing down your program, |
| it has to be visible in your threads' executions. |
| |
| That said, there are a number of tools |
| that will usually do a much better job of helping you to focus your |
| attention where it will do the most good. |
| Two popular choices are \co{gprof} and \co{perf}. |
| To use \co{perf} on a single-process program, prefix your command |
| with \co{perf record}, then after the command completes, type |
| \co{perf report}. |
| There is a lot of work on tools for performance debugging of multi-threaded |
| programs, which should make this important job easier. |
| Again, one good starting point is Brendan Gregg's blog.\footnote{ |
| \url{http://www.brendangregg.com/blog/}} |
| |
| \subsection{Differential Profiling} |
| \label{sec:debugging:Differential Profiling} |
| |
| Scalability problems will not necessarily be apparent unless you are running |
| on very large systems. |
| However, it is sometimes possible to detect impending scalability problems |
| even when running on much smaller systems. |
| One technique for doing this is called \emph{differential profiling}. |
| |
| The idea is to run your workload under two different sets of conditions. |
| For example, you might run it on two CPUs, then run it again on four |
| CPUs. |
| You might instead vary the load placed on the system, the number of |
| network adapters, the number of mass-storage devices, and so on. |
| You then collect profiles of the two runs, and mathematically combine |
| corresponding profile measurements. |
| For example, if your main concern is scalability, you might take the |
| ratio of corresponding measurements, and then sort the ratios into |
| descending numerical order. |
| The prime scalability suspects will then be sorted to the top of the |
| list~\cite{McKenney95a,McKenney99b}. |
| |
| Some tools such as \co{perf} have built-in differential-profiling |
| support. |
| |
| \subsection{Microbenchmarking} |
| \label{sec:debugging:Microbenchmarking} |
| |
| Microbenchmarking can be useful when deciding which algorithms or |
| data structures are worth incorporating into a larger body of software |
| for deeper evaluation. |
| |
| One common approach to microbenchmarking is to measure the time, |
| run some number of iterations of the code |
| under test, then measure the time again. |
| The difference between the two times divided by the number of iterations |
| gives the measured time required to execute the code under test. |
| |
| Unfortunately, this approach to measurement allows any number of errors |
| to creep in, including: |
| |
| \begin{enumerate} |
| \item The measurement will include some of the overhead of |
| the time measurement. |
| This source of error can be reduced to an arbitrarily small |
| value by increasing the number of iterations. |
| \item The first few iterations of the test might incur cache misses |
| or (worse yet) page faults that might inflate the measured |
| value. |
| This source of error can also be reduced by increasing the |
| number of iterations, or it can often be eliminated entirely |
| by running a few warm-up iterations before starting the |
| measurement period. |
| Most systems have ways of detecting whether a given process |
| incurred a page fault, and you should make use of this to |
| reject runs whose performance has been thus impeded. |
| \item Some types of interference, for example, random memory errors, |
| are so rare that they can be dealt with by running a number |
| of sets of iterations of the test. |
| If the level of interference was statistically significant, |
| any performance outliers could be rejected statistically. |
| \item Any iteration of the test might be interfered with by other |
| activity on the system. |
| Sources of interference include other applications, system |
| utilities and daemons, device interrupts, firmware interrupts |
| (including system management interrupts, or SMIs), |
| virtualization, memory errors, and much else besides. |
| Assuming that these sources of interference occur randomly, |
| their effect can be minimized by reducing the number of |
| iterations. |
| \item \IX{Thermal throttling} can understate scalability because increasing |
| CPU activity increases heat generation, and on systems without |
| adequate cooling (most of them!), this can result in the CPU |
| frequency decreasing as the number of CPUs increases.\footnote{ |
| Systems with adequate cooling tend to look like gaming systems.} |
| Of course, if you are testing an application to evaluate its |
| expected behavior when run in production, such thermal throttling |
| is simply a fact of life. |
| Otherwise, if you are interested in theoretical scalability, |
| use a system with adequate cooling or reduce the CPU clock rate |
| to a level that the cooling system can handle. |
| \end{enumerate} |
| |
| The first and fourth sources of interference provide conflicting advice, |
| which is one sign that we are living in the real world. |
| |
| \QuickQuiz{ |
| But what about other sources of error, for example, due to |
| interactions between caches and memory layout? |
| }\QuickQuizAnswer{ |
| Changes in memory layout can indeed result in unrealistic |
| decreases in execution time. |
| For example, suppose that a given microbenchmark almost |
| always overflows the L0 cache's associativity, but with just the right |
| memory layout, it all fits. |
| If this is a real concern, consider running your microbenchmark |
| using huge pages (or within the kernel or on bare metal) in |
| order to completely control the memory layout. |
| |
| But note that there are many different possible memory-layout |
| bottlenecks. |
| Benchmarks sensitive to memory bandwidth (such as those involving |
| matrix arithmetic) should spread the running threads across the |
| available cores and sockets to maximize memory parallelism. |
| They should also spread the data across \IXplr{NUMA node}, memory |
| controllers, and DRAM chips to the extent possible. |
| In contrast, benchmarks sensitive to memory latency (including |
| most poorly scaling applications) should instead maximize |
| locality, filling each core and socket in turn before adding |
| another one. |
| }\QuickQuizEnd |
| |
| The following sections discuss ways of dealing with these measurement |
| errors, with |
| \cref{sec:debugging:Isolation} |
| covering isolation techniques that may be used to prevent some forms of |
| interference, |
| and with |
| \cref{sec:debugging:Detecting Interference} |
| covering methods for detecting interference so as to reject measurement |
| data that might have been corrupted by that interference. |
| |
| \subsection{Isolation} |
| \label{sec:debugging:Isolation} |
| |
| The Linux kernel provides a number of ways to isolate a group of |
| CPUs from outside interference. |
| |
| First, let's look at interference by other processes, threads, and tasks. |
| The POSIX \co{sched_setaffinity()} system call may be used to move |
| most tasks off of a given set of CPUs and to confine your tests to |
| that same group. |
| The Linux-specific user-level \co{taskset} command may be used for |
| the same purpose, though both \co{sched_setaffinity()} and |
| \co{taskset} require elevated permissions. |
| Linux-specific control groups (cgroups) may be used for this same purpose. |
| This approach can be quite effective at reducing interference, and |
| is sufficient in many cases. |
| However, it does have limitations, for example, it cannot do anything |
| about the per-CPU kernel threads that are often used for housekeeping |
| tasks. |
| |
| One way to avoid interference from per-CPU kernel threads is to run |
| your test at a high real-time priority, for example, by using |
| the POSIX \co{sched_setscheduler()} system call. |
| However, note that if you do this, you are implicitly taking on |
| responsibility for avoiding infinite loops, because otherwise |
| your test can prevent part of the kernel from functioning. |
| This is an example of the Spiderman Principle: |
| ``With great power comes great responsibility.'' |
| And although the default real-time throttling settings often address |
| such problems, they might do so by causing your real-time threads |
| to miss their deadlines. |
| |
| These approaches can greatly reduce, and perhaps even eliminate, |
| interference from processes, threads, and tasks. |
| However, it does nothing to prevent interference from device |
| interrupts, at least in the absence of threaded interrupts. |
| Linux allows some control of threaded interrupts via the |
| \path{/proc/irq} directory, which contains numerical directories, one |
| per interrupt vector. |
| Each numerical directory contains \co{smp_affinity} and |
| \co{smp_affinity_list}. |
| Given sufficient permissions, you can write a value to these files |
| to restrict interrupts to the specified set of CPUs. |
| For example, either |
| ``\co{echo 3 > /proc/irq/23/smp_affinity}'' |
| or |
| ``\co{echo 0-1 > /proc/irq/23/smp_affinity_list}'' |
| would confine interrupts on vector~23 to CPUs~0 and~1, |
| at least given sufficient privileges. |
| You can use ``\co{cat /proc/interrupts}'' to obtain a list of the interrupt |
| vectors on your system, how many are handled by each CPU, and what |
| devices use each interrupt vector. |
| |
| Running a similar command for all interrupt vectors on your system |
| would confine interrupts to CPUs~0 and~1, leaving the remaining CPUs |
| free of interference. |
| Or mostly free of interference, anyway. |
| It turns out that the scheduling-clock interrupt fires on each CPU |
| that is running in user mode.\footnote{ |
| Frederic Weisbecker leads up a \co{NO_HZ_FULL} |
| adaptive-ticks project |
| that allows scheduling-clock interrupts to be disabled |
| on CPUs that have only one runnable task. |
| As of 2021, this is largely complete.} |
| In addition you must take care to ensure that the set of CPUs that you |
| confine the interrupts to is capable of handling the load. |
| |
| But this only handles processes and interrupts running in the same |
| operating-system instance as the test. |
| Suppose that you are running the test in a guest OS that is itself |
| running on a hypervisor, for example, Linux running KVM\@? |
| Although you can in theory apply the same techniques at the hypervisor |
| level that you can at the guest-OS level, it is quite common for |
| hypervisor-level operations to be restricted to authorized personnel. |
| In addition, none of these techniques work against firmware-level |
| interference. |
| |
| \QuickQuiz{ |
| Wouldn't the techniques suggested to isolate the code under |
| test also affect that code's performance, particularly if |
| it is running within a larger application? |
| }\QuickQuizAnswer{ |
| Indeed it might, although in most microbenchmarking efforts |
| you would extract the code under test from the enclosing |
| application. |
| Nevertheless, if for some reason you must keep the code under |
| test within the application, you will very likely need to use |
| the techniques discussed in |
| \cref{sec:debugging:Detecting Interference}. |
| }\QuickQuizEnd |
| |
| Of course, if it is in fact the interference that is producing the |
| behavior of interest, you will instead need to promote interference, |
| in which case being unable to prevent it is not a problem. |
| But if you really do need interference-free measurements, then instead |
| of preventing the interference, you might need to detect the interference |
| as described in the next section. |
| |
| \subsection{Detecting Interference} |
| \label{sec:debugging:Detecting Interference} |
| |
| If you cannot prevent interference, perhaps you can detect it |
| and reject results from any affected test runs. |
| \Cref{sec:debugging:Detecting Interference Via Measurement} |
| describes methods of rejection involving additional measurements, |
| while \cref{sec:debugging:Detecting Interference Via Statistics} |
| describes statistics-based rejection. |
| |
| \subsubsection{Detecting Interference Via Measurement} |
| \label{sec:debugging:Detecting Interference Via Measurement} |
| |
| % Sources of interference include other applications, system |
| % utilities and daemons, device interrupts, firmware interrupts |
| % (including system management interrupts, or SMIs), |
| % virtualization, memory errors, and much else besides. |
| |
| Many systems, including Linux, provide means for determining after the |
| fact whether some forms of interference have occurred. |
| For example, process-based interference results in context switches, |
| which, on Linux-based systems, are visible in |
| \path{/proc/<PID>/sched} via the \co{nr_switches} field. |
| Similarly, interrupt-based interference can be detected via the |
| \path{/proc/interrupts} file. |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:debugging:Using getrusage() to Detect Context Switches] |
| \begin{VerbatimL} |
| #include <sys/time.h> |
| #include <sys/resource.h> |
| |
| /* Return 0 if test results should be rejected. */ |
| int runtest(void) |
| { |
| struct rusage ru1; |
| struct rusage ru2; |
| |
| if (getrusage(RUSAGE_SELF, &ru1) != 0) { |
| perror("getrusage"); |
| abort(); |
| } |
| /* run test here. */ |
| if (getrusage(RUSAGE_SELF, &ru2 != 0) { |
| perror("getrusage"); |
| abort(); |
| } |
| return (ru1.ru_nvcsw == ru2.ru_nvcsw && |
| ru1.runivcsw == ru2.runivcsw); |
| } |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Using \tco{getrusage()} to Detect Context Switches} |
| \label{lst:debugging:Using getrusage() to Detect Context Switches} |
| \end{listing} |
| |
| Opening and reading files is not the way to low overhead, and it is |
| possible to get the count of context switches for a given thread |
| by using the \co{getrusage()} system call, as shown in |
| \cref{lst:debugging:Using getrusage() to Detect Context Switches}. |
| This same system call can be used to detect minor page faults (\co{ru_minflt}) |
| and major page faults (\co{ru_majflt}). |
| |
| Unfortunately, detecting memory errors and firmware interference is quite |
| system-specific, as is the detection of interference due to virtualization. |
| Although avoidance is better than detection, and detection is better than |
| statistics, there are times when one must avail oneself of statistics, |
| a topic addressed in the next section. |
| |
| \subsubsection{Detecting Interference Via Statistics} |
| \label{sec:debugging:Detecting Interference Via Statistics} |
| |
| Any statistical analysis will be based on assumptions about the data, |
| and performance microbenchmarks often support the following assumptions: |
| |
| \begin{enumerate} |
| \item Smaller measurements are more likely to be accurate than |
| larger measurements. |
| \item The measurement uncertainty of good data is known. |
| \item A reasonable fraction of the test runs will result in good data. |
| \end{enumerate} |
| |
| The fact that smaller measurements are more likely to be accurate than |
| larger measurements suggests that sorting the measurements in increasing |
| order is likely to be productive.\footnote{ |
| To paraphrase the old saying, ``Sort first and ask questions later.''} |
| The fact that the measurement uncertainty is known allows us to accept |
| measurements within this uncertainty of each other: |
| If the effects of interference are large compared to this uncertainty, |
| this will ease rejection of bad data. |
| Finally, the fact that some fraction (for example, one third) can be |
| assumed to be good allows us to blindly accept the first portion of the |
| sorted list, and this data can then be used to gain an estimate of the |
| natural variation of the measured data, over and above the assumed |
| measurement error. |
| |
| The approach is to take the specified number of leading elements from the |
| beginning of the sorted list, and use these to estimate a typical |
| inter-element delta, which in turn may be multiplied by the number of |
| elements in the list to obtain an upper bound on permissible values. |
| The algorithm then repeatedly considers the next element of the list. |
| If it falls below the upper bound, and if the distance between |
| the next element and the previous element is not too much greater than |
| the average inter-element distance for the portion of the list accepted |
| thus far, then the next element is accepted and the process repeats. |
| Otherwise, the remainder of the list is rejected. |
| |
| \begin{listing} |
| \input{CodeSamples/debugging/datablows@whole.fcv} |
| \caption{Statistical Elimination of Interference} |
| \label{lst:debugging:Statistical Elimination of Interference} |
| \end{listing} |
| |
| \Cref{lst:debugging:Statistical Elimination of Interference} |
| shows a simple \co{sh}/\co{awk} script implementing this notion. |
| Input consists of an x-value followed by an arbitrarily long list of y-values, |
| and output consists of one line for each input line, with fields as follows: |
| |
| \begin{enumerate} |
| \item The x-value. |
| \item The average of the selected data. |
| \item The minimum of the selected data. |
| \item The maximum of the selected data. |
| \item The number of selected data items. |
| \item The number of input data items. |
| \end{enumerate} |
| |
| This script takes three optional arguments as follows: |
| |
| \begin{description} |
| \item [\lopt{divisor}\nf{:}] Number of segments to divide the list |
| into, for example, a divisor of four means that the first quarter of |
| the data elements will be assumed to be good. |
| This defaults to three. |
| \item [\lopt{relerr}\nf{:}] Relative measurement error. |
| The script assumes that values that differ by less than this |
| error are for all intents and purposes equal. |
| This defaults to 0.01, which is equivalent to 1\pct. |
| \item [\lopt{trendbreak}\nf{:}] Ratio of inter-element spacing |
| constituting a break in the trend of the data. |
| For example, if the average spacing in the data accepted so far |
| is 1.5, then if the trend-break ratio is 2.0, then if the next |
| data value differs from the last one by more than 3.0, this |
| constitutes a break in the trend. |
| (Unless of course, the relative error is greater than 3.0, in |
| which case the ``break'' will be ignored.) |
| \end{description} |
| |
| \begin{fcvref}[ln:debugging:datablows:whole] |
| \Clnrefrange{param:b}{param:e} of |
| \cref{lst:debugging:Statistical Elimination of Interference} |
| set the default values for the parameters, and |
| \clnrefrange{parse:b}{parse:e} parse |
| any command-line overriding of these parameters. |
| \end{fcvref} |
| \begin{fcvref}[ln:debugging:datablows:whole:awk] |
| The \co{awk} invocation on \clnref{invoke} sets the values of the |
| \co{divisor}, \co{relerr}, and \co{trendbreak} variables to their |
| \co{sh} counterparts. |
| In the usual \co{awk} manner, |
| \clnrefrange{copy:b}{end} are executed on each input |
| line. |
| The loop spanning \clnref{copy:b,copy:e} copies |
| the input y-values to the |
| \co{d} array, which \clnref{asort} sorts into increasing order. |
| \Clnref{comp_i} computes the number of trustworthy y-values |
| by applying \co{divisor} and rounding up. |
| |
| \Clnrefrange{delta}{comp_max:e} compute the \co{maxdelta} |
| lower bound on the upper bound of y-values. |
| To this end, \clnref{maxdelta} multiplies the difference in values over |
| the trusted region of data by the \co{divisor}, which projects the |
| difference in values across the trusted region across the entire |
| set of y-values. |
| However, this value might well be much smaller than the relative error, |
| so \clnref{maxdelta1} computes the absolute error (\co{d[i] * relerr}) |
| and adds |
| that to the difference \co{delta} across the trusted portion of the data. |
| \Clnref{comp_max:b,comp_max:e} then compute the maximum of |
| these two values. |
| |
| Each pass through the loop spanning \clnrefrange{add:b}{add:e} |
| attempts to add another |
| data value to the set of good data. |
| \Clnrefrange{chk_engh}{break} compute the trend-break delta, |
| with \clnref{chk_engh} disabling this |
| limit if we don't yet have enough values to compute a trend, |
| and with \clnref{mul_avr} multiplying \co{trendbreak} by the average |
| difference between pairs of data values in the good set. |
| If \clnref{chk_max} determines that the candidate data value would exceed the |
| lower bound on the upper bound (\co{maxdelta}) \emph{and} |
| that the difference between the candidate data value |
| and its predecessor exceeds the trend-break difference (\co{maxdiff}), |
| then \clnref{break} exits the loop: |
| We have the full good set of data. |
| |
| \Clnrefrange{comp_stat:b}{comp_stat:e} then compute and print |
| statistics. |
| \end{fcvref} |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| This approach is just plain weird! |
| Why not use means and standard deviations, like we were taught |
| in our statistics classes? |
| }\QuickQuizAnswerB{ |
| Because mean and standard deviation were not designed to do this job. |
| To see this, try applying mean and standard deviation to the |
| following data set, given a 1\pct\ relative error in measurement: |
| |
| \begin{quote} |
| 49,548.4 49,549.4 49,550.2 49,550.9 49,550.9 49,551.0 |
| 49,551.5 49,552.1 49,899.0 49,899.3 49,899.7 49,899.8 |
| 49,900.1 49,900.4 52,244.9 53,333.3 53,333.3 53,706.3 |
| 53,706.3 54,084.5 |
| \end{quote} |
| |
| The problem is that mean and standard deviation do not rest on |
| any sort of measurement-error assumption, and they will therefore |
| see the difference between the values near 49,500 and those near |
| 49,900 as being statistically significant, when in fact they are |
| well within the bounds of estimated measurement error. |
| |
| Of course, it is possible to create a script similar to |
| that in |
| \cref{lst:debugging:Statistical Elimination of Interference} |
| that uses standard deviation rather than absolute difference |
| to get a similar effect, |
| and this is left as an exercise for the interested reader. |
| Be careful to avoid divide-by-zero errors arising from strings |
| of identical data values! |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| But what if all the y-values in the trusted group of data |
| are exactly zero? |
| Won't that cause the script to reject any non-zero value? |
| }\QuickQuizAnswerE{ |
| Indeed it will! |
| But if your performance measurements often produce a value of |
| exactly zero, perhaps you need to take a closer look at your |
| performance-measurement code. |
| |
| Note that many approaches based on mean and standard deviation |
| will have similar problems with this sort of dataset. |
| }\QuickQuizEndE |
| } |
| |
| Although statistical interference detection can be quite useful, it should |
| be used only as a last resort. |
| Where feasible, it is far better to avoid interference in the first place |
| (\cref{sec:debugging:Isolation}), or, failing that, |
| detecting interference via measurement |
| (\cref{sec:debugging:Detecting Interference Via Measurement}). |
| |
| Never forget that for any statistical method that you create, there is |
| a dataset out there that will defeat it, thus making fools of both you |
| and your method. |
| Therefore, the more creative you are, the more careful you must be! |
| |
| \section{Summary} |
| \label{sec:debugging:Summary} |
| % |
| % \epigraph{To err is human---but it feels devine.}{\emph{Mae West}} |
| \epigraph{To err is human! |
| Stop being human!!!}{Ed Nofziger} |
| |
| Although validation never will be an exact science, much can be gained |
| by taking an organized approach to it, as an organized approach will |
| help you choose the right validation tools for your job, avoiding |
| situations like the one fancifully depicted in |
| \cref{fig:debugging:Choose Validation Methods Wisely}. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{cartoons/UseTheRightCannon}} |
| \caption{Choose Validation Methods Wisely} |
| \ContributedBy{fig:debugging:Choose Validation Methods Wisely}{Melissa Broussard} |
| \end{figure} |
| |
| A key choice is that of statistics. |
| Although the methods described in this chapter work very well most of |
| the time, they do have their limitations, courtesy of the Halting |
| Problem~\cite{AlanMTuring1937HaltingProblem,GeoffreyKPullum2000HaltingProblem}. |
| Fortunately for us, there is a huge number of special cases in which |
| we can not only work out whether a program will halt, but also |
| estimate how long it will run before halting, as discussed in |
| \cref{sec:debugging:Performance Estimation}. |
| Furthermore, in cases where a given program might or might not work |
| correctly, we can often establish estimates for what fraction of the |
| time it will work correctly, as discussed in |
| \cref{sec:debugging:Probability and Heisenbugs}. |
| |
| Nevertheless, unthinking reliance on these estimates is brave to the |
| point of foolhardiness. |
| After all, we are summarizing a huge mass of complexity in code and |
| data structures down to a single solitary number. |
| Even though we can get away with such bravery a surprisingly large |
| fraction of the time, abstracting all that code and data away will |
| occasionally cause severe problems. |
| |
| One possible problem is variability, where repeated runs give wildly |
| different results. |
| This problem is often addressed using standard deviation, however, using |
| two numbers to summarize the behavior of a large and complex program is |
| about as brave as using only one number. |
| In computer programming, the surprising thing is that use of the |
| mean or the mean and standard deviation are often sufficient. |
| Nevertheless, there are no guarantees. |
| |
| One cause of variation is confounding factors. |
| For example, the CPU time consumed by a linked-list search will depend |
| on the length of the list. |
| Averaging together runs with wildly different list lengths will |
| probably not be useful, and adding a standard deviation to the mean |
| will not be much better. |
| The right thing to do would be control for list length, either by |
| holding the length constant or to measure CPU time as a function of |
| list length. |
| |
| Of course, this advice assumes that you are aware of the confounding |
| factors, and Murphy says that you will not be. |
| I have been involved in projects that had confounding factors as diverse |
| as air conditioners (which drew considerable power at startup, thus |
| causing the voltage supplied to the computer to momentarily drop too |
| low, sometimes resulting in failure), cache state (resulting in odd |
| variations in performance), I/O errors (including disk errors, packet |
| loss, and duplicate Ethernet MAC addresses), and even porpoises (which |
| could not resist playing with an array of transponders, which could be |
| otherwise used for high-precision acoustic positioning and navigation). |
| And this is but one reason why a good night's sleep is such an effective |
| debugging tool. |
| |
| In short, validation always will require some measure of the behavior of |
| the system. |
| To be at all useful, this measure must be a severe summarization of the |
| system, which in turn means that it can be misleading. |
| So as the saying goes, ``Be careful. |
| It is a real world out there.'' |
| |
| But what if you are working on the Linux kernel, which as of 2017 was |
| estimated to have more than 20 billion instances running throughout |
| the world? |
| In that case, a bug that occurs once every million years on a single system |
| will be encountered more than 50 times per day across the installed base. |
| A test with a 50\pct\ chance of encountering this bug in a one-hour run |
| would need to increase that bug's probability of occurrence by more than |
| ten orders of magnitude, which poses a severe challenge to |
| today's testing methodologies. |
| One important tool that can sometimes be applied with good effect to |
| such situations is formal verification, the subject of the next chapter, |
| and, more speculatively, \cref{sec:future:Formal Regression Testing?}. |
| |
| The topic of choosing a validation plan, be it testing, formal |
| verification, or both, is taken up by |
| \cref{sec:formal:Choosing a Validation Plan}. |
| |
| \QuickQuizAnswersChp{qqzdebugging} |