| % debugging/debugging.tex |
| |
| \QuickQuizChapter{chp:Validation}{Validation} |
| % |
| \Epigraph{If it is not tested, it doesn't work.}{\emph{Unknown}} |
| |
| I have had a few parallel programs work the first time, but that is only |
| because I have written a large number parallel programs over the past two |
| decades. |
| And I have had far more parallel programs that fooled me into thinking |
| that they were working correctly the first time than actually were working |
| the first time. |
| |
| I have therefore had great need of validation for my parallel programs. |
| The basic trick behind parallel validation, as with other software |
| validation, is to realize that the computer knows what is wrong. |
| It is therefore your job to force it to tell you. |
| This chapter can therefore be thought of as a short course in |
| machine interrogation.\footnote{ |
| But you can leave the thumbscrews and waterboards at home. |
| This chapter covers much more sophisticated and effective |
| methods, especially given that most computer |
| systems neither feel pain nor fear drowning. |
| At least as far as we know.} |
| |
| A longer course may be found in many recent books on validation, as |
| well as at least one rather old but quite worthwhile |
| one~\cite{GlenfordJMyers1979}. |
| Validation is an extremely important topic that cuts across all forms |
| of software, and is therefore worth intensive study in its own right. |
| However, this book is primarily about concurrency, so this chapter |
| will necessarily do little more than scratch the surface of this |
| critically important topic. |
| |
| Section~\ref{sec:debugging:Introduction} |
| introduces the philosophy of debugging. |
| Section~\ref{sec:debugging:Tracing} |
| discusses tracing, |
| Section~\ref{sec:debugging:Assertions} |
| discusses assertions, and |
| Section~\ref{sec:debugging:Static Analysis} |
| discusses static analysis. |
| Section~\ref{sec:debugging:Code Review} |
| describes some unconventional approaches to code review that can |
| be helpful when the fabled 10,000 eyes happen not to be looking at your code. |
| Section~\ref{sec:debugging:Probability and Heisenbugs} |
| overviews the use of probability for validating parallel software. |
| Because performance and scalability are first-class requirements |
| for parallel programming, |
| Section~\ref{sec:debugging:Performance Estimation} covers these |
| topics. |
| Finally, |
| Section~\ref{sec:debugging:Summary} |
| gives a fanciful summary and a short list of statistical traps to avoid. |
| |
| But never forget that the two best debugging tools are a solid design |
| and a good night's sleep! |
| |
| \section{Introduction} |
| \label{sec:debugging:Introduction} |
| |
| Section~\ref{sec:debugging:Where Do Bugs Come From?} |
| discusses the sources of bugs, and |
| Section~\ref{sec:debugging:Required Mindset} |
| overviews the mindset required when validating software. |
| Section~\ref{sec:debugging:When Should Validation Start?} |
| discusses when you should start validation, and |
| Section~\ref{sec:debugging:The Open Source Way} describes the |
| surprisingly effective open-source regimen of code review and |
| community testing. |
| |
| \subsection{Where Do Bugs Come From?} |
| \label{sec:debugging:Where Do Bugs Come From?} |
| |
| Bugs come from developers. |
| The basic problem is that the human brain did not evolve with computer |
| software in mind. |
| Instead, the human brain evolved in concert with other human brains and |
| with animal brains. |
| Because of this history, the following three characteristics of computers |
| often come as a shock to human intuition: |
| |
| \begin{enumerate} |
| \item Computers typically lack common sense, despite decades of research |
| sacrificed at the altar of artificial intelligence. |
| \item Computers generally fail to understand user intent, or more |
| formally, computers generally lack a theory of mind. |
| \item Computers usually cannot do anything useful with a fragmentary plan, |
| instead requiring that each and every detail of each and every |
| possible scenario be spelled out in full. |
| \end{enumerate} |
| |
| The first two points should be uncontroversial, as they are illustrated |
| by any number of failed products, perhaps most famously Clippy and |
| Microsoft Bob. |
| By attempting to relate to users as people, these two products raised |
| common-sense and theory-of-mind expectations that they proved incapable |
| of meeting. |
| Perhaps the set of software assistants that have recently started appearing |
| on smartphones will fare better. |
| That said, the developers working on them by all accounts still develop |
| the old way: The assistants might well benefit end users, but not so |
| much their own developers. |
| |
| This human love of fragmentary plans deserves more explanation, |
| especially given that it is a classic two-edged sword. |
| This love of fragmentary plans is apparently due to the assumption that |
| the person carrying out the plan will have (1)~common sense and (2)~a good |
| understanding of the intent behind the plan. |
| This latter assumption is especially likely to hold in the common case |
| where the person doing the planning and the person carrying out the plan |
| are one and the same: In this case, the plan will be revised almost |
| subconsciously as obstacles arise. |
| Therefore, the love of fragmentary plans has served human beings well, |
| in part because it is better to take random actions that have a high |
| probability of locating food than to starve to death while attempting |
| to plan the unplannable. |
| However, the past usefulness of fragmentary plans in everyday life is |
| no guarantee of their future usefulness in stored-program computers. |
| |
| Furthermore, the need to follow fragmentary plans has had important effects |
| on the human psyche, due to the fact |
| that throughout much of human history, life was often difficult and dangerous. |
| It should come as no surprise that executing a fragmentary plan that has |
| a high probability of a violent encounter with sharp teeth and claws |
| requires almost insane levels of optimism---a level of optimism that actually |
| is present in most human beings. |
| These insane levels of optimism extend to self-assessments of programming |
| ability, as evidenced by the effectiveness of (and the controversy over) |
| interviewing techniques involving coding trivial |
| programs~\cite{RegBraithwaite2007FizzBuzz}. |
| In fact, the clinical term for a human being with less-than-insane |
| levels of optimism is ``clinically depressed.'' Such people usually have |
| extreme difficulty functioning in their daily lives, underscoring the |
| perhaps counter-intuitive importance of insane levels of optimism to a |
| normal, healthy life. |
| If you are not insanely optimistic, you are less likely to start a difficult |
| but worthwhile project.\footnote{ |
| There are some famous exceptions to this rule of thumb. |
| One set of exceptions is people who take on difficult or risky |
| projects in order to make at least a temporary escape |
| from their depression. |
| Another set is people who have nothing to lose: the project is |
| literally a matter of life or death.} |
| |
| \QuickQuiz{} |
| When in computing is the willingness to follow a fragmentary |
| plan critically important? |
| \QuickQuizAnswer{ |
| There are any number of situations, but perhaps the most important |
| situation is when no one has ever created anything resembling |
| the program to be developed. |
| In this case, the only way to create a credible plan is to |
| implement the program, create the plan, and implement it a |
| second time. |
| But whoever implements the program for the first time has no |
| choice but to follow a fragmentary plan because any detailed |
| plan created in ignorance cannot survive first contact with |
| the real world. |
| |
| And perhaps this is one reason why evolution has favored insanely |
| optimistic human beings who are happy to follow fragmentary plans! |
| } \QuickQuizEnd |
| |
| An important special case is the project that, while valuable, is not |
| valuable enough to justify the time required to implement it. |
| This special case is quite common, and one early symptom is the |
| unwillingness of the decision-makers to invest enough to actually |
| implement the project. |
| A natural reaction is for the developers to produce an unrealistically |
| optimistic estimate in order to be permitted to start the project. |
| If the organization (be it open source or proprietary) is strong enough, |
| it might survive the resulting schedule slips and budget overruns, |
| so that the project might see the light of day. |
| However, if the organization is not strong enough and if the decision-makers |
| fail to cancel the project as soon as it becomes clear that the estimates |
| are garbage, then the project might well kill the organization. |
| This might result in another organization picking up the project and |
| either completing it, cancelling it, or being killed by it. |
| A given project might well succeed only after killing several |
| organizations. |
| One can only hope that the organization that eventually makes a success |
| of a serial-organization-killer project manages maintains a suitable |
| level of humility, lest it be killed by the next project. |
| |
| Important though insane levels of optimism might be, they are a key source |
| of bugs (and perhaps failure of organizations). |
| The question is therefore ``How to maintain the optimism required to start |
| a large project while at the same time injecting enough reality to keep |
| the bugs down to a dull roar?'' |
| The next section examines this conundrum. |
| |
| \subsection{Required Mindset} |
| \label{sec:debugging:Required Mindset} |
| |
| When carrying out any validation effort, you should keep the following |
| definitions in mind: |
| |
| \begin{enumerate} |
| \item The only bug-free programs are trivial programs. |
| \item A reliable program has no known bugs. |
| \end{enumerate} |
| |
| From these definitions, it logically follows that any reliable |
| non-trivial program contains at least one bug that you do not |
| know about. |
| Therefore, any validation effort undertaken on a non-trivial program |
| that fails to find any bugs is itself a failure. |
| A good validation is therefore an exercise in destruction. |
| This means that if you are the type of person who enjoys breaking things, |
| validation is just the right type of job for you. |
| |
| \QuickQuiz{} |
| Suppose that you are writing a script that processes the |
| output of the \co{time} command, which looks as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{verbatim} |
| real 0m0.132s |
| user 0m0.040s |
| sys 0m0.008s |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| The script is required to check its input for errors, and to |
| give appropriate diagnostics if fed erroneous \co{time} output. |
| What test inputs should you provide to this program to test it |
| for use with \co{time} output generated by single-threaded programs? |
| \QuickQuizAnswer{ |
| \begin{enumerate} |
| \item Do you have a test case in which all the time is |
| consumed in user mode by a CPU-bound program? |
| \item Do you have a test case in which all the time is |
| consumed in system mode by a CPU-bound program? |
| \item Do you have a test case in which all three times |
| are zero? |
| \item Do you have a test case in which the ``user'' and ``sys'' |
| times sum to more than the ``real'' time? |
| (This would of course be completely legitimate in |
| a multithreaded program.) |
| \item Do you have a set of tests cases in which one of the |
| times uses more than one second? |
| \item Do you have a set of tests cases in which one of the |
| times uses more than ten second? |
| \item Do you have a set of test cases in which one of the |
| times has non-zero minutes? (For example, ``15m36.342s''.) |
| \item Do you have a set of test cases in which one of the |
| times has a seconds value of greater than 60? |
| \item Do you have a set of test cases in which one of the |
| times overflows 32 bits of milliseconds? 64 bits of |
| milliseconds? |
| \item Do you have a set of test cases in which one of the |
| times is negative? |
| \item Do you have a set of test cases in which one of the |
| times has a positive minutes value but a negative |
| seconds value? |
| \item Do you have a set of test cases in which one of the |
| times omits the ``m'' or the ``s''? |
| \item Do you have a set of test cases in which one of the |
| times is non-numeric? (For example, ``Go Fish''.) |
| \item Do you have a set of test cases in which one of the |
| lines is omitted? (For example, where there is a |
| ``real'' value and a ``sys'' value, but no ``user'' |
| value.) |
| \item Do you have a set of test cases where one of the |
| lines is duplicated? Or duplicated, but with a |
| different time value for the duplicate? |
| \item Do you have a set of test cases where a given line |
| has more than one time value? (For example, |
| ``real 0m0.132s 0m0.008s''.) |
| \item Do you have a set of test cases containing random |
| characters? |
| \item In all test cases involving invalid input, did you |
| generate all permutations? |
| \item For each test case, do you have an expected outcome |
| for that test? |
| \end{enumerate} |
| |
| If you did not generate test data for a substantial number of |
| the above cases, you will need to cultivate a more destructive |
| attitude in order to have a chance of generating high-quality |
| tests. |
| |
| Of course, one way to economize on destructiveness is to |
| generate the tests with the to-be-tested source code at hand, |
| which is called white-box testing (as opposed to black-box testing). |
| However, this is no panacea: You will find that it is all too |
| easy to find your thinking limited by what the program can handle, |
| thus failing to generate truly destructive inputs. |
| } \QuickQuizEnd |
| |
| But perhaps you are a super-programmer whose code is always perfect |
| the first time every time. |
| If so, congratulations! |
| Feel free to skip this chapter, but |
| I do hope that you will forgive my skepticism. |
| You see, I have met far more people who claimed to be able |
| to write perfect code the first time than I have |
| people who were actually capable of carrying out this feat, which |
| is not too surprising given the previous discussion of optimism |
| and over-confidence. |
| And even if you really are a super-programmer, you just might |
| find yourself debugging lesser mortals' work. |
| |
| One approach for the rest of us is to alternate between our normal |
| state of insane optimism |
| (Sure, I can program that!) and severe pessimism |
| (It seems to work, but I just know that there have to be more bugs hiding |
| in there somewhere!). |
| It helps if you enjoy breaking things. |
| If you don't, or if your joy in breaking things is limited to breaking |
| \emph{other} people's things, find someone who does love breaking your |
| code and get them to help you test it. |
| |
| Another helpful frame of mind is to hate it when other people find bugs in |
| your code. |
| This hatred can help motivate you to torture your code beyond reason |
| in order to increase the probability that you find the bugs rather than |
| someone else. |
| |
| One final frame of mind is to consider the possibility that someone's |
| life depends on your code being correct. |
| This can also motivate you to torture your code into revealing the |
| whereabouts of its bugs. |
| |
| This wide variety of frames of mind opens the door to |
| the possibility of multiple people with different frames of |
| mind contributing to the project, with varying levels of optimism. |
| This can work well, if properly organized. |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{2in}{!}{\includegraphics{cartoons/TortureTux}} |
| \caption{Validation and the Geneva Convention} |
| \ContributedBy{Figure}{fig:debugging:Validation and the Geneva Convention}{Melissa Broussard} |
| \end{figure} |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{2in}{!}{\includegraphics{cartoons/TortureLaptop}} |
| \caption{Rationalizing Validation} |
| \ContributedBy{Figure}{fig:debugging:Rationalizing Validation}{Melissa Broussard} |
| \end{figure} |
| |
| Some people might see vigorous validation as a form of torture, as |
| depicted in |
| Figure~\ref{fig:debugging:Validation and the Geneva Convention}.\footnote{ |
| More cynical people might question whether these people are instead |
| merely afraid that validation will find bugs that they will then |
| be expected to fix.} |
| Such people might do well to remind themselves that, Tux cartoons aside, |
| they are really torturing an inanimate object, as shown in |
| Figure~\ref{fig:debugging:Rationalizing Validation}. |
| In addition, rest assured that those who fail to torture their code are |
| doomed to be tortured by it. |
| |
| However, this leaves open the question of exactly when during the project |
| lifetime validation should start, a topic taken up by the next section. |
| |
| \subsection{When Should Validation Start?} |
| \label{sec:debugging:When Should Validation Start?} |
| |
| Validation should start at the same time that the project starts. |
| |
| To see this, consider that tracking down a bug is much harder in a large |
| program than in a small one. |
| Therefore, to minimize the time and effort required to track down bugs, |
| you should test small units of code. |
| Although you won't find all the bugs this way, you will find a substantial |
| fraction, and it will be much easier to find and fix the ones you do find. |
| Testing at this level can also alert you to larger flaws in your overall |
| design, minimizing the time you waste writing code that is quite literally |
| broken by design. |
| |
| But why wait until you have code before validating your design?\footnote{ |
| The old saying ``First we must code, then we have incentive to |
| think'' notwithstanding.} |
| Hopefully reading Chapters~\ref{chp:Hardware and its Habits} |
| and~\ref{chp:Tools of the Trade} provided you with the information |
| required to avoid some regrettably common design flaws, |
| but discussing your design with a colleague or even simply writing it |
| down can help flush out additional flaws. |
| |
| However, it is all too often the case that waiting to start validation |
| until you have a design is waiting too long. |
| Mightn't your natural level of optimism caused you to start the design |
| before you fully understood the requirements? |
| The answer to this question will almost always be ``yes''. |
| One good way to avoid flawed requirements is to get to know your users. |
| To really serve them well, you will have to live among them. |
| |
| \QuickQuiz{} |
| You are asking me to do all this validation BS before |
| I even start coding??? |
| That sounds like a great way to never get started!!! |
| \QuickQuizAnswer{ |
| If it is your project, for example, a hobby, do what you like. |
| Any time you waste will be your own, and you have no one else |
| to answer to for it. |
| And there is a good chance that the time will not be completely |
| wasted. |
| For example, if you are embarking on a first-of-a-kind project, |
| the requirements are in some sense unknowable anyway. |
| In this case, the best approach might be to quickly prototype |
| a number of rough solutions, try them out, and see what works |
| best. |
| |
| On the other hand, if you are being paid to produce a system that |
| is broadly similar to existing systems, you owe it to your users, |
| your employer, and your future self to validate early and often. |
| } \QuickQuizEnd |
| |
| First-of-a-kind projects |
| require different approaches to validation, for example, |
| rapid prototyping. |
| Here, the main goal of the first few prototypes is to learn how |
| the project should be implemented, not so much to create a correct |
| implementation on the first try. |
| However, it is important to keep in mind that you should not omit |
| validation, but rather take a radically different approach to it. |
| |
| Now that we have established that you should start validation when you |
| start the project, the following sections cover a number of validation |
| techniques and methods that have proven their worth. |
| |
| \subsection{The Open Source Way} |
| \label{sec:debugging:The Open Source Way} |
| |
| The open-source programming methodology has proven quite effective, and |
| includes a regimen of intense code review and testing. |
| |
| I can personally attest to the effectiveness of the open-source community's |
| intense code review. |
| One of the first patches I prepared for the Linux kernel involved |
| a distributed filesystem where a user on one node writes to a given |
| file at a location that a user on another node has mapped into memory. |
| In this case, it is necessary to invalidate the affected pages from |
| the mapping in order to allow the filesystem to maintain coherence |
| during the write operation. |
| I coded up a first attempt at a patch, and, in keeping with the open-source |
| maxim ``post early, post often'', I posted the patch. |
| I then considered how I was going to test it. |
| |
| But before I could even decide on an overall test strategy, I got a |
| reply to my posting pointing out a few bugs. |
| I fixed the bugs and reposted the patch, and returned to thinking |
| out my test strategy. |
| However, before I had a chance to write any test code, I received |
| a reply to my reposted patch, pointing out more bugs. |
| This process repeated itself many times, and I am not sure that I |
| ever got a chance to actually test the patch. |
| |
| This experience brought home the truth of the open-source saying: |
| Given enough eyeballs, all bugs are shallow~\cite{EricSRaymond99b}. |
| |
| However, when you post some code or a given patch, it is worth |
| asking a few questions: |
| |
| \begin{enumerate} |
| \item How many of those eyeballs are actually going to look at your code? |
| \item How many will be experienced and clever enough to actually find |
| your bugs? |
| \item Exactly when are they going to look? |
| \end{enumerate} |
| |
| I was lucky: There was someone out there who wanted the functionality |
| provided by my patch, who had long experience with distributed filesystems, |
| and who looked at my patch almost immediately. |
| If no one had looked at my patch, there would have been no review, and |
| therefore no finding of bugs. |
| If the people looking at my patch had lacked experience with distributed |
| filesystems, it is unlikely that they would have found all the bugs. |
| Had they waited months or even years to look, I likely would have forgotten |
| how the patch was supposed to work, making it much more difficult to |
| fix them. |
| |
| However, we must not forget the second tenet of the open-source development, |
| namely intensive testing. |
| For example, a great many people test the Linux kernel. |
| Some test patches as they are submitted, perhaps even yours. |
| Others test the -next tree, which is helpful, but there is likely to be |
| several weeks or even months delay between the time that you write the |
| patch and the time that it appears in the -next tree, by which time the |
| patch will not be quite as fresh in your mind. |
| Still others test maintainer trees, which often have a similar time delay. |
| |
| Quite a few people don't test code until it is committed to mainline, |
| or the master source tree (Linus's tree in the case of the Linux kernel). |
| If your maintainer won't accept your patch until it has been tested, |
| this presents you with a deadlock situation: your patch won't be accepted |
| until it is tested, but it won't be tested until it is accepted. |
| Nevertheless, people who test mainline code are still relatively |
| aggressive, given that many people and organizations do not test code |
| until it has been pulled into a Linux distro. |
| |
| And even if someone does test your patch, there is no guarantee that they |
| will be running the hardware and software configuration and workload |
| required to locate your bugs. |
| |
| Therefore, even when writing code for an open-source project, you need to |
| be prepared to develop and run your own test suite. |
| Test development is an underappreciated and very valuable skill, so be |
| sure to take full advantage of any existing test suites available to |
| you. |
| Important as test development is, we will leave further discussion of it |
| to books dedicated to that topic. |
| The following sections therefore discuss locating bugs in your code given that |
| you already have a good test suite. |
| |
| \section{Tracing} |
| \label{sec:debugging:Tracing} |
| |
| When all else fails, add a \co{printk()}! |
| Or a \co{printf()}, if you are working with user-mode C-language applications. |
| |
| The rationale is simple: If you cannot figure out how execution reached |
| a given point in the code, sprinkle print statements earlier in the |
| code to work out what happened. |
| You can get a similar effect, and with more convenience and flexibility, |
| by using a debugger such as gdb (for user applications) or kgdb |
| (for debugging Linux kernels). |
| Much more sophisticated tools exist, with some of the more recent |
| offering the ability to rewind backwards in time from the point |
| of failure. |
| |
| These brute-force testing tools are all valuable, especially now |
| that typical systems have more than 64K of memory and CPUs running |
| faster than 4MHz. |
| Much has been |
| written about these tools, so this chapter will add little more. |
| |
| However, these tools all have a serious shortcoming when the job at hand |
| is to convince a the fastpath of a high-performance parallel algorithm |
| to tell you what is going wrong, namely, they often have excessive |
| overheads. |
| There are special tracing technologies for this purpose, which typically |
| leverage data ownership techniques |
| (see Chapter~\ref{chp:Data Ownership}) |
| to minimize the overhead of runtime data collection. |
| One example within the Linux kernel is |
| ``trace events''~\cite{StevenRostedt2010perfTraceEventP1,StevenRostedt2010perfTraceEventP2,StevenRostedt2010perfTraceEventP3,StevenRostedt2010perfHP+DeathlyMacros}, |
| which uses per-CPU buffers to allow data to be collected with |
| extremely low overhead. |
| Even so, enabling tracing can sometimes change timing enough to |
| hide bugs, resulting in \emph{heisenbugs}, which are discussed in |
| Section~\ref{sec:debugging:Probability and Heisenbugs} |
| and especially Section~\ref{sec:debugging:Hunting Heisenbugs}. |
| In userspace code, there is a huge number of tools that can help you. |
| One good starting point is Brendan Gregg's blog.\footnote{ |
| \url{http://www.brendangregg.com/blog/}} |
| |
| Even if you avoid heisenbugs, other pitfalls await you. |
| For example, although the machine really does know all, |
| what it knows is almost always way more than your head can hold. |
| For this reason, high-quality test suites normally come with sophisticated |
| scripts to analyze the voluminous output. |
| But beware---scripts won't necessarily notice surprising things. |
| My rcutorture scripts are a case in point: Early versions of those |
| scripts were quite satisfied with a test run in which RCU grace periods |
| stalled indefinitely. |
| This of course resulted in the scripts being modified to detect RCU |
| grace-period stalls, but this does not change the fact that the scripts |
| will only detects problems that I think to make them detect. |
| But note well that unless you have a solid design, you won't know what |
| your script should check for! |
| |
| Another problem with tracing and especially with \co{printk()} calls |
| is that their overhead is often too much for production use. |
| In some such cases, assertions can be helpful. |
| |
| \section{Assertions} |
| \label{sec:debugging:Assertions} |
| |
| Assertions are usually implemented in the following manner: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \tt |
| \scriptsize |
| \begin{verbatim} |
| 1 if (something_bad_is_happening()) |
| 2 complain(); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| This pattern is often encapsulated into C-preprocessor macros or |
| language intrinsics, for example, in the Linux kernel, this might |
| be represented as \co{WARN_ON(something_bad_is_happening())}. |
| Of course, if \co{something_bad_is_happening()} quite frequently, |
| the resulting output might obscure reports of other problems, |
| in which case |
| \co{WARN_ON_ONCE(something_bad_is_happening())} might be more appropriate. |
| |
| \QuickQuiz{} |
| How can you implement \co{WARN_ON_ONCE()}? |
| \QuickQuizAnswer{ |
| If you don't mind having a \co{WARN_ON_ONCE()} that |
| will sometimes warn twice or three times, simply maintain |
| a static variable that is initialized to zero. |
| If the condition triggers, check the static variable, and |
| if it is non-zero, return. |
| Otherwise, set it to one, print the message, and return. |
| |
| If you really need the message to never appear more than once, |
| perhaps because it is huge, you can use an atomic exchange |
| operation in place of ``set it to one'' above. |
| Print the message only if the atomic exchange operation returns |
| zero. |
| } \QuickQuizEnd |
| |
| In parallel code, one especially bad something that might happen is that |
| a function expecting to be called under a particular lock might be called |
| without that lock being held. |
| Such functions sometimes have header comments stating something like |
| ``The caller must hold \co{foo_lock} when calling this function'', but |
| such a comment does no good unless someone actually reads it. |
| An executable statement like \co{lock_is_held(&foo_lock)} carries far |
| more weight. |
| |
| The Linux kernel's lockdep |
| facility~\cite{JonathanCorbet2006lockdep,StevenRostedt2011locdepCryptic} |
| takes this a step farther, reporting potential deadlocks as well as |
| allowing functions to verify that the proper locks are held. |
| Of course, this additional functionality incurs significant overhead, |
| so that lockdep is not necessarily appropriate for production use. |
| |
| So what can be done in cases where checking is necessary, but where the |
| overhead of runtime checking cannot be tolerated? |
| One approach is static analysis, which is discussed in the next section. |
| |
| \section{Static Analysis} |
| \label{sec:debugging:Static Analysis} |
| |
| Static analysis is a validation technique were one program takes a second |
| program as input, reporting errors and vulnerabilities located in this |
| second program. |
| Interestingly enough, almost all programs are subjected to static analysis |
| by their compilers or interpreters. |
| These tools are of course far from perfect, but their ability to locate |
| errors has improved immensely over the past few decades, in part because |
| they now have much more than 64K bytes of memory in which to carry out their |
| analysis. |
| |
| The original UNIX \co{lint} tool~\cite{StephenJohnson1977lint} was |
| quite useful, though much of its functionality has since been incorporated |
| into C compilers. |
| There are nevertheless lint-like tools under development and in use to |
| this day. |
| |
| The sparse static analyzer~\cite{JonathanCorbet2004sparse} |
| looks for higher-level issues in the Linux kernel, including: |
| |
| \begin{enumerate} |
| \item Misuse of pointers to user-space structures. |
| \item Assignments from too-long constants. |
| \item Empty \co{switch} statements. |
| \item Mismatched lock acquisition and release primitives. |
| \item Misuse of per-CPU primitives. |
| \item Use of RCU primitives on non-RCU pointers and vice versa. |
| \end{enumerate} |
| |
| Although it is likely that compilers will continue to increase their |
| static-analysis capabilities, the sparse static analyzer demonstrates |
| the benefits of static analysis outside of the compiler, particularly |
| for finding application-specific bugs. |
| |
| \section{Code Review} |
| \label{sec:debugging:Code Review} |
| |
| Various code-review activities are special cases of static analysis, but |
| with human beings doing the analysis. |
| This section covers inspection, walkthroughs, and self-inspection. |
| |
| \subsection{Inspection} |
| \label{sec:debugging:Inspection} |
| |
| Traditionally, formal code inspections take place in face-to-face meetings |
| with formally defined roles: moderator, developer, and one or two other |
| participants. |
| The developer reads through the code, explaining what it is doing and |
| why it works. |
| The one or two other participants ask questions and raise issues, while |
| the moderator's job is to resolve any conflicts and to take notes. |
| This process can be extremely effective at locating bugs, particularly |
| if all of the participants are familiar with the code at hand. |
| |
| However, this face-to-face formal procedure does not necessarily |
| work well in the global Linux kernel community, although it might work |
| well via an IRC session. |
| Instead, individuals review code separately and provide comments via |
| email or IRC. |
| The note-taking is provided by email archives or IRC logs, and moderators |
| volunteer their services as appropriate. |
| Give or take the occasional flamewar, this process also works reasonably |
| well, particularly if all of the participants are familiar with the |
| code at hand.\footnote{ |
| That said, one advantage of the Linux kernel community approach |
| over traditional formal inspections is the greater probability of |
| contributions from people \emph{not} familiar with the code, |
| who therefore might not be blinded by the invalid assumptions |
| harbored by those familiar with the code.} |
| |
| It is quite likely that the Linux kernel community's review process |
| is ripe for improvement: |
| |
| \begin{enumerate} |
| \item There is sometimes a shortage of people with the time and |
| expertise required to carry out an effective review. |
| \item Even though all review discussions are archived, they are |
| often ``lost'' in the sense that insights are forgotten and |
| people often fail to look up the discussions. |
| This can result in re-insertion of the same old bugs. |
| \item It is sometimes difficult to resolve flamewars when they do |
| break out, especially when the combatants have disjoint |
| goals, experience, and vocabulary. |
| \end{enumerate} |
| |
| When reviewing, therefore, it is worthwhile to review relevant documentation |
| in commit logs, bug reports, and LWN articles. |
| |
| \subsection{Walkthroughs} |
| \label{sec:debugging:Walkthroughs} |
| |
| A traditional code walkthrough is similar to a formal inspection, |
| except that the group |
| ``plays computer'' with the code, driven by specific test cases. |
| A typical walkthrough team has a moderator, a secretary (who records |
| bugs found), a testing expert (who generates the test cases) and |
| perhaps one to two others. |
| These can be extremely effective, albeit also extremely time-consuming. |
| |
| It has been some decades since I have participated in a formal |
| walkthrough, and I suspect that a present-day walkthrough would |
| use single-stepping debuggers. |
| One could imagine a particularly sadistic procedure as follows: |
| |
| \begin{enumerate} |
| \item The tester presents the test case. |
| \item The moderator starts the code under a debugger, using the |
| specified test case as input. |
| \item Before each statement is executed, the developer is required |
| to predict the outcome of the statement and explain why |
| this outcome is correct. |
| \item If the outcome differs from that predicted by the developer, |
| this is taken as evidence of a potential bug. |
| \item In parallel code, a ``concurrency shark'' asks what code |
| might execute concurrently with this code, and why such |
| concurrency is harmless. |
| \end{enumerate} |
| |
| Sadistic, certainly. |
| Effective? |
| Maybe. |
| If the participants have a good understanding of the requirements, |
| software tools, data structures, and algorithms, then walkthroughs |
| can be extremely effective. |
| If not, walkthroughs are often a waste of time. |
| |
| \subsection{Self-Inspection} |
| \label{sec:debugging:Self-Inspection} |
| |
| Although developers are usually not all that effective at inspecting |
| their own code, there are a number of situations where there is no |
| reasonable alternative. |
| For example, the developer might be the only person authorized to look |
| at the code, other qualified developers might all be too busy, or |
| the code in question might be sufficiently bizarre that the developer |
| is unable to convince anyone else to take it seriously until after |
| demonstrating a prototype. |
| In these cases, the following procedure can be quite helpful, |
| especially for complex parallel code: |
| |
| \begin{enumerate} |
| \item Write design document with requirements, diagrams for data structures, |
| and rationale for design choices. |
| \item Consult with experts, update the design document as needed. |
| \item Write the code in pen on paper, correct errors as you go. |
| Resist the temptation to refer to pre-existing nearly identical code |
| sequences, instead, copy them. |
| \item If there were errors, copy the code in pen on fresh paper, correcting |
| errors as you go. |
| Repeat until the last two copies are identical. |
| \item Produce proofs of correctness for any non-obvious code. |
| \item Where possible, test the code fragments from the bottom up. |
| \item When all the code is integrated, do full-up functional and |
| stress testing. |
| \item Once the code passes all tests, write code-level documentation, |
| perhaps as an extension to the design document discussed above. |
| \end{enumerate} |
| |
| When I faithfully follow this procedure for new RCU code, there are |
| normally only a few bugs left at the end. |
| With a few prominent (and embarrassing) |
| exceptions~\cite{PaulEMcKenney2011RCU3.0trainwreck}, |
| I usually manage to locate these bugs before others do. |
| That said, this is getting more difficult over time as the number and |
| variety of Linux-kernel users increases. |
| |
| \QuickQuiz{} |
| Why would anyone bother copying existing code in pen on paper??? |
| Doesn't that just increase the probability of transcription errors? |
| \QuickQuizAnswer{ |
| If you are worried about transcription errors, please allow me |
| to be the first to introduce you to a really cool tool named |
| \co{diff}. |
| In addition, carrying out the copying can be quite valuable: |
| \begin{enumerate} |
| \item If you are copying a lot of code, you are probably failing |
| to take advantage of an opportunity for abstraction. |
| The act of copying code can provide great motivation |
| for abstraction. |
| \item Copying the code gives you an opportunity to think about |
| whether the code really works in its new setting. |
| Is there some non-obvious constraint, such as the need |
| to disable interrupts or to hold some lock? |
| \item Copying the code also gives you time to consider whether |
| there is some better way to get the job done. |
| \end{enumerate} |
| So, yes, copy the code! |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| This procedure is ridiculously over-engineered! |
| How can you expect to get a reasonable amount of software |
| written doing it this way??? |
| \QuickQuizAnswer{ |
| Indeed, repeatedly copying code by hand is laborious and slow. |
| However, when combined with heavy-duty stress testing and |
| proofs of correctness, this approach is also extremely effective |
| for complex parallel code where ultimate performance and |
| reliability are required and where debugging is difficult. |
| The Linux-kernel RCU implementation is a case in point. |
| |
| On the other hand, if you are writing a simple single-threaded |
| shell script to manipulate some data, then you would be |
| best-served by a different methodology. |
| For example, you might enter each command one at a time |
| into an interactive shell with a test data set to make |
| sure that it did what you wanted, then copy-and-paste the |
| successful commands into your script. |
| Finally, test the script as a whole. |
| |
| If you have a friend or colleague who is willing to help out, |
| pair programming can work very well, as can any number of |
| formal design- and code-review processes. |
| |
| And if you are writing code as a hobby, then do whatever you like. |
| |
| In short, different types of software need different development |
| methodologies. |
| } \QuickQuizEnd |
| |
| The above procedure works well for new code, but what if you need to |
| inspect code that you have already written? |
| You can of course apply the above procedure for old code in the special |
| case where you wrote one to throw away~\cite{Brooks79}, |
| but the following approach can also be helpful in less desperate |
| circumstances: |
| |
| \begin{enumerate} |
| \item Using your favorite documentation tool (\LaTeX{}, HTML, |
| OpenOffice, or straight ASCII), describe the high-level |
| design of the code in question. |
| Use lots of diagrams to illustrate the data structures |
| and how these structures are updated. |
| \item Make a copy of the code, stripping away all comments. |
| \item Document what the code does statement by statement. |
| \item Fix bugs as you find them. |
| \end{enumerate} |
| |
| This works because describing the code in detail is an excellent way to spot |
| bugs~\cite{GlenfordJMyers1979}. |
| Although this second procedure is also a good way to get your head around |
| someone else's code, in many cases, the first step suffices. |
| |
| Although review and inspection by others is probably more efficient and |
| effective, the above procedures can be quite helpful in cases where |
| for whatever reason it is not feasible to involve others. |
| |
| At this point, you might be wondering how to write parallel code without |
| having to do all this boring paperwork. |
| Here are some time-tested ways of accomplishing this: |
| |
| \begin{enumerate} |
| \item Write a sequential program that scales through use of |
| available parallel library functions. |
| \item Write sequential plug-ins for a parallel framework, |
| such as map-reduce, BOINC, or a web-application server. |
| \item Do such a good job of parallel design that the problem |
| is fully partitioned, then just implement sequential |
| program(s) that run in parallel without communication. |
| \item Stick to one of the application areas (such as linear algebra) |
| where tools can automatically decompose and parallelize |
| the problem. |
| \item Make extremely disciplined use of parallel-programming |
| primitives, so that the resulting code is easily seen to be correct. |
| But beware: It is always tempting to break the rules |
| ``just a little bit'' to gain better performance or |
| scalability. |
| Breaking the rules often results in general breakage. |
| That is, unless you carefully do the paperwork described in this |
| section. |
| \end{enumerate} |
| |
| But the sad fact is that even if you do the paperwork or use one of |
| the above ways to more-or-less safely avoid paperwork, |
| there will be bugs. |
| If nothing else, more users and a greater variety of users will expose |
| more bugs more quickly, especially if those users are doing things |
| that the original developers did not consider. |
| The next section describes how to handle the probabilistic bugs that |
| occur all too commonly when validating parallel software. |
| |
| \section{Probability and Heisenbugs} |
| \label{sec:debugging:Probability and Heisenbugs} |
| |
| So your parallel program fails. |
| Sometimes. |
| |
| But you used techniques from the earlier sections to locate |
| the problem and now have a fix in place! |
| Congratulations!!! |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{cartoons/r-2014-Passed-the-stress-test}} |
| \caption{Passed on Merits? Or Dumb Luck?} |
| \ContributedBy{Figure}{fig:cpu:Passed-the-stress-test}{Melissa Broussard} |
| \end{figure} |
| |
| Now the question is just how much testing is required in order to be |
| certain that |
| you actually fixed the bug, as opposed to just reducing the probability |
| of it occurring on the one hand, having fixed only one of several |
| related bugs on the other hand, or made some ineffectual unrelated |
| change on yet a third hand. |
| In short, what is the answer to the eternal question posed by |
| Figure~\ref{fig:cpu:Passed-the-stress-test}? |
| |
| Unfortunately, the honest answer is that an infinite amount of testing |
| is required to attain absolute certainty. |
| |
| \QuickQuiz{} |
| Suppose that you had a very large number of systems at your |
| disposal. |
| For example, at current cloud prices, you can purchase a |
| huge amount of CPU time at a reasonably low cost. |
| Why not use this approach to get close enough to certainty |
| for all practical purposes? |
| \QuickQuizAnswer{ |
| This approach might well be a valuable addition to your |
| validation arsenal. |
| But it does have a few limitations: |
| \begin{enumerate} |
| \item Some bugs have extremely low probabilities of occurrence, |
| but nevertheless need to be fixed. |
| For example, suppose that the Linux kernel's RCU |
| implementation had a bug that is triggered only once |
| per century of machine time on average. |
| A century of CPU time is hugely expensive even on |
| the cheapest cloud platforms, but we could expect |
| this bug to result in more than 2,000 failures per day |
| on the more than 100 million Linux instances in the |
| world as of 2011. |
| \item The bug might well have zero probability of occurrence |
| on your test setup, which means that you won't see it |
| no matter how much machine time you burn testing it. |
| \end{enumerate} |
| Of course, if your code is small enough, formal validation |
| may be helpful, as discussed in |
| Chapter~\ref{chp:Formal Verification}. |
| But beware: formal validation of your code will not find |
| errors in your assumptions, misunderstanding of the |
| requirements, misunderstanding of the software or hardware |
| primitives you use, or errors that you did not think to construct |
| a proof for. |
| } \QuickQuizEnd |
| |
| But suppose that we are willing to give up absolute certainty in favor |
| of high probability. |
| Then we can bring powerful statistical tools to bear on this problem. |
| However, this section will focus on simple statistical tools. |
| These tools are extremely helpful, but please note |
| that reading this section not a substitute |
| for taking a good set of statistics classes.\footnote{ |
| Which I most highly recommend. |
| The few statistics courses I have taken have provided value |
| way out of proportion to the time I spent studying for them.} |
| |
| For our start with simple statistical tools, we need to decide whether |
| we are doing discrete or continuous testing. |
| Discrete testing features well-defined individual test runs. |
| For example, a boot-up test of a Linux kernel patch is an example |
| of a discrete test. |
| You boot the kernel, and it either comes up or it does not. |
| Although you might spend an hour boot-testing your kernel, the number of |
| times you attempted to boot the kernel and the number of times the |
| boot-up succeeded would often be of more interest than the length |
| of time you spent testing. |
| Functional tests tend to be discrete. |
| |
| On the other hand, if my patch involved RCU, I would probably run |
| rcutorture, which is a kernel module that, strangely enough, tests RCU. |
| Unlike booting the kernel, where the appearance of a login prompt |
| signals the successful end of a discrete test, rcutorture will happily |
| continue torturing RCU until either the kernel crashes or until you |
| tell it to stop. |
| The duration of the rcutorture test is therefore (usually) of more |
| interest than the number of times you started and stopped it. |
| Therefore, rcutorture is an example of a continuous test, a category |
| that includes many stress tests. |
| |
| The statistics governing discrete and continuous tests differ somewhat. |
| However, the statistics for discrete tests is simpler and more |
| familiar than that for continuous tests, and furthermore the |
| statistics for discrete tests can often be pressed into service |
| (with some loss of accuracy) for continuous tests. |
| We therefore start with discrete tests. |
| |
| \subsection{Statistics for Discrete Testing} |
| \label{sec:debugging:Statistics for Discrete Testing} |
| |
| Suppose that the bug had a 10\% chance of occurring in |
| a given run and that we do five runs. |
| How do we compute that probability of at least one run failing? |
| One way is as follows: |
| |
| \begin{enumerate} |
| \item Compute the probability of a given run succeeding, which is 90\%. |
| \item Compute the probability of all five runs succeeding, which |
| is 0.9 raised to the fifth power, or about 59\%. |
| \item There are only two possibilities: either all five runs succeed, |
| or at least one fails. |
| Therefore, the probability of at least one failure is |
| 59\% taken away from 100\%, or 41\%. |
| \end{enumerate} |
| |
| However, many people find it easier to work with a formula than a series |
| of steps, although if you prefer the above series of steps, have at it! |
| For those who like formulas, call the probability of a single failure $f$. |
| The probability of a single success is then $1-f$ and the probability |
| that all of $n$ tests will succeed is then: |
| |
| \begin{equation} |
| S_n = \left(1-f\right)^n |
| \end{equation} |
| |
| The probability of failure is $1-S_n$, or: |
| |
| \begin{equation} |
| F_n = 1-\left(1-f\right)^n |
| \label{eq:debugging:Binomial Failure Rate} |
| \end{equation} |
| |
| \QuickQuiz{} |
| Say what??? |
| When I plug the earlier example of five tests each with a |
| 10\% failure rate into the formula, I get 59,050\% and that |
| just doesn't make sense!!! |
| \QuickQuizAnswer{ |
| You are right, that makes no sense at all. |
| |
| Remember that a probability is a number between zero and one, |
| so that you need to divide a percentage by 100 to get a |
| probability. |
| So 10\% is a probability of 0.1, which gets a probability |
| of 0.4095, which rounds to 41\%, which quite sensibly |
| matches the earlier result. |
| } \QuickQuizEnd |
| |
| So suppose that a given test has been failing 10\% of the time. |
| How many times do you have to run the test to be 99\% sure that |
| your supposed fix has actually improved matters? |
| |
| Another way to ask this question is ``How many times would we need |
| to run the test to cause the probability of failure to rise above 99\%?'' |
| After all, if we were to run the test enough times that the probability |
| of seeing at least one failure becomes 99\%, if there are no failures, |
| there is only 1\% probability of this being due to dumb luck. |
| And if we plug $f=0.1$ into |
| Equation~\ref{eq:debugging:Binomial Failure Rate} and vary $n$, |
| we find that 43 runs gives us a 98.92\% chance of at least one test failing |
| given the original 10\% per-test failure rate, |
| while 44 runs gives us a 99.03\% chance of at least one test failing. |
| So if we run the test on our fix 44 times and see no failures, there |
| is a 99\% probability that our fix was actually a real improvement. |
| |
| But repeatedly plugging numbers into |
| Equation~\ref{eq:debugging:Binomial Failure Rate} |
| can get tedious, so let's solve for $n$: |
| |
| \begin{eqnarray} |
| F_n = 1-\left(1-f\right)^n \\ |
| 1 - F_n = \left(1-f\right)^n \\ |
| \log \left(1 - F_n\right) = n \; \log \left(1 - f\right) |
| \end{eqnarray} |
| |
| Finally the number of tests required is given by: |
| |
| \begin{equation} |
| n = \frac{\log\left(1 - F_n\right)}{\log\left(1 - f\right)} |
| \label{eq:debugging:Binomial Number of Tests Required} |
| \end{equation} |
| |
| Plugging $f=0.1$ and $F_n=0.99$ into |
| Equation~\ref{eq:debugging:Binomial Number of Tests Required} |
| gives 43.7, meaning that we need 44 consecutive successful test |
| runs to be 99\% certain that our fix was a real improvement. |
| This matches the number obtained by the previous method, which |
| is reassuring. |
| |
| \QuickQuiz{} |
| In Equation~\ref{eq:debugging:Binomial Number of Tests Required}, |
| are the logarithms base-10, base-2, or base-$e$? |
| \QuickQuizAnswer{ |
| It does not matter. |
| You will get the same answer no matter what base of logarithms |
| you use because the result is a pure ratio of logarithms. |
| The only constraint is that you use the same base for both |
| the numerator and the denominator. |
| } \QuickQuizEnd |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{CodeSamples/debugging/BinomialNRuns}} |
| \caption{Number of Tests Required for 99 Percent Confidence Given Failure Rate} |
| \label{fig:debugging:Number of Tests Required for 99 Percent Confidence Given Failure Rate} |
| \end{figure} |
| |
| Figure~\ref{fig:debugging:Number of Tests Required for 99 Percent Confidence Given Failure Rate} |
| shows a plot of this function. |
| Not surprisingly, the less frequently each test run fails, the more |
| test runs are required to be 99\% confident that the bug has been |
| fixed. |
| If the bug caused the test to fail only 1\% of the time, then a |
| mind-boggling 458 test runs are required. |
| As the failure probability decreases, the number of test runs required |
| increases, going to infinity as the failure probability goes to zero. |
| |
| The moral of this story is that when you have found a rarely occurring |
| bug, your testing job will be much easier if you can come up with |
| a carefully targeted test with a much higher failure rate. |
| For example, if your targeted test raised the failure rate from 1\% |
| to 30\%, then the number of runs required for 99\% confidence |
| would drop from 458 test runs to a mere thirteen test runs. |
| |
| But these thirteen test runs would only give you 99\% confidence that |
| your fix had produced ``some improvement''. |
| Suppose you instead want to have 99\% confidence that your fix reduced |
| the failure rate by an order of magnitude. |
| How many failure-free test runs are required? |
| |
| An order of magnitude improvement from a 30\% failure rate would be |
| a 3\% failure rate. |
| Plugging these numbers into |
| Equation~\ref{eq:debugging:Binomial Number of Tests Required} yields: |
| |
| \begin{equation} |
| n = \frac{\log\left(1 - 0.99\right)}{\log\left(1 - 0.03\right)} = 151.2 |
| \end{equation} |
| |
| So our order of magnitude improvement requires roughly an order of |
| magnitude more testing. |
| Certainty is impossible, and high probabilities are quite expensive. |
| Clearly making tests run more quickly and making failures more probable |
| are essential skills in the development of highly reliable software. |
| These skills will be covered in |
| Section~\ref{sec:debugging:Hunting Heisenbugs}. |
| |
| \subsection{Abusing Statistics for Discrete Testing} |
| \label{sec:debugging:Abusing Statistics for Discrete Testing} |
| |
| But suppose that you have a continuous test that fails about three |
| times every ten hours, and that you fix the bug that you believe was |
| causing the failure. |
| How long do you have to run this test without failure to be 99\% certain |
| that you reduced the probability of failure? |
| |
| Without doing excessive violence to statistics, we could simply |
| redefine a one-hour run to be a discrete test that has a 30\% |
| probability of failure. |
| Then the results of in the previous section tell us that if the test |
| runs for 13 hours without failure, there is a 99\% probability that |
| our fix actually improved the program's reliability. |
| |
| A dogmatic statistician might not approve of this approach, but the |
| sad fact is that the errors introduced by this sort of abuse of |
| statistical methodology are usually quite small compared to the |
| errors inherent in your measurements of your program's failure rates. |
| Nevertheless, the next section describes a slightly less dodgy approach. |
| |
| \subsection{Statistics for Continuous Testing} |
| \label{sec:debuggingStatistics for Continuous Testing} |
| |
| The fundamental formula for failure probabilities is the Poisson |
| distribution: |
| |
| \begin{equation} |
| F_m = \frac{\lambda^m}{m!} e^{-\lambda} |
| \label{eq:debugging:Poisson Probability} |
| \end{equation} |
| |
| Here $F_m$ is the probability of $m$ failures in the test and |
| $\lambda$ is the expected failure rate per unit time. |
| A rigorous derivation may be found in any advanced probability |
| textbook, for example, Feller's classic ``An Introduction to Probability |
| Theory and Its Applications''~\cite{Feller58}, while a more |
| intuitive approach may be found in the first edition of |
| this book~\cite{McKenney2014ParallelProgramming-e1}. |
| |
| Let's try reworking the example from |
| Section~\ref{sec:debugging:Abusing Statistics for Discrete Testing} |
| using the Poisson distribution. |
| Recall that this example involved a test with a 30\% failure rate per |
| hour, and that the question was how long the test would need to run |
| error-free |
| on a alleged fix to be 99\% certain that the fix actually reduced the |
| failure rate. |
| In this case, $\lambda$ is zero, so that |
| Equation~\ref{eq:debugging:Poisson Probability} reduces to: |
| |
| \begin{equation} |
| F_0 = e^{-\lambda} |
| \end{equation} |
| |
| Solving this requires setting $F_0$ |
| to 0.01 and solving for $\lambda$, resulting in: |
| |
| \begin{equation} |
| \lambda = - \log 0.01 = 4.6 |
| \end{equation} |
| |
| Because we get $0.3$ failures per hour, the number of hours required |
| is $4.6/0.3 = 14.3$, which is within 10\% of the 13 hours |
| calculated using the method in |
| Section~\ref{sec:debugging:Abusing Statistics for Discrete Testing}. |
| Given that you normally won't know your failure rate to within 10\%, |
| this indicates that the method in |
| Section~\ref{sec:debugging:Abusing Statistics for Discrete Testing} |
| is a good and sufficient substitute for the Poisson distribution in |
| a great many situations. |
| |
| More generally, if we have $n$ failures per unit time, and we want to |
| be P\% certain that a fix reduced the failure rate, we can use the |
| following formula: |
| |
| \begin{equation} |
| T = - \frac{1}{n} \log \frac{100 - P}{100} |
| \label{eq:debugging:Error-Free Test Duration} |
| \end{equation} |
| |
| \QuickQuiz{} |
| Suppose that a bug causes a test failure three times per hour |
| on average. |
| How long must the test run error-free to provide 99.9\% |
| confidence that the fix significantly reduced the probability |
| of failure? |
| \QuickQuizAnswer{ |
| We set $n$ to $3$ and $P$ to $99.9$ in |
| Equation~\ref{eq:debugging:Error-Free Test Duration}, resulting in: |
| |
| \begin{equation} |
| T = - \frac{1}{3} \log \frac{100 - 99.9}{100} = 2.3 |
| \end{equation} |
| |
| If the test runs without failure for 2.3 hours, we can be 99.9\% |
| certain that the fix reduced the probability of failure. |
| } \QuickQuizEnd |
| |
| As before, the less frequently the bug occurs and the greater the |
| required level of confidence, the longer the required error-free test run. |
| |
| Suppose that a given test fails about once every hour, but after a bug |
| fix, a 24-hour test run fails only twice. |
| Assuming that the failure leading to the bug is a random occurrence, |
| what is the probability that the small number of |
| failures in the second run was due to random chance? |
| In other words, how confident should we be that the fix actually |
| had some effect on the bug? |
| This probability may be calculated by summing |
| Equation~\ref{eq:debugging:Poisson Probability} as follows: |
| |
| \begin{equation} |
| F_0 + F_1 + \dots + F_{m - 1} + F_m = |
| \sum_{i=0}^m \frac{\lambda^i}{i!} e^{-\lambda} |
| \end{equation} |
| |
| This is the Poisson cumulative distribution function, which can be |
| written more compactly as: |
| |
| \begin{equation} |
| F_{i \le m} = \sum_{i=0}^m \frac{\lambda^i}{i!} e^{-\lambda} |
| \label{eq:debugging:Possion CDF} |
| \end{equation} |
| |
| Here $m$ is the number of errors in the long test run |
| (in this case, two) and $\lambda$ is expected number of errors |
| in the long test run (in this case, 24). |
| Plugging $m=2$ and $\lambda=24$ into this expression gives the probability |
| of two or fewer failures as about |
| $1.2 \times 10^{-8}$, in other words, we have a high level of confidence |
| that the fix actually had some relationship to the bug.\footnote{ |
| Of course, this result in no way excuses you from finding and |
| fixing the bug(s) resulting in the remaining two failures!} |
| |
| \QuickQuiz{} |
| Doing the summation of all the factorials and exponentials |
| is a real pain. |
| Isn't there an easier way? |
| \QuickQuizAnswer{ |
| One approach is to use the open-source symbolic manipulation |
| program named ``maxima''. |
| Once you have installed this program, which is a part of many |
| Debian-based Linux distributions, you can run it and give the |
| \co{load(distrib);} command followed by any number of |
| \co{bfloat(cdf_poisson(m,l));} commands, where the \co{m} |
| is replaced by the desired value of $m$ and the \co{l} |
| is replaced by the desired value of $\lambda$. |
| |
| In particular, the \co{bfloat(cdf_poisson(2,24));} command |
| results in \co{1.181617112359357b-8}, which matches the value |
| given by Equation~\ref{eq:debugging:Possion CDF}. |
| |
| Alternatively, you can use the rough-and-ready method described in |
| Section~\ref{sec:debugging:Abusing Statistics for Discrete Testing}. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| But wait!!! |
| Given that there has to be \emph{some} number of failures |
| (including the possibility of zero failures), |
| shouldn't the summation shown in |
| Equation~\ref{eq:debugging:Possion CDF} |
| approach the value $1$ as $m$ goes to infinity? |
| \QuickQuizAnswer{ |
| Indeed it should. |
| And it does. |
| |
| To see this, note that $e^{-\lambda}$ does not depend on $i$, |
| which means that it can be pulled out of the summation as follows: |
| |
| \begin{equation} |
| e^{-\lambda} \sum_{i=0}^\infty \frac{\lambda^i}{i!} |
| \end{equation} |
| |
| The remaining summation is exactly the Taylor series for |
| $e^\lambda$, yielding: |
| |
| \begin{equation} |
| e^{-\lambda} e^\lambda |
| \end{equation} |
| |
| The two exponentials are reciprocals, and therefore cancel, |
| resulting in exactly $1$, as required. |
| } \QuickQuizEnd |
| |
| The Poisson distribution is a powerful tool for analyzing test results, |
| but the fact is that in this last example there were still two remaining |
| test failures in a 24-hour test run. |
| Such a low failure rate results in very long test runs. |
| The next section discusses counter-intuitive ways of improving this situation. |
| |
| \subsection{Hunting Heisenbugs} |
| \label{sec:debugging:Hunting Heisenbugs} |
| |
| This line of thought also helps explain heisenbugs: |
| adding tracing and assertions can easily reduce the probability |
| of a bug appearing. |
| And this is why extremely lightweight tracing and assertion mechanism are |
| so critically important. |
| |
| The name ``heisenbug'' stems from the Heisenberg Uncertainty Principle |
| from quantum physics, which states that it is impossible to exactly |
| quantify a given particle's position and velocity at any given point |
| in time~\cite{WeinerHeisenberg1927Uncertain}. |
| Any attempt to more accurately measure that particle's position will |
| result in increased uncertainty of its velocity. |
| A similar effect occurs for heisenbugs: attempts to track down the heisenbug |
| causes it to radically change its symptoms or even disappear completely. |
| |
| If the field of physics inspired the name of this problem, it is only |
| logical that we should look to the field of physics for the solution. |
| Fortunately, particle physics is up to the task: |
| Why not create an anti-heisenbug to annihilate the heisenbug? |
| |
| This section describes a number of ways to do just that: |
| |
| \begin{enumerate} |
| \item Add delay to race-prone regions. |
| \item Increase workload intensity. |
| \item Test suspicious subsystems in isolation. |
| \item Simulate unusual events. |
| \item Count near misses. |
| \end{enumerate} |
| |
| Although producing an anti-heisenbug for a given heisenbug is more an |
| art than a science, the following sections give some tips on |
| generating the corresponding species of anti-heisenbug. |
| These are followed by a discussion section, |
| Section~\ref{sec:debugging:Heisenbug Discussion}. |
| |
| \subsubsection{Add Delay} |
| \label{sec:debugging:Add Delay} |
| |
| Consider the count-lossy code in |
| Section~\ref{sec:count:Why Isn't Concurrent Counting Trivial?}. |
| Adding \co{printf()} statements will likely greatly reduce or even |
| eliminate the lost counts. |
| However, converting the load-add-store sequence to a load-add-delay-store |
| sequence will greatly increase the incidence of lost counts (try it!). |
| Once you spot a bug involving a race condition, it is frequently possible |
| to create an anti-heisenbug by adding delay in this manner. |
| |
| Of course, this begs the question of how to find the race condition in |
| the first place. |
| This is a bit of a dark art, but there are a number of things you can |
| do to find them. |
| |
| One approach is to recognize that race conditions often end up corrupting |
| some of the data involved in the race. |
| It is therefore good practice to double-check the synchronization of |
| any corrupted data. |
| Even if you cannot immediately recognize the race condition, adding |
| delay before and after accesses to the corrupted data might change |
| the failure rate. |
| By adding and removing the delays in an organized fashion (e.g., binary |
| search), you might learn more about the workings of the race condition. |
| |
| \QuickQuiz{} |
| How is this approach supposed to help if the corruption affected some |
| unrelated pointer, which then caused the corruption??? |
| \QuickQuizAnswer{ |
| Indeed, that can happen. |
| Many CPUs have hardware-debugging facilities that can help you |
| locate that unrelated pointer. |
| Furthermore, if you have a core dump, you can search the core |
| dump for pointers referencing the corrupted region of memory. |
| You can also look at the data layout of the corruption, and |
| check pointers whose type matches that layout. |
| |
| You can also step back and test the modules making up your |
| program more intensively, which will likely confine the corruption |
| to the module responsible for it. |
| If this makes the corruption vanish, consider adding additional |
| argument checking to the functions exported from each module. |
| |
| Nevertheless, this is a hard problem, which is why I used the |
| words ``a bit of a dark art''. |
| } \QuickQuizEnd |
| |
| Another important approach is to |
| vary the software and hardware configuration and look for statistically |
| significant differences in failure rate. |
| You can then look more intensively at the code implicated by the |
| software or hardware configuration changes that make the greatest |
| difference in failure rate. |
| It might be helpful to test that code in isolation, for example. |
| |
| One important aspect of software configuration is the history of |
| changes, which is why \co{git bisect} is so useful. |
| Bisection of the change history can provide very valuable clues as |
| to the nature of the heisenbug. |
| |
| \QuickQuiz{} |
| But I did the bisection, and ended up with a huge commit. |
| What do I do now? |
| \QuickQuizAnswer{ |
| A huge commit? |
| Shame on you! |
| This is but one reason why you are supposed to keep the commits small. |
| |
| And that is your answer: Break up the commit into bite-sized |
| pieces and bisect the pieces. |
| In my experience, the act of breaking up the commit is often |
| sufficient to make the bug painfully obvious. |
| } \QuickQuizEnd |
| |
| However you locate the suspicious section of code, you can then introduce |
| delays to attempt to increase the probability of failure. |
| As we have seen, increasing the probability of failure makes it much |
| easier to gain high confidence in the corresponding fix. |
| |
| However, it is sometimes quite difficult to track down the problem using |
| normal debugging techniques. |
| The following sections present some other alternatives. |
| |
| \subsubsection{Increase Workload Intensity} |
| \label{sec:debugging:Increase Workload Intensity} |
| |
| It is often the case that a given test suite places relatively |
| low stress on a given subsystem, so that a small change in timing |
| can cause a heisenbug to disappear. |
| One way to create an anti-heisenbug for this case is to increase |
| the workload intensity, which has a good chance of increasing the |
| probability of the bug appearing. |
| If the probability is increased sufficiently, it may be possible to |
| add lightweight diagnostics such as tracing without causing the |
| bug to vanish. |
| |
| How can you increase the workload intensity? |
| This depends on the program, but here are some things to try: |
| |
| \begin{enumerate} |
| \item Add more CPUs. |
| \item If the program uses networking, add more network adapters |
| and more or faster remote systems. |
| \item If the program is doing heavy I/O when the problem occurs, |
| either (1) add more storage devices, (2) use faster storage |
| devices, for example, substitute SSDs for disks, |
| or (3) use a RAM-based filesystem to substitute main |
| memory for mass storage. |
| \item Change the size of the problem, for example, if doing a parallel |
| matrix multiply, change the size of the matrix. |
| Larger problems may introduce more complexity, but smaller |
| problems often increase the level of contention. |
| If you aren't sure whether you should go large or go small, |
| just try both. |
| \end{enumerate} |
| |
| However, it is often the case that the bug is in a specific subsystem, |
| and the structure of the program limits the amount of stress that can |
| be applied to that subsystem. |
| The next section addresses this situation. |
| |
| \subsubsection{Isolate Suspicious Subsystems} |
| \label{sec:debugging:Isolate Suspicious Subsystems} |
| |
| If the program is structured such that it is difficult or impossible |
| to apply much stress to a subsystem that is under suspicion, |
| a useful anti-heisenbug is a stress test that tests that subsystem in |
| isolation. |
| The Linux kernel's rcutorture module takes exactly this approach with |
| RCU: By applying more stress to RCU than is feasible in a production |
| environment, the probability that any RCU bugs will be found during |
| rcutorture testing rather than during production use is increased.\footnote{ |
| Though sadly not increased to probability one.} |
| |
| In fact, when creating a parallel program, it is wise to stress-test |
| the components separately. |
| Creating such component-level stress tests can seem like a waste of time, |
| but a little bit of component-level testing can save a huge amount |
| of system-level debugging. |
| |
| \subsubsection{Simulate Unusual Events} |
| \label{sec:debugging:Simulate Unusual Events} |
| |
| Heisenbugs are sometimes due to unusual events, such as |
| memory-allocation failure, conditional-lock-acquisition failure, |
| CPU-hotplug operations, timeouts, packet losses, and so on. |
| One way to construct an anti-heisenbug for this class of heisenbug |
| is to introduce spurious failures. |
| |
| For example, instead of invoking \co{malloc()} directly, invoke |
| a wrapper function that uses a random number to decide whether |
| to return \co{NULL} unconditionally on the one hand, or to actually |
| invoke \co{malloc()} and return the resulting pointer on the other. |
| Inducing spurious failures is an excellent way to bake robustness into |
| sequential programs as well as parallel programs. |
| |
| \QuickQuiz{} |
| Why don't existing conditional-locking primitives provide this |
| spurious-failure functionality? |
| \QuickQuizAnswer{ |
| There are locking algorithms that depend on conditional-locking |
| primitives telling them the truth. |
| For example, if conditional-lock failure signals that |
| some other thread is already working on a given job, |
| spurious failure might cause that job to never get done, |
| possibly resulting in a hang. |
| } \QuickQuizEnd |
| |
| \subsubsection{Count Near Misses} |
| \label{sec:debugging:Count Near Misses} |
| |
| Bugs are often an all-or-nothing thing, so that either the bug happens |
| or it doesn't, with nothing in between. |
| However, it is sometimes possible to define a \emph{near miss} where |
| the bug does not result in a failure, but has likely manifested. |
| For example, suppose your code is making a robot walk. |
| The robot's falling over constitutes a bug in your program, but |
| stumbling and recovering might constitute a near miss. |
| If the robot falls over only once per hour, but stumbles every few |
| minutes, you might be able to speed up your debugging progress by |
| counting the number of stumbles in addition to the number of falls. |
| |
| In concurrent programs, timestamping can sometimes be used to detect |
| near misses. |
| For example, locking primitives incur significant delays, so if there is |
| a too-short delay between a pair of operations that are supposed |
| to be protected by different acquisitions of the same lock, this too-short |
| delay might be counted as a near miss.\footnote{ |
| Of course, in this case, you might be better off using |
| whatever \co{lock_held()} primitive is available |
| in your environment. |
| If there isn't a \co{lock_held()} primitive, create one!} |
| |
| \begin{figure}[tbp] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{debugging/RCUnearMiss}} |
| \caption{RCU Errors and Near Misses} |
| \label{fig:debugging:RCU Errors and Near Misses} |
| \end{figure} |
| |
| For example, a low-probability bug in RCU priority boosting occurred |
| roughly once every hundred hours of focused rcutorture testing. |
| Because it would take almost 500 hours of failure-free testing to be |
| 99\% certain that the bug's probability had been significantly reduced, |
| the \co{git bisect} process |
| to find the failure would be painfully slow---or would require an extremely |
| large test farm. |
| Fortunately, the RCU operation being tested included not only a wait for |
| an RCU grace period, but also a previous wait for the grace period to start |
| and a subsequent wait for an RCU callback to be |
| invoked after completion of the RCU grace period. |
| This distinction between an \co{rcutorture} error and near miss is |
| shown in |
| Figure~\ref{fig:debugging:RCU Errors and Near Misses}. |
| To qualify as a full-fledged error, an RCU read-side critical section |
| must extend from the \co{call_rcu()} that initiated a grace period, |
| through the remainder of the previous grace period, through the |
| entirety of the grace period initiated by the \co{call_rcu()} |
| (denoted by the region between the jagged lines), and |
| through the delay from the end of that grace period to the callback |
| invocation, as indicated by the ``Error'' arrow. |
| However, the formal definition of RCU prohibits RCU read-side critical |
| sections from extending across a single grace period, as indicated by |
| the ``Near Miss'' arrow. |
| This suggests using near misses as the error condition, however, this |
| can be problematic because different CPUs can have different opinions |
| as to exactly where a given |
| grace period starts and ends, as indicated by the jagged lines.\footnote{ |
| The jaggedness of these lines is seriously understated because |
| idle CPUs might well be completely unaware of the most recent |
| few hundred grace periods.} |
| Using the near misses as the error condition could therefore result |
| in false positives, which need to be avoided in the automated |
| \co{rcutorture} testing. |
| |
| By sheer dumb luck, \co{rcutorture} happens to include some statistics that |
| are sensitive to the near-miss version of the grace period. |
| As noted above, these statistics are subject to false positives due to |
| their unsynchronized access to RCU's state variables, |
| but these false positives turn out to be extremely rare on strongly |
| ordered systems such as the IBM mainframe and x86, occurring less than |
| once per thousand hours of testing. |
| |
| These near misses occurred roughly once per hour, about two orders of |
| magnitude more frequently than the actual errors. |
| Use of these near misses allowed the bug's root cause to be identified |
| in less than a week and a high degree of confidence in the fix to be |
| built in less than a day. |
| In contrast, excluding the near misses in favor of the real errors would |
| have required months of debug and validation time. |
| |
| To sum up near-miss counting, the general approach is to replace counting |
| of infrequent failures with more-frequent near misses that are believed |
| to be correlated with those failures. |
| These near-misses can be considered an anti-heisenbug to the real failure's |
| heisenbug because the near-misses, being more frequent, are likely to |
| be more robust in the face of changes to your code, for example, the |
| changes you make to add debugging code. |
| |
| \subsubsection{Heisenbug Discussion} |
| \label{sec:debugging:Heisenbug Discussion} |
| |
| The alert reader might have noticed that this section was fuzzy and |
| qualitative, in stark contrast to the precise mathematics of |
| Sections~\ref{sec:debugging:Statistics for Discrete Testing}, |
| ~\ref{sec:debugging:Abusing Statistics for Discrete Testing}, |
| and~\ref{sec:debuggingStatistics for Continuous Testing}. |
| If you love precision and mathematics, you may be disappointed to |
| learn that the situations to which this section applies are far |
| more common than those to which the preceding sections apply. |
| |
| In fact, the common case is that although you might have reason to believe |
| that your code has bugs, you have no idea what those bugs are, what |
| causes them, how likely they are to appear, or what conditions affect |
| their probability of appearance. |
| In this all-too-common case, statistics cannot help you.\footnote{ |
| Although if you know what your program is supposed to do and |
| if your program is small enough (both less likely that you |
| might think), then the formal-verification tools described in |
| Chapter~\ref{chp:Formal Verification} |
| can be helpful.} |
| That is to say, statistics cannot help you \emph{directly}. |
| But statistics can be of great indirect help---\emph{if} you have the |
| humility required to admit that you make mistakes, that you can reduce the |
| probability of these mistakes (for example, by getting enough sleep), and |
| that the number and type of mistakes you made in the past is indicative of |
| the number and type of mistakes that you are likely to make in the future. |
| For example, I have a deplorable tendency to forget to write a small |
| but critical portion of the initialization code, and frequently get most |
| or even all of a parallel program correct---except for a stupid |
| omission in initialization. |
| Once I was willing to admit to myself that I am prone to this type of |
| mistake, it was easier (but not easy!) to force myself to double-check |
| my initialization code. |
| Doing this allowed me to find numerous bugs ahead of time. |
| |
| Using Taleb's nomenclature~\cite{NassimTaleb2007BlackSwan}, |
| a white swan is a bug that we can reproduce. |
| We can run a large number of tests, use ordinary statistics to |
| estimate the bug's probability, and use ordinary statistics again |
| to estimate our confidence in a proposed fix. |
| An unsuspected bug is a black swan. |
| We know nothing about it, we have no tests that have yet caused it |
| to happen, and statistics is of no help. |
| Studying our own behavior, especially the number and types of mistakes |
| we make, can turn black swans into grey swans. |
| We might not know exactly what the bugs are, but we have some idea of |
| their number and maybe also of their type. |
| Ordinary statistics is still of no help (at least not until we are |
| able to reproduce one of the bugs), but robust\footnote{ |
| That is to say brutal.} |
| testing methods can be of great help. |
| The goal, therefore, is to use experience and good validation practices |
| to turn the black swans grey, focused testing and analysis to turn the |
| grey swans white, and ordinary methods to fix the white swans. |
| |
| That said, thus far, we have focused solely on bugs in the parallel program's |
| functionality. |
| However, because performance is a first-class requirement for a |
| parallel program (otherwise, why not write a sequential program?), |
| the next section discusses performance bugs. |
| |
| \section{Performance Estimation} |
| \label{sec:debugging:Performance Estimation} |
| |
| Parallel programs usually have performance and scalability requirements, |
| after all, if performance is not an issue, why not use a sequential |
| program? |
| Ultimate performance and linear scalability might not be necessary, but |
| there is little use for a parallel program that runs slower than its |
| optimal sequential counterpart. |
| And there really are cases where every microsecond matters and every |
| nanosecond is needed. |
| Therefore, for parallel programs, insufficient performance is just as |
| much a bug as is incorrectness. |
| |
| \QuickQuiz{} |
| That is ridiculous!!! |
| After all, isn't getting the correct answer later than one would like |
| better than getting an incorrect answer??? |
| \QuickQuizAnswer{ |
| This question fails to consider the option of choosing not to |
| compute the answer at all, and in doing so, also fails to consider |
| the costs of computing the answer. |
| For example, consider short-term weather forecasting, for which |
| accurate models exist, but which require large (and expensive) |
| clustered supercomputers, at least if you want to actually run |
| the model faster than the weather. |
| |
| And in this case, any performance bug that prevents the model from |
| running faster than the actual weather prevents any forecasting. |
| Given that the whole purpose of purchasing the large clustered |
| supercomputers was to forecast weather, if you cannot run the |
| model faster than the weather, you would be better off not running |
| the model at all. |
| |
| More severe examples may be found in the area of safety-critical |
| real-time computing. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| But if you are going to put in all the hard work of parallelizing |
| an application, why not do it right? |
| Why settle for anything less than optimal performance and |
| linear scalability? |
| \QuickQuizAnswer{ |
| Although I do heartily salute your spirit and aspirations, |
| you are forgetting that there may be high costs due to delays |
| in the program's completion. |
| For an extreme example, suppose that a 40\% performance shortfall |
| from a single-threaded application is causing one person to die |
| each day. |
| Suppose further that in a day you could hack together a |
| quick and dirty |
| parallel program that ran 50\% faster on an eight-CPU system |
| than the sequential version, but that an optimal parallel |
| program would require four months of painstaking design, coding, |
| debugging, and tuning. |
| |
| It is safe to say that more than 100 people would prefer the |
| quick and dirty version. |
| } \QuickQuizEnd |
| |
| Validating a parallel program must therfore include validating its |
| performance. |
| But validating performance means having a workload to run and performance |
| criteria with which to evaluate the program at hand. |
| These needs are often met by \emph{performance benchmarks}, which |
| are discussed in the next section. |
| |
| \subsection{Benchmarking} |
| \label{sec:debugging:Benchmarking} |
| |
| The old saying goes ``There are lies, damn lies, statistics, |
| and benchmarks.'' |
| However, benchmarks are heavily used, so it is not helpful to |
| be too dismissive of them. |
| |
| Benchmarks span the range from ad hoc test jigs to international |
| standards, but regardless of their level of formality, benchmarks |
| serve four major purposes: |
| |
| \begin{enumerate} |
| \item Providing a fair framework for comparing competing implementations. |
| \item Focusing competitive energy on improving implementations in ways |
| that matter to users. |
| \item Serving as example uses of the implementations being benchmarked. |
| \item Serving as a marketing tool to highlight your software's |
| strong points against your competitors' offerings. |
| \end{enumerate} |
| |
| Of course, the only completely fair framework is the intended |
| application itself. |
| So why would anyone who cared about fairness in benchmarking |
| bother creating imperfect benchmarks rather than simply |
| using the application itself as the benchmark? |
| |
| Running the actual application is in fact the best approach where it is practical. |
| Unfortunately, it is often impractical for the following reasons: |
| |
| \begin{enumerate} |
| \item The application might be proprietary, and you |
| might not have the right to run the intended application. |
| \item The application might require more hardware |
| than you have access to. |
| \item The application might use data that you cannot legally |
| access, for example, due to privacy regulations. |
| \end{enumerate} |
| |
| In these cases, creating a benchmark that approximates |
| the application can help overcome these obstacles. |
| A carefully constructed benchmark can help promote performance, |
| scalability, energy efficiency, and much else besides. |
| |
| \subsection{Profiling} |
| \label{sec:debugging:Profiling} |
| |
| In many cases, a fairly small portion of your software is responsible |
| for the majority of the performance and scalability shortfall. |
| However, developers are notoriously unable to identify the actual |
| bottlenecks by hand. |
| For example, in the case of a kernel buffer allocator, all attention focused |
| on a search of a dense array which turned out to represent only |
| a few percent of the allocator's execution time. |
| An execution profile collected via a logic analyzer focused attention |
| on the cache misses that were actually responsible for the majority |
| of the problem~\cite{McKenney93}. |
| |
| An old-school but quite effective method of tracking down performance |
| and scalability bugs is to run your program under a debugger, |
| then periodically interrupt it, recording the stacks of all threads |
| at each interruption. |
| The theory here is that if something is slowing down your program, |
| it has to be visible in your threads' executions. |
| |
| That said, there are a number of tools |
| that will usually do a much better job of helping you to focus your |
| attention where it will do the most good. |
| Two popular choices are \co{gprof} and \co{perf}. |
| To use \co{perf} on a single-process program, prefix your command |
| with \co{perf record}, then after the command completes, type |
| \co{perf report}. |
| There is a lot of work on tools for performance debugging of multi-threaded |
| programs, which should make this important job easier. |
| Again, one good starting point is Brendan Gregg's blog.\footnote{ |
| \url{http://www.brendangregg.com/blog/}} |
| |
| \subsection{Differential Profiling} |
| \label{sec:debugging:Differential Profiling} |
| |
| Scalability problems will not necessarily be apparent unless you are running |
| on very large systems. |
| However, it is sometimes possible to detect impending scalability problems |
| even when running on much smaller systems. |
| One technique for doing this is called \emph{differential profiling}. |
| |
| The idea is to run your workload under two different sets of conditions. |
| For example, you might run it on two CPUs, then run it again on four |
| CPUs. |
| You might instead vary the load placed on the system, the number of |
| network adapters, the number of mass-storage devices, and so on. |
| You then collect profiles of the two runs, and mathematically combine |
| corresponding profile measurements. |
| For example, if your main concern is scalability, you might take the |
| ratio of corresponding measurements, and then sort the ratios into |
| descending numerical order. |
| The prime scalability suspects will then be sorted to the top of the |
| list~\cite{McKenney95a,McKenney99b}. |
| |
| Some tools such as \co{perf} have built-in differential-profiling |
| support. |
| |
| \subsection{Microbenchmarking} |
| \label{sec:debugging:Microbenchmarking} |
| |
| Microbenchmarking can be useful when deciding which algorithms or |
| data structures are worth incorporating into a larger body of software |
| for deeper evaluation. |
| |
| One common approach to microbenchmarking is to measure the time, |
| run some number of iterations of the code |
| under test, then measure the time again. |
| The difference between the two times divided by the number of iterations |
| gives the measured time required to execute the code under test. |
| |
| Unfortunately, this approach to measurement allows any number of errors |
| to creep in, including: |
| |
| \begin{enumerate} |
| \item The measurement will include some of the overhead of |
| the time measurement. |
| This source of error can be reduced to an arbitrarily small |
| value by increasing the number of iterations. |
| \item The first few iterations of the test might incur cache misses |
| or (worse yet) page faults that might inflate the measured |
| value. |
| This source of error can also be reduced by increasing the |
| number of iterations, or it can often be eliminated entirely |
| by running a few warm-up iterations before starting the |
| measurement period. |
| \item Some types of interference, for example, random memory errors, |
| are so rare that they can be dealt with by running a number |
| of sets of iterations of the test. |
| If the level of interference was statistically significant, |
| any performance outliers could be rejected statistically. |
| \item Any iteration of the test might be interfered with by other |
| activity on the system. |
| Sources of interference include other applications, system |
| utilities and daemons, device interrupts, firmware interrupts |
| (including system management interrupts, or SMIs), |
| virtualization, memory errors, and much else besides. |
| Assuming that these sources of interference occur randomly, |
| their effect can be minimized by reducing the number of |
| iterations. |
| \end{enumerate} |
| |
| The first and fourth sources of interference provide conflicting advice, |
| which is one sign that we are living in the real world. |
| The remainder of this section looks at ways of resolving this conflict. |
| |
| \QuickQuiz{} |
| But what about other sources of error, for example, due to |
| interactions between caches and memory layout? |
| \QuickQuizAnswer{ |
| Changes in memory layout can indeed result in unrealistic |
| decreases in execution time. |
| For example, suppose that a given microbenchmark almost |
| always overflows the L0 cache's associativity, but with just the right |
| memory layout, it all fits. |
| If this is a real concern, consider running your microbenchmark |
| using huge pages (or within the kernel or on bare metal) in |
| order to completely control the memory layout. |
| } \QuickQuizEnd |
| |
| The following sections discuss ways of dealing with these measurement |
| errors, with |
| Section~\ref{sec:debugging:Isolation} |
| covering isolation techniques that may be used to prevent some forms of |
| interference, |
| and with |
| Section~\ref{sec:debugging:Detecting Interference} |
| covering methods for detecting interference so as to reject measurement |
| data that might have been corrupted by that interference. |
| |
| \subsection{Isolation} |
| \label{sec:debugging:Isolation} |
| |
| The Linux kernel provides a number of ways to isolate a group of |
| CPUs from outside interference. |
| |
| First, let's look at interference by other processes, threads, and tasks. |
| The POSIX \co{sched_setaffinity()} system call may be used to move |
| most tasks off of a given set of CPUs and to confine your tests to |
| that same group. |
| The Linux-specific user-level \co{taskset} command may be used for |
| the same purpose, though both \co{sched_setaffinity()} and |
| \co{taskset} require elevated permissions. |
| Linux-specific control groups (cgroups) may be used for this same purpose. |
| This approach can be quite effective at reducing interference, and |
| is sufficient in many cases. |
| However, it does have limitations, for example, it cannot do anything |
| about the per-CPU kernel threads that are often used for housekeeping |
| tasks. |
| |
| One way to avoid interference from per-CPU kernel threads is to run |
| your test at a high real-time priority, for example, by using |
| the POSIX \co{sched_setscheduler()} system call. |
| However, note that if you do this, you are implicitly taking on |
| responsibility for avoiding infinite loops, because otherwise |
| your test will prevent part of the kernel from functioning.\footnote{ |
| This is an example of the Spiderman Principle: ``With great |
| power comes great responsibility.''} |
| |
| These approaches can greatly reduce, and perhaps even eliminate, |
| interference from processes, threads, and tasks. |
| However, it does nothing to prevent interference from device |
| interrupts, at least in the absence of threaded interrupts. |
| Linux allows some control of threaded interrupts via the |
| \path{/proc/irq} directory, which contains numerical directories, one |
| per interrupt vector. |
| Each numerical directory contains \co{smp_affinity} and |
| \co{smp_affinity_list}. |
| Given sufficient permissions, you can write a value to these files |
| to restrict interrupts to the specified set of CPUs. |
| For example, ``\co{sudo echo 3 > /proc/irq/23/smp_affinity}'' |
| would confine interrupts on vector~23 to CPUs~0 and~1. |
| The same results may be obtained via |
| ``\co{sudo echo 0-1 > /proc/irq/23/smp_affinity_list}''. |
| You can use ``\co{cat /proc/interrupts}'' to obtain a list of the interrupt |
| vectors on your system, how many are handled by each CPU, and what |
| devices use each interrupt vector. |
| |
| Running a similar command for all interrupt vectors on your system |
| would confine interrupts to CPUs~0 and~1, leaving the remaining CPUs |
| free of interference. |
| Or mostly free of interference, anyway. |
| It turns out that the scheduling-clock interrupt fires on each CPU |
| that is running in user mode.\footnote{ |
| Frederic Weisbecker is working on an adaptive-ticks project |
| that will allow the scheduling-clock interrupt to be shut |
| off on any CPU that has only one runnable task, but as of |
| early 2013, this is unfortunately still work in progress.} |
| In addition you must take care to ensure that the set of CPUs that you |
| confine the interrupts to is capable of handling the load. |
| |
| But this only handles processes and interrupts running in the same |
| operating-system instance as the test. |
| Suppose that you are running the test in a guest OS that is itself |
| running on a hypervisor, for example, Linux running KVM? |
| Although you can in theory apply the same techniques at the hypervisor |
| level that you can at the guest-OS level, it is quite common for |
| hypervisor-level operations to be restricted to authorized personnel. |
| In addition, none of these techniques work against firmware-level |
| interference. |
| |
| \QuickQuiz{} |
| Wouldn't the techniques suggested to isolate the code under |
| test also affect that code's performance, particularly if |
| it is running within a larger application? |
| \QuickQuizAnswer{ |
| Indeed it might, although in most microbenchmarking efforts |
| you would extract the code under test from the enclosing |
| application. |
| Nevertheless, if for some reason you must keep the code under |
| test within the application, you will very likely need to use |
| the techniques discussed in |
| Section~\ref{sec:debugging:Detecting Interference}. |
| } \QuickQuizEnd |
| |
| If you find yourself in this painful situation, instead of preventing |
| the interference, you might need to detect the interference as described |
| in the next section. |
| |
| \subsection{Detecting Interference} |
| \label{sec:debugging:Detecting Interference} |
| |
| If you cannot prevent interference, perhaps you can detect the |
| interference after the fact and reject the test runs that were affected |
| by that interference. |
| Section~\ref{sec:debugging:Detecting Interference Via Measurement} |
| describes methods of rejection involving additional measurements, |
| while Section~\ref{sec:debugging:Detecting Interference Via Statistics} |
| describes statistics-based rejection. |
| |
| \subsubsection{Detecting Interference Via Measurement} |
| \label{sec:debugging:Detecting Interference Via Measurement} |
| |
| % Sources of interference include other applications, system |
| % utilities and daemons, device interrupts, firmware interrupts |
| % (including system management interrupts, or SMIs), |
| % virtualization, memory errors, and much else besides. |
| |
| Many systems, including Linux, provide means for determining after the |
| fact whether some forms of interference have occurred. |
| For example, if your test encountered process-based interference, |
| a context switch must have occurred during the test. |
| On Linux-based systems, this context switch will be visible in |
| \path{/proc/<PID>/sched} in the \co{nr_switches} field. |
| Similarly, interrupt-based interference can be detected via the |
| \path{/proc/interrupts} file. |
| |
| \begin{figure}[tb] |
| { \scriptsize |
| \begin{verbbox} |
| 1 #include <sys/time.h> |
| 2 #include <sys/resource.h> |
| 3 |
| 4 /* Return 0 if test results should be rejected. */ |
| 5 int runtest(void) |
| 6 { |
| 7 struct rusage ru1; |
| 8 struct rusage ru2; |
| 9 |
| 10 if (getrusage(RUSAGE_SELF, &ru1) != 0) { |
| 11 perror("getrusage"); |
| 12 abort(); |
| 13 } |
| 14 /* run test here. */ |
| 15 if (getrusage(RUSAGE_SELF, &ru2 != 0) { |
| 16 perror("getrusage"); |
| 17 abort(); |
| 18 } |
| 19 return (ru1.ru_nvcsw == ru2.ru_nvcsw && |
| 20 ru1.runivcsw == ru2.runivcsw); |
| 21 } |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Using \tco{getrusage()} to Detect Context Switches} |
| \label{fig:count:Using getrusage() to Detect Context Switches} |
| \end{figure} |
| |
| Opening and reading files is not the way to low overhead, and it is |
| possible to get the count of context switches for a given thread |
| by using the \co{getrusage()} system call, as shown in |
| Figure~\ref{fig:count:Using getrusage() to Detect Context Switches}. |
| This same system call can be used to detect minor page faults (\co{ru_minflt}) |
| and major page faults (\co{ru_majflt}). |
| |
| Unfortunately, detecting memory errors and firmware interference is quite |
| system-specific, as is the detection of interference due to virtualization. |
| Although avoidance is better than detection, and detection is better than |
| statistics, there are times when one must avail oneself of statistics, |
| a topic addressed in the next section. |
| |
| \subsubsection{Detecting Interference Via Statistics} |
| \label{sec:debugging:Detecting Interference Via Statistics} |
| |
| Any statistical analysis will be based on assumptions about the data, |
| and performance microbenchmarks often support the following assumptions: |
| |
| \begin{enumerate} |
| \item Smaller measurements are more likely to be accurate than |
| larger measurements. |
| \item The measurement uncertainty of good data is known. |
| \item A reasonable fraction of the test runs will result in good data. |
| \end{enumerate} |
| |
| The fact that smaller measurements are more likely to be accurate than |
| larger measurements suggests that sorting the measurements in increasing |
| order is likely to be productive.\footnote{ |
| To paraphrase the old saying, ``Sort first and ask questions later.''} |
| The fact that the measurement uncertainty is known allows us to accept |
| measurements within this uncertainty of each other: If the effects of |
| interference are large compared to this uncertainty, this will ease |
| rejection of bad data. |
| Finally, the fact that some fraction (for example, one third) can be |
| assumed to be good allows us to blindly accept the first portion of the |
| sorted list, and this data can then be used to gain an estimate of the |
| natural variation of the measured data, over and above the assumed |
| measurement error. |
| |
| The approach is to take the specified number of leading elements from the |
| beginning of the sorted list, and use these to estimate a typical |
| inter-element delta, which in turn may be multiplied by the number of |
| elements in the list to obtain an upper bound on permissible values. |
| The algorithm then repeatedly considers the next element of the list. |
| If it falls below the upper bound, and if the distance between |
| the next element and the previous element is not too much greater than |
| the average inter-element distance for the portion of the list accepted |
| thus far, then the next element is accepted and the process repeats. |
| Otherwise, the remainder of the list is rejected. |
| |
| \begin{figure}[tb] |
| { \scriptsize |
| \begin{verbbox} |
| 1 divisor=3 |
| 2 relerr=0.01 |
| 3 trendbreak=10 |
| 4 while test $# -gt 0 |
| 5 do |
| 6 case "$1" in |
| 7 --divisor) |
| 8 shift |
| 9 divisor=$1 |
| 10 ;; |
| 11 --relerr) |
| 12 shift |
| 13 relerr=$1 |
| 14 ;; |
| 15 --trendbreak) |
| 16 shift |
| 17 trendbreak=$1 |
| 18 ;; |
| 19 esac |
| 20 shift |
| 21 done |
| 22 |
| 23 awk -v divisor=$divisor -v relerr=$relerr \ |
| 24 -v trendbreak=$trendbreak '{ |
| 25 for (i = 2; i <= NF; i++) |
| 26 d[i - 1] = $i; |
| 27 asort(d); |
| 28 i = int((NF + divisor - 1) / divisor); |
| 29 delta = d[i] - d[1]; |
| 30 maxdelta = delta * divisor; |
| 31 maxdelta1 = delta + d[i] * relerr; |
| 32 if (maxdelta1 > maxdelta) |
| 33 maxdelta = maxdelta1; |
| 34 for (j = i + 1; j < NF; j++) { |
| 35 if (j <= 2) |
| 36 maxdiff = d[NF - 1] - d[1]; |
| 37 else |
| 38 maxdiff = trendbreak * \ |
| 39 (d[j - 1] - d[1]) / (j - 2); |
| 40 if (d[j] - d[1] > maxdelta && \ |
| 41 d[j] - d[j - 1] > maxdiff) |
| 42 break; |
| 43 } |
| 44 n = sum = 0; |
| 45 for (k = 1; k < j; k++) { |
| 46 sum += d[k]; |
| 47 n++; |
| 48 } |
| 49 min = d[1]; |
| 50 max = d[j - 1]; |
| 51 avg = sum / n; |
| 52 print $1, avg, min, max, n, NF - 1; |
| 53 }' |
| \end{verbbox} |
| } |
| \centering |
| \theverbbox |
| \caption{Statistical Elimination of Interference} |
| \label{fig:count:Statistical Elimination of Interference} |
| \end{figure} |
| |
| Figure~\ref{fig:count:Statistical Elimination of Interference} |
| shows a simple \co{sh}/\co{awk} script implementing this notion. |
| Input consists of an x-value followed by an arbitrarily long list of y-values, |
| and output consists of one line for each input line, with fields as follows: |
| |
| \begin{enumerate} |
| \item The x-value. |
| \item The average of the selected data. |
| \item The minimum of the selected data. |
| \item The maximum of the selected data. |
| \item The number of selected data items. |
| \item The number of input data items. |
| \end{enumerate} |
| |
| This script takes three optional arguments as follows: |
| |
| \begin{description} |
| \item [\tco{--divisor}\nf{:}] Number of segments to divide the list into, |
| for example, a divisor of four means that the first quarter |
| of the data elements will be assumed to be good. |
| This defaults to three. |
| \item [\tco{--relerr}\nf{:}] Relative measurement error. The script assumes |
| that values that differ by less than this error are for all |
| intents and purposes equal. |
| This defaults to 0.01, which is equivalent to 1\%. |
| \item [\tco{--trendbreak}\nf{:}] Ratio of inter-element spacing constituting |
| a break in the trend of the data. |
| For example, if the average spacing in the data accepted so far |
| is 1.5, then if the trend-break ratio is 2.0, then if the next |
| data value differs from the last one by more than 3.0, this |
| constitutes a break in the trend. |
| (Unless of course, the relative error is greater than 3.0, in |
| which case the ``break'' will be ignored.) |
| \end{description} |
| |
| Lines~1-3 of |
| Figure~\ref{fig:count:Statistical Elimination of Interference} |
| set the default values for the parameters, and lines~4-21 parse |
| any command-line overriding of these parameters. |
| The \co{awk} invocation on lines~23 and~24 sets the values of the |
| \co{divisor}, \co{relerr}, and \co{trendbreak} variables to their |
| \co{sh} counterparts. |
| In the usual \co{awk} manner, lines~25-52 are executed on each input |
| line. |
| The loop spanning lines~24 and~26 copies the input y-values to the |
| \co{d} array, which line~27 sorts into increasing order. |
| Line~28 computes the number of y-values that are to be trusted absolutely |
| by applying \co{divisor} and rounding up. |
| |
| Lines~29-33 compute the \co{maxdelta} value used as a lower bound on |
| the upper bound of y-values. |
| To this end, lines~29 and~30 multiply the difference in values over |
| the trusted region of data by the \co{divisor}, which projects the |
| difference in values across the trusted region across the entire |
| set of y-values. |
| However, this value might well be much smaller than the relative error, |
| so line~31 computes the absolute error (\co{d[i] * relerr}) and adds |
| that to the difference \co{delta} across the trusted portion of the data. |
| Lines~32 and~33 then compute the maximum of these two values. |
| |
| Each pass through the loop spanning lines~34-43 attempts to add another |
| data value to the set of good data. |
| Lines~35-39 compute the trend-break delta, with line~36 disabling this |
| limit if we don't yet have enough values to compute a trend, |
| and with lines~38 and~39 multiplying \co{trendbreak} by the average |
| difference between pairs of data values in the good set. |
| If line~40 determines that the candidate data value would exceed the |
| lower bound on the upper bound (\co{maxdelta}) \emph{and} |
| line~41 determines that the difference between the candidate data value |
| and its predecessor exceeds the trend-break difference (\co{maxdiff}), |
| then line~42 exits the loop: We have the full good set of data. |
| |
| Lines~44-52 then compute and print the statistics for the data set. |
| |
| \QuickQuiz{} |
| This approach is just plain weird! |
| Why not use means and standard deviations, like we were taught |
| in our statistics classes? |
| \QuickQuizAnswer{ |
| Because mean and standard deviation were not designed to do this job. |
| To see this, try applying mean and standard deviation to the |
| following data set, given a 1\% relative error in measurement: |
| |
| \begin{quote} |
| 49,548.4 49,549.4 49,550.2 49,550.9 49,550.9 49,551.0 |
| 49,551.5 49,552.1 49,899.0 49,899.3 49,899.7 49,899.8 |
| 49,900.1 49,900.4 52,244.9 53,333.3 53,333.3 53,706.3 |
| 53,706.3 54,084.5 |
| \end{quote} |
| |
| The problem is that mean and standard deviation do not rest on |
| any sort of measurement-error assumption, and they will therefore |
| see the difference between the values near 49,500 and those near |
| 49,900 as being statistically significant, when in fact they are |
| well within the bounds of estimated measurement error. |
| |
| Of course, it is possible to create a script similar to |
| that in |
| Figure~\ref{fig:count:Statistical Elimination of Interference} |
| that uses standard deviation rather than absolute difference |
| to get a similar effect, |
| and this is left as an exercise for the interested reader. |
| Be careful to avoid divide-by-zero errors arising from strings |
| of identical data values! |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| But what if all the y-values in the trusted group of data |
| are exactly zero? |
| Won't that cause the script to reject any non-zero value? |
| \QuickQuizAnswer{ |
| Indeed it will! |
| But if your performance measurements often produce a value of |
| exactly zero, perhaps you need to take a closer look at your |
| performance-measurement code. |
| |
| Note that many approaches based on mean and standard deviation |
| will have similar problems with this sort of dataset. |
| } \QuickQuizEnd |
| |
| Although statistical interference detection can be quite useful, it should |
| be used only as a last resort. |
| It is far better to avoid interference in the first place |
| (Section~\ref{sec:debugging:Isolation}), or, failing that, |
| detecting interference via measurement |
| (Section~\ref{sec:debugging:Detecting Interference Via Measurement}). |
| |
| \section{Summary} |
| \label{sec:debugging:Summary} |
| |
| \begin{figure}[tbhp] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{cartoons/UseTheRightCannon}} |
| \caption{Choose Validation Methods Wisely} |
| \ContributedBy{Figure}{fig:debugging:Choose Validation Methods Wisely}{Melissa Broussard} |
| \end{figure} |
| |
| Although validation never will be an exact science, much can be gained |
| by taking an organized approach to it, as an organized approach will |
| help you choose the right validation tools for your job, avoiding |
| situations like the one fancifully depicted in |
| Figure~\ref{fig:debugging:Choose Validation Methods Wisely}. |
| |
| A key choice is that of statistics. |
| Although the methods described in this chapter work very well most of |
| the time, they do have their limitations. |
| These limitations are inherent because we are attempting to do something |
| that is in general impossible, courtesy of the |
| Halting Problem~\cite{AlanMTuring1937HaltingProblem,GeoffreyKPullum2000HaltingProblem}. |
| Fortunately for us, there are a huge number of special cases in which |
| we can not only work out whether a given program will halt, but also |
| establish estimates for how long it will run before halting, as discussed in |
| Section~\ref{sec:debugging:Performance Estimation}. |
| Furthermore, in cases where a given program might or might not work |
| correctly, we can often establish estimates for what fraction of the |
| time it will work correctly, as discussed in |
| Section~\ref{sec:debugging:Probability and Heisenbugs}. |
| |
| Nevertheless, unthinking reliance on these estimates is brave to the |
| point of foolhardiness. |
| After all, we are summarizing a huge mass of complexity in code and |
| data structures down to a single solitary number. |
| Even though we can get away with such bravery a surprisingly large |
| fraction of the time, abstracting all that code and data away will |
| occasionally cause severe problems. |
| |
| One possible problem is variability, where repeated runs might give |
| wildly different results. |
| This is often dealt with by maintaining a standard deviation as well |
| as a mean, but the fact is that attempting to summarize the behavior |
| of a large and complex program with two numbers is almost as brave as |
| summarizing its behavior with only one number. |
| In computer programming, the surprising thing is that use of the |
| mean or the mean and standard deviation are often sufficient, but |
| there are no guarantees. |
| |
| One cause of variation is confounding factors. |
| For example, the CPU time consumed by a linked-list search will depend |
| on the length of the list. |
| Averaging together runs with wildly different list lengths will |
| probably not be useful, and adding a standard deviation to the mean |
| will not be much better. |
| The right thing to do would be control for list length, either by |
| holding the length constant or to measure CPU time as a function of |
| list length. |
| |
| Of course, this advice assumes that you are aware of the confounding |
| factors, and Murphy says that you probably will not be. |
| I have been involved in projects that had confounding factors as |
| diverse as air conditioners (which drew considerable power at startup, |
| thus causing the voltage supplied to the computer to momentarily drop |
| too low, sometimes resulting in failure), cache state (resulting in |
| odd variations in performance), I/O errors (including disk errors, |
| packet loss, and duplicate Ethernet MAC addresses), and even |
| porpoises (which could not resist playing with an array of transponders, |
| which, in the absence of porpoises, could be |
| used for high-precision acoustic positioning and navigation). |
| And this is but one reason why a good night's sleep is such an |
| effective debugging tool. |
| |
| In short, validation always will require some measure of the behavior of |
| the system. |
| Because this measure must be a severe summarization of the system, |
| it can be misleading. |
| So as the saying goes, ``Be careful. It is a real world out there.'' |
| |
| But suppose you are working on the Linux kernel, which as of 2013 has |
| about a billion instances throughout the world? |
| In that case, a bug that would be encountered once every million years |
| will be encountered almost three times per day across the installed |
| base. |
| A test with a 50\% chance of encountering this bug in a one-hour run |
| would need to increase that bug's probability of occurrence by more |
| than nine orders of magnitude, which poses a severe challenge to |
| today's testing methodologies. |
| One important tool that can sometimes be applied with good effect to |
| such situations is formal verification, the subject of the next chapter. |