| % future/formalregress.tex |
| % mainfile: ../perfbook.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \section{Formal Regression Testing?} |
| \label{sec:future:Formal Regression Testing?} |
| % |
| \epigraph{Theory without experiments: |
| Have we gone too far?} |
| {Michael Mitzenmacher} |
| |
| Formal verification has long proven useful in a number of production |
| environments~\cite{JamesRLarus2004RightingSoftware,AlBessey2010BillionLoCLater,ByronCook2018FormalAmazon,CaitlinSadowski2018staticAnalysisGoogle,DinoDistefano2019FBstaticAnalysis}. |
| However, it is an open question as to whether hard-core formal verification |
| will ever be included in the automated regression-test suites used for |
| continuous integration within complex concurrent codebases, such as the |
| Linux kernel. |
| Although there is already a proof of concept for Linux-kernel |
| SRCU~\cite{LanceRoy2017CBMC-SRCU}, this test is for a small portion |
| of one of the simplest RCU implementations, and has proven difficult |
| to keep it current with the ever-changing Linux kernel. |
| It is therefore worth asking what would be required to incorporate |
| formal verification as first-class members of the Linux kernel's |
| regression tests. |
| |
| The following list is a good |
| start~\cite[slide 34]{PaulEMcKenney2015DagstuhlVerification}: |
| |
| \begin{enumerate} |
| \item Any required translation must be automated. |
| \item The environment (including memory ordering) must be correctly |
| handled. |
| \item The memory and CPU overhead must be acceptably modest. |
| \item Specific information leading to the location of the bug |
| must be provided. |
| \item Information beyond the source code and inputs must be |
| modest in scope. |
| \item The bugs located must be relevant to the code's users. |
| \end{enumerate} |
| |
| This list builds on, but is somewhat more modest than, Richard Bornat's |
| dictum: |
| ``Formal-verification researchers should verify the code that |
| developers write, in the language they write it in, running in the |
| environment that it runs in, as they write it.'' |
| The following sections discuss each of the above requirements, followed |
| by a section presenting a scorecard of how well a few tools stack up |
| against these requirements. |
| |
| \QuickQuiz{ |
| This list is ridiculously utopian! |
| Why not stick to the current state of the formal-verification art? |
| }\QuickQuizAnswer{ |
| You are welcome to your opinion on what is and is not utopian, |
| but I will be paying more attention to people actually making |
| progress on the items in that list than to anyone who might be |
| objecting to them. |
| This might have something to do with my long experience with |
| people attempting to talk me out of specific things that their |
| favorite tools cannot handle. |
| |
| In the meantime, please feel free to read the papers written by |
| the people who are actually making progress, for example, this |
| one~\cite{DinoDistefano2019FBstaticAnalysis}. |
| }\QuickQuizEnd |
| |
| \subsection{Automatic Translation} |
| \label{sec:future:Automatic Translation} |
| |
| Although Promela and \co{spin} |
| are invaluable design aids, if you need to formally regression-test |
| your C-language program, you must hand-translate to Promela each time |
| you would like to re-verify your code. |
| If your code happens to be in the Linux kernel, which releases every |
| 60--90 days, you will need to hand-translate from four to six times |
| each year. |
| Over time, human error will creep in, which means that the verification |
| won't match the source code, rendering the verification useless. |
| Repeated verification clearly requires either that the formal-verification |
| tooling input your code directly, or that there be bug-free automatic |
| translation of your code to the form required for verification. |
| |
| PPCMEM and \co{herd} can in theory directly input assembly language |
| and C++ code, but these tools work only on very small litmus tests, |
| which normally means that you must extract the core of your |
| mechanism---by hand. |
| As with Promela and \co{spin}, both PPCMEM and \co{herd} are |
| extremely useful, but they are not well-suited for regression suites. |
| |
| In contrast, \IXaltacr{\co{cbmc}}{cbmc} and \IX{Nidhugg} can input |
| C programs of reasonable |
| (though still quite limited) size, and if their capabilities continue |
| to grow, could well become excellent additions to regression suites. |
| The Coverity static-analysis tool also inputs C programs, and of very |
| large size, including the Linux kernel. |
| Of course, Coverity's static analysis is quite simple compared to that |
| of \co{cbmc} and Nidhugg. |
| On the other hand, Coverity had an all-encompassing definition of |
| ``C program'' that posed special challenges~\cite{AlBessey2010BillionLoCLater}. |
| Amazon Web Services uses a variety of formal-verification tools, |
| including \co{cbmc}, and applies some of these tools to regression |
| testing~\cite{ByronCook2018FormalAmazon}. |
| Google uses a number of relatively simple static analysis tools directly |
| on large Java code bases, which are arguably less diverse than C code |
| bases~\cite{CaitlinSadowski2018staticAnalysisGoogle}. |
| Facebook uses more aggressive forms of formal verification against its |
| code bases, including analysis of concurrency~\cite{DinoDistefano2019FBstaticAnalysis,PeterWOHearn2019incorrectnessLogic}, |
| though not yet on the Linux kernel. |
| Finally, Microsoft has long used static analysis on its code |
| bases~\cite{JamesRLarus2004RightingSoftware}. |
| |
| Given this list, it is clearly possible to create sophisticated |
| formal-verification tools that directly consume production-quality |
| source code. |
| |
| However, one shortcoming of taking C code as input is that it assumes |
| that the compiler is correct. |
| An alternative approach is to take the binary produced by the C compiler |
| as input, thereby accounting for any relevant compiler bugs. |
| This approach has been used in a number of verification efforts, |
| perhaps most notably by the seL4 |
| project~\cite{ThomasSewell2013L4binaryVerification}. |
| |
| \QuickQuiz{ |
| Given the groundbreaking nature of the various verifiers used |
| in the seL4 project, why doesn't this chapter cover them in |
| more depth? |
| }\QuickQuizAnswer{ |
| There can be no doubt that the verifiers used by the seL4 |
| project are quite capable. |
| However, seL4 started as a single-CPU project. |
| And although seL4 has gained multi-processor |
| capabilities, it is currently using very coarse-grained |
| locking that is similar to the Linux kernel's old |
| Big Kernel Lock (BKL)\@. |
| There will hopefully come a day when it makes sense to add |
| seL4's verifiers to a book on parallel programming, but |
| this is not yet that day. |
| }\QuickQuizEnd |
| |
| However, verifying directly from either the source or binary both have the |
| advantage of eliminating human translation errors, which is critically |
| important for reliable regression testing. |
| |
| This is not to say that tools with special-purpose languages are useless. |
| On the contrary, they can be quite helpful for design-time verification, |
| as was discussed in |
| \cref{chp:Formal Verification}. |
| However, such tools are not particularly helpful for automated regression |
| testing, which is in fact the topic of this section. |
| |
| \subsection{Environment} |
| \label{sec:future:Environment} |
| |
| It is critically important that formal-verification tools correctly |
| model their environment. |
| One all-too-common omission is the \IX{memory model}, where a great |
| many formal-verification tools, including Promela/\co{spin}, are |
| restricted to \IXalth{sequential consistency}{sequential}{memory consistency}. |
| The QRCU experience related in |
| \cref{sec:formal:Is QRCU Really Correct?} |
| is an important cautionary tale. |
| |
| Promela and \co{spin} assume sequential consistency, which is not a |
| good match for modern computer systems, as was seen in |
| \cref{chp:Advanced Synchronization: Memory Ordering}. |
| In contrast, one of the great strengths of PPCMEM and \co{herd} |
| is their detailed modeling of various CPU families memory models, |
| including x86, \ARM, Power, and, in the case of \co{herd}, |
| a Linux-kernel memory model~\cite{Alglave:2018:FSC:3173162.3177156}, |
| which was accepted into Linux-kernel version v4.17. |
| |
| The \co{cbmc} and Nidhugg tools provide some ability to select |
| memory models, but do not provide the variety that PPCMEM and |
| \co{herd} do. |
| However, it is likely that the larger-scale tools will adopt |
| a greater variety of memory models as time goes on. |
| |
| In the longer term, it would be helpful for formal-verification |
| tools to include I/O~\cite{PaulEMcKenney2016LinuxKernelMMIO}, |
| but it may be some time before this comes to pass. |
| |
| Nevertheless, tools that fail to match the environment can still |
| be useful. |
| For example, a great many concurrency bugs would still be bugs on |
| a mythical sequentially consistent system, and these bugs could |
| be located by a tool that over-approximates the system's memory model |
| with sequential consistency. |
| Nevertheless, these tools will fail to find bugs involving missing |
| memory-ordering directives, as noted in the aforementioned |
| cautionary tale of |
| \cref{sec:formal:Is QRCU Really Correct?}. |
| |
| \subsection{Overhead} |
| \label{sec:future:Overhead} |
| |
| Almost all hard-core formal-verification tools are exponential |
| in nature, which might seem discouraging until you consider that |
| many of the most interesting software questions are in fact undecidable. |
| However, there are differences in degree, even among exponentials. |
| |
| PPCMEM by design is unoptimized, in order to provide greater assurance |
| that the memory models of interest are accurately represented. |
| The \co{herd} tool optimizes more aggressively, as described in |
| \cref{sec:formal:Axiomatic Approaches}, and is thus orders of magnitude |
| faster than PPCMEM\@. |
| Nevertheless, both PPCMEM and \co{herd} target very small litmus tests |
| rather than larger bodies of code. |
| |
| In contrast, Promela/\co{spin}, \co{cbmc}, and Nidhugg are designed for |
| (somewhat) larger bodies of code. |
| Promela/\co{spin} was used to verify the Curiosity rover's |
| filesystem~\cite{DBLP:journals/amai/GroceHHJX14} and, as noted earlier, |
| both \co{cbmc} and Nidhugg were appled to Linux-kernel RCU\@. |
| |
| If advances in heuristics continue at the rate of the past three |
| decades, we can look forward to large reductions in overhead for |
| formal verification. |
| That said, combinatorial explosion is still combinatorial explosion, |
| which would be expected to sharply limit the size of programs that |
| could be verified, with or without continued improvements in |
| heuristics. |
| |
| However, the flip side of combinatorial explosion is Philip II of |
| Macedon's timeless advice: |
| ``Divide and rule.'' |
| If a large program can be divided and the pieces verified, the result |
| can be combinatorial \emph{implosion}~\cite{PaulEMcKenney2011Verico}. |
| One natural place to divide is on API boundaries, for example, those |
| of locking primitives. |
| One verification pass can then verify that the locking implementation |
| is correct, and additional verification passes can verify correct |
| use of the locking APIs. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-SB+l-o-o-u+l-o-o-u-C@whole.fcv} |
| \caption{Emulating Locking with \tco{cmpxchg_acquire()}} |
| \label{lst:future:Emulating Locking with cmpxchg} |
| \end{listing} |
| |
| \begin{table} |
| \rowcolors{1}{}{lightgray} |
| \renewcommand*{\arraystretch}{1.1} |
| \small |
| \centering |
| \begin{tabular}{S[table-format=1.0]S[table-format=1.3]S[table-format=2.3]} |
| \toprule |
| \multicolumn{1}{c}{\# Threads} & \multicolumn{1}{c}{Locking} & |
| \multicolumn{1}{c}{\tco{cmpxchg_acquire}} \\ |
| \midrule |
| 2 & 0.004 & 0.022 \\ |
| 3 & 0.041 & 0.743 \\ |
| 4 & 0.374 & 59.565 \\ |
| 5 & 4.905 & \\ |
| \bottomrule |
| \end{tabular} |
| \caption{Emulating Locking: |
| Performance (s)} |
| \label{tab:future:Emulating Locking: Performance (s)} |
| \end{table} |
| |
| The performance benefits of this approach can be demonstrated using |
| the Linux-kernel memory |
| model~\cite{Alglave:2018:FSC:3173162.3177156}. |
| This model provides \co{spin_lock()} and \co{spin_unlock()} |
| primitives, but these primitives can also be emulated using |
| \co{cmpxchg_acquire()} and \co{smp_store_release()}, as shown in |
| \cref{lst:future:Emulating Locking with cmpxchg} |
| (\path{C-SB+l-o-o-u+l-o-o-*u.litmus} and \path{C-SB+l-o-o-u+l-o-o-u*-C.litmus}). |
| \Cref{tab:future:Emulating Locking: Performance (s)} |
| compares the performance and scalability of using the model's |
| \co{spin_lock()} and \co{spin_unlock()} against emulating these |
| primitives as shown in the listing. |
| The difference is not insignificant: |
| At four processes, the model is more than two orders of magnitude |
| faster than emulation! |
| |
| \QuickQuiz{ |
| \begin{fcvref}[ln:future:formalregress:C-SB+l-o-o-u+l-o-o-u-C:whole] |
| Why bother with a separate \co{filter} command on \clnref{filter_} of |
| \cref{lst:future:Emulating Locking with cmpxchg} |
| instead of just adding the condition to the \co{exists} clause? |
| And wouldn't it be simpler to use \co{xchg_acquire()} instead |
| of \co{cmpxchg_acquire()}? |
| \end{fcvref} |
| }\QuickQuizAnswer{ |
| The \co{filter} clause causes the \co{herd} tool to discard |
| executions at an earlier stage of processing than does |
| the \co{exists} clause, which provides significant speedups. |
| |
| \begin{table} |
| \rowcolors{7}{lightgray}{} |
| \renewcommand*{\arraystretch}{1.1} |
| \small |
| \centering |
| \begin{tabular}{S[table-format=1.0]S[table-format=1.3]S[table-format=2.3] |
| S[table-format=3.3]S[table-format=2.3]S[table-format=3.3]} |
| \toprule |
| & & \multicolumn{2}{c}{\tco{cmpxchg_acquire()}} |
| & \multicolumn{2}{c}{\tco{xchg_acquire()}} \\ |
| \cmidrule(l){3-4} \cmidrule(l){5-6} |
| \multicolumn{1}{c}{\#} & \multicolumn{1}{c}{Lock} |
| & \multicolumn{1}{c}{\tco{filter}} |
| & \multicolumn{1}{c}{\tco{exists}} |
| & \multicolumn{1}{c}{\tco{filter}} |
| & \multicolumn{1}{c}{\tco{exists}} \\ |
| \cmidrule{1-1} \cmidrule(l){2-2} \cmidrule(l){3-4} \cmidrule(l){5-6} |
| 2 & 0.004 & 0.022 & 0.039 & 0.027 & 0.058 \\ |
| 3 & 0.041 & 0.743 & 1.653 & 0.968 & 3.203 \\ |
| 4 & 0.374 & 59.565 & 151.962 & 74.818 & 500.96 \\ |
| 5 & 4.905 & & & & \\ |
| \bottomrule |
| \end{tabular} |
| \caption{Emulating Locking: |
| Performance Comparison (s)} |
| \label{tab:future:Emulating Locking: Performance Comparison (s)} |
| \end{table} |
| |
| As for \co{xchg_acquire()}, this atomic operation will do a |
| write whether or not lock acquisition succeeds, which means |
| that a model using \co{xchg_acquire()} will have more operations |
| than one using \co{cmpxchg_acquire()}, which won't do a write |
| in the failed-acquisition case. |
| More writes means more combinatorial to explode, as shown in |
| \cref{tab:future:Emulating Locking: Performance Comparison (s)} |
| (\path{C-SB+l-o-o-u+l-o-o-*u.litmus}, |
| \path{C-SB+l-o-o-u+l-o-o-u*-C.litmus}, |
| \path{C-SB+l-o-o-u+l-o-o-u*-CE.litmus}, |
| \path{C-SB+l-o-o-u+l-o-o-u*-X.litmus}, and |
| \path{C-SB+l-o-o-u+l-o-o-u*-XE.litmus}). |
| This table clearly shows that \co{cmpxchg_acquire()} |
| outperforms \co{xchg_acquire()} and that use of the |
| \co{filter} clause outperforms use of the \co{exists} clause. |
| }\QuickQuizEnd |
| |
| It would of course be quite useful for tools to automatically divide |
| up large programs, verify the pieces, and then verify the combinations |
| of pieces. |
| In the meantime, verification of large programs will require significant |
| manual intervention. |
| This intervention will preferably mediated by scripting, the better to |
| reliably carry out repeated verifications on each release, and |
| preferably eventually in a manner well-suited for continuous integration. |
| And Facebook's Infer tool has taken important steps towards doing just |
| that, via compositionality and |
| abstraction~\cite{SamBlackshear2018RacerD,DinoDistefano2019FBstaticAnalysis}. |
| |
| In any case, we can expect formal-verification capabilities to continue |
| to increase over time, and any such increases will in turn increase |
| the applicability of formal verification to regression testing. |
| |
| \subsection{Locate Bugs} |
| \label{sec:future:Locate Bugs} |
| |
| Any software artifact of any size contains bugs. |
| Therefore, a formal-verification tool that reports only the |
| presence or absence of bugs is not particularly useful. |
| What is needed is a tool that gives at least \emph{some} information |
| as to where the bug is located and the nature of that bug. |
| |
| The \co{cbmc} output includes a traceback mapping back to the source |
| code, similar to Promela/\co{spin}'s, as does Nidhugg. |
| Of course, these tracebacks can be quite long, and analyzing them |
| can be quite tedious. |
| However, doing so is usually quite a bit faster |
| and more pleasant than locating bugs the old-fashioned way. |
| |
| In addition, one of the simplest tests of formal-verification tools is |
| bug injection. |
| After all, not only could any of us write |
| \co{printf("VERIFIED\\n")}, but the plain fact is that |
| developers of formal-verification tools are just as bug-prone as |
| are the rest of us. |
| Therefore, formal-verification tools that just proclaim that a |
| bug exists are fundamentally less trustworthy because it is |
| more difficult to verify them on real-world code. |
| |
| All that aside, people writing formal-verification tools are |
| permitted to leverage existing tools. |
| For example, a tool designed to determine only the presence |
| or absence of a serious but rare bug might leverage bisection. |
| If an old version of the program under test did not contain the bug, |
| but a new version did, then bisection could be used to quickly |
| locate the commit that inserted the bug, which might be |
| sufficient information to find and fix the bug. |
| Of course, this sort of strategy would not work well for common |
| bugs because in this case bisection would fail due to all commits |
| having at least one instance of the common bug. |
| |
| Therefore, the execution traces provided |
| by many formal-verification tools will continue to be valuable, |
| particularly for complex and difficult-to-understand bugs. |
| In addition, recent work applies \emph{incorrectness-logic} |
| formalism reminiscent of the traditional Hoare logic used for |
| full-up correctness proofs, but with the sole purpose of finding |
| bugs~\cite{PeterWOHearn2019incorrectnessLogic}. |
| |
| \subsection{Minimal Scaffolding} |
| \label{sec:future:Minimal Scaffolding} |
| |
| In the old days, formal-verification researchers demanded a full |
| specification against which the software would be verified. |
| Unfortunately, a mathematically rigorous specification might well |
| be larger than the actual code, and each line of specification |
| is just as likely to contain bugs as is each line of code. |
| A formal verification effort proving that the code faithfully implemented |
| the specification would be a proof of bug-for-bug compatibility between |
| the two, which might not be all that helpful. |
| |
| Worse yet, the requirements for a number of software artifacts, |
| including Linux-kernel RCU, are empirical in |
| nature~\cite{PaulEMcKenney2015RCUreqts1,PaulEMcKenney2015RCUreqts2,PaulEMcKenney2015RCUreqts3}.\footnote{ |
| Or, in formal-verification parlance, Linux-kernel RCU has an |
| \emph{incomplete specification}.} |
| For this common type of software, a complete specification is a |
| polite fiction. |
| Nor are complete specifications any less fictional for hardware, |
| as was made clear by the late-2017 Meltdown and Spectre side-channel |
| attacks~\cite{JannHorn2018MeltdownSpectre}. |
| |
| This situation might cause one to give up all hope of formal verification |
| of real-world software and hardware artifacts, but it turns out that there is |
| quite a bit that can be done. |
| For example, design and coding rules can act as a partial specification, |
| as can assertions contained in the code. |
| And in fact formal-verification tools such as \co{cbmc} and Nidhugg |
| both check for assertions that can be triggered, implicitly treating |
| these assertions as part of the specification. |
| However, the assertions are also part of the code, which makes it less |
| likely that they will become obsolete, especially if the code is |
| also subjected to stress tests.\footnote{ |
| And you \emph{do} stress-test your code, don't you?} |
| The \co{cbmc} tool also checks for array-out-of-bound references, |
| thus implicitly adding them to the specification. |
| The aforementioned incorrectness logic can also be thought of as using |
| an implicit bugs-not-present |
| specification~\cite{PeterWOHearn2019incorrectnessLogic}. |
| |
| This implicit-specification approach makes quite a bit of sense, particularly |
| if you look at formal verification not as a full proof of correctness, |
| but rather an alternative form of validation with a different set of |
| strengths and weaknesses than the common case, that is, testing. |
| From this viewpoint, software will always have bugs, and therefore any |
| tool of any kind that helps to find those bugs is a very good thing |
| indeed. |
| |
| \subsection{Relevant Bugs} |
| \label{sec:future:Relevant Bugs} |
| |
| Finding bugs---and fixing them---is of course the whole point of any |
| type of validation effort. |
| Clearly, false positives are to be avoided. |
| But even in the absence of false positives, there are bugs and there are bugs. |
| |
| For example, suppose that a software artifact had exactly 100 remaining |
| bugs, each of which manifested on average once every million years |
| of runtime. |
| Suppose further that an omniscient formal-verification tool located |
| all 100 bugs, which the developers duly fixed. |
| What happens to the reliability of this software artifact? |
| |
| The answer is that the reliability \emph{decreases}. |
| |
| To see this, keep in mind that historical experience indicates that |
| about 7\pct\ of fixes introduce a new bug~\cite{RexBlack2012SQA}. |
| Therefore, fixing the 100 bugs, which had a combined mean time to failure |
| (MTBF) of about 10,000 years, will introduce seven more bugs. |
| Historical statistics indicate that each new bug will have an MTBF |
| much less than 70,000 years. |
| This in turn suggests that the combined MTBF of these seven new bugs |
| will most likely be much less than 10,000 years, which in turn means |
| that the well-intentioned fixing of the original 100 bugs actually |
| decreased the reliability of the overall software. |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| How do we know that the MTBFs of known bugs is a good estimate |
| of the MTBFs of bugs that have not yet been located? |
| }\QuickQuizAnswerB{ |
| We don't, but it does not matter. |
| |
| To see this, note that the 7\pct\ figure only applies to injected |
| bugs that were subsequently located: |
| It necessarily ignores any injected bugs that were never found. |
| Therefore, the MTBF statistics of known bugs is likely to be |
| a good approximation of that of the injected bugs that are |
| subsequently located. |
| |
| A key point in this whole section is that we should be more |
| concerned about bugs that inconvenience users than about |
| other bugs that never actually manifest. |
| This of course is \emph{not} to say that we should completely |
| ignore bugs that have not yet inconvenienced users, just that |
| we should properly prioritize our efforts so as to fix the |
| most important and urgent bugs first. |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| But the formal-verification tools should immediately find all the |
| bugs introduced by the fixes, so why is this a problem? |
| }\QuickQuizAnswerE{ |
| It is a problem because real-world formal-verification tools |
| (as opposed to those that exist only in the imaginations of |
| the more vociferous proponents of formal verification) are |
| not omniscient, and thus are only able to locate certain types |
| of bugs. |
| For but one example, formal-verification tools are unlikely to |
| spot a bug corresponding to an omitted assertion or, equivalently, |
| a bug corresponding to an undiscovered portion of the specification. |
| }\QuickQuizEndE |
| } |
| |
| Worse yet, imagine another software artifact with one bug that fails |
| once every day on average and 99 more that fail every million years |
| each. |
| Suppose that a formal-verification tool located the 99 million-year |
| bugs, but failed to find the one-day bug. |
| Fixing the 99 bugs located will take time and effort, decrease |
| reliability, and do nothing at all about the pressing each-day failure |
| that is likely causing embarrassment and perhaps much worse besides. |
| |
| Therefore, it would be best to have a validation tool that |
| preferentially located the most troublesome bugs. |
| However, as noted in |
| \cref{sec:future:Locate Bugs}, |
| it is permissible to leverage additional tools. |
| One powerful tool is none other than plain old testing. |
| Given knowledge of the bug, it should be possible to construct |
| specific tests for it, possibly also using some of the techniques |
| described in |
| \cref{sec:debugging:Hunting Heisenbugs} |
| to increase the probability of the bug manifesting. |
| These techniques should allow calculation of a rough estimate of the |
| bug's raw failure rate, which could in turn be used to prioritize |
| bug-fix efforts. |
| |
| \QuickQuiz{ |
| But many formal-verification tools can only find one bug at |
| a time, so that each bug must be fixed before the tool can |
| locate the next. |
| How can bug-fix efforts be prioritized given such a tool? |
| }\QuickQuizAnswer{ |
| One approach is to provide a simple fix that might not be |
| suitable for a production environment, but which allows |
| the tool to locate the next bug. |
| Another approach is to restrict configuration or inputs |
| so that the bugs located thus far cannot occur. |
| There are a number of similar approaches, but the common theme |
| is that fixing the bug from the tool's viewpoint is usually much |
| easier than constructing and validating a production-quality fix, |
| and the key point is to prioritize the larger efforts required |
| to construct and validate the production-quality fixes. |
| }\QuickQuizEnd |
| |
| There has been some recent formal-verification work that prioritizes |
| executions having fewer preemptions, under that reasonable assumption |
| that smaller numbers of preemptions are more likely. |
| |
| Identifying relevant bugs might sound like too much to ask, but it is what |
| is really required if we are to actually increase software reliability. |
| |
| \subsection{Formal Regression Scorecard} |
| \label{sec:future:Formal Regression Scorecard} |
| |
| \begin{table*} |
| % \rowcolors{6}{}{lightgray} |
| %\renewcommand*{\arraystretch}{1.1} |
| \small |
| \centering |
| \setlength{\tabcolsep}{2pt} |
| \begin{tabular}{lcccccccccc} |
| \toprule |
| & & Promela & & PPCMEM & & \tco{herd} & & \tco{cbmc} & & Nidhugg \\ |
| \midrule |
| (1) Automated & |
| & \cellcolor{red!50} & |
| & \cellcolor{orange!50} & |
| & \cellcolor{orange!50} & |
| & \cellcolor{blue!50} & |
| & \cellcolor{blue!50} \\ |
| \addlinespace[3pt] |
| (2) Environment & |
| & \cellcolor{red!50} (MM) & |
| & \cellcolor{green!50} & |
| & \cellcolor{blue!50} & |
| & \cellcolor{yellow!50} (MM) & |
| & \cellcolor{orange!50} (MM) \\ |
| \addlinespace[3pt] |
| (3) Overhead & |
| & \cellcolor{yellow!50} & |
| & \cellcolor{red!50} & |
| & \cellcolor{yellow!50} & |
| & \cellcolor{yellow!50} (SAT) & |
| & \cellcolor{green!50} \\ |
| \addlinespace[3pt] |
| (4) Locate Bugs & |
| & \cellcolor{yellow!50} & |
| & \cellcolor{yellow!50} & |
| & \cellcolor{yellow!50} & |
| & \cellcolor{green!50} & |
| & \cellcolor{green!50} \\ |
| \addlinespace[3pt] |
| (5) Minimal Scaffolding & |
| & \cellcolor{green!50} & |
| & \cellcolor{yellow!50} & |
| & \cellcolor{yellow!50} & |
| & \cellcolor{blue!50} & |
| & \cellcolor{blue!50} \\ |
| \addlinespace[3pt] |
| (6) Relevant Bugs & |
| & \cellcolor{yellow!50} ??? & |
| & \cellcolor{yellow!50} ??? & |
| & \cellcolor{yellow!50} ??? & |
| & \cellcolor{yellow!50} ??? & |
| & \cellcolor{yellow!50} ??? \\ |
| \bottomrule |
| \end{tabular} |
| \caption{Formal Regression Scorecard} |
| \label{tab:future:Formal Regression Scorecard} |
| \end{table*} |
| |
| \Cref{tab:future:Formal Regression Scorecard} |
| shows a rough-and-ready scorecard for the formal-verification tools |
| covered in this chapter. |
| Shorter wavelengths are better than longer wavelengths. |
| |
| Promela requires hand translation and supports only |
| \IXalth{sequential consistency}{sequential}{memory consistency}, |
| so its first two cells are red. |
| It has reasonable overhead (for formal verification, anyway) |
| and provides a traceback, so its next two cells are yellow. |
| Despite requiring hand translation, Promela handles assertions |
| in a natural way, so its fifth cell is green. |
| |
| PPCMEM usually requires hand translation due to the small size of litmus |
| tests that it supports, so its first cell is orange. |
| It handles several \IX{memory model}s, so its second cell is green. |
| Its overhead is quite high, so its third cell is red. |
| It provides a graphical display of relations among operations, which |
| is not as helpful as a traceback, but is still quite useful, so its |
| fourth cell is yellow. |
| It requires constructing an \co{exists} clause and cannot take |
| intra-process assertions, so its fifth cell is also yellow. |
| |
| The \co{herd} tool has size restrictions similar to those of PPCMEM, |
| so \co{herd}'s first cell is also orange. |
| It supports a wide variety of memory models, so its second cell is blue. |
| It has reasonable overhead, so its third cell is yellow. |
| Its bug-location and assertion capabilities are quite similar to those |
| of PPCMEM, so \co{herd} also gets yellow for the next two cells. |
| |
| The \co{cbmc} tool inputs C code directly, so its first cell is blue. |
| It supports a few memory models, so its second cell is yellow. |
| It has reasonable overhead, so its third cell is also yellow, however, |
| perhaps SAT-solver performance will continue improving. |
| It provides a traceback, so its fourth cell is green. |
| It takes assertions directly from the C code, so its fifth cell is blue. |
| |
| Nidhugg also inputs C code directly, so its first cell is also blue. |
| It supports only a couple of memory models, so its second cell is orange. |
| Its overhead is quite low (for formal-verification), so its |
| third cell is green. |
| It provides a traceback, so its fourth cell is green. |
| It takes assertions directly from the C code, so its fifth cell is blue. |
| |
| So what about the sixth and final row? |
| It is too early to tell how any of the tools do at finding the right bugs, |
| so they are all yellow with question marks. |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| How would testing stack up in the scorecard shown in |
| \cref{tab:future:Formal Regression Scorecard}? |
| }\QuickQuizAnswerB{ |
| It would be blue all the way down, with the possible |
| exception of the third row (overhead) which might well |
| be marked down for testing's difficulty finding |
| improbable bugs. |
| |
| On the other hand, improbable bugs are often also |
| irrelevant bugs, so your mileage may vary. |
| |
| Much depends on the size of your installed base. |
| If your code is only ever going to run on (say) 10,000 |
| systems, Murphy can actually be a really nice guy. |
| Everything that can go wrong, will. |
| Eventually. |
| Perhaps in geologic time. |
| |
| But if your code is running on 20~billion systems, |
| like the Linux kernel was said to be by late 2017, |
| Murphy can be a real jerk! |
| Everything that can go wrong, will, and it can go wrong |
| really quickly!!! |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| But aren't there a great many more formal-verification systems |
| than are shown in |
| \cref{tab:future:Formal Regression Scorecard}? |
| }\QuickQuizAnswerE{ |
| Indeed there are! |
| This table focuses on those that Paul has used, but others are |
| proving to be useful. |
| Formal verification has been heavily used in the seL4 |
| project~\cite{ThomasSewell2013L4binaryVerification}, |
| and its tools can now handle modest levels of concurrency. |
| More recently, Catalin Marinas used Lamport's |
| TLA tool~\cite{Lamport:2002:SST:579617} to locate some |
| forward-progress bugs in the Linux kernel's queued spinlock |
| implementation. |
| Will Deacon fixed these bugs~\cite{WillDeacon2018qspinlock}, |
| and Catalin verified Will's |
| fixes~\cite{CatalinMarinas2018qspinlockTLA}. |
| |
| Lighter-weight formal verification tools have been used heavily |
| in production~\cite{JamesRLarus2004RightingSoftware,AlBessey2010BillionLoCLater,ByronCook2018FormalAmazon,CaitlinSadowski2018staticAnalysisGoogle,DinoDistefano2019FBstaticAnalysis}. |
| }\QuickQuizEndE |
| } |
| |
| Once again, please note that this table rates these tools for use in |
| regression testing. |
| Just because many of them are a poor fit for regression testing does |
| not at all mean that they are useless, in fact, |
| many of them have proven their worth many times over.\footnote{ |
| For but one example, Promela was used to verify the file system |
| of none other than the Curiosity Rover. |
| Was \emph{your} formal verification tool used on software that |
| currently runs on Mars???} |
| Just not for regression testing. |
| |
| However, this might well change. |
| After all, formal verification tools made impressive strides in the 2010s. |
| If that progress continues, formal verification might well become an |
| indispensable tool in the parallel programmer's validation toolbox. |