future/formalregress.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % future/formalregress.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \section{Formal Regression Testing?}
 \label{sec:future:Formal Regression Testing?}
 %
 \epigraph{Theory without experiments:
 	  Have we gone too far?}
 	 {Michael Mitzenmacher}

 Formal verification has long proven useful in a number of production
 environments~\cite{JamesRLarus2004RightingSoftware,AlBessey2010BillionLoCLater,ByronCook2018FormalAmazon,CaitlinSadowski2018staticAnalysisGoogle,DinoDistefano2019FBstaticAnalysis}.
 However, it is an open question as to whether hard-core formal verification
 will ever be included in the automated regression-test suites used for
 continuous integration within complex concurrent codebases, such as the
 Linux kernel.
 Although there is already a proof of concept for Linux-kernel
 SRCU~\cite{LanceRoy2017CBMC-SRCU}, this test is for a small portion
 of one of the simplest RCU implementations, and has proven difficult
 to keep it current with the ever-changing Linux kernel.
 It is therefore worth asking what would be required to incorporate
 formal verification as first-class members of the Linux kernel's
 regression tests.

 The following list is a good
 start~\cite[slide 34]{PaulEMcKenney2015DagstuhlVerification}:

 \begin{enumerate}
 \item	Any required translation must be automated.
 \item	The environment (including memory ordering) must be correctly
 	handled.
 \item	The memory and CPU overhead must be acceptably modest.
 \item	Specific information leading to the location of the bug
 	must be provided.
 \item	Information beyond the source code and inputs must be
 	modest in scope.
 \item	The bugs located must be relevant to the code's users.
 \end{enumerate}

 This list builds on, but is somewhat more modest than, Richard Bornat's
 dictum:
 ``Formal-verification researchers should verify the code that
 developers write, in the language they write it in, running in the
 environment that it runs in, as they write it.''
 The following sections discuss each of the above requirements, followed
 by a section presenting a scorecard of how well a few tools stack up
 against these requirements.

 \QuickQuiz{
 	This list is ridiculously utopian!
 	Why not stick to the current state of the formal-verification art?
 }\QuickQuizAnswer{
 	You are welcome to your opinion on what is and is not utopian,
 	but I will be paying more attention to people actually making
 	progress on the items in that list than to anyone who might be
 	objecting to them.
 	This might have something to do with my long experience with
 	people attempting to talk me out of specific things that their
 	favorite tools cannot handle.

 	In the meantime, please feel free to read the papers written by
 	the people who are actually making progress, for example, this
 	one~\cite{DinoDistefano2019FBstaticAnalysis}.
 }\QuickQuizEnd

 \subsection{Automatic Translation}
 \label{sec:future:Automatic Translation}

 Although Promela and \co{spin}
 are invaluable design aids, if you need to formally regression-test
 your C-language program, you must hand-translate to Promela each time
 you would like to re-verify your code.
 If your code happens to be in the Linux kernel, which releases every
 60--90 days, you will need to hand-translate from four to six times
 each year.
 Over time, human error will creep in, which means that the verification
 won't match the source code, rendering the verification useless.
 Repeated verification clearly requires either that the formal-verification
 tooling input your code directly, or that there be bug-free automatic
 translation of your code to the form required for verification.

 PPCMEM and \co{herd} can in theory directly input assembly language
 and C++ code, but these tools work only on very small litmus tests,
 which normally means that you must extract the core of your
 mechanism---by hand.
 As with Promela and \co{spin}, both PPCMEM and \co{herd} are
 extremely useful, but they are not well-suited for regression suites.

 In contrast, \IXaltacr{\co{cbmc}}{cbmc} and \IX{Nidhugg} can input
 C programs of reasonable
 (though still quite limited) size, and if their capabilities continue
 to grow, could well become excellent additions to regression suites.
 The Coverity static-analysis tool also inputs C programs, and of very
 large size, including the Linux kernel.
 Of course, Coverity's static analysis is quite simple compared to that
 of \co{cbmc} and Nidhugg.
 On the other hand, Coverity had an all-encompassing definition of
 ``C program'' that posed special challenges~\cite{AlBessey2010BillionLoCLater}.
 Amazon Web Services uses a variety of formal-verification tools,
 including \co{cbmc}, and applies some of these tools to regression
 testing~\cite{ByronCook2018FormalAmazon}.
 Google uses a number of relatively simple static analysis tools directly
 on large Java code bases, which are arguably less diverse than C code
 bases~\cite{CaitlinSadowski2018staticAnalysisGoogle}.
 Facebook uses more aggressive forms of formal verification against its
 code bases, including analysis of concurrency~\cite{DinoDistefano2019FBstaticAnalysis,PeterWOHearn2019incorrectnessLogic},
 though not yet on the Linux kernel.
 Finally, Microsoft has long used static analysis on its code
 bases~\cite{JamesRLarus2004RightingSoftware}.

 Given this list, it is clearly possible to create sophisticated
 formal-verification tools that directly consume production-quality
 source code.

 However, one shortcoming of taking C code as input is that it assumes
 that the compiler is correct.
 An alternative approach is to take the binary produced by the C compiler
 as input, thereby accounting for any relevant compiler bugs.
 This approach has been used in a number of verification efforts,
 perhaps most notably by the seL4
 project~\cite{ThomasSewell2013L4binaryVerification}.

 \QuickQuiz{
 	Given the groundbreaking nature of the various verifiers used
 	in the seL4 project, why doesn't this chapter cover them in
 	more depth?
 }\QuickQuizAnswer{
 	There can be no doubt that the verifiers used by the seL4
 	project are quite capable.
 	However, seL4 started as a single-CPU project.
 	And although seL4 has gained multi-processor
 	capabilities, it is currently using very coarse-grained
 	locking that is similar to the Linux kernel's old
 	Big Kernel Lock (BKL)\@.
 	There will hopefully come a day when it makes sense to add
 	seL4's verifiers to a book on parallel programming, but
 	this is not yet that day.
 }\QuickQuizEnd

 However, verifying directly from either the source or binary both have the
 advantage of eliminating human translation errors, which is critically
 important for reliable regression testing.

 This is not to say that tools with special-purpose languages are useless.
 On the contrary, they can be quite helpful for design-time verification,
 as was discussed in
 \cref{chp:Formal Verification}.
 However, such tools are not particularly helpful for automated regression
 testing, which is in fact the topic of this section.

 \subsection{Environment}
 \label{sec:future:Environment}

 It is critically important that formal-verification tools correctly
 model their environment.
 One all-too-common omission is the \IX{memory model}, where a great
 many formal-verification tools, including Promela/\co{spin}, are
 restricted to \IXalth{sequential consistency}{sequential}{memory consistency}.
 The QRCU experience related in
 \cref{sec:formal:Is QRCU Really Correct?}
 is an important cautionary tale.

 Promela and \co{spin} assume sequential consistency, which is not a
 good match for modern computer systems, as was seen in
 \cref{chp:Advanced Synchronization: Memory Ordering}.
 In contrast, one of the great strengths of PPCMEM and \co{herd}
 is their detailed modeling of various CPU families memory models,
 including x86, \ARM, Power, and, in the case of \co{herd},
 a Linux-kernel memory model~\cite{Alglave:2018:FSC:3173162.3177156},
 which was accepted into Linux-kernel version v4.17.

 The \co{cbmc} and Nidhugg tools provide some ability to select
 memory models, but do not provide the variety that PPCMEM and
 \co{herd} do.
 However, it is likely that the larger-scale tools will adopt
 a greater variety of memory models as time goes on.

 In the longer term, it would be helpful for formal-verification
 tools to include I/O~\cite{PaulEMcKenney2016LinuxKernelMMIO},
 but it may be some time before this comes to pass.

 Nevertheless, tools that fail to match the environment can still
 be useful.
 For example, a great many concurrency bugs would still be bugs on
 a mythical sequentially consistent system, and these bugs could
 be located by a tool that over-approximates the system's memory model
 with sequential consistency.
 Nevertheless, these tools will fail to find bugs involving missing
 memory-ordering directives, as noted in the aforementioned
 cautionary tale of
 \cref{sec:formal:Is QRCU Really Correct?}.

 \subsection{Overhead}
 \label{sec:future:Overhead}

 Almost all hard-core formal-verification tools are exponential
 in nature, which might seem discouraging until you consider that
 many of the most interesting software questions are in fact undecidable.
 However, there are differences in degree, even among exponentials.

 PPCMEM by design is unoptimized, in order to provide greater assurance
 that the memory models of interest are accurately represented.
 The \co{herd} tool optimizes more aggressively, as described in
 \cref{sec:formal:Axiomatic Approaches}, and is thus orders of magnitude
 faster than PPCMEM\@.
 Nevertheless, both PPCMEM and \co{herd} target very small litmus tests
 rather than larger bodies of code.

 In contrast, Promela/\co{spin}, \co{cbmc}, and Nidhugg are designed for
 (somewhat) larger bodies of code.
 Promela/\co{spin} was used to verify the Curiosity rover's
 filesystem~\cite{DBLP:journals/amai/GroceHHJX14} and, as noted earlier,
 both \co{cbmc} and Nidhugg were appled to Linux-kernel RCU\@.

 If advances in heuristics continue at the rate of the past three
 decades, we can look forward to large reductions in overhead for
 formal verification.
 That said, combinatorial explosion is still combinatorial explosion,
 which would be expected to sharply limit the size of programs that
 could be verified, with or without continued improvements in
 heuristics.

 However, the flip side of combinatorial explosion is Philip II of
 Macedon's timeless advice:
 ``Divide and rule.''
 If a large program can be divided and the pieces verified, the result
 can be combinatorial \emph{implosion}~\cite{PaulEMcKenney2011Verico}.
 One natural place to divide is on API boundaries, for example, those
 of locking primitives.
 One verification pass can then verify that the locking implementation
 is correct, and additional verification passes can verify correct
 use of the locking APIs.

 \begin{listing}
 \input{CodeSamples/formal/herd/C-SB+l-o-o-u+l-o-o-u-C@whole.fcv}
 \caption{Emulating Locking with \tco{cmpxchg_acquire()}}
 \label{lst:future:Emulating Locking with cmpxchg}
 \end{listing}

 \begin{table}
 \rowcolors{1}{}{lightgray}
 \renewcommand*{\arraystretch}{1.1}
 \small
 \centering
 \begin{tabular}{S[table-format=1.0]S[table-format=1.3]S[table-format=2.3]}
 	\toprule
 	\multicolumn{1}{c}{\# Threads} & \multicolumn{1}{c}{Locking} &
 			\multicolumn{1}{c}{\tco{cmpxchg_acquire}} \\
 	\midrule
 	2 & 0.004 &  0.022 \\
 	3 & 0.041 &  0.743 \\
 	4 & 0.374 & 59.565 \\
 	5 & 4.905 &        \\
 	\bottomrule
 \end{tabular}
 \caption{Emulating Locking:
 			    Performance (s)}
 \label{tab:future:Emulating Locking: Performance (s)}
 \end{table}

 The performance benefits of this approach can be demonstrated using
 the Linux-kernel memory
 model~\cite{Alglave:2018:FSC:3173162.3177156}.
 This model provides \co{spin_lock()} and \co{spin_unlock()}
 primitives, but these primitives can also be emulated using
 \co{cmpxchg_acquire()} and \co{smp_store_release()}, as shown in
 \cref{lst:future:Emulating Locking with cmpxchg}
 (\path{C-SB+l-o-o-u+l-o-o-*u.litmus} and \path{C-SB+l-o-o-u+l-o-o-u*-C.litmus}).
 \Cref{tab:future:Emulating Locking: Performance (s)}
 compares the performance and scalability of using the model's
 \co{spin_lock()} and \co{spin_unlock()} against emulating these
 primitives as shown in the listing.
 The difference is not insignificant:
 At four processes, the model is more than two orders of magnitude
 faster than emulation!

 \QuickQuiz{
 \begin{fcvref}[ln:future:formalregress:C-SB+l-o-o-u+l-o-o-u-C:whole]
 	Why bother with a separate \co{filter} command on \clnref{filter_} of
 	\cref{lst:future:Emulating Locking with cmpxchg}
 	instead of just adding the condition to the \co{exists} clause?
 	And wouldn't it be simpler to use \co{xchg_acquire()} instead
 	of \co{cmpxchg_acquire()}?
 \end{fcvref}
 }\QuickQuizAnswer{
 	The \co{filter} clause causes the \co{herd} tool to discard
 	executions at an earlier stage of processing than does
 	the \co{exists} clause, which provides significant speedups.

 \begin{table}
 \rowcolors{7}{lightgray}{}
 \renewcommand*{\arraystretch}{1.1}
 \small
 \centering
 \begin{tabular}{S[table-format=1.0]S[table-format=1.3]S[table-format=2.3]
 		S[table-format=3.3]S[table-format=2.3]S[table-format=3.3]}
 	\toprule
 	& & \multicolumn{2}{c}{\tco{cmpxchg_acquire()}}
 		& \multicolumn{2}{c}{\tco{xchg_acquire()}} \\
 	\cmidrule(l){3-4} \cmidrule(l){5-6}
 	\multicolumn{1}{c}{\#} & \multicolumn{1}{c}{Lock}
 		& \multicolumn{1}{c}{\tco{filter}}
 			& \multicolumn{1}{c}{\tco{exists}}
 				& \multicolumn{1}{c}{\tco{filter}}
 					& \multicolumn{1}{c}{\tco{exists}} \\
 	\cmidrule{1-1} \cmidrule(l){2-2} \cmidrule(l){3-4} \cmidrule(l){5-6}
 	2 & 0.004 &  0.022 &   0.039 &  0.027 &  0.058 \\
 	3 & 0.041 &  0.743 &   1.653 &  0.968 &  3.203 \\
 	4 & 0.374 & 59.565 & 151.962 & 74.818 & 500.96 \\
 	5 & 4.905 &        &         &        &        \\
 	\bottomrule
 \end{tabular}
 \caption{Emulating Locking:
 			    Performance Comparison (s)}
 \label{tab:future:Emulating Locking: Performance Comparison (s)}
 \end{table}

 	As for \co{xchg_acquire()}, this atomic operation will do a
 	write whether or not lock acquisition succeeds, which means
 	that a model using \co{xchg_acquire()} will have more operations
 	than one using \co{cmpxchg_acquire()}, which won't do a write
 	in the failed-acquisition case.
 	More writes means more combinatorial to explode, as shown in
 	\cref{tab:future:Emulating Locking: Performance Comparison (s)}
 	(\path{C-SB+l-o-o-u+l-o-o-*u.litmus},
 	\path{C-SB+l-o-o-u+l-o-o-u*-C.litmus},
 	\path{C-SB+l-o-o-u+l-o-o-u*-CE.litmus},
 	\path{C-SB+l-o-o-u+l-o-o-u*-X.litmus}, and
 	\path{C-SB+l-o-o-u+l-o-o-u*-XE.litmus}).
 	This table clearly shows that \co{cmpxchg_acquire()}
 	outperforms \co{xchg_acquire()} and that use of the
 	\co{filter} clause outperforms use of the \co{exists} clause.
 }\QuickQuizEnd

 It would of course be quite useful for tools to automatically divide
 up large programs, verify the pieces, and then verify the combinations
 of pieces.
 In the meantime, verification of large programs will require significant
 manual intervention.
 This intervention will preferably mediated by scripting, the better to
 reliably carry out repeated verifications on each release, and
 preferably eventually in a manner well-suited for continuous integration.
 And Facebook's Infer tool has taken important steps towards doing just
 that, via compositionality and
 abstraction~\cite{SamBlackshear2018RacerD,DinoDistefano2019FBstaticAnalysis}.

 In any case, we can expect formal-verification capabilities to continue
 to increase over time, and any such increases will in turn increase
 the applicability of formal verification to regression testing.

 \subsection{Locate Bugs}
 \label{sec:future:Locate Bugs}

 Any software artifact of any size contains bugs.
 Therefore, a formal-verification tool that reports only the
 presence or absence of bugs is not particularly useful.
 What is needed is a tool that gives at least \emph{some} information
 as to where the bug is located and the nature of that bug.

 The \co{cbmc} output includes a traceback mapping back to the source
 code, similar to Promela/\co{spin}'s, as does Nidhugg.
 Of course, these tracebacks can be quite long, and analyzing them
 can be quite tedious.
 However, doing so is usually quite a bit faster
 and more pleasant than locating bugs the old-fashioned way.

 In addition, one of the simplest tests of formal-verification tools is
 bug injection.
 After all, not only could any of us write
 \co{printf("VERIFIED\\n")}, but the plain fact is that
 developers of formal-verification tools are just as bug-prone as
 are the rest of us.
 Therefore, formal-verification tools that just proclaim that a
 bug exists are fundamentally less trustworthy because it is
 more difficult to verify them on real-world code.

 All that aside, people writing formal-verification tools are
 permitted to leverage existing tools.
 For example, a tool designed to determine only the presence
 or absence of a serious but rare bug might leverage bisection.
 If an old version of the program under test did not contain the bug,
 but a new version did, then bisection could be used to quickly
 locate the commit that inserted the bug, which might be
 sufficient information to find and fix the bug.
 Of course, this sort of strategy would not work well for common
 bugs because in this case bisection would fail due to all commits
 having at least one instance of the common bug.

 Therefore, the execution traces provided
 by many formal-verification tools will continue to be valuable,
 particularly for complex and difficult-to-understand bugs.
 In addition, recent work applies \emph{incorrectness-logic}
 formalism reminiscent of the traditional Hoare logic used for
 full-up correctness proofs, but with the sole purpose of finding
 bugs~\cite{PeterWOHearn2019incorrectnessLogic}.

 \subsection{Minimal Scaffolding}
 \label{sec:future:Minimal Scaffolding}

 In the old days, formal-verification researchers demanded a full
 specification against which the software would be verified.
 Unfortunately, a mathematically rigorous specification might well
 be larger than the actual code, and each line of specification
 is just as likely to contain bugs as is each line of code.
 A formal verification effort proving that the code faithfully implemented
 the specification would be a proof of bug-for-bug compatibility between
 the two, which might not be all that helpful.

 Worse yet, the requirements for a number of software artifacts,
 including Linux-kernel RCU, are empirical in
 nature~\cite{PaulEMcKenney2015RCUreqts1,PaulEMcKenney2015RCUreqts2,PaulEMcKenney2015RCUreqts3}.\footnote{
 	Or, in formal-verification parlance, Linux-kernel RCU has an
 	\emph{incomplete specification}.}
 For this common type of software, a complete specification is a
 polite fiction.
 Nor are complete specifications any less fictional for hardware,
 as was made clear by the late-2017 Meltdown and Spectre side-channel
 attacks~\cite{JannHorn2018MeltdownSpectre}.

 This situation might cause one to give up all hope of formal verification
 of real-world software and hardware artifacts, but it turns out that there is
 quite a bit that can be done.
 For example, design and coding rules can act as a partial specification,
 as can assertions contained in the code.
 And in fact formal-verification tools such as \co{cbmc} and Nidhugg
 both check for assertions that can be triggered, implicitly treating
 these assertions as part of the specification.
 However, the assertions are also part of the code, which makes it less
 likely that they will become obsolete, especially if the code is
 also subjected to stress tests.\footnote{
 	And you \emph{do} stress-test your code, don't you?}
 The \co{cbmc} tool also checks for array-out-of-bound references,
 thus implicitly adding them to the specification.
 The aforementioned incorrectness logic can also be thought of as using
 an implicit bugs-not-present
 specification~\cite{PeterWOHearn2019incorrectnessLogic}.

 This implicit-specification approach makes quite a bit of sense, particularly
 if you look at formal verification not as a full proof of correctness,
 but rather an alternative form of validation with a different set of
 strengths and weaknesses than the common case, that is, testing.
 From this viewpoint, software will always have bugs, and therefore any
 tool of any kind that helps to find those bugs is a very good thing
 indeed.

 \subsection{Relevant Bugs}
 \label{sec:future:Relevant Bugs}

 Finding bugs---and fixing them---is of course the whole point of any
 type of validation effort.
 Clearly, false positives are to be avoided.
 But even in the absence of false positives, there are bugs and there are bugs.

 For example, suppose that a software artifact had exactly 100 remaining
 bugs, each of which manifested on average once every million years
 of runtime.
 Suppose further that an omniscient formal-verification tool located
 all 100 bugs, which the developers duly fixed.
 What happens to the reliability of this software artifact?

 The answer is that the reliability \emph{decreases}.

 To see this, keep in mind that historical experience indicates that
 about 7\pct\ of fixes introduce a new bug~\cite{RexBlack2012SQA}.
 Therefore, fixing the 100 bugs, which had a combined mean time to failure
 (MTBF) of about 10,000 years, will introduce seven more bugs.
 Historical statistics indicate that each new bug will have an MTBF
 much less than 70,000 years.
 This in turn suggests that the combined MTBF of these seven new bugs
 will most likely be much less than 10,000 years, which in turn means
 that the well-intentioned fixing of the original 100 bugs actually
 decreased the reliability of the overall software.

 \QuickQuizSeries{%
 \QuickQuizB{
 	How do we know that the MTBFs of known bugs is a good estimate
 	of the MTBFs of bugs that have not yet been located?
 }\QuickQuizAnswerB{
 	We don't, but it does not matter.

 	To see this, note that the 7\pct\ figure only applies to injected
 	bugs that were subsequently located:
 	It necessarily ignores any injected bugs that were never found.
 	Therefore, the MTBF statistics of known bugs is likely to be
 	a good approximation of that of the injected bugs that are
 	subsequently located.

 	A key point in this whole section is that we should be more
 	concerned about bugs that inconvenience users than about
 	other bugs that never actually manifest.
 	This of course is \emph{not} to say that we should completely
 	ignore bugs that have not yet inconvenienced users, just that
 	we should properly prioritize our efforts so as to fix the
 	most important and urgent bugs first.
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	But the formal-verification tools should immediately find all the
 	bugs introduced by the fixes, so why is this a problem?
 }\QuickQuizAnswerE{
 	It is a problem because real-world formal-verification tools
 	(as opposed to those that exist only in the imaginations of
 	the more vociferous proponents of formal verification) are
 	not omniscient, and thus are only able to locate certain types
 	of bugs.
 	For but one example, formal-verification tools are unlikely to
 	spot a bug corresponding to an omitted assertion or, equivalently,
 	a bug corresponding to an undiscovered portion of the specification.
 }\QuickQuizEndE
 }

 Worse yet, imagine another software artifact with one bug that fails
 once every day on average and 99 more that fail every million years
 each.
 Suppose that a formal-verification tool located the 99 million-year
 bugs, but failed to find the one-day bug.
 Fixing the 99 bugs located will take time and effort, decrease
 reliability, and do nothing at all about the pressing each-day failure
 that is likely causing embarrassment and perhaps much worse besides.

 Therefore, it would be best to have a validation tool that
 preferentially located the most troublesome bugs.
 However, as noted in
 \cref{sec:future:Locate Bugs},
 it is permissible to leverage additional tools.
 One powerful tool is none other than plain old testing.
 Given knowledge of the bug, it should be possible to construct
 specific tests for it, possibly also using some of the techniques
 described in
 \cref{sec:debugging:Hunting Heisenbugs}
 to increase the probability of the bug manifesting.
 These techniques should allow calculation of a rough estimate of the
 bug's raw failure rate, which could in turn be used to prioritize
 bug-fix efforts.

 \QuickQuiz{
 	But many formal-verification tools can only find one bug at
 	a time, so that each bug must be fixed before the tool can
 	locate the next.
 	How can bug-fix efforts be prioritized given such a tool?
 }\QuickQuizAnswer{
 	One approach is to provide a simple fix that might not be
 	suitable for a production environment, but which allows
 	the tool to locate the next bug.
 	Another approach is to restrict configuration or inputs
 	so that the bugs located thus far cannot occur.
 	There are a number of similar approaches, but the common theme
 	is that fixing the bug from the tool's viewpoint is usually much
 	easier than constructing and validating a production-quality fix,
 	and the key point is to prioritize the larger efforts required
 	to construct and validate the production-quality fixes.
 }\QuickQuizEnd

 There has been some recent formal-verification work that prioritizes
 executions having fewer preemptions, under that reasonable assumption
 that smaller numbers of preemptions are more likely.

 Identifying relevant bugs might sound like too much to ask, but it is what
 is really required if we are to actually increase software reliability.

 \subsection{Formal Regression Scorecard}
 \label{sec:future:Formal Regression Scorecard}

 \begin{table*}
 % \rowcolors{6}{}{lightgray}
 %\renewcommand*{\arraystretch}{1.1}
 \small
 \centering
 \setlength{\tabcolsep}{2pt}
 \begin{tabular}{lcccccccccc}
 	\toprule
 	& & Promela & & PPCMEM & & \tco{herd} & & \tco{cbmc} & & Nidhugg \\
 	\midrule
 	(1) Automated &
 		& \cellcolor{red!50} &
 			& \cellcolor{orange!50} &
 				& \cellcolor{orange!50} &
 					& \cellcolor{blue!50} &
 						& \cellcolor{blue!50} \\
 	\addlinespace[3pt]
 	(2) Environment &
 		& \cellcolor{red!50} (MM) &
 			& \cellcolor{green!50} &
 				& \cellcolor{blue!50} &
 					& \cellcolor{yellow!50} (MM) &
 						& \cellcolor{orange!50} (MM) \\
 	\addlinespace[3pt]
 	(3) Overhead &
 		& \cellcolor{yellow!50} &
 			& \cellcolor{red!50} &
 				& \cellcolor{yellow!50} &
 					& \cellcolor{yellow!50} (SAT) &
 						& \cellcolor{green!50} \\
 	\addlinespace[3pt]
 	(4) Locate Bugs &
 		& \cellcolor{yellow!50} &
 			& \cellcolor{yellow!50} &
 				& \cellcolor{yellow!50} &
 					& \cellcolor{green!50} &
 						& \cellcolor{green!50} \\
 	\addlinespace[3pt]
 	(5) Minimal Scaffolding &
 		& \cellcolor{green!50} &
 			& \cellcolor{yellow!50} &
 				& \cellcolor{yellow!50} &
 					& \cellcolor{blue!50} &
 						& \cellcolor{blue!50} \\
 	\addlinespace[3pt]
 	(6) Relevant Bugs &
 		& \cellcolor{yellow!50} ??? &
 			& \cellcolor{yellow!50} ??? &
 				& \cellcolor{yellow!50} ??? &
 					& \cellcolor{yellow!50} ??? &
 						& \cellcolor{yellow!50} ??? \\
 	\bottomrule
 \end{tabular}
 \caption{Formal Regression Scorecard}
 \label{tab:future:Formal Regression Scorecard}
 \end{table*}

 \Cref{tab:future:Formal Regression Scorecard}
 shows a rough-and-ready scorecard for the formal-verification tools
 covered in this chapter.
 Shorter wavelengths are better than longer wavelengths.

 Promela requires hand translation and supports only
 \IXalth{sequential consistency}{sequential}{memory consistency},
 so its first two cells are red.
 It has reasonable overhead (for formal verification, anyway)
 and provides a traceback, so its next two cells are yellow.
 Despite requiring hand translation, Promela handles assertions
 in a natural way, so its fifth cell is green.

 PPCMEM usually requires hand translation due to the small size of litmus
 tests that it supports, so its first cell is orange.
 It handles several \IX{memory model}s, so its second cell is green.
 Its overhead is quite high, so its third cell is red.
 It provides a graphical display of relations among operations, which
 is not as helpful as a traceback, but is still quite useful, so its
 fourth cell is yellow.
 It requires constructing an \co{exists} clause and cannot take
 intra-process assertions, so its fifth cell is also yellow.

 The \co{herd} tool has size restrictions similar to those of PPCMEM,
 so \co{herd}'s first cell is also orange.
 It supports a wide variety of memory models, so its second cell is blue.
 It has reasonable overhead, so its third cell is yellow.
 Its bug-location and assertion capabilities are quite similar to those
 of PPCMEM, so \co{herd} also gets yellow for the next two cells.

 The \co{cbmc} tool inputs C code directly, so its first cell is blue.
 It supports a few memory models, so its second cell is yellow.
 It has reasonable overhead, so its third cell is also yellow, however,
 perhaps SAT-solver performance will continue improving.
 It provides a traceback, so its fourth cell is green.
 It takes assertions directly from the C code, so its fifth cell is blue.

 Nidhugg also inputs C code directly, so its first cell is also blue.
 It supports only a couple of memory models, so its second cell is orange.
 Its overhead is quite low (for formal-verification), so its
 third cell is green.
 It provides a traceback, so its fourth cell is green.
 It takes assertions directly from the C code, so its fifth cell is blue.

 So what about the sixth and final row?
 It is too early to tell how any of the tools do at finding the right bugs,
 so they are all yellow with question marks.

 \QuickQuizSeries{%
 \QuickQuizB{
 	How would testing stack up in the scorecard shown in
 	\cref{tab:future:Formal Regression Scorecard}?
 }\QuickQuizAnswerB{
 	It would be blue all the way down, with the possible
 	exception of the third row (overhead) which might well
 	be marked down for testing's difficulty finding
 	improbable bugs.

 	On the other hand, improbable bugs are often also
 	irrelevant bugs, so your mileage may vary.

 	Much depends on the size of your installed base.
 	If your code is only ever going to run on (say) 10,000
 	systems, Murphy can actually be a really nice guy.
 	Everything that can go wrong, will.
 	Eventually.
 	Perhaps in geologic time.

 	But if your code is running on 20~billion systems,
 	like the Linux kernel was said to be by late 2017,
 	Murphy can be a real jerk!
 	Everything that can go wrong, will, and it can go wrong
 	really quickly!!!
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	But aren't there a great many more formal-verification systems
 	than are shown in
 	\cref{tab:future:Formal Regression Scorecard}?
 }\QuickQuizAnswerE{
 	Indeed there are!
 	This table focuses on those that Paul has used, but others are
 	proving to be useful.
 	Formal verification has been heavily used in the seL4
 	project~\cite{ThomasSewell2013L4binaryVerification},
 	and its tools can now handle modest levels of concurrency.
 	More recently, Catalin Marinas used Lamport's
 	TLA tool~\cite{Lamport:2002:SST:579617} to locate some
 	forward-progress bugs in the Linux kernel's queued spinlock
 	implementation.
 	Will Deacon fixed these bugs~\cite{WillDeacon2018qspinlock},
 	and Catalin verified Will's
 	fixes~\cite{CatalinMarinas2018qspinlockTLA}.

 	Lighter-weight formal verification tools have been used heavily
 	in production~\cite{JamesRLarus2004RightingSoftware,AlBessey2010BillionLoCLater,ByronCook2018FormalAmazon,CaitlinSadowski2018staticAnalysisGoogle,DinoDistefano2019FBstaticAnalysis}.
 }\QuickQuizEndE
 }

 Once again, please note that this table rates these tools for use in
 regression testing.
 Just because many of them are a poor fit for regression testing does
 not at all mean that they are useless, in fact,
 many of them have proven their worth many times over.\footnote{
 	For but one example, Promela was used to verify the file system
 	of none other than the Curiosity Rover.
 	Was \emph{your} formal verification tool used on software that
 	currently runs on Mars???}
 Just not for regression testing.

 However, this might well change.
 After all, formal verification tools made impressive strides in the 2010s.
 If that progress continues, formal verification might well become an
 indispensable tool in the parallel programmer's validation toolbox.