blob: 786fd7addcd2494872b0ce375f1389895edf5a0b [file] [log] [blame]
% future/formalregress.tex
% mainfile: ../perfbook.tex
% SPDX-License-Identifier: CC-BY-SA-3.0
\section{Formal Regression Testing?}
\label{sec:future:Formal Regression Testing?}
%
\epigraph{Theory without experiments:
Have we gone too far?}
{Michael Mitzenmacher}
Formal verification has long proven useful in a number of production
environments~\cite{JamesRLarus2004RightingSoftware,AlBessey2010BillionLoCLater,ByronCook2018FormalAmazon,CaitlinSadowski2018staticAnalysisGoogle,DinoDistefano2019FBstaticAnalysis}.
However, it is an open question as to whether hard-core formal verification
will ever be included in the automated regression-test suites used for
continuous integration within complex concurrent codebases, such as the
Linux kernel.
Although there is already a proof of concept for Linux-kernel
SRCU~\cite{LanceRoy2017CBMC-SRCU}, this test is for a small portion
of one of the simplest RCU implementations, and has proven difficult
to keep it current with the ever-changing Linux kernel.
It is therefore worth asking what would be required to incorporate
formal verification as first-class members of the Linux kernel's
regression tests.
The following list is a good
start~\cite[slide 34]{PaulEMcKenney2015DagstuhlVerification}:
\begin{enumerate}
\item Any required translation must be automated.
\item The environment (including memory ordering) must be correctly
handled.
\item The memory and CPU overhead must be acceptably modest.
\item Specific information leading to the location of the bug
must be provided.
\item Information beyond the source code and inputs must be
modest in scope.
\item The bugs located must be relevant to the code's users.
\end{enumerate}
This list builds on, but is somewhat more modest than, Richard Bornat's
dictum:
``Formal-verification researchers should verify the code that
developers write, in the language they write it in, running in the
environment that it runs in, as they write it.''
The following sections discuss each of the above requirements, followed
by a section presenting a scorecard of how well a few tools stack up
against these requirements.
\QuickQuiz{
This list is ridiculously utopian!
Why not stick to the current state of the formal-verification art?
}\QuickQuizAnswer{
You are welcome to your opinion on what is and is not utopian,
but I will be paying more attention to people actually making
progress on the items in that list than to anyone who might be
objecting to them.
This might have something to do with my long experience with
people attempting to talk me out of specific things that their
favorite tools cannot handle.
In the meantime, please feel free to read the papers written by
the people who are actually making progress, for example, this
one~\cite{DinoDistefano2019FBstaticAnalysis}.
}\QuickQuizEnd
\subsection{Automatic Translation}
\label{sec:future:Automatic Translation}
Although Promela and \co{spin}
are invaluable design aids, if you need to formally regression-test
your C-language program, you must hand-translate to Promela each time
you would like to re-verify your code.
If your code happens to be in the Linux kernel, which releases every
60--90 days, you will need to hand-translate from four to six times
each year.
Over time, human error will creep in, which means that the verification
won't match the source code, rendering the verification useless.
Repeated verification clearly requires either that the formal-verification
tooling input your code directly, or that there be bug-free automatic
translation of your code to the form required for verification.
PPCMEM and \co{herd} can in theory directly input assembly language
and C++ code, but these tools work only on very small litmus tests,
which normally means that you must extract the core of your
mechanism---by hand.
As with Promela and \co{spin}, both PPCMEM and \co{herd} are
extremely useful, but they are not well-suited for regression suites.
In contrast, \IXaltacr{\co{cbmc}}{cbmc} and \IX{Nidhugg} can input
C programs of reasonable
(though still quite limited) size, and if their capabilities continue
to grow, could well become excellent additions to regression suites.
The Coverity static-analysis tool also inputs C programs, and of very
large size, including the Linux kernel.
Of course, Coverity's static analysis is quite simple compared to that
of \co{cbmc} and Nidhugg.
On the other hand, Coverity had an all-encompassing definition of
``C program'' that posed special challenges~\cite{AlBessey2010BillionLoCLater}.
Amazon Web Services uses a variety of formal-verification tools,
including \co{cbmc}, and applies some of these tools to regression
testing~\cite{ByronCook2018FormalAmazon}.
Google uses a number of relatively simple static analysis tools directly
on large Java code bases, which are arguably less diverse than C code
bases~\cite{CaitlinSadowski2018staticAnalysisGoogle}.
Facebook uses more aggressive forms of formal verification against its
code bases, including analysis of concurrency~\cite{DinoDistefano2019FBstaticAnalysis,PeterWOHearn2019incorrectnessLogic},
though not yet on the Linux kernel.
Finally, Microsoft has long used static analysis on its code
bases~\cite{JamesRLarus2004RightingSoftware}.
Given this list, it is clearly possible to create sophisticated
formal-verification tools that directly consume production-quality
source code.
However, one shortcoming of taking C code as input is that it assumes
that the compiler is correct.
An alternative approach is to take the binary produced by the C compiler
as input, thereby accounting for any relevant compiler bugs.
This approach has been used in a number of verification efforts,
perhaps most notably by the seL4
project~\cite{ThomasSewell2013L4binaryVerification}.
\QuickQuiz{
Given the groundbreaking nature of the various verifiers used
in the seL4 project, why doesn't this chapter cover them in
more depth?
}\QuickQuizAnswer{
There can be no doubt that the verifiers used by the seL4
project are quite capable.
However, seL4 started as a single-CPU project.
And although seL4 has gained multi-processor
capabilities, it is currently using very coarse-grained
locking that is similar to the Linux kernel's old
Big Kernel Lock (BKL)\@.
There will hopefully come a day when it makes sense to add
seL4's verifiers to a book on parallel programming, but
this is not yet that day.
}\QuickQuizEnd
However, verifying directly from either the source or binary both have the
advantage of eliminating human translation errors, which is critically
important for reliable regression testing.
This is not to say that tools with special-purpose languages are useless.
On the contrary, they can be quite helpful for design-time verification,
as was discussed in
\cref{chp:Formal Verification}.
However, such tools are not particularly helpful for automated regression
testing, which is in fact the topic of this section.
\subsection{Environment}
\label{sec:future:Environment}
It is critically important that formal-verification tools correctly
model their environment.
One all-too-common omission is the \IX{memory model}, where a great
many formal-verification tools, including Promela/\co{spin}, are
restricted to \IXalth{sequential consistency}{sequential}{memory consistency}.
The QRCU experience related in
\cref{sec:formal:Is QRCU Really Correct?}
is an important cautionary tale.
Promela and \co{spin} assume sequential consistency, which is not a
good match for modern computer systems, as was seen in
\cref{chp:Advanced Synchronization: Memory Ordering}.
In contrast, one of the great strengths of PPCMEM and \co{herd}
is their detailed modeling of various CPU families memory models,
including x86, \ARM, Power, and, in the case of \co{herd},
a Linux-kernel memory model~\cite{Alglave:2018:FSC:3173162.3177156},
which was accepted into Linux-kernel version v4.17.
The \co{cbmc} and Nidhugg tools provide some ability to select
memory models, but do not provide the variety that PPCMEM and
\co{herd} do.
However, it is likely that the larger-scale tools will adopt
a greater variety of memory models as time goes on.
In the longer term, it would be helpful for formal-verification
tools to include I/O~\cite{PaulEMcKenney2016LinuxKernelMMIO},
but it may be some time before this comes to pass.
Nevertheless, tools that fail to match the environment can still
be useful.
For example, a great many concurrency bugs would still be bugs on
a mythical sequentially consistent system, and these bugs could
be located by a tool that over-approximates the system's memory model
with sequential consistency.
Nevertheless, these tools will fail to find bugs involving missing
memory-ordering directives, as noted in the aforementioned
cautionary tale of
\cref{sec:formal:Is QRCU Really Correct?}.
\subsection{Overhead}
\label{sec:future:Overhead}
Almost all hard-core formal-verification tools are exponential
in nature, which might seem discouraging until you consider that
many of the most interesting software questions are in fact undecidable.
However, there are differences in degree, even among exponentials.
PPCMEM by design is unoptimized, in order to provide greater assurance
that the memory models of interest are accurately represented.
The \co{herd} tool optimizes more aggressively, as described in
\cref{sec:formal:Axiomatic Approaches}, and is thus orders of magnitude
faster than PPCMEM\@.
Nevertheless, both PPCMEM and \co{herd} target very small litmus tests
rather than larger bodies of code.
In contrast, Promela/\co{spin}, \co{cbmc}, and Nidhugg are designed for
(somewhat) larger bodies of code.
Promela/\co{spin} was used to verify the Curiosity rover's
filesystem~\cite{DBLP:journals/amai/GroceHHJX14} and, as noted earlier,
both \co{cbmc} and Nidhugg were appled to Linux-kernel RCU\@.
If advances in heuristics continue at the rate of the past three
decades, we can look forward to large reductions in overhead for
formal verification.
That said, combinatorial explosion is still combinatorial explosion,
which would be expected to sharply limit the size of programs that
could be verified, with or without continued improvements in
heuristics.
However, the flip side of combinatorial explosion is Philip II of
Macedon's timeless advice:
``Divide and rule.''
If a large program can be divided and the pieces verified, the result
can be combinatorial \emph{implosion}~\cite{PaulEMcKenney2011Verico}.
One natural place to divide is on API boundaries, for example, those
of locking primitives.
One verification pass can then verify that the locking implementation
is correct, and additional verification passes can verify correct
use of the locking APIs.
\begin{listing}
\input{CodeSamples/formal/herd/C-SB+l-o-o-u+l-o-o-u-C@whole.fcv}
\caption{Emulating Locking with \tco{cmpxchg_acquire()}}
\label{lst:future:Emulating Locking with cmpxchg}
\end{listing}
\begin{table}
\rowcolors{1}{}{lightgray}
\renewcommand*{\arraystretch}{1.1}
\small
\centering
\begin{tabular}{S[table-format=1.0]S[table-format=1.3]S[table-format=2.3]}
\toprule
\multicolumn{1}{c}{\# Threads} & \multicolumn{1}{c}{Locking} &
\multicolumn{1}{c}{\tco{cmpxchg_acquire}} \\
\midrule
2 & 0.004 & 0.022 \\
3 & 0.041 & 0.743 \\
4 & 0.374 & 59.565 \\
5 & 4.905 & \\
\bottomrule
\end{tabular}
\caption{Emulating Locking:
Performance (s)}
\label{tab:future:Emulating Locking: Performance (s)}
\end{table}
The performance benefits of this approach can be demonstrated using
the Linux-kernel memory
model~\cite{Alglave:2018:FSC:3173162.3177156}.
This model provides \co{spin_lock()} and \co{spin_unlock()}
primitives, but these primitives can also be emulated using
\co{cmpxchg_acquire()} and \co{smp_store_release()}, as shown in
\cref{lst:future:Emulating Locking with cmpxchg}
(\path{C-SB+l-o-o-u+l-o-o-*u.litmus} and \path{C-SB+l-o-o-u+l-o-o-u*-C.litmus}).
\Cref{tab:future:Emulating Locking: Performance (s)}
compares the performance and scalability of using the model's
\co{spin_lock()} and \co{spin_unlock()} against emulating these
primitives as shown in the listing.
The difference is not insignificant:
At four processes, the model is more than two orders of magnitude
faster than emulation!
\QuickQuiz{
\begin{fcvref}[ln:future:formalregress:C-SB+l-o-o-u+l-o-o-u-C:whole]
Why bother with a separate \co{filter} command on \clnref{filter_} of
\cref{lst:future:Emulating Locking with cmpxchg}
instead of just adding the condition to the \co{exists} clause?
And wouldn't it be simpler to use \co{xchg_acquire()} instead
of \co{cmpxchg_acquire()}?
\end{fcvref}
}\QuickQuizAnswer{
The \co{filter} clause causes the \co{herd} tool to discard
executions at an earlier stage of processing than does
the \co{exists} clause, which provides significant speedups.
\begin{table}
\rowcolors{7}{lightgray}{}
\renewcommand*{\arraystretch}{1.1}
\small
\centering
\begin{tabular}{S[table-format=1.0]S[table-format=1.3]S[table-format=2.3]
S[table-format=3.3]S[table-format=2.3]S[table-format=3.3]}
\toprule
& & \multicolumn{2}{c}{\tco{cmpxchg_acquire()}}
& \multicolumn{2}{c}{\tco{xchg_acquire()}} \\
\cmidrule(l){3-4} \cmidrule(l){5-6}
\multicolumn{1}{c}{\#} & \multicolumn{1}{c}{Lock}
& \multicolumn{1}{c}{\tco{filter}}
& \multicolumn{1}{c}{\tco{exists}}
& \multicolumn{1}{c}{\tco{filter}}
& \multicolumn{1}{c}{\tco{exists}} \\
\cmidrule{1-1} \cmidrule(l){2-2} \cmidrule(l){3-4} \cmidrule(l){5-6}
2 & 0.004 & 0.022 & 0.039 & 0.027 & 0.058 \\
3 & 0.041 & 0.743 & 1.653 & 0.968 & 3.203 \\
4 & 0.374 & 59.565 & 151.962 & 74.818 & 500.96 \\
5 & 4.905 & & & & \\
\bottomrule
\end{tabular}
\caption{Emulating Locking:
Performance Comparison (s)}
\label{tab:future:Emulating Locking: Performance Comparison (s)}
\end{table}
As for \co{xchg_acquire()}, this atomic operation will do a
write whether or not lock acquisition succeeds, which means
that a model using \co{xchg_acquire()} will have more operations
than one using \co{cmpxchg_acquire()}, which won't do a write
in the failed-acquisition case.
More writes means more combinatorial to explode, as shown in
\cref{tab:future:Emulating Locking: Performance Comparison (s)}
(\path{C-SB+l-o-o-u+l-o-o-*u.litmus},
\path{C-SB+l-o-o-u+l-o-o-u*-C.litmus},
\path{C-SB+l-o-o-u+l-o-o-u*-CE.litmus},
\path{C-SB+l-o-o-u+l-o-o-u*-X.litmus}, and
\path{C-SB+l-o-o-u+l-o-o-u*-XE.litmus}).
This table clearly shows that \co{cmpxchg_acquire()}
outperforms \co{xchg_acquire()} and that use of the
\co{filter} clause outperforms use of the \co{exists} clause.
}\QuickQuizEnd
It would of course be quite useful for tools to automatically divide
up large programs, verify the pieces, and then verify the combinations
of pieces.
In the meantime, verification of large programs will require significant
manual intervention.
This intervention will preferably mediated by scripting, the better to
reliably carry out repeated verifications on each release, and
preferably eventually in a manner well-suited for continuous integration.
And Facebook's Infer tool has taken important steps towards doing just
that, via compositionality and
abstraction~\cite{SamBlackshear2018RacerD,DinoDistefano2019FBstaticAnalysis}.
In any case, we can expect formal-verification capabilities to continue
to increase over time, and any such increases will in turn increase
the applicability of formal verification to regression testing.
\subsection{Locate Bugs}
\label{sec:future:Locate Bugs}
Any software artifact of any size contains bugs.
Therefore, a formal-verification tool that reports only the
presence or absence of bugs is not particularly useful.
What is needed is a tool that gives at least \emph{some} information
as to where the bug is located and the nature of that bug.
The \co{cbmc} output includes a traceback mapping back to the source
code, similar to Promela/\co{spin}'s, as does Nidhugg.
Of course, these tracebacks can be quite long, and analyzing them
can be quite tedious.
However, doing so is usually quite a bit faster
and more pleasant than locating bugs the old-fashioned way.
In addition, one of the simplest tests of formal-verification tools is
bug injection.
After all, not only could any of us write
\co{printf("VERIFIED\\n")}, but the plain fact is that
developers of formal-verification tools are just as bug-prone as
are the rest of us.
Therefore, formal-verification tools that just proclaim that a
bug exists are fundamentally less trustworthy because it is
more difficult to verify them on real-world code.
All that aside, people writing formal-verification tools are
permitted to leverage existing tools.
For example, a tool designed to determine only the presence
or absence of a serious but rare bug might leverage bisection.
If an old version of the program under test did not contain the bug,
but a new version did, then bisection could be used to quickly
locate the commit that inserted the bug, which might be
sufficient information to find and fix the bug.
Of course, this sort of strategy would not work well for common
bugs because in this case bisection would fail due to all commits
having at least one instance of the common bug.
Therefore, the execution traces provided
by many formal-verification tools will continue to be valuable,
particularly for complex and difficult-to-understand bugs.
In addition, recent work applies \emph{incorrectness-logic}
formalism reminiscent of the traditional Hoare logic used for
full-up correctness proofs, but with the sole purpose of finding
bugs~\cite{PeterWOHearn2019incorrectnessLogic}.
\subsection{Minimal Scaffolding}
\label{sec:future:Minimal Scaffolding}
In the old days, formal-verification researchers demanded a full
specification against which the software would be verified.
Unfortunately, a mathematically rigorous specification might well
be larger than the actual code, and each line of specification
is just as likely to contain bugs as is each line of code.
A formal verification effort proving that the code faithfully implemented
the specification would be a proof of bug-for-bug compatibility between
the two, which might not be all that helpful.
Worse yet, the requirements for a number of software artifacts,
including Linux-kernel RCU, are empirical in
nature~\cite{PaulEMcKenney2015RCUreqts1,PaulEMcKenney2015RCUreqts2,PaulEMcKenney2015RCUreqts3}.\footnote{
Or, in formal-verification parlance, Linux-kernel RCU has an
\emph{incomplete specification}.}
For this common type of software, a complete specification is a
polite fiction.
Nor are complete specifications any less fictional for hardware,
as was made clear by the late-2017 Meltdown and Spectre side-channel
attacks~\cite{JannHorn2018MeltdownSpectre}.
This situation might cause one to give up all hope of formal verification
of real-world software and hardware artifacts, but it turns out that there is
quite a bit that can be done.
For example, design and coding rules can act as a partial specification,
as can assertions contained in the code.
And in fact formal-verification tools such as \co{cbmc} and Nidhugg
both check for assertions that can be triggered, implicitly treating
these assertions as part of the specification.
However, the assertions are also part of the code, which makes it less
likely that they will become obsolete, especially if the code is
also subjected to stress tests.\footnote{
And you \emph{do} stress-test your code, don't you?}
The \co{cbmc} tool also checks for array-out-of-bound references,
thus implicitly adding them to the specification.
The aforementioned incorrectness logic can also be thought of as using
an implicit bugs-not-present
specification~\cite{PeterWOHearn2019incorrectnessLogic}.
This implicit-specification approach makes quite a bit of sense, particularly
if you look at formal verification not as a full proof of correctness,
but rather an alternative form of validation with a different set of
strengths and weaknesses than the common case, that is, testing.
From this viewpoint, software will always have bugs, and therefore any
tool of any kind that helps to find those bugs is a very good thing
indeed.
\subsection{Relevant Bugs}
\label{sec:future:Relevant Bugs}
Finding bugs---and fixing them---is of course the whole point of any
type of validation effort.
Clearly, false positives are to be avoided.
But even in the absence of false positives, there are bugs and there are bugs.
For example, suppose that a software artifact had exactly 100 remaining
bugs, each of which manifested on average once every million years
of runtime.
Suppose further that an omniscient formal-verification tool located
all 100 bugs, which the developers duly fixed.
What happens to the reliability of this software artifact?
The answer is that the reliability \emph{decreases}.
To see this, keep in mind that historical experience indicates that
about 7\pct\ of fixes introduce a new bug~\cite{RexBlack2012SQA}.
Therefore, fixing the 100 bugs, which had a combined mean time to failure
(MTBF) of about 10,000 years, will introduce seven more bugs.
Historical statistics indicate that each new bug will have an MTBF
much less than 70,000 years.
This in turn suggests that the combined MTBF of these seven new bugs
will most likely be much less than 10,000 years, which in turn means
that the well-intentioned fixing of the original 100 bugs actually
decreased the reliability of the overall software.
\QuickQuizSeries{%
\QuickQuizB{
How do we know that the MTBFs of known bugs is a good estimate
of the MTBFs of bugs that have not yet been located?
}\QuickQuizAnswerB{
We don't, but it does not matter.
To see this, note that the 7\pct\ figure only applies to injected
bugs that were subsequently located:
It necessarily ignores any injected bugs that were never found.
Therefore, the MTBF statistics of known bugs is likely to be
a good approximation of that of the injected bugs that are
subsequently located.
A key point in this whole section is that we should be more
concerned about bugs that inconvenience users than about
other bugs that never actually manifest.
This of course is \emph{not} to say that we should completely
ignore bugs that have not yet inconvenienced users, just that
we should properly prioritize our efforts so as to fix the
most important and urgent bugs first.
}\QuickQuizEndB
%
\QuickQuizE{
But the formal-verification tools should immediately find all the
bugs introduced by the fixes, so why is this a problem?
}\QuickQuizAnswerE{
It is a problem because real-world formal-verification tools
(as opposed to those that exist only in the imaginations of
the more vociferous proponents of formal verification) are
not omniscient, and thus are only able to locate certain types
of bugs.
For but one example, formal-verification tools are unlikely to
spot a bug corresponding to an omitted assertion or, equivalently,
a bug corresponding to an undiscovered portion of the specification.
}\QuickQuizEndE
}
Worse yet, imagine another software artifact with one bug that fails
once every day on average and 99 more that fail every million years
each.
Suppose that a formal-verification tool located the 99 million-year
bugs, but failed to find the one-day bug.
Fixing the 99 bugs located will take time and effort, decrease
reliability, and do nothing at all about the pressing each-day failure
that is likely causing embarrassment and perhaps much worse besides.
Therefore, it would be best to have a validation tool that
preferentially located the most troublesome bugs.
However, as noted in
\cref{sec:future:Locate Bugs},
it is permissible to leverage additional tools.
One powerful tool is none other than plain old testing.
Given knowledge of the bug, it should be possible to construct
specific tests for it, possibly also using some of the techniques
described in
\cref{sec:debugging:Hunting Heisenbugs}
to increase the probability of the bug manifesting.
These techniques should allow calculation of a rough estimate of the
bug's raw failure rate, which could in turn be used to prioritize
bug-fix efforts.
\QuickQuiz{
But many formal-verification tools can only find one bug at
a time, so that each bug must be fixed before the tool can
locate the next.
How can bug-fix efforts be prioritized given such a tool?
}\QuickQuizAnswer{
One approach is to provide a simple fix that might not be
suitable for a production environment, but which allows
the tool to locate the next bug.
Another approach is to restrict configuration or inputs
so that the bugs located thus far cannot occur.
There are a number of similar approaches, but the common theme
is that fixing the bug from the tool's viewpoint is usually much
easier than constructing and validating a production-quality fix,
and the key point is to prioritize the larger efforts required
to construct and validate the production-quality fixes.
}\QuickQuizEnd
There has been some recent formal-verification work that prioritizes
executions having fewer preemptions, under that reasonable assumption
that smaller numbers of preemptions are more likely.
Identifying relevant bugs might sound like too much to ask, but it is what
is really required if we are to actually increase software reliability.
\subsection{Formal Regression Scorecard}
\label{sec:future:Formal Regression Scorecard}
\begin{table*}
% \rowcolors{6}{}{lightgray}
%\renewcommand*{\arraystretch}{1.1}
\small
\centering
\setlength{\tabcolsep}{2pt}
\begin{tabular}{lcccccccccc}
\toprule
& & Promela & & PPCMEM & & \tco{herd} & & \tco{cbmc} & & Nidhugg \\
\midrule
(1) Automated &
& \cellcolor{red!50} &
& \cellcolor{orange!50} &
& \cellcolor{orange!50} &
& \cellcolor{blue!50} &
& \cellcolor{blue!50} \\
\addlinespace[3pt]
(2) Environment &
& \cellcolor{red!50} (MM) &
& \cellcolor{green!50} &
& \cellcolor{blue!50} &
& \cellcolor{yellow!50} (MM) &
& \cellcolor{orange!50} (MM) \\
\addlinespace[3pt]
(3) Overhead &
& \cellcolor{yellow!50} &
& \cellcolor{red!50} &
& \cellcolor{yellow!50} &
& \cellcolor{yellow!50} (SAT) &
& \cellcolor{green!50} \\
\addlinespace[3pt]
(4) Locate Bugs &
& \cellcolor{yellow!50} &
& \cellcolor{yellow!50} &
& \cellcolor{yellow!50} &
& \cellcolor{green!50} &
& \cellcolor{green!50} \\
\addlinespace[3pt]
(5) Minimal Scaffolding &
& \cellcolor{green!50} &
& \cellcolor{yellow!50} &
& \cellcolor{yellow!50} &
& \cellcolor{blue!50} &
& \cellcolor{blue!50} \\
\addlinespace[3pt]
(6) Relevant Bugs &
& \cellcolor{yellow!50} ??? &
& \cellcolor{yellow!50} ??? &
& \cellcolor{yellow!50} ??? &
& \cellcolor{yellow!50} ??? &
& \cellcolor{yellow!50} ??? \\
\bottomrule
\end{tabular}
\caption{Formal Regression Scorecard}
\label{tab:future:Formal Regression Scorecard}
\end{table*}
\Cref{tab:future:Formal Regression Scorecard}
shows a rough-and-ready scorecard for the formal-verification tools
covered in this chapter.
Shorter wavelengths are better than longer wavelengths.
Promela requires hand translation and supports only
\IXalth{sequential consistency}{sequential}{memory consistency},
so its first two cells are red.
It has reasonable overhead (for formal verification, anyway)
and provides a traceback, so its next two cells are yellow.
Despite requiring hand translation, Promela handles assertions
in a natural way, so its fifth cell is green.
PPCMEM usually requires hand translation due to the small size of litmus
tests that it supports, so its first cell is orange.
It handles several \IX{memory model}s, so its second cell is green.
Its overhead is quite high, so its third cell is red.
It provides a graphical display of relations among operations, which
is not as helpful as a traceback, but is still quite useful, so its
fourth cell is yellow.
It requires constructing an \co{exists} clause and cannot take
intra-process assertions, so its fifth cell is also yellow.
The \co{herd} tool has size restrictions similar to those of PPCMEM,
so \co{herd}'s first cell is also orange.
It supports a wide variety of memory models, so its second cell is blue.
It has reasonable overhead, so its third cell is yellow.
Its bug-location and assertion capabilities are quite similar to those
of PPCMEM, so \co{herd} also gets yellow for the next two cells.
The \co{cbmc} tool inputs C code directly, so its first cell is blue.
It supports a few memory models, so its second cell is yellow.
It has reasonable overhead, so its third cell is also yellow, however,
perhaps SAT-solver performance will continue improving.
It provides a traceback, so its fourth cell is green.
It takes assertions directly from the C code, so its fifth cell is blue.
Nidhugg also inputs C code directly, so its first cell is also blue.
It supports only a couple of memory models, so its second cell is orange.
Its overhead is quite low (for formal-verification), so its
third cell is green.
It provides a traceback, so its fourth cell is green.
It takes assertions directly from the C code, so its fifth cell is blue.
So what about the sixth and final row?
It is too early to tell how any of the tools do at finding the right bugs,
so they are all yellow with question marks.
\QuickQuizSeries{%
\QuickQuizB{
How would testing stack up in the scorecard shown in
\cref{tab:future:Formal Regression Scorecard}?
}\QuickQuizAnswerB{
It would be blue all the way down, with the possible
exception of the third row (overhead) which might well
be marked down for testing's difficulty finding
improbable bugs.
On the other hand, improbable bugs are often also
irrelevant bugs, so your mileage may vary.
Much depends on the size of your installed base.
If your code is only ever going to run on (say) 10,000
systems, Murphy can actually be a really nice guy.
Everything that can go wrong, will.
Eventually.
Perhaps in geologic time.
But if your code is running on 20~billion systems,
like the Linux kernel was said to be by late 2017,
Murphy can be a real jerk!
Everything that can go wrong, will, and it can go wrong
really quickly!!!
}\QuickQuizEndB
%
\QuickQuizE{
But aren't there a great many more formal-verification systems
than are shown in
\cref{tab:future:Formal Regression Scorecard}?
}\QuickQuizAnswerE{
Indeed there are!
This table focuses on those that Paul has used, but others are
proving to be useful.
Formal verification has been heavily used in the seL4
project~\cite{ThomasSewell2013L4binaryVerification},
and its tools can now handle modest levels of concurrency.
More recently, Catalin Marinas used Lamport's
TLA tool~\cite{Lamport:2002:SST:579617} to locate some
forward-progress bugs in the Linux kernel's queued spinlock
implementation.
Will Deacon fixed these bugs~\cite{WillDeacon2018qspinlock},
and Catalin verified Will's
fixes~\cite{CatalinMarinas2018qspinlockTLA}.
Lighter-weight formal verification tools have been used heavily
in production~\cite{JamesRLarus2004RightingSoftware,AlBessey2010BillionLoCLater,ByronCook2018FormalAmazon,CaitlinSadowski2018staticAnalysisGoogle,DinoDistefano2019FBstaticAnalysis}.
}\QuickQuizEndE
}
Once again, please note that this table rates these tools for use in
regression testing.
Just because many of them are a poor fit for regression testing does
not at all mean that they are useless, in fact,
many of them have proven their worth many times over.\footnote{
For but one example, Promela was used to verify the file system
of none other than the Curiosity Rover.
Was \emph{your} formal verification tool used on software that
currently runs on Mars???}
Just not for regression testing.
However, this might well change.
After all, formal verification tools made impressive strides in the 2010s.
If that progress continues, formal verification might well become an
indispensable tool in the parallel programmer's validation toolbox.