blob: d6c76ddfadd3e9ac9641ca6d31fcb7d07fad5a03 [file] [log] [blame]
% formal/axiomatic.tex
\section{Axiomatic Approaches}
\label{sec:formal:Axiomatic Approaches}
\OriginallyPublished{Section}{sec:formal:Axiomatic Approaches}{Axiomatic Approaches}{Linux Weekly News}{PaulEMcKenney2014weakaxiom}
%
\epigraph{Theory helps us to bear our ignorance of facts.}
{\emph{George Santayana}}
\begin{listing}[tb]
\begin{linelabel}[ln:formal:IRIW Litmus Test]
\begin{VerbatimL}[commandchars=\%\@\$]
PPC IRIW.litmus
""
(* Traditional IRIW. *)
{
0:r1=1; 0:r2=x;
1:r1=1; 1:r4=y;
2:r2=x; 2:r4=y;
3:r2=x; 3:r4=y;
}
P0 | P1 | P2 | P3 ;
stw r1,0(r2) | stw r1,0(r4) | lwz r3,0(r2) | lwz r3,0(r4) ;
| | sync | sync ;
| | lwz r5,0(r4) | lwz r5,0(r2) ;
exists
(2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0)
\end{VerbatimL}
\end{linelabel}
\caption{IRIW Litmus Test}
\label{lst:formal:IRIW Litmus Test}
\end{listing}
Although the PPCMEM tool can solve the famous ``independent reads of
independent writes'' (IRIW) litmus test shown in
Listing~\ref{lst:formal:IRIW Litmus Test}, doing so requires no less than
fourteen CPU hours and generates no less than ten gigabytes of state space.
That said, this situation is a great improvement over that before the advent
of PPCMEM, where solving this problem required perusing volumes of
reference manuals, attempting proofs, discussing with experts, and
being unsure of the final answer.
Although fourteen hours can seem like a long time, it is much shorter
than weeks or even months.
However, the time required is a bit surprising given the simplicity
of the litmus test, which has two threads storing to two separate variables
and two other threads loading from these two variables in opposite
orders.
The assertion triggers if the two loading threads disagree on the order
of the two stores.
This litmus test is simple, even by the standards of memory-order litmus
tests.
One reason for the amount of time and space consumed is that PPCMEM does
a trace-based full-state-space search, which means that it must generate
and evaluate all possible orders and combinations of events at the
architectural level.
At this level, both loads and stores correspond to ornate sequences
of events and actions, resulting in a very large state space that must
be completely searched, in turn resulting in large memory and CPU
consumption.
Of course, many of the traces are quite similar to one another, which
suggests that an approach that treated similar traces as one might
improve performace.
One such approach is the axiomatic approach of
Alglave et al.~\cite{Alglave:2014:HCM:2594291.2594347},
which creates a set of axioms to represent the memory model and then
converts litmus tests to theorems that might be proven or disproven
over this set of axioms.
The resulting tool, called ``herd'', conveniently takes as input the
same litmus tests as PPCMEM, including the IRIW litmus test shown in
Listing~\ref{lst:formal:IRIW Litmus Test}.
\begin{listing}[tb]
\begin{linelabel}[ln:formal:Expanded IRIW Litmus Test]
\begin{VerbatimL}[commandchars=\%\@\$]
PPC IRIW5.litmus
""
(* Traditional IRIW, but with five stores instead of *)
(* just one. *)
{
0:r1=1; 0:r2=x;
1:r1=1; 1:r4=y;
2:r2=x; 2:r4=y;
3:r2=x; 3:r4=y;
}
P0 | P1 | P2 | P3 ;
stw r1,0(r2) | stw r1,0(r4) | lwz r3,0(r2) | lwz r3,0(r4) ;
addi r1,r1,1 | addi r1,r1,1 | sync | sync ;
stw r1,0(r2) | stw r1,0(r4) | lwz r5,0(r4) | lwz r5,0(r2) ;
addi r1,r1,1 | addi r1,r1,1 | | ;
stw r1,0(r2) | stw r1,0(r4) | | ;
addi r1,r1,1 | addi r1,r1,1 | | ;
stw r1,0(r2) | stw r1,0(r4) | | ;
addi r1,r1,1 | addi r1,r1,1 | | ;
stw r1,0(r2) | stw r1,0(r4) | | ;
exists
(2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0)
\end{VerbatimL}
\end{linelabel}
\caption{Expanded IRIW Litmus Test}
\label{lst:formal:Expanded IRIW Litmus Test}
\end{listing}
However, where PPCMEM requires 14 CPU hours to solve IRIW, herd does so
in 17 milliseconds, which represents a speedup of more than six orders
of magnitude.
That said, the problem is exponential in nature, so we should expect
herd to exhibit exponential slowdowns for larger problems.
And this is exactly what happens, for example, if we add four more writes
per writing CPU as shown in
Listing~\ref{lst:formal:Expanded IRIW Litmus Test},
herd slows down by a factor of more than 50,000, requiring more than
15 \emph{minutes} of CPU time.
Adding threads also results in exponential
slowdowns~\cite{PaulEMcKenney2014weakaxiom}.
Despite their exponential nature, both PPCMEM and herd have proven quite
useful for checking key parallel algorithms, including the queued-lock
handoff on x86 systems.
The weaknesses of the herd tool are similar to those of PPCMEM, which
were described in
Section~\ref{sec:formal:PPCMEM Discussion}.
There are some obscure (but very real) cases for which the PPCMEM and
herd tools disagree, and as of late 2014 resolving these disagreements
was ongoing.
It would be helpful if the litmus tests could be written in C
(as in Listing~\ref{lst:formal:Meaning of PPCMEM Litmus Test})
rather than assembly
(as in Listing~\ref{lst:formal:PPCMEM Litmus Test}).
This is now possible, as will be described in the following sections.
\subsection{Axiomatic Approaches and Locking}
\label{sec:formal:Axiomatic Approaches and Locking}
\begin{listing}[tb]
\input{CodeSamples/formal/herd/C-Lock1@whole.fcv}
\caption{Locking Example}
\label{lst:formal:Locking Example}
\end{listing}
Axiomatic approaches may also be applied to higher-level
languages and also to higher-level synchronization primitives, as
exemplified by the lock-based litmus test shown in
Listing~\ref{lst:formal:Locking Example} (\path{C-Lock1.litmus}).
This litmus test can be modeled by the Linux kernel memory model
(LKMM)~\cite{Alglave:2018:FSC:3173162.3177156}.
As expected, the \co{herd} tool's output features the string \co{Never},
correctly indicating that \co{P1()} cannot see \co{x} having a value
of one.\footnote{
The output of the \co{herd} tool is compatible with that
of PPCMEM, so feel free to look at
Listings~\ref{lst:formal:PPCMEM Detects an Error}
and~\ref{lst:formal:PPCMEM on Repaired Litmus Test}
for examples showing the output format.}
\QuickQuiz{}
What do you have to do to run \co{herd} on litmus tests like
that shown in Listing~\ref{lst:formal:Locking Example}?
\QuickQuizAnswer{
Get version v4.17 (or later) of the Linux-kernel source code,
then follow the instructions in \path{tools/memory-model/README}
to install the needed tools.
Then follow the further instructions to run these tools on the
litmus test of your choice.
} \QuickQuizEnd
\begin{listing}[tb]
\input{CodeSamples/formal/herd/C-Lock2@whole.fcv}
\caption{Broken Locking Example}
\label{lst:formal:Broken Locking Example}
\end{listing}
Of course, if \co{P0()} and \co{P1()} use different locks, as shown in
Listing~\ref{lst:formal:Broken Locking Example} (\path{C-Lock2.litmus}),
then all bets are off.
And in this case, the \co{herd} tool's output features the string
\co{Sometimes}, correctly indicating that use of different locks allows
\co{P1()} to see \co{x} having a value of one.
\QuickQuiz{}
Why bother modeling locking directly?
Why not simply emulate locking with atomic operations?
\QuickQuizAnswer{
\begin{table}[tb]
\rowcolors{2}{}{lightgray}
\small
\centering
\newcommand{\lockfml}[1]{\multicolumn{1}{c}{\begin{picture}(6,150)(0,0)\rotatebox{90}{#1}\end{picture}}}
\begin{tabular}{rrrrrr}
\lockfml{\# Processes}
& \lockfml{Model}
& \lockfml{Emulate: \tco{cmpxchg_acquire()}, \tco{filter}}
& \lockfml{Emulate: \tco{xchg_acquire()}, \tco{filter}}
& \lockfml{Emulate: \tco{cmpxchg_acquire()}, \tco{exists}}
& \lockfml{Emulate: \tco{xchg_acquire()}, \tco{exists}}
\\
\midrule
2 & 0.004 & 0.022 & 0.027 & 0.039 & 0.058 \\
3 & 0.041 & 0.743 & 0.968 & 1.653 & 3.203 \\
4 & 0.374 & 59.565 & 74.818 & 151.962 & 500.96 \\
5 & 4.905 \\
\end{tabular}
\caption{Locking: Modeling vs. Emulation Time (s)}
\label{tab:formal:Locking: Modeling vs. Emulation Time (s)}
\end{table}
In a word, performance, as can be seen in
Table~\ref{tab:formal:Locking: Modeling vs. Emulation Time (s)}.
The first column shows the number of herd processes modeled.
The second column shows the herd runtime when modeling
\co{spin_lock()} and \co{spin_unlock()} directly in herd's
cat language.
The third column shows the herd runtime when emulating
\co{spin_lock()} with \co{cmpxchg_acquire()} and \co{spin_unlock()}
with \co{smp_store_release()}, using the herd \co{filter}
clause to reject executions that fail to acquire the lock.
The fourth column is like the third, but using \co{xchg_acquire()}
instead of \co{cmpxchg_acquire()}.
The fifth and sixth columns are like the third and fourth,
but instead using the herd \co{exists} clause to reject
executions that fail to acquire the lock.
Note also that use of the \co{filter} clause is about twice
as fast as is use of the \co{exists} clause.
This is no surprise because the \co{filter} clause allows
early abandoning of excluded executions, where the executions
that are excluded are the ones in which the lock is concurrently
held by more than one process.
More important, modeling \co{spin_lock()} and \co{spin_unlock()}
directly ranges from five times faster to more than two orders
of magnitude faster than modeling emulated locking.
This should also be no surprise, as direct modeling raises
the level of abstraction, thus reducing the number of events
that herd must model.
Because almost everything that herd does is of exponential
computational complexity, modest reductions in the number of
events produces exponentially large reductions in runtime.
Thus, in formal verification even more than in parallel
programming itself, divide and conquer!!!
} \QuickQuizEnd
But locking is not the only synchronization primitive that can be
modeled directly:
The next section looks at RCU.
\subsection{Axiomatic Approaches and RCU}
\label{sec:formal:Axiomatic Approaches and RCU}
\begin{listing}[tb]
\input{CodeSamples/formal/herd/C-RCU-remove@whole.fcv}
\caption{Canonical RCU Removal Litmus Test}
\label{lst:formal:Canonical RCU Removal Litmus Test}
\end{listing}
\begin{lineref}[ln:formal:C-RCU-remove:whole]
Axiomatic approaches can also analyze litmus tests involving RCU.
To that end,
Listing~\ref{lst:formal:Canonical RCU Removal Litmus Test}
(\path{C-RCU-remove.litmus})
shows a litmus test corresponding to the canonical RCU-mediated
removal from a linked list.
As with the locking litmus test, this RCU litmus test can be
modeled by LKMM, with similar performance advantages compared
to modeling emulations of RCU.
Line~\lnref{head} shows \co{x} as the list head, initially
referencing \co{y}, which in turn is initialized to the value
\co{2} on line~\lnref{tail:1}.
\co{P0()} on lines~\lnref{P0start}--\lnref{P0end}
removes element \co{y} from the list by replacing it with element \co{z}
(line~\lnref{assignnewtail}),
waits for a grace period (line~\lnref{sync}),
and finally zeroes \co{y} to emulate \co{free()} (line~\lnref{free}).
\co{P1()} on lines~\lnref{P1start}--\lnref{P1end}
executes within an RCU read-side critical section
(lines~\lnref{rl}--\lnref{rul}),
picking up the list head (line~\lnref{rderef}) and then
loading the next element (line~\lnref{read}).
The next element should be non-zero, that is, not yet freed
(line~\lnref{exists_}).
Several other variables are output for debugging purposes
(line~\lnref{locations_}).
The output of the \co{herd} tool when running this litmus test features
\co{Never}, indicating that \co{P0()} never accesses a freed element,
as expected.
Also as expected, removing line~\lnref{sync} results in \co{P0()}
accessing a freed element, as indicated by the \co{Sometimes} in
the \co{herd} output.
\end{lineref}
\begin{listing}[tb]
\input{CodeSamples/formal/herd/C-RomanPenyaev-list-rcu-rr@whole.fcv}
\caption{Complex RCU Litmus Test}
\label{lst:formal:Complex RCU Litmus Test}
\end{listing}
\begin{lineref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
A litmus test for a more complex example proposed by
Roman Penyaev~\cite{RomanPenyaev2018rrRCU} is shown in
Listing~\ref{lst:formal:Complex RCU Litmus Test}
(\path{C-RomanPenyaev-list-rcu-rr.litmus}).
In this example, readers (modeled by \co{P0()} on
lines~\lnref{P0start}--\lnref{P0end}) access a linked list
in a round-robin fashion by ``leaking'' a pointer to the last
list element accessed into variable \co{c}.
Updaters (modeled by \co{P1()} on lines~\lnref{P1start}--\lnref{P1end})
remove an element, taking care to avoid disrupting current or future
readers.
\QuickQuiz{}
Wait!!!
Isn't leaking pointers out of an RCU read-side critical
section a critical bug???
\QuickQuizAnswer{
Yes, it usually is a critical bug.
However, in this case, the updater has been cleverly constructed
to properly handle such pointer leaks.
But please don't make a habit of doing this sort of thing, and
especially don't do this without having put a lot of thought
into making some more conventional approach work.
} \QuickQuizEnd
Lines~\lnref{listtail}--\lnref{listhead} define the initial linked
list, tail first.
In the Linux kernel, this would be a doubly linked circular list,
but \co{herd} is currently incapable of modeling such a beast.
The strategy is instead to use a singly linked linear list that
is long enough that the end is never reached.
Line~\lnref{rrcache} defines variable \co{c}, which is used to
cache the list pointer between successive RCU read-side critical
sections.
Again, \co{P0()} on lines~\lnref{P0start}--\lnref{P0end} models readers.
This process models a pair of successive readers traversing round-robin
through the list, with the first reader on lines~\lnref{rl1}--\lnref{rul1}
and the second reader on lines~\lnref{rl2}--\lnref{rul2}.
Line~\lnref{rdcache} fetches the pointer cached in \co{c}, and if
line~\lnref{rdckcache} sees that the pointer was \co{NULL},
line~\lnref{rdinitcache} restarts at the beginning of the list.
In either case, line~\lnref{rdnext} advances to the next list element,
and line~\lnref{rdupdcache} stores a pointer to this element back into
variable \co{c}.
Lines~\lnref{rl2}--\lnref{rul2} repeat this process, but using
registers \co{r3} and \co{r4} instead of \co{r1} and \co{r2}.
As with
Listing~\ref{lst:formal:Canonical RCU Removal Litmus Test},
this litmus test stores zero to emulate \co{free()}, so
line~\lnref{exists_} checks for any of these four registers being
\co{NULL}, also known as zero.
Because \co{P0()} leaks an RCU-protected pointer from its first
RCU read-side critical section to its second, \co{P1()} must carry
out its update (removing \co{x}) very carefully.
Line~\lnref{updremove} removes \co{x} by linking \co{w} to \co{y}.
Line~\lnref{updsync1} waits for readers, after which no subsequent reader
has a path to \co{x} via the linked list.
Line~\lnref{updrdcache} fetches \co{c}, and if line~\lnref{updckcache}
determines that \co{c} references the newly removed \co{x},
line~\lnref{updinitcache} sets \co{c} to \co{NULL}
and line~\lnref{updsync2} again waits for readers, after which no
subsequent reader can fetch \co{x} from \co{c}.
In either case, line~\lnref{updfree} emulates \co{free()} by storing
zero to \co{x}.
\QuickQuiz{}
\begin{lineref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
In Listing~\ref{lst:formal:Complex RCU Litmus Test},
why couldn't a reader fetch \co{c} just before \co{P1()}
zeroed it on line~\lnref{updinitcache}, and then later
store this same value back into \co{c} just after it was
zeroed, thus defeating the zeroing operation?
\end{lineref}
\QuickQuizAnswer{
\begin{lineref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
Because the reader advances to the next element on
line~\lnref{rdnext}, thus avoiding storing a pointer to the
same element as was fetched.
\end{lineref}
} \QuickQuizEnd
The output of the \co{herd} tool when running this litmus test features
\co{Never}, indicating that \co{P0()} never accesses a freed element,
as expected.
Also as expected, removing either \co{synchronize_rcu()} results
in \co{P1()} accessing a freed element, as indicated by \co{Sometimes}
in the \co{herd} output.
\end{lineref}
\QuickQuiz{}
\begin{lineref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
In Listing~\ref{lst:formal:Complex RCU Litmus Test},
why not have just one call to \co{synchronize_rcu()}
immediately before line~\lnref{updfree}?
\end{lineref}
\QuickQuizAnswer{
\begin{lineref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
Because this results in \co{P0()} accessing a freed element.
But don't take my word for this, try it out in \co{herd}!
\end{lineref}
} \QuickQuizEnd
\QuickQuiz{}
\begin{lineref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
Also in Listing~\ref{lst:formal:Complex RCU Litmus Test},
can't line~\lnref{updfree} be \co{WRITE_ONCE()} instead
of \co{smp_store_release()}?
\end{lineref}
\QuickQuizAnswer{
\begin{lineref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
That is an excellent question.
As of late 2018, the answer is ``no one knows''.
Much depends on the semantics of ARMv8's conditional-move
instruction.
While awaiting clarity on these semantics, \co{smp_store_release()}
is the safe choice.
\end{lineref}
} \QuickQuizEnd
These sections have shown how axiomatic approaches can successfully
model synchronization primitives such as locking and RCU in C-language
litmus tests.
Longer term, the hope is that the axiomatic approaches will model
even higher-level software artifacts, producing exponential
verification speedups.
This could potentially allow axiomatic verification of much larger
software systems.
Another alternative is to press the axioms of boolean logic into service,
as described in the next section.