formal/axiomatic.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % formal/axiomatic.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \section{Axiomatic Approaches}
 \label{sec:formal:Axiomatic Approaches}
 \OriginallyPublished{sec:formal:Axiomatic Approaches}{Axiomatic Approaches}{Linux Weekly News}{PaulEMcKenney2014weakaxiom}
 %
 \epigraph{Theory helps us to bear our ignorance of facts.}
 	{George Santayana}

 Although the PPCMEM tool can solve the famous ``independent reads of
 independent writes'' (IRIW) litmus test shown in
 \cref{lst:formal:IRIW Litmus Test}, doing so requires no less than
 fourteen CPU hours and generates no less than ten gigabytes of state space.
 That said, this situation is a great improvement over that prior to the advent
 of PPCMEM, where solving this problem required perusing volumes of
 reference manuals, attempting proofs, discussing with experts, and
 then after all that, doubting the final answer.
 Although fourteen hours can seem like a long time, it is much shorter
 than weeks or even months.

 \begin{listing}
 \begin{fcvlabel}[ln:formal:IRIW Litmus Test]
 \begin{VerbatimL}[commandchars=\%\@\$]
 PPC IRIW.litmus
 ""
 (* Traditional IRIW. *)
 {
 0:r1=1; 0:r2=x;
 1:r1=1;         1:r4=y;
         2:r2=x; 2:r4=y;
         3:r2=x; 3:r4=y;
 }
 P0           | P1           | P2           | P3           ;
 stw r1,0(r2) | stw r1,0(r4) | lwz r3,0(r2) | lwz r3,0(r4) ;
              |              | sync         | sync         ;
              |              | lwz r5,0(r4) | lwz r5,0(r2) ;

 exists
 (2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0)
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{IRIW Litmus Test}
 \label{lst:formal:IRIW Litmus Test}
 \end{listing}

 However, the time required is a bit surprising given the simplicity
 of the litmus test, which has two threads storing to two separate variables
 and two other threads loading from these two variables in opposite
 orders.
 The assertion triggers if the two loading threads disagree on the order
 of the two stores.
 Even by the standards of memory-order litmus tests, this is quite simple.

 One reason for the amount of time and space consumed is that PPCMEM does
 a trace-based full-state-space search, which means that it must generate
 and evaluate all possible orders and combinations of events at the
 architectural level.
 At this level, both loads and stores correspond to ornate sequences
 of events and actions, resulting in a very large state space that must
 be completely searched, in turn resulting in large memory and CPU
 consumption.

 Of course, many of the traces are quite similar to one another, which
 suggests that an approach that treated similar traces as one might
 improve performace.
 One such approach is the axiomatic approach of
 \pplsur{Jade}{Alglave} et al.~\cite{Alglave:2014:HCM:2594291.2594347},
 which creates a set of axioms to represent the \IX{memory model} and then
 converts litmus tests to theorems that might be proven or disproven
 over this set of axioms.
 The resulting tool, called \qco{herd},  conveniently takes as input the
 same litmus tests as PPCMEM, including the IRIW litmus test shown in
 \cref{lst:formal:IRIW Litmus Test}.

 \begin{listing}
 \begin{fcvlabel}[ln:formal:Expanded IRIW Litmus Test]
 \begin{VerbatimL}[commandchars=\%\@\$]
 PPC IRIW5.litmus
 ""
 (* Traditional IRIW, but with five stores instead of *)
 (* just one.                                         *)
 {
 0:r1=1; 0:r2=x;
 1:r1=1;         1:r4=y;
         2:r2=x; 2:r4=y;
         3:r2=x; 3:r4=y;
 }
 P0           | P1           | P2           | P3           ;
 stw r1,0(r2) | stw r1,0(r4) | lwz r3,0(r2) | lwz r3,0(r4) ;
 addi r1,r1,1 | addi r1,r1,1 | sync         | sync         ;
 stw r1,0(r2) | stw r1,0(r4) | lwz r5,0(r4) | lwz r5,0(r2) ;
 addi r1,r1,1 | addi r1,r1,1 |              |              ;
 stw r1,0(r2) | stw r1,0(r4) |              |              ;
 addi r1,r1,1 | addi r1,r1,1 |              |              ;
 stw r1,0(r2) | stw r1,0(r4) |              |              ;
 addi r1,r1,1 | addi r1,r1,1 |              |              ;
 stw r1,0(r2) | stw r1,0(r4) |              |              ;

 exists
 (2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0)
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Expanded IRIW Litmus Test}
 \label{lst:formal:Expanded IRIW Litmus Test}
 \end{listing}

 However, where PPCMEM requires 14 CPU hours to solve IRIW, \co{herd} does so
 in 17 milliseconds, which represents a speedup of more than six orders
 of magnitude.
 That said, the problem is exponential in nature, so we should expect
 \co{herd} to exhibit exponential slowdowns for larger problems.
 And this is exactly what happens, for example, if we add four more writes
 per writing CPU as shown in
 \cref{lst:formal:Expanded IRIW Litmus Test},
 \co{herd} slows down by a factor of more than 50,000, requiring more than
 15 \emph{minutes} of CPU time.
 Adding threads also results in exponential
 slowdowns~\cite{PaulEMcKenney2014weakaxiom}.

 Despite their exponential nature, both PPCMEM and \co{herd} have proven quite
 useful for checking key parallel algorithms, including the queued-lock
 handoff on x86 systems.
 The weaknesses of the \co{herd} tool are similar to those of PPCMEM, which
 were described in
 \cref{sec:formal:PPCMEM Discussion}.
 There are some obscure (but very real) cases for which the PPCMEM and
 \co{herd} tools disagree, and as of 2021 many but not all of these disagreements
 was resolved.

 It would be helpful if the litmus tests could be written in C
 (as in \cref{lst:formal:Meaning of PPCMEM Litmus Test})
 rather than assembly
 (as in \cref{lst:formal:PPCMEM Litmus Test}).
 This is now possible, as will be described in the following sections.

 \subsection{Axiomatic Approaches and Locking}
 \label{sec:formal:Axiomatic Approaches and Locking}

 Axiomatic approaches may also be applied to higher-level
 languages and also to higher-level synchronization primitives, as
 exemplified by the lock-based litmus test shown in
 \cref{lst:formal:Locking Example} (\path{C-Lock1.litmus}).
 This litmus test can be modeled by the
 \IXalthmr{\acrf{lkmm}}{Linux kernel}{memory model}~%
 \cite{Alglave:2018:FSC:3173162.3177156,LucMaranget2018lock.cat}.
 As expected, the \co{herd} tool's output features the string \co{Never},
 correctly indicating that \co{P1()} cannot see \co{x} having a value
 of one.\footnote{
 	The output of the \co{herd} tool is compatible with that
 	of PPCMEM, so feel free to look at
 	\cref{lst:formal:PPCMEM Detects an Error,%
 	lst:formal:PPCMEM on Repaired Litmus Test}
 	for examples showing the output format.}

 \begin{listing}
 \input{CodeSamples/formal/herd/C-Lock1@whole.fcv}
 \caption{Locking Example}
 \label{lst:formal:Locking Example}
 \end{listing}

 \QuickQuiz{
 	What do you have to do to run \co{herd} on litmus tests like
 	that shown in \cref{lst:formal:Locking Example}?
 }\QuickQuizAnswer{
 	Get version v4.17 (or later) of the Linux-kernel source code,
 	then follow the instructions in \path{tools/memory-model/README}
 	to install the needed tools.
 	Then follow the further instructions to run these tools on the
 	litmus test of your choice.
 }\QuickQuizEnd

 \begin{listing}
 \input{CodeSamples/formal/herd/C-Lock2@whole.fcv}
 \caption{Broken Locking Example}
 \label{lst:formal:Broken Locking Example}
 \end{listing}

 Of course, if \co{P0()} and \co{P1()} use different locks, as shown in
 \cref{lst:formal:Broken Locking Example} (\path{C-Lock2.litmus}),
 then all bets are off.
 And in this case, the \co{herd} tool's output features the string
 \co{Sometimes}, correctly indicating that use of different locks allows
 \co{P1()} to see \co{x} having a value of one.

 \QuickQuiz{
 	Why bother modeling locking directly?
 	Why not simply emulate locking with atomic operations?
 }\QuickQuizAnswer{
 	In a word, performance, as can be seen in
 	\cref{tab:formal:Locking: Modeling vs. Emulation Time (s)}.
 	The first column shows the number of \co{herd} processes modeled.
 	The second column shows the \co{herd} runtime when modeling
 	\co{spin_lock()} and \co{spin_unlock()} directly in \co{herd}'s
 	cat language.
 	The third column shows the \co{herd} runtime when emulating
 	\co{spin_lock()} with \co{cmpxchg_acquire()} and \co{spin_unlock()}
 	with \co{smp_store_release()}, using the \co{herd} \co{filter}
 	clause to reject executions that fail to acquire the lock.
 	The fourth column is like the third, but using \co{xchg_acquire()}
 	instead of \co{cmpxchg_acquire()}.
 	The fifth and sixth columns are like the third and fourth,
 	but instead using the \co{herd} \co{exists} clause to reject
 	executions that fail to acquire the lock.

 \begin{table}
 \rowcolors{10}{}{lightgray}
 \small
 \centering
 \newcommand{\lockfml}[1]{\multicolumn{1}{c}{\begin{picture}(6,8)(0,0)\rotatebox{90}{#1}\end{picture}}}
 \begin{tabular}{rrrrrr}
 	\toprule
 	& Model & \multicolumn{4}{c}{Emulate} \\
 	\cmidrule(l){2-2} \cmidrule(l){3-6}
 	& & \multicolumn{2}{c}{\tco{filter}} & \multicolumn{2}{c}{\tco{exists}} \\
 	\cmidrule(l){3-4} \cmidrule(l){5-6}
 	\lockfml{\# Proc.}
 	&
 	& \tco{cmpxchg}
 	& \tco{xchg}
 	& \tco{cmpxchg}
 	& \tco{xchg}
 	\\
 	\cmidrule{1-1} \cmidrule(l){2-2} \cmidrule(l){3-4} \cmidrule(l){5-6}
 	2 & 0.004 &  0.022 &  0.027 &   0.039 &   0.058 \\
 	3 & 0.041 &  0.743 &  0.968 &   1.653 &   3.203 \\
 	4 & 0.374 & 59.565 & 74.818 & 151.962 & 500.960 \\
 	5 & 4.905 \\
 	\bottomrule
 \end{tabular}
 \caption{Locking:
 		  Modeling vs.\@ Emulation Time (s)}
 \label{tab:formal:Locking: Modeling vs. Emulation Time (s)}
 \end{table}

 	Note also that use of the \co{filter} clause is about twice
 	as fast as is use of the \co{exists} clause.
 	This is no surprise because the \co{filter} clause allows
 	early abandoning of excluded executions, where the executions
 	that are excluded are the ones in which the lock is concurrently
 	held by more than one process.

 	More important, modeling \co{spin_lock()} and \co{spin_unlock()}
 	directly ranges from five times faster to more than two orders
 	of magnitude faster than modeling emulated locking.
 	This should also be no surprise, as direct modeling raises
 	the level of abstraction, thus reducing the number of events
 	that \co{herd} must model.
 	Because almost everything that \co{herd} does is of exponential
 	computational complexity, modest reductions in the number of
 	events produces exponentially large reductions in runtime.

 	Thus, in formal verification even more than in parallel
 	programming itself, divide and conquer!!!
 }\QuickQuizEnd

 But locking is not the only synchronization primitive that can be
 modeled directly:
 The next section looks at RCU\@.

 \subsection{Axiomatic Approaches and RCU}
 \label{sec:formal:Axiomatic Approaches and RCU}

 \begin{fcvref}[ln:formal:C-RCU-remove:whole]
 Axiomatic approaches can also analyze litmus tests involving
 RCU~\cite{Alglave:2018:FSC:3173162.3177156}.
 To that end,
 \cref{lst:formal:Canonical RCU Removal Litmus Test}
 (\path{C-RCU-remove.litmus})
 shows a litmus test corresponding to the canonical RCU-mediated
 removal from a linked list.
 As with the locking litmus test, this RCU litmus test can be
 modeled by LKMM, with similar performance advantages compared
 to modeling emulations of RCU\@.
 \Clnref{head} shows \co{x} as the list head, initially
 referencing \co{y}, which in turn is initialized to the value
 \co{2} on \clnref{tail:1}.

 \begin{listing}
 \input{CodeSamples/formal/herd/C-RCU-remove@whole.fcv}
 \caption{Canonical RCU Removal Litmus Test}
 \label{lst:formal:Canonical RCU Removal Litmus Test}
 \end{listing}

 \co{P0()} on \clnrefrange{P0start}{P0end}
 removes element \co{y} from the list by replacing it with element \co{z}
 (\clnref{assignnewtail}),
 waits for a \IX{grace period} (\clnref{sync}),
 and finally zeroes \co{y} to emulate \co{free()} (\clnref{free}).
 \co{P1()} on \clnrefrange{P1start}{P1end}
 executes within an RCU read-side critical section
 (\clnrefrange{rl}{rul}),
 picking up the list head (\clnref{rderef}) and then
 loading the next element (\clnref{read}).
 The next element should be non-zero, that is, not yet freed
 (\clnref{exists_}).
 Several other variables are output for debugging purposes
 (\clnref{locations_}).

 The output of the \co{herd} tool when running this litmus test features
 \co{Never}, indicating that \co{P0()} never accesses a freed element,
 as expected.
 Also as expected, removing \clnref{sync} results in \co{P0()}
 accessing a freed element, as indicated by the \co{Sometimes} in
 the \co{herd} output.
 \end{fcvref}

 \begin{listing}
 \ebresizeverb{.98}{\input{CodeSamples/formal/herd/C-RomanPenyaev-list-rcu-rr@whole.fcv}}
 \caption{Complex RCU Litmus Test}
 \label{lst:formal:Complex RCU Litmus Test}
 \end{listing}

 \begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
 A litmus test for a more complex example proposed by
 \ppl{Roman}{Penyaev}~\cite{RomanPenyaev2018rrRCU} is shown in
 \cref{lst:formal:Complex RCU Litmus Test}
 (\path{C-RomanPenyaev-list-rcu-rr.litmus}).
 In this example, readers (modeled by \co{P0()} on
 \clnrefrange{P0start}{P0end}) access a linked list
 in a round-robin fashion by ``leaking'' a pointer to the last
 list element accessed into variable \co{c}.
 Updaters (modeled by \co{P1()} on \clnrefrange{P1start}{P1end})
 remove an element, taking care to avoid disrupting current or future
 readers.

 \QuickQuiz{
 	Wait!!!
 	Isn't leaking pointers out of an RCU read-side critical
 	section a critical bug???
 }\QuickQuizAnswer{
 	Yes, it usually is a critical bug.
 	However, in this case, the updater has been cleverly constructed
 	to properly handle such pointer leaks.
 	But please don't make a habit of doing this sort of thing, and
 	especially don't do this without having put a lot of thought
 	into making some more conventional approach work.
 }\QuickQuizEnd

 \Clnrefrange{listtail}{listhead} define the initial linked
 list, tail first.
 In the Linux kernel, this would be a doubly linked circular list,
 but \co{herd} is currently incapable of modeling such a beast.
 The strategy is instead to use a singly linked linear list that
 is long enough that the end is never reached.
 \Clnref{rrcache} defines variable \co{c}, which is used to
 cache the list pointer between successive RCU read-side critical
 sections.

 Again, \co{P0()} on \clnrefrange{P0start}{P0end} models readers.
 This process models a pair of successive readers traversing round-robin
 through the list, with the first reader on \clnrefrange{rl1}{rul1}
 and the second reader on \clnrefrange{rl2}{rul2}.
 \Clnref{rdcache} fetches the pointer cached in \co{c}, and if
 \clnref{rdckcache} sees that the pointer was \co{NULL},
 \clnref{rdinitcache} restarts at the beginning of the list.
 In either case, \clnref{rdnext} advances to the next list element,
 and \clnref{rdupdcache} stores a pointer to this element back into
 variable \co{c}.
 \Clnrefrange{rl2}{rul2} repeat this process, but using
 registers \co{r3} and \co{r4} instead of \co{r1} and \co{r2}.
 As with
 \cref{lst:formal:Canonical RCU Removal Litmus Test},
 this litmus test stores zero to emulate \co{free()}, so
 \clnref{exists_} checks for any of these four registers being
 \co{NULL}, also known as zero.

 Because \co{P0()} leaks an RCU-protected pointer from its first
 RCU read-side critical section to its second, \co{P1()} must carry
 out its update (removing \co{x}) very carefully.
 \Clnref{updremove} removes \co{x} by linking \co{w} to \co{y}.
 \Clnref{updsync1} waits for readers, after which no subsequent reader
 has a path to \co{x} via the linked list.
 \Clnref{updrdcache} fetches \co{c}, and if \clnref{updckcache}
 determines that \co{c} references the newly removed \co{x},
 \clnref{updinitcache} sets \co{c} to \co{NULL}
 and \clnref{updsync2} again waits for readers, after which no
 subsequent reader can fetch \co{x} from \co{c}.
 In either case, \clnref{updfree} emulates \co{free()} by storing
 zero to \co{x}.

 \QuickQuiz{
 	\begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
 	In \cref{lst:formal:Complex RCU Litmus Test},
 	why couldn't a reader fetch \co{c} just before \co{P1()}
 	zeroed it on \clnref{updinitcache}, and then later
 	store this same value back into \co{c} just after it was
 	zeroed, thus defeating the zeroing operation?
 	\end{fcvref}
 }\QuickQuizAnswer{
 	\begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
 	Because the reader advances to the next element on
 	\clnref{rdnext}, thus avoiding storing a pointer to the
 	same element as was fetched.
 	\end{fcvref}
 }\QuickQuizEnd

 The output of the \co{herd} tool when running this litmus test features
 \co{Never}, indicating that \co{P0()} never accesses a freed element,
 as expected.
 Also as expected, removing either \co{synchronize_rcu()} results
 in \co{P1()} accessing a freed element, as indicated by \co{Sometimes}
 in the \co{herd} output.
 \end{fcvref}

 \QuickQuizSeries{%
 \QuickQuizB{
 	\begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
 	In \cref{lst:formal:Complex RCU Litmus Test},
 	why not have just one call to \co{synchronize_rcu()}
 	immediately before \clnref{updfree}?
 	\end{fcvref}
 }\QuickQuizAnswerB{
 	\begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
 	Because this results in \co{P0()} accessing a freed element.
 	But don't take my word for this, try it out in \co{herd}!
 	\end{fcvref}
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	\begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
 	Also in \cref{lst:formal:Complex RCU Litmus Test},
 	can't \clnref{updfree} be \co{WRITE_ONCE()} instead
 	of \co{smp_store_release()}?
 	\end{fcvref}
 }\QuickQuizAnswerE{
 	\begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
 	That is an excellent question.
 	As of late 2021, the answer is ``no one knows''.
 	Much depends on the semantics of \ARMv8's conditional-move
 	instruction.
 	While awaiting clarity on these semantics, \co{smp_store_release()}
 	is the safe choice.
 	\end{fcvref}
 }\QuickQuizEndE
 }

 These sections have shown how axiomatic approaches can successfully
 model synchronization primitives such as locking and RCU in C-language
 litmus tests.
 Longer term, the hope is that the axiomatic approaches will model
 even higher-level software artifacts, producing exponential
 verification speedups.
 This could potentially allow axiomatic verification of much larger
 software systems, perhaps incorporating spatial-synchronization techniques
 from separation
 logic~\cite{AlexeyGotsman2013ESOPRCU,PeterWOHearn2001SeparationLogic}.
 Another alternative is to press the axioms of boolean logic into service,
 as described in the next section.
	% formal/axiomatic.tex
	% mainfile: ../perfbook.tex
	% SPDX-License-Identifier: CC-BY-SA-3.0

	\section{Axiomatic Approaches}
	\label{sec:formal:Axiomatic Approaches}
	\OriginallyPublished{sec:formal:Axiomatic Approaches}{Axiomatic Approaches}{Linux Weekly News}{PaulEMcKenney2014weakaxiom}
	%
	\epigraph{Theory helps us to bear our ignorance of facts.}
	{George Santayana}

	Although the PPCMEM tool can solve the famous ``independent reads of
	independent writes'' (IRIW) litmus test shown in
	\cref{lst:formal:IRIW Litmus Test}, doing so requires no less than
	fourteen CPU hours and generates no less than ten gigabytes of state space.
	That said, this situation is a great improvement over that prior to the advent
	of PPCMEM, where solving this problem required perusing volumes of
	reference manuals, attempting proofs, discussing with experts, and
	then after all that, doubting the final answer.
	Although fourteen hours can seem like a long time, it is much shorter
	than weeks or even months.

	\begin{listing}
	\begin{fcvlabel}[ln:formal:IRIW Litmus Test]
	\begin{VerbatimL}[commandchars=\%\@\$]
	PPC IRIW.litmus
	""
	(* Traditional IRIW. *)
	{
	0:r1=1; 0:r2=x;
	1:r1=1; 1:r4=y;
	2:r2=x; 2:r4=y;
	3:r2=x; 3:r4=y;
	}
	P0 \| P1 \| P2 \| P3 ;
	stw r1,0(r2) \| stw r1,0(r4) \| lwz r3,0(r2) \| lwz r3,0(r4) ;
	\| \| sync \| sync ;
	\| \| lwz r5,0(r4) \| lwz r5,0(r2) ;

	exists
	(2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0)
	\end{VerbatimL}
	\end{fcvlabel}
	\caption{IRIW Litmus Test}
	\label{lst:formal:IRIW Litmus Test}
	\end{listing}

	However, the time required is a bit surprising given the simplicity
	of the litmus test, which has two threads storing to two separate variables
	and two other threads loading from these two variables in opposite
	orders.
	The assertion triggers if the two loading threads disagree on the order
	of the two stores.
	Even by the standards of memory-order litmus tests, this is quite simple.

	One reason for the amount of time and space consumed is that PPCMEM does
	a trace-based full-state-space search, which means that it must generate
	and evaluate all possible orders and combinations of events at the
	architectural level.
	At this level, both loads and stores correspond to ornate sequences
	of events and actions, resulting in a very large state space that must
	be completely searched, in turn resulting in large memory and CPU
	consumption.

	Of course, many of the traces are quite similar to one another, which
	suggests that an approach that treated similar traces as one might
	improve performace.
	One such approach is the axiomatic approach of
	\pplsur{Jade}{Alglave} et al.~\cite{Alglave:2014:HCM:2594291.2594347},
	which creates a set of axioms to represent the \IX{memory model} and then
	converts litmus tests to theorems that might be proven or disproven
	over this set of axioms.
	The resulting tool, called \qco{herd}, conveniently takes as input the
	same litmus tests as PPCMEM, including the IRIW litmus test shown in
	\cref{lst:formal:IRIW Litmus Test}.

	\begin{listing}
	\begin{fcvlabel}[ln:formal:Expanded IRIW Litmus Test]
	\begin{VerbatimL}[commandchars=\%\@\$]
	PPC IRIW5.litmus
	""
	(* Traditional IRIW, but with five stores instead of *)
	(* just one. *)
	{
	0:r1=1; 0:r2=x;
	1:r1=1; 1:r4=y;
	2:r2=x; 2:r4=y;
	3:r2=x; 3:r4=y;
	}
	P0 \| P1 \| P2 \| P3 ;
	stw r1,0(r2) \| stw r1,0(r4) \| lwz r3,0(r2) \| lwz r3,0(r4) ;
	addi r1,r1,1 \| addi r1,r1,1 \| sync \| sync ;
	stw r1,0(r2) \| stw r1,0(r4) \| lwz r5,0(r4) \| lwz r5,0(r2) ;
	addi r1,r1,1 \| addi r1,r1,1 \| \| ;
	stw r1,0(r2) \| stw r1,0(r4) \| \| ;
	addi r1,r1,1 \| addi r1,r1,1 \| \| ;
	stw r1,0(r2) \| stw r1,0(r4) \| \| ;
	addi r1,r1,1 \| addi r1,r1,1 \| \| ;
	stw r1,0(r2) \| stw r1,0(r4) \| \| ;

	exists
	(2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0)
	\end{VerbatimL}
	\end{fcvlabel}
	\caption{Expanded IRIW Litmus Test}
	\label{lst:formal:Expanded IRIW Litmus Test}
	\end{listing}

	However, where PPCMEM requires 14 CPU hours to solve IRIW, \co{herd} does so
	in 17 milliseconds, which represents a speedup of more than six orders
	of magnitude.
	That said, the problem is exponential in nature, so we should expect
	\co{herd} to exhibit exponential slowdowns for larger problems.
	And this is exactly what happens, for example, if we add four more writes
	per writing CPU as shown in
	\cref{lst:formal:Expanded IRIW Litmus Test},
	\co{herd} slows down by a factor of more than 50,000, requiring more than
	15 \emph{minutes} of CPU time.
	Adding threads also results in exponential
	slowdowns~\cite{PaulEMcKenney2014weakaxiom}.

	Despite their exponential nature, both PPCMEM and \co{herd} have proven quite
	useful for checking key parallel algorithms, including the queued-lock
	handoff on x86 systems.
	The weaknesses of the \co{herd} tool are similar to those of PPCMEM, which
	were described in
	\cref{sec:formal:PPCMEM Discussion}.
	There are some obscure (but very real) cases for which the PPCMEM and
	\co{herd} tools disagree, and as of 2021 many but not all of these disagreements
	was resolved.

	It would be helpful if the litmus tests could be written in C
	(as in \cref{lst:formal:Meaning of PPCMEM Litmus Test})
	rather than assembly
	(as in \cref{lst:formal:PPCMEM Litmus Test}).
	This is now possible, as will be described in the following sections.

	\subsection{Axiomatic Approaches and Locking}
	\label{sec:formal:Axiomatic Approaches and Locking}

	Axiomatic approaches may also be applied to higher-level
	languages and also to higher-level synchronization primitives, as
	exemplified by the lock-based litmus test shown in
	\cref{lst:formal:Locking Example} (\path{C-Lock1.litmus}).
	This litmus test can be modeled by the
	\IXalthmr{\acrf{lkmm}}{Linux kernel}{memory model}~%
	\cite{Alglave:2018:FSC:3173162.3177156,LucMaranget2018lock.cat}.
	As expected, the \co{herd} tool's output features the string \co{Never},
	correctly indicating that \co{P1()} cannot see \co{x} having a value
	of one.\footnote{
	The output of the \co{herd} tool is compatible with that
	of PPCMEM, so feel free to look at
	\cref{lst:formal:PPCMEM Detects an Error,%
	lst:formal:PPCMEM on Repaired Litmus Test}
	for examples showing the output format.}

	\begin{listing}
	\input{CodeSamples/formal/herd/C-Lock1@whole.fcv}
	\caption{Locking Example}
	\label{lst:formal:Locking Example}
	\end{listing}

	\QuickQuiz{
	What do you have to do to run \co{herd} on litmus tests like
	that shown in \cref{lst:formal:Locking Example}?
	}\QuickQuizAnswer{
	Get version v4.17 (or later) of the Linux-kernel source code,
	then follow the instructions in \path{tools/memory-model/README}
	to install the needed tools.
	Then follow the further instructions to run these tools on the
	litmus test of your choice.
	}\QuickQuizEnd

	\begin{listing}
	\input{CodeSamples/formal/herd/C-Lock2@whole.fcv}
	\caption{Broken Locking Example}
	\label{lst:formal:Broken Locking Example}
	\end{listing}

	Of course, if \co{P0()} and \co{P1()} use different locks, as shown in
	\cref{lst:formal:Broken Locking Example} (\path{C-Lock2.litmus}),
	then all bets are off.
	And in this case, the \co{herd} tool's output features the string
	\co{Sometimes}, correctly indicating that use of different locks allows
	\co{P1()} to see \co{x} having a value of one.

	\QuickQuiz{
	Why bother modeling locking directly?
	Why not simply emulate locking with atomic operations?
	}\QuickQuizAnswer{
	In a word, performance, as can be seen in
	\cref{tab:formal:Locking: Modeling vs. Emulation Time (s)}.
	The first column shows the number of \co{herd} processes modeled.
	The second column shows the \co{herd} runtime when modeling
	\co{spin_lock()} and \co{spin_unlock()} directly in \co{herd}'s
	cat language.
	The third column shows the \co{herd} runtime when emulating
	\co{spin_lock()} with \co{cmpxchg_acquire()} and \co{spin_unlock()}
	with \co{smp_store_release()}, using the \co{herd} \co{filter}
	clause to reject executions that fail to acquire the lock.
	The fourth column is like the third, but using \co{xchg_acquire()}
	instead of \co{cmpxchg_acquire()}.
	The fifth and sixth columns are like the third and fourth,
	but instead using the \co{herd} \co{exists} clause to reject
	executions that fail to acquire the lock.

	\begin{table}
	\rowcolors{10}{}{lightgray}
	\small
	\centering
	\newcommand{\lockfml}[1]{\multicolumn{1}{c}{\begin{picture}(6,8)(0,0)\rotatebox{90}{#1}\end{picture}}}
	\begin{tabular}{rrrrrr}
	\toprule
	& Model & \multicolumn{4}{c}{Emulate} \\
	\cmidrule(l){2-2} \cmidrule(l){3-6}
	& & \multicolumn{2}{c}{\tco{filter}} & \multicolumn{2}{c}{\tco{exists}} \\
	\cmidrule(l){3-4} \cmidrule(l){5-6}
	\lockfml{\# Proc.}
	&
	& \tco{cmpxchg}
	& \tco{xchg}
	& \tco{cmpxchg}
	& \tco{xchg}
	\\
	\cmidrule{1-1} \cmidrule(l){2-2} \cmidrule(l){3-4} \cmidrule(l){5-6}
	2 & 0.004 & 0.022 & 0.027 & 0.039 & 0.058 \\
	3 & 0.041 & 0.743 & 0.968 & 1.653 & 3.203 \\
	4 & 0.374 & 59.565 & 74.818 & 151.962 & 500.960 \\
	5 & 4.905 \\
	\bottomrule
	\end{tabular}
	\caption{Locking:
	Modeling vs.\@ Emulation Time (s)}
	\label{tab:formal:Locking: Modeling vs. Emulation Time (s)}
	\end{table}

	Note also that use of the \co{filter} clause is about twice
	as fast as is use of the \co{exists} clause.
	This is no surprise because the \co{filter} clause allows
	early abandoning of excluded executions, where the executions
	that are excluded are the ones in which the lock is concurrently
	held by more than one process.

	More important, modeling \co{spin_lock()} and \co{spin_unlock()}
	directly ranges from five times faster to more than two orders
	of magnitude faster than modeling emulated locking.
	This should also be no surprise, as direct modeling raises
	the level of abstraction, thus reducing the number of events
	that \co{herd} must model.
	Because almost everything that \co{herd} does is of exponential
	computational complexity, modest reductions in the number of
	events produces exponentially large reductions in runtime.

	Thus, in formal verification even more than in parallel
	programming itself, divide and conquer!!!
	}\QuickQuizEnd

	But locking is not the only synchronization primitive that can be
	modeled directly:
	The next section looks at RCU\@.

	\subsection{Axiomatic Approaches and RCU}
	\label{sec:formal:Axiomatic Approaches and RCU}

	\begin{fcvref}[ln:formal:C-RCU-remove:whole]
	Axiomatic approaches can also analyze litmus tests involving
	RCU~\cite{Alglave:2018:FSC:3173162.3177156}.
	To that end,
	\cref{lst:formal:Canonical RCU Removal Litmus Test}
	(\path{C-RCU-remove.litmus})
	shows a litmus test corresponding to the canonical RCU-mediated
	removal from a linked list.
	As with the locking litmus test, this RCU litmus test can be
	modeled by LKMM, with similar performance advantages compared
	to modeling emulations of RCU\@.
	\Clnref{head} shows \co{x} as the list head, initially
	referencing \co{y}, which in turn is initialized to the value
	\co{2} on \clnref{tail:1}.

	\begin{listing}
	\input{CodeSamples/formal/herd/C-RCU-remove@whole.fcv}
	\caption{Canonical RCU Removal Litmus Test}
	\label{lst:formal:Canonical RCU Removal Litmus Test}
	\end{listing}

	\co{P0()} on \clnrefrange{P0start}{P0end}
	removes element \co{y} from the list by replacing it with element \co{z}
	(\clnref{assignnewtail}),
	waits for a \IX{grace period} (\clnref{sync}),
	and finally zeroes \co{y} to emulate \co{free()} (\clnref{free}).
	\co{P1()} on \clnrefrange{P1start}{P1end}
	executes within an RCU read-side critical section
	(\clnrefrange{rl}{rul}),
	picking up the list head (\clnref{rderef}) and then
	loading the next element (\clnref{read}).
	The next element should be non-zero, that is, not yet freed
	(\clnref{exists_}).
	Several other variables are output for debugging purposes
	(\clnref{locations_}).

	The output of the \co{herd} tool when running this litmus test features
	\co{Never}, indicating that \co{P0()} never accesses a freed element,
	as expected.
	Also as expected, removing \clnref{sync} results in \co{P0()}
	accessing a freed element, as indicated by the \co{Sometimes} in
	the \co{herd} output.
	\end{fcvref}

	\begin{listing}
	\ebresizeverb{.98}{\input{CodeSamples/formal/herd/C-RomanPenyaev-list-rcu-rr@whole.fcv}}
	\caption{Complex RCU Litmus Test}
	\label{lst:formal:Complex RCU Litmus Test}
	\end{listing}

	\begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
	A litmus test for a more complex example proposed by
	\ppl{Roman}{Penyaev}~\cite{RomanPenyaev2018rrRCU} is shown in
	\cref{lst:formal:Complex RCU Litmus Test}
	(\path{C-RomanPenyaev-list-rcu-rr.litmus}).
	In this example, readers (modeled by \co{P0()} on
	\clnrefrange{P0start}{P0end}) access a linked list
	in a round-robin fashion by ``leaking'' a pointer to the last
	list element accessed into variable \co{c}.
	Updaters (modeled by \co{P1()} on \clnrefrange{P1start}{P1end})
	remove an element, taking care to avoid disrupting current or future
	readers.

	\QuickQuiz{
	Wait!!!
	Isn't leaking pointers out of an RCU read-side critical
	section a critical bug???
	}\QuickQuizAnswer{
	Yes, it usually is a critical bug.
	However, in this case, the updater has been cleverly constructed
	to properly handle such pointer leaks.
	But please don't make a habit of doing this sort of thing, and
	especially don't do this without having put a lot of thought
	into making some more conventional approach work.
	}\QuickQuizEnd

	\Clnrefrange{listtail}{listhead} define the initial linked
	list, tail first.
	In the Linux kernel, this would be a doubly linked circular list,
	but \co{herd} is currently incapable of modeling such a beast.
	The strategy is instead to use a singly linked linear list that
	is long enough that the end is never reached.
	\Clnref{rrcache} defines variable \co{c}, which is used to
	cache the list pointer between successive RCU read-side critical
	sections.

	Again, \co{P0()} on \clnrefrange{P0start}{P0end} models readers.
	This process models a pair of successive readers traversing round-robin
	through the list, with the first reader on \clnrefrange{rl1}{rul1}
	and the second reader on \clnrefrange{rl2}{rul2}.
	\Clnref{rdcache} fetches the pointer cached in \co{c}, and if
	\clnref{rdckcache} sees that the pointer was \co{NULL},
	\clnref{rdinitcache} restarts at the beginning of the list.
	In either case, \clnref{rdnext} advances to the next list element,
	and \clnref{rdupdcache} stores a pointer to this element back into
	variable \co{c}.
	\Clnrefrange{rl2}{rul2} repeat this process, but using
	registers \co{r3} and \co{r4} instead of \co{r1} and \co{r2}.
	As with
	\cref{lst:formal:Canonical RCU Removal Litmus Test},
	this litmus test stores zero to emulate \co{free()}, so
	\clnref{exists_} checks for any of these four registers being
	\co{NULL}, also known as zero.

	Because \co{P0()} leaks an RCU-protected pointer from its first
	RCU read-side critical section to its second, \co{P1()} must carry
	out its update (removing \co{x}) very carefully.
	\Clnref{updremove} removes \co{x} by linking \co{w} to \co{y}.
	\Clnref{updsync1} waits for readers, after which no subsequent reader
	has a path to \co{x} via the linked list.
	\Clnref{updrdcache} fetches \co{c}, and if \clnref{updckcache}
	determines that \co{c} references the newly removed \co{x},
	\clnref{updinitcache} sets \co{c} to \co{NULL}
	and \clnref{updsync2} again waits for readers, after which no
	subsequent reader can fetch \co{x} from \co{c}.
	In either case, \clnref{updfree} emulates \co{free()} by storing
	zero to \co{x}.

	\QuickQuiz{
	\begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
	In \cref{lst:formal:Complex RCU Litmus Test},
	why couldn't a reader fetch \co{c} just before \co{P1()}
	zeroed it on \clnref{updinitcache}, and then later
	store this same value back into \co{c} just after it was
	zeroed, thus defeating the zeroing operation?
	\end{fcvref}
	}\QuickQuizAnswer{
	\begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
	Because the reader advances to the next element on
	\clnref{rdnext}, thus avoiding storing a pointer to the
	same element as was fetched.
	\end{fcvref}
	}\QuickQuizEnd

	The output of the \co{herd} tool when running this litmus test features
	\co{Never}, indicating that \co{P0()} never accesses a freed element,
	as expected.
	Also as expected, removing either \co{synchronize_rcu()} results
	in \co{P1()} accessing a freed element, as indicated by \co{Sometimes}
	in the \co{herd} output.
	\end{fcvref}

	\QuickQuizSeries{%
	\QuickQuizB{
	\begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
	In \cref{lst:formal:Complex RCU Litmus Test},
	why not have just one call to \co{synchronize_rcu()}
	immediately before \clnref{updfree}?
	\end{fcvref}
	}\QuickQuizAnswerB{
	\begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
	Because this results in \co{P0()} accessing a freed element.
	But don't take my word for this, try it out in \co{herd}!
	\end{fcvref}
	}\QuickQuizEndB
	%
	\QuickQuizE{
	\begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
	Also in \cref{lst:formal:Complex RCU Litmus Test},
	can't \clnref{updfree} be \co{WRITE_ONCE()} instead
	of \co{smp_store_release()}?
	\end{fcvref}
	}\QuickQuizAnswerE{
	\begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole]
	That is an excellent question.
	As of late 2021, the answer is ``no one knows''.
	Much depends on the semantics of \ARMv8's conditional-move
	instruction.
	While awaiting clarity on these semantics, \co{smp_store_release()}
	is the safe choice.
	\end{fcvref}
	}\QuickQuizEndE
	}

	These sections have shown how axiomatic approaches can successfully
	model synchronization primitives such as locking and RCU in C-language
	litmus tests.
	Longer term, the hope is that the axiomatic approaches will model
	even higher-level software artifacts, producing exponential
	verification speedups.
	This could potentially allow axiomatic verification of much larger
	software systems, perhaps incorporating spatial-synchronization techniques
	from separation
	logic~\cite{AlexeyGotsman2013ESOPRCU,PeterWOHearn2001SeparationLogic}.
	Another alternative is to press the axioms of boolean logic into service,
	as described in the next section.