| % formal/axiomatic.tex |
| % mainfile: ../perfbook.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \section{Axiomatic Approaches} |
| \label{sec:formal:Axiomatic Approaches} |
| \OriginallyPublished{sec:formal:Axiomatic Approaches}{Axiomatic Approaches}{Linux Weekly News}{PaulEMcKenney2014weakaxiom} |
| % |
| \epigraph{Theory helps us to bear our ignorance of facts.} |
| {George Santayana} |
| |
| Although the PPCMEM tool can solve the famous ``independent reads of |
| independent writes'' (IRIW) litmus test shown in |
| \cref{lst:formal:IRIW Litmus Test}, doing so requires no less than |
| fourteen CPU hours and generates no less than ten gigabytes of state space. |
| That said, this situation is a great improvement over that prior to the advent |
| of PPCMEM, where solving this problem required perusing volumes of |
| reference manuals, attempting proofs, discussing with experts, and |
| then after all that, doubting the final answer. |
| Although fourteen hours can seem like a long time, it is much shorter |
| than weeks or even months. |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:formal:IRIW Litmus Test] |
| \begin{VerbatimL}[commandchars=\%\@\$] |
| PPC IRIW.litmus |
| "" |
| (* Traditional IRIW. *) |
| { |
| 0:r1=1; 0:r2=x; |
| 1:r1=1; 1:r4=y; |
| 2:r2=x; 2:r4=y; |
| 3:r2=x; 3:r4=y; |
| } |
| P0 | P1 | P2 | P3 ; |
| stw r1,0(r2) | stw r1,0(r4) | lwz r3,0(r2) | lwz r3,0(r4) ; |
| | | sync | sync ; |
| | | lwz r5,0(r4) | lwz r5,0(r2) ; |
| |
| exists |
| (2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0) |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{IRIW Litmus Test} |
| \label{lst:formal:IRIW Litmus Test} |
| \end{listing} |
| |
| However, the time required is a bit surprising given the simplicity |
| of the litmus test, which has two threads storing to two separate variables |
| and two other threads loading from these two variables in opposite |
| orders. |
| The assertion triggers if the two loading threads disagree on the order |
| of the two stores. |
| Even by the standards of memory-order litmus tests, this is quite simple. |
| |
| One reason for the amount of time and space consumed is that PPCMEM does |
| a trace-based full-state-space search, which means that it must generate |
| and evaluate all possible orders and combinations of events at the |
| architectural level. |
| At this level, both loads and stores correspond to ornate sequences |
| of events and actions, resulting in a very large state space that must |
| be completely searched, in turn resulting in large memory and CPU |
| consumption. |
| |
| Of course, many of the traces are quite similar to one another, which |
| suggests that an approach that treated similar traces as one might |
| improve performace. |
| One such approach is the axiomatic approach of |
| \pplsur{Jade}{Alglave} et al.~\cite{Alglave:2014:HCM:2594291.2594347}, |
| which creates a set of axioms to represent the \IX{memory model} and then |
| converts litmus tests to theorems that might be proven or disproven |
| over this set of axioms. |
| The resulting tool, called \qco{herd}, conveniently takes as input the |
| same litmus tests as PPCMEM, including the IRIW litmus test shown in |
| \cref{lst:formal:IRIW Litmus Test}. |
| |
| \begin{listing} |
| \begin{fcvlabel}[ln:formal:Expanded IRIW Litmus Test] |
| \begin{VerbatimL}[commandchars=\%\@\$] |
| PPC IRIW5.litmus |
| "" |
| (* Traditional IRIW, but with five stores instead of *) |
| (* just one. *) |
| { |
| 0:r1=1; 0:r2=x; |
| 1:r1=1; 1:r4=y; |
| 2:r2=x; 2:r4=y; |
| 3:r2=x; 3:r4=y; |
| } |
| P0 | P1 | P2 | P3 ; |
| stw r1,0(r2) | stw r1,0(r4) | lwz r3,0(r2) | lwz r3,0(r4) ; |
| addi r1,r1,1 | addi r1,r1,1 | sync | sync ; |
| stw r1,0(r2) | stw r1,0(r4) | lwz r5,0(r4) | lwz r5,0(r2) ; |
| addi r1,r1,1 | addi r1,r1,1 | | ; |
| stw r1,0(r2) | stw r1,0(r4) | | ; |
| addi r1,r1,1 | addi r1,r1,1 | | ; |
| stw r1,0(r2) | stw r1,0(r4) | | ; |
| addi r1,r1,1 | addi r1,r1,1 | | ; |
| stw r1,0(r2) | stw r1,0(r4) | | ; |
| |
| exists |
| (2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0) |
| \end{VerbatimL} |
| \end{fcvlabel} |
| \caption{Expanded IRIW Litmus Test} |
| \label{lst:formal:Expanded IRIW Litmus Test} |
| \end{listing} |
| |
| However, where PPCMEM requires 14 CPU hours to solve IRIW, \co{herd} does so |
| in 17 milliseconds, which represents a speedup of more than six orders |
| of magnitude. |
| That said, the problem is exponential in nature, so we should expect |
| \co{herd} to exhibit exponential slowdowns for larger problems. |
| And this is exactly what happens, for example, if we add four more writes |
| per writing CPU as shown in |
| \cref{lst:formal:Expanded IRIW Litmus Test}, |
| \co{herd} slows down by a factor of more than 50,000, requiring more than |
| 15 \emph{minutes} of CPU time. |
| Adding threads also results in exponential |
| slowdowns~\cite{PaulEMcKenney2014weakaxiom}. |
| |
| Despite their exponential nature, both PPCMEM and \co{herd} have proven quite |
| useful for checking key parallel algorithms, including the queued-lock |
| handoff on x86 systems. |
| The weaknesses of the \co{herd} tool are similar to those of PPCMEM, which |
| were described in |
| \cref{sec:formal:PPCMEM Discussion}. |
| There are some obscure (but very real) cases for which the PPCMEM and |
| \co{herd} tools disagree, and as of 2021 many but not all of these disagreements |
| was resolved. |
| |
| It would be helpful if the litmus tests could be written in C |
| (as in \cref{lst:formal:Meaning of PPCMEM Litmus Test}) |
| rather than assembly |
| (as in \cref{lst:formal:PPCMEM Litmus Test}). |
| This is now possible, as will be described in the following sections. |
| |
| \subsection{Axiomatic Approaches and Locking} |
| \label{sec:formal:Axiomatic Approaches and Locking} |
| |
| Axiomatic approaches may also be applied to higher-level |
| languages and also to higher-level synchronization primitives, as |
| exemplified by the lock-based litmus test shown in |
| \cref{lst:formal:Locking Example} (\path{C-Lock1.litmus}). |
| This litmus test can be modeled by the |
| \IXalthmr{\acrf{lkmm}}{Linux kernel}{memory model}~% |
| \cite{Alglave:2018:FSC:3173162.3177156,LucMaranget2018lock.cat}. |
| As expected, the \co{herd} tool's output features the string \co{Never}, |
| correctly indicating that \co{P1()} cannot see \co{x} having a value |
| of one.\footnote{ |
| The output of the \co{herd} tool is compatible with that |
| of PPCMEM, so feel free to look at |
| \cref{lst:formal:PPCMEM Detects an Error,% |
| lst:formal:PPCMEM on Repaired Litmus Test} |
| for examples showing the output format.} |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-Lock1@whole.fcv} |
| \caption{Locking Example} |
| \label{lst:formal:Locking Example} |
| \end{listing} |
| |
| \QuickQuiz{ |
| What do you have to do to run \co{herd} on litmus tests like |
| that shown in \cref{lst:formal:Locking Example}? |
| }\QuickQuizAnswer{ |
| Get version v4.17 (or later) of the Linux-kernel source code, |
| then follow the instructions in \path{tools/memory-model/README} |
| to install the needed tools. |
| Then follow the further instructions to run these tools on the |
| litmus test of your choice. |
| }\QuickQuizEnd |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-Lock2@whole.fcv} |
| \caption{Broken Locking Example} |
| \label{lst:formal:Broken Locking Example} |
| \end{listing} |
| |
| Of course, if \co{P0()} and \co{P1()} use different locks, as shown in |
| \cref{lst:formal:Broken Locking Example} (\path{C-Lock2.litmus}), |
| then all bets are off. |
| And in this case, the \co{herd} tool's output features the string |
| \co{Sometimes}, correctly indicating that use of different locks allows |
| \co{P1()} to see \co{x} having a value of one. |
| |
| \QuickQuiz{ |
| Why bother modeling locking directly? |
| Why not simply emulate locking with atomic operations? |
| }\QuickQuizAnswer{ |
| In a word, performance, as can be seen in |
| \cref{tab:formal:Locking: Modeling vs. Emulation Time (s)}. |
| The first column shows the number of \co{herd} processes modeled. |
| The second column shows the \co{herd} runtime when modeling |
| \co{spin_lock()} and \co{spin_unlock()} directly in \co{herd}'s |
| cat language. |
| The third column shows the \co{herd} runtime when emulating |
| \co{spin_lock()} with \co{cmpxchg_acquire()} and \co{spin_unlock()} |
| with \co{smp_store_release()}, using the \co{herd} \co{filter} |
| clause to reject executions that fail to acquire the lock. |
| The fourth column is like the third, but using \co{xchg_acquire()} |
| instead of \co{cmpxchg_acquire()}. |
| The fifth and sixth columns are like the third and fourth, |
| but instead using the \co{herd} \co{exists} clause to reject |
| executions that fail to acquire the lock. |
| |
| \begin{table} |
| \rowcolors{10}{}{lightgray} |
| \small |
| \centering |
| \newcommand{\lockfml}[1]{\multicolumn{1}{c}{\begin{picture}(6,8)(0,0)\rotatebox{90}{#1}\end{picture}}} |
| \begin{tabular}{rrrrrr} |
| \toprule |
| & Model & \multicolumn{4}{c}{Emulate} \\ |
| \cmidrule(l){2-2} \cmidrule(l){3-6} |
| & & \multicolumn{2}{c}{\tco{filter}} & \multicolumn{2}{c}{\tco{exists}} \\ |
| \cmidrule(l){3-4} \cmidrule(l){5-6} |
| \lockfml{\# Proc.} |
| & |
| & \tco{cmpxchg} |
| & \tco{xchg} |
| & \tco{cmpxchg} |
| & \tco{xchg} |
| \\ |
| \cmidrule{1-1} \cmidrule(l){2-2} \cmidrule(l){3-4} \cmidrule(l){5-6} |
| 2 & 0.004 & 0.022 & 0.027 & 0.039 & 0.058 \\ |
| 3 & 0.041 & 0.743 & 0.968 & 1.653 & 3.203 \\ |
| 4 & 0.374 & 59.565 & 74.818 & 151.962 & 500.960 \\ |
| 5 & 4.905 \\ |
| \bottomrule |
| \end{tabular} |
| \caption{Locking: |
| Modeling vs.\@ Emulation Time (s)} |
| \label{tab:formal:Locking: Modeling vs. Emulation Time (s)} |
| \end{table} |
| |
| Note also that use of the \co{filter} clause is about twice |
| as fast as is use of the \co{exists} clause. |
| This is no surprise because the \co{filter} clause allows |
| early abandoning of excluded executions, where the executions |
| that are excluded are the ones in which the lock is concurrently |
| held by more than one process. |
| |
| More important, modeling \co{spin_lock()} and \co{spin_unlock()} |
| directly ranges from five times faster to more than two orders |
| of magnitude faster than modeling emulated locking. |
| This should also be no surprise, as direct modeling raises |
| the level of abstraction, thus reducing the number of events |
| that \co{herd} must model. |
| Because almost everything that \co{herd} does is of exponential |
| computational complexity, modest reductions in the number of |
| events produces exponentially large reductions in runtime. |
| |
| Thus, in formal verification even more than in parallel |
| programming itself, divide and conquer!!! |
| }\QuickQuizEnd |
| |
| But locking is not the only synchronization primitive that can be |
| modeled directly: |
| The next section looks at RCU\@. |
| |
| \subsection{Axiomatic Approaches and RCU} |
| \label{sec:formal:Axiomatic Approaches and RCU} |
| |
| \begin{fcvref}[ln:formal:C-RCU-remove:whole] |
| Axiomatic approaches can also analyze litmus tests involving |
| RCU~\cite{Alglave:2018:FSC:3173162.3177156}. |
| To that end, |
| \cref{lst:formal:Canonical RCU Removal Litmus Test} |
| (\path{C-RCU-remove.litmus}) |
| shows a litmus test corresponding to the canonical RCU-mediated |
| removal from a linked list. |
| As with the locking litmus test, this RCU litmus test can be |
| modeled by LKMM, with similar performance advantages compared |
| to modeling emulations of RCU\@. |
| \Clnref{head} shows \co{x} as the list head, initially |
| referencing \co{y}, which in turn is initialized to the value |
| \co{2} on \clnref{tail:1}. |
| |
| \begin{listing} |
| \input{CodeSamples/formal/herd/C-RCU-remove@whole.fcv} |
| \caption{Canonical RCU Removal Litmus Test} |
| \label{lst:formal:Canonical RCU Removal Litmus Test} |
| \end{listing} |
| |
| \co{P0()} on \clnrefrange{P0start}{P0end} |
| removes element \co{y} from the list by replacing it with element \co{z} |
| (\clnref{assignnewtail}), |
| waits for a \IX{grace period} (\clnref{sync}), |
| and finally zeroes \co{y} to emulate \co{free()} (\clnref{free}). |
| \co{P1()} on \clnrefrange{P1start}{P1end} |
| executes within an RCU read-side critical section |
| (\clnrefrange{rl}{rul}), |
| picking up the list head (\clnref{rderef}) and then |
| loading the next element (\clnref{read}). |
| The next element should be non-zero, that is, not yet freed |
| (\clnref{exists_}). |
| Several other variables are output for debugging purposes |
| (\clnref{locations_}). |
| |
| The output of the \co{herd} tool when running this litmus test features |
| \co{Never}, indicating that \co{P0()} never accesses a freed element, |
| as expected. |
| Also as expected, removing \clnref{sync} results in \co{P0()} |
| accessing a freed element, as indicated by the \co{Sometimes} in |
| the \co{herd} output. |
| \end{fcvref} |
| |
| \begin{listing} |
| \ebresizeverb{.98}{\input{CodeSamples/formal/herd/C-RomanPenyaev-list-rcu-rr@whole.fcv}} |
| \caption{Complex RCU Litmus Test} |
| \label{lst:formal:Complex RCU Litmus Test} |
| \end{listing} |
| |
| \begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole] |
| A litmus test for a more complex example proposed by |
| \ppl{Roman}{Penyaev}~\cite{RomanPenyaev2018rrRCU} is shown in |
| \cref{lst:formal:Complex RCU Litmus Test} |
| (\path{C-RomanPenyaev-list-rcu-rr.litmus}). |
| In this example, readers (modeled by \co{P0()} on |
| \clnrefrange{P0start}{P0end}) access a linked list |
| in a round-robin fashion by ``leaking'' a pointer to the last |
| list element accessed into variable \co{c}. |
| Updaters (modeled by \co{P1()} on \clnrefrange{P1start}{P1end}) |
| remove an element, taking care to avoid disrupting current or future |
| readers. |
| |
| \QuickQuiz{ |
| Wait!!! |
| Isn't leaking pointers out of an RCU read-side critical |
| section a critical bug??? |
| }\QuickQuizAnswer{ |
| Yes, it usually is a critical bug. |
| However, in this case, the updater has been cleverly constructed |
| to properly handle such pointer leaks. |
| But please don't make a habit of doing this sort of thing, and |
| especially don't do this without having put a lot of thought |
| into making some more conventional approach work. |
| }\QuickQuizEnd |
| |
| \Clnrefrange{listtail}{listhead} define the initial linked |
| list, tail first. |
| In the Linux kernel, this would be a doubly linked circular list, |
| but \co{herd} is currently incapable of modeling such a beast. |
| The strategy is instead to use a singly linked linear list that |
| is long enough that the end is never reached. |
| \Clnref{rrcache} defines variable \co{c}, which is used to |
| cache the list pointer between successive RCU read-side critical |
| sections. |
| |
| Again, \co{P0()} on \clnrefrange{P0start}{P0end} models readers. |
| This process models a pair of successive readers traversing round-robin |
| through the list, with the first reader on \clnrefrange{rl1}{rul1} |
| and the second reader on \clnrefrange{rl2}{rul2}. |
| \Clnref{rdcache} fetches the pointer cached in \co{c}, and if |
| \clnref{rdckcache} sees that the pointer was \co{NULL}, |
| \clnref{rdinitcache} restarts at the beginning of the list. |
| In either case, \clnref{rdnext} advances to the next list element, |
| and \clnref{rdupdcache} stores a pointer to this element back into |
| variable \co{c}. |
| \Clnrefrange{rl2}{rul2} repeat this process, but using |
| registers \co{r3} and \co{r4} instead of \co{r1} and \co{r2}. |
| As with |
| \cref{lst:formal:Canonical RCU Removal Litmus Test}, |
| this litmus test stores zero to emulate \co{free()}, so |
| \clnref{exists_} checks for any of these four registers being |
| \co{NULL}, also known as zero. |
| |
| Because \co{P0()} leaks an RCU-protected pointer from its first |
| RCU read-side critical section to its second, \co{P1()} must carry |
| out its update (removing \co{x}) very carefully. |
| \Clnref{updremove} removes \co{x} by linking \co{w} to \co{y}. |
| \Clnref{updsync1} waits for readers, after which no subsequent reader |
| has a path to \co{x} via the linked list. |
| \Clnref{updrdcache} fetches \co{c}, and if \clnref{updckcache} |
| determines that \co{c} references the newly removed \co{x}, |
| \clnref{updinitcache} sets \co{c} to \co{NULL} |
| and \clnref{updsync2} again waits for readers, after which no |
| subsequent reader can fetch \co{x} from \co{c}. |
| In either case, \clnref{updfree} emulates \co{free()} by storing |
| zero to \co{x}. |
| |
| \QuickQuiz{ |
| \begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole] |
| In \cref{lst:formal:Complex RCU Litmus Test}, |
| why couldn't a reader fetch \co{c} just before \co{P1()} |
| zeroed it on \clnref{updinitcache}, and then later |
| store this same value back into \co{c} just after it was |
| zeroed, thus defeating the zeroing operation? |
| \end{fcvref} |
| }\QuickQuizAnswer{ |
| \begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole] |
| Because the reader advances to the next element on |
| \clnref{rdnext}, thus avoiding storing a pointer to the |
| same element as was fetched. |
| \end{fcvref} |
| }\QuickQuizEnd |
| |
| The output of the \co{herd} tool when running this litmus test features |
| \co{Never}, indicating that \co{P0()} never accesses a freed element, |
| as expected. |
| Also as expected, removing either \co{synchronize_rcu()} results |
| in \co{P1()} accessing a freed element, as indicated by \co{Sometimes} |
| in the \co{herd} output. |
| \end{fcvref} |
| |
| \QuickQuizSeries{% |
| \QuickQuizB{ |
| \begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole] |
| In \cref{lst:formal:Complex RCU Litmus Test}, |
| why not have just one call to \co{synchronize_rcu()} |
| immediately before \clnref{updfree}? |
| \end{fcvref} |
| }\QuickQuizAnswerB{ |
| \begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole] |
| Because this results in \co{P0()} accessing a freed element. |
| But don't take my word for this, try it out in \co{herd}! |
| \end{fcvref} |
| }\QuickQuizEndB |
| % |
| \QuickQuizE{ |
| \begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole] |
| Also in \cref{lst:formal:Complex RCU Litmus Test}, |
| can't \clnref{updfree} be \co{WRITE_ONCE()} instead |
| of \co{smp_store_release()}? |
| \end{fcvref} |
| }\QuickQuizAnswerE{ |
| \begin{fcvref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole] |
| That is an excellent question. |
| As of late 2021, the answer is ``no one knows''. |
| Much depends on the semantics of \ARMv8's conditional-move |
| instruction. |
| While awaiting clarity on these semantics, \co{smp_store_release()} |
| is the safe choice. |
| \end{fcvref} |
| }\QuickQuizEndE |
| } |
| |
| These sections have shown how axiomatic approaches can successfully |
| model synchronization primitives such as locking and RCU in C-language |
| litmus tests. |
| Longer term, the hope is that the axiomatic approaches will model |
| even higher-level software artifacts, producing exponential |
| verification speedups. |
| This could potentially allow axiomatic verification of much larger |
| software systems, perhaps incorporating spatial-synchronization techniques |
| from separation |
| logic~\cite{AlexeyGotsman2013ESOPRCU,PeterWOHearn2001SeparationLogic}. |
| Another alternative is to press the axioms of boolean logic into service, |
| as described in the next section. |