| % formal/axiomatic.tex |
| |
| \section{Axiomatic Approaches} |
| \label{sec:formal:Axiomatic Approaches} |
| \OriginallyPublished{Section}{sec:formal:Axiomatic Approaches}{Axiomatic Approaches}{Linux Weekly News}{PaulEMcKenney2014weakaxiom} |
| % |
| \epigraph{Theory helps us to bear our ignorance of facts.} |
| {\emph{George Santayana}} |
| |
| \begin{listing}[tb] |
| \begin{linelabel}[ln:formal:IRIW Litmus Test] |
| \begin{VerbatimL}[commandchars=\%\@\$] |
| PPC IRIW.litmus |
| "" |
| (* Traditional IRIW. *) |
| { |
| 0:r1=1; 0:r2=x; |
| 1:r1=1; 1:r4=y; |
| 2:r2=x; 2:r4=y; |
| 3:r2=x; 3:r4=y; |
| } |
| P0 | P1 | P2 | P3 ; |
| stw r1,0(r2) | stw r1,0(r4) | lwz r3,0(r2) | lwz r3,0(r4) ; |
| | | sync | sync ; |
| | | lwz r5,0(r4) | lwz r5,0(r2) ; |
| |
| exists |
| (2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0) |
| \end{VerbatimL} |
| \end{linelabel} |
| \caption{IRIW Litmus Test} |
| \label{lst:formal:IRIW Litmus Test} |
| \end{listing} |
| |
| Although the PPCMEM tool can solve the famous ``independent reads of |
| independent writes'' (IRIW) litmus test shown in |
| Listing~\ref{lst:formal:IRIW Litmus Test}, doing so requires no less than |
| fourteen CPU hours and generates no less than ten gigabytes of state space. |
| That said, this situation is a great improvement over that before the advent |
| of PPCMEM, where solving this problem required perusing volumes of |
| reference manuals, attempting proofs, discussing with experts, and |
| being unsure of the final answer. |
| Although fourteen hours can seem like a long time, it is much shorter |
| than weeks or even months. |
| |
| However, the time required is a bit surprising given the simplicity |
| of the litmus test, which has two threads storing to two separate variables |
| and two other threads loading from these two variables in opposite |
| orders. |
| The assertion triggers if the two loading threads disagree on the order |
| of the two stores. |
| This litmus test is simple, even by the standards of memory-order litmus |
| tests. |
| |
| One reason for the amount of time and space consumed is that PPCMEM does |
| a trace-based full-state-space search, which means that it must generate |
| and evaluate all possible orders and combinations of events at the |
| architectural level. |
| At this level, both loads and stores correspond to ornate sequences |
| of events and actions, resulting in a very large state space that must |
| be completely searched, in turn resulting in large memory and CPU |
| consumption. |
| |
| Of course, many of the traces are quite similar to one another, which |
| suggests that an approach that treated similar traces as one might |
| improve performace. |
| One such approach is the axiomatic approach of |
| Alglave et al.~\cite{Alglave:2014:HCM:2594291.2594347}, |
| which creates a set of axioms to represent the memory model and then |
| converts litmus tests to theorems that might be proven or disproven |
| over this set of axioms. |
| The resulting tool, called ``herd'', conveniently takes as input the |
| same litmus tests as PPCMEM, including the IRIW litmus test shown in |
| Listing~\ref{lst:formal:IRIW Litmus Test}. |
| |
| \begin{listing}[tb] |
| \begin{linelabel}[ln:formal:Expanded IRIW Litmus Test] |
| \begin{VerbatimL}[commandchars=\%\@\$] |
| PPC IRIW5.litmus |
| "" |
| (* Traditional IRIW, but with five stores instead of *) |
| (* just one. *) |
| { |
| 0:r1=1; 0:r2=x; |
| 1:r1=1; 1:r4=y; |
| 2:r2=x; 2:r4=y; |
| 3:r2=x; 3:r4=y; |
| } |
| P0 | P1 | P2 | P3 ; |
| stw r1,0(r2) | stw r1,0(r4) | lwz r3,0(r2) | lwz r3,0(r4) ; |
| addi r1,r1,1 | addi r1,r1,1 | sync | sync ; |
| stw r1,0(r2) | stw r1,0(r4) | lwz r5,0(r4) | lwz r5,0(r2) ; |
| addi r1,r1,1 | addi r1,r1,1 | | ; |
| stw r1,0(r2) | stw r1,0(r4) | | ; |
| addi r1,r1,1 | addi r1,r1,1 | | ; |
| stw r1,0(r2) | stw r1,0(r4) | | ; |
| addi r1,r1,1 | addi r1,r1,1 | | ; |
| stw r1,0(r2) | stw r1,0(r4) | | ; |
| |
| exists |
| (2:r3=1 /\ 2:r5=0 /\ 3:r3=1 /\ 3:r5=0) |
| \end{VerbatimL} |
| \end{linelabel} |
| \caption{Expanded IRIW Litmus Test} |
| \label{lst:formal:Expanded IRIW Litmus Test} |
| \end{listing} |
| |
| However, where PPCMEM requires 14 CPU hours to solve IRIW, herd does so |
| in 17 milliseconds, which represents a speedup of more than six orders |
| of magnitude. |
| That said, the problem is exponential in nature, so we should expect |
| herd to exhibit exponential slowdowns for larger problems. |
| And this is exactly what happens, for example, if we add four more writes |
| per writing CPU as shown in |
| Listing~\ref{lst:formal:Expanded IRIW Litmus Test}, |
| herd slows down by a factor of more than 50,000, requiring more than |
| 15 \emph{minutes} of CPU time. |
| Adding threads also results in exponential |
| slowdowns~\cite{PaulEMcKenney2014weakaxiom}. |
| |
| Despite their exponential nature, both PPCMEM and herd have proven quite |
| useful for checking key parallel algorithms, including the queued-lock |
| handoff on x86 systems. |
| The weaknesses of the herd tool are similar to those of PPCMEM, which |
| were described in |
| Section~\ref{sec:formal:PPCMEM Discussion}. |
| There are some obscure (but very real) cases for which the PPCMEM and |
| herd tools disagree, and as of late 2014 resolving these disagreements |
| was ongoing. |
| |
| It would be helpful if the litmus tests could be written in C |
| (as in Listing~\ref{lst:formal:Meaning of PPCMEM Litmus Test}) |
| rather than assembly |
| (as in Listing~\ref{lst:formal:PPCMEM Litmus Test}). |
| This is now possible, as will be described in the following sections. |
| |
| \subsection{Axiomatic Approaches and Locking} |
| \label{sec:formal:Axiomatic Approaches and Locking} |
| |
| \begin{listing}[tb] |
| \input{CodeSamples/formal/herd/C-Lock1@whole.fcv} |
| \caption{Locking Example} |
| \label{lst:formal:Locking Example} |
| \end{listing} |
| |
| Axiomatic approaches may also be applied to higher-level |
| languages and also to higher-level synchronization primitives, as |
| exemplified by the lock-based litmus test shown in |
| Listing~\ref{lst:formal:Locking Example} (\path{C-Lock1.litmus}). |
| This litmus test can be modeled by the Linux kernel memory model |
| (LKMM)~\cite{Alglave:2018:FSC:3173162.3177156}. |
| As expected, the \co{herd} tool's output features the string \co{Never}, |
| correctly indicating that \co{P1()} cannot see \co{x} having a value |
| of one.\footnote{ |
| The output of the \co{herd} tool is compatible with that |
| of PPCMEM, so feel free to look at |
| Listings~\ref{lst:formal:PPCMEM Detects an Error} |
| and~\ref{lst:formal:PPCMEM on Repaired Litmus Test} |
| for examples showing the output format.} |
| |
| \QuickQuiz{} |
| What do you have to do to run \co{herd} on litmus tests like |
| that shown in Listing~\ref{lst:formal:Locking Example}? |
| \QuickQuizAnswer{ |
| Get version v4.17 (or later) of the Linux-kernel source code, |
| then follow the instructions in \path{tools/memory-model/README} |
| to install the needed tools. |
| Then follow the further instructions to run these tools on the |
| litmus test of your choice. |
| } \QuickQuizEnd |
| |
| \begin{listing}[tb] |
| \input{CodeSamples/formal/herd/C-Lock2@whole.fcv} |
| \caption{Broken Locking Example} |
| \label{lst:formal:Broken Locking Example} |
| \end{listing} |
| |
| Of course, if \co{P0()} and \co{P1()} use different locks, as shown in |
| Listing~\ref{lst:formal:Broken Locking Example} (\path{C-Lock2.litmus}), |
| then all bets are off. |
| And in this case, the \co{herd} tool's output features the string |
| \co{Sometimes}, correctly indicating that use of different locks allows |
| \co{P1()} to see \co{x} having a value of one. |
| |
| \QuickQuiz{} |
| Why bother modeling locking directly? |
| Why not simply emulate locking with atomic operations? |
| \QuickQuizAnswer{ |
| \begin{table}[tb] |
| \rowcolors{2}{}{lightgray} |
| \small |
| \centering |
| \newcommand{\lockfml}[1]{\multicolumn{1}{c}{\begin{picture}(6,150)(0,0)\rotatebox{90}{#1}\end{picture}}} |
| \begin{tabular}{rrrrrr} |
| \lockfml{\# Processes} |
| & \lockfml{Model} |
| & \lockfml{Emulate: \tco{cmpxchg_acquire()}, \tco{filter}} |
| & \lockfml{Emulate: \tco{xchg_acquire()}, \tco{filter}} |
| & \lockfml{Emulate: \tco{cmpxchg_acquire()}, \tco{exists}} |
| & \lockfml{Emulate: \tco{xchg_acquire()}, \tco{exists}} |
| \\ |
| \midrule |
| 2 & 0.004 & 0.022 & 0.027 & 0.039 & 0.058 \\ |
| 3 & 0.041 & 0.743 & 0.968 & 1.653 & 3.203 \\ |
| 4 & 0.374 & 59.565 & 74.818 & 151.962 & 500.96 \\ |
| 5 & 4.905 \\ |
| \end{tabular} |
| \caption{Locking: Modeling vs. Emulation Time (s)} |
| \label{tab:formal:Locking: Modeling vs. Emulation Time (s)} |
| \end{table} |
| In a word, performance, as can be seen in |
| Table~\ref{tab:formal:Locking: Modeling vs. Emulation Time (s)}. |
| The first column shows the number of herd processes modeled. |
| The second column shows the herd runtime when modeling |
| \co{spin_lock()} and \co{spin_unlock()} directly in herd's |
| cat language. |
| The third column shows the herd runtime when emulating |
| \co{spin_lock()} with \co{cmpxchg_acquire()} and \co{spin_unlock()} |
| with \co{smp_store_release()}, using the herd \co{filter} |
| clause to reject executions that fail to acquire the lock. |
| The fourth column is like the third, but using \co{xchg_acquire()} |
| instead of \co{cmpxchg_acquire()}. |
| The fifth and sixth columns are like the third and fourth, |
| but instead using the herd \co{exists} clause to reject |
| executions that fail to acquire the lock. |
| |
| Note also that use of the \co{filter} clause is about twice |
| as fast as is use of the \co{exists} clause. |
| This is no surprise because the \co{filter} clause allows |
| early abandoning of excluded executions, where the executions |
| that are excluded are the ones in which the lock is concurrently |
| held by more than one process. |
| |
| More important, modeling \co{spin_lock()} and \co{spin_unlock()} |
| directly ranges from five times faster to more than two orders |
| of magnitude faster than modeling emulated locking. |
| This should also be no surprise, as direct modeling raises |
| the level of abstraction, thus reducing the number of events |
| that herd must model. |
| Because almost everything that herd does is of exponential |
| computational complexity, modest reductions in the number of |
| events produces exponentially large reductions in runtime. |
| |
| Thus, in formal verification even more than in parallel |
| programming itself, divide and conquer!!! |
| } \QuickQuizEnd |
| |
| But locking is not the only synchronization primitive that can be |
| modeled directly: |
| The next section looks at RCU. |
| |
| \subsection{Axiomatic Approaches and RCU} |
| \label{sec:formal:Axiomatic Approaches and RCU} |
| |
| \begin{listing}[tb] |
| \input{CodeSamples/formal/herd/C-RCU-remove@whole.fcv} |
| \caption{Canonical RCU Removal Litmus Test} |
| \label{lst:formal:Canonical RCU Removal Litmus Test} |
| \end{listing} |
| |
| \begin{lineref}[ln:formal:C-RCU-remove:whole] |
| Axiomatic approaches can also analyze litmus tests involving RCU. |
| To that end, |
| Listing~\ref{lst:formal:Canonical RCU Removal Litmus Test} |
| (\path{C-RCU-remove.litmus}) |
| shows a litmus test corresponding to the canonical RCU-mediated |
| removal from a linked list. |
| As with the locking litmus test, this RCU litmus test can be |
| modeled by LKMM, with similar performance advantages compared |
| to modeling emulations of RCU. |
| Line~\lnref{head} shows \co{x} as the list head, initially |
| referencing \co{y}, which in turn is initialized to the value |
| \co{2} on line~\lnref{tail:1}. |
| |
| \co{P0()} on lines~\lnref{P0start}--\lnref{P0end} |
| removes element \co{y} from the list by replacing it with element \co{z} |
| (line~\lnref{assignnewtail}), |
| waits for a grace period (line~\lnref{sync}), |
| and finally zeroes \co{y} to emulate \co{free()} (line~\lnref{free}). |
| \co{P1()} on lines~\lnref{P1start}--\lnref{P1end} |
| executes within an RCU read-side critical section |
| (lines~\lnref{rl}--\lnref{rul}), |
| picking up the list head (line~\lnref{rderef}) and then |
| loading the next element (line~\lnref{read}). |
| The next element should be non-zero, that is, not yet freed |
| (line~\lnref{exists_}). |
| Several other variables are output for debugging purposes |
| (line~\lnref{locations_}). |
| |
| The output of the \co{herd} tool when running this litmus test features |
| \co{Never}, indicating that \co{P0()} never accesses a freed element, |
| as expected. |
| Also as expected, removing line~\lnref{sync} results in \co{P0()} |
| accessing a freed element, as indicated by the \co{Sometimes} in |
| the \co{herd} output. |
| \end{lineref} |
| |
| \begin{listing}[tb] |
| \input{CodeSamples/formal/herd/C-RomanPenyaev-list-rcu-rr@whole.fcv} |
| \caption{Complex RCU Litmus Test} |
| \label{lst:formal:Complex RCU Litmus Test} |
| \end{listing} |
| |
| \begin{lineref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole] |
| A litmus test for a more complex example proposed by |
| Roman Penyaev~\cite{RomanPenyaev2018rrRCU} is shown in |
| Listing~\ref{lst:formal:Complex RCU Litmus Test} |
| (\path{C-RomanPenyaev-list-rcu-rr.litmus}). |
| In this example, readers (modeled by \co{P0()} on |
| lines~\lnref{P0start}--\lnref{P0end}) access a linked list |
| in a round-robin fashion by ``leaking'' a pointer to the last |
| list element accessed into variable \co{c}. |
| Updaters (modeled by \co{P1()} on lines~\lnref{P1start}--\lnref{P1end}) |
| remove an element, taking care to avoid disrupting current or future |
| readers. |
| |
| \QuickQuiz{} |
| Wait!!! |
| Isn't leaking pointers out of an RCU read-side critical |
| section a critical bug??? |
| \QuickQuizAnswer{ |
| Yes, it usually is a critical bug. |
| However, in this case, the updater has been cleverly constructed |
| to properly handle such pointer leaks. |
| But please don't make a habit of doing this sort of thing, and |
| especially don't do this without having put a lot of thought |
| into making some more conventional approach work. |
| } \QuickQuizEnd |
| |
| Lines~\lnref{listtail}--\lnref{listhead} define the initial linked |
| list, tail first. |
| In the Linux kernel, this would be a doubly linked circular list, |
| but \co{herd} is currently incapable of modeling such a beast. |
| The strategy is instead to use a singly linked linear list that |
| is long enough that the end is never reached. |
| Line~\lnref{rrcache} defines variable \co{c}, which is used to |
| cache the list pointer between successive RCU read-side critical |
| sections. |
| |
| Again, \co{P0()} on lines~\lnref{P0start}--\lnref{P0end} models readers. |
| This process models a pair of successive readers traversing round-robin |
| through the list, with the first reader on lines~\lnref{rl1}--\lnref{rul1} |
| and the second reader on lines~\lnref{rl2}--\lnref{rul2}. |
| Line~\lnref{rdcache} fetches the pointer cached in \co{c}, and if |
| line~\lnref{rdckcache} sees that the pointer was \co{NULL}, |
| line~\lnref{rdinitcache} restarts at the beginning of the list. |
| In either case, line~\lnref{rdnext} advances to the next list element, |
| and line~\lnref{rdupdcache} stores a pointer to this element back into |
| variable \co{c}. |
| Lines~\lnref{rl2}--\lnref{rul2} repeat this process, but using |
| registers \co{r3} and \co{r4} instead of \co{r1} and \co{r2}. |
| As with |
| Listing~\ref{lst:formal:Canonical RCU Removal Litmus Test}, |
| this litmus test stores zero to emulate \co{free()}, so |
| line~\lnref{exists_} checks for any of these four registers being |
| \co{NULL}, also known as zero. |
| |
| Because \co{P0()} leaks an RCU-protected pointer from its first |
| RCU read-side critical section to its second, \co{P1()} must carry |
| out its update (removing \co{x}) very carefully. |
| Line~\lnref{updremove} removes \co{x} by linking \co{w} to \co{y}. |
| Line~\lnref{updsync1} waits for readers, after which no subsequent reader |
| has a path to \co{x} via the linked list. |
| Line~\lnref{updrdcache} fetches \co{c}, and if line~\lnref{updckcache} |
| determines that \co{c} references the newly removed \co{x}, |
| line~\lnref{updinitcache} sets \co{c} to \co{NULL} |
| and line~\lnref{updsync2} again waits for readers, after which no |
| subsequent reader can fetch \co{x} from \co{c}. |
| In either case, line~\lnref{updfree} emulates \co{free()} by storing |
| zero to \co{x}. |
| |
| \QuickQuiz{} |
| \begin{lineref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole] |
| In Listing~\ref{lst:formal:Complex RCU Litmus Test}, |
| why couldn't a reader fetch \co{c} just before \co{P1()} |
| zeroed it on line~\lnref{updinitcache}, and then later |
| store this same value back into \co{c} just after it was |
| zeroed, thus defeating the zeroing operation? |
| \end{lineref} |
| \QuickQuizAnswer{ |
| \begin{lineref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole] |
| Because the reader advances to the next element on |
| line~\lnref{rdnext}, thus avoiding storing a pointer to the |
| same element as was fetched. |
| \end{lineref} |
| } \QuickQuizEnd |
| |
| The output of the \co{herd} tool when running this litmus test features |
| \co{Never}, indicating that \co{P0()} never accesses a freed element, |
| as expected. |
| Also as expected, removing either \co{synchronize_rcu()} results |
| in \co{P1()} accessing a freed element, as indicated by \co{Sometimes} |
| in the \co{herd} output. |
| \end{lineref} |
| |
| \QuickQuiz{} |
| \begin{lineref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole] |
| In Listing~\ref{lst:formal:Complex RCU Litmus Test}, |
| why not have just one call to \co{synchronize_rcu()} |
| immediately before line~\lnref{updfree}? |
| \end{lineref} |
| \QuickQuizAnswer{ |
| \begin{lineref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole] |
| Because this results in \co{P0()} accessing a freed element. |
| But don't take my word for this, try it out in \co{herd}! |
| \end{lineref} |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| \begin{lineref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole] |
| Also in Listing~\ref{lst:formal:Complex RCU Litmus Test}, |
| can't line~\lnref{updfree} be \co{WRITE_ONCE()} instead |
| of \co{smp_store_release()}? |
| \end{lineref} |
| \QuickQuizAnswer{ |
| \begin{lineref}[ln:formal:C-RomanPenyaev-list-rcu-rr:whole] |
| That is an excellent question. |
| As of late 2018, the answer is ``no one knows''. |
| Much depends on the semantics of ARMv8's conditional-move |
| instruction. |
| While awaiting clarity on these semantics, \co{smp_store_release()} |
| is the safe choice. |
| \end{lineref} |
| } \QuickQuizEnd |
| |
| These sections have shown how axiomatic approaches can successfully |
| model synchronization primitives such as locking and RCU in C-language |
| litmus tests. |
| Longer term, the hope is that the axiomatic approaches will model |
| even higher-level software artifacts, producing exponential |
| verification speedups. |
| This could potentially allow axiomatic verification of much larger |
| software systems. |
| Another alternative is to press the axioms of boolean logic into service, |
| as described in the next section. |