blob: 87a704001b83fac4a08fc1dd215846e6fb6edbb0 [file] [log] [blame]
% formal/spinhint.html
\section{General-Purpose State-Space Search}
\label{sec:formal:General-Purpose State-Space Search}
This section features the general-purpose Promela and spin tools,
which may be used to carry out a full
state-space search of many types of multi-threaded code.
They are also quite useful for verifying data communication protocols.
Section~\ref{sec:formal:Promela and Spin}
introduces Promela and spin, including a couple of warm-up exercises
verifying both non-atomic and atomic increment.
Section~\ref{sec:formal:How to Use Promela}
describes use of Promela, including example command lines and a
comparison of Promela syntax to that of C.
Section~\ref{sec:formal:Promela Example: Locking}
shows how Promela may be used to verify locking,
\ref{sec:formal:Promela Example: QRCU}
uses Promela to verify an unusual implementation of RCU named ``QRCU'',
and finally
Section~\ref{sec:formal:Promela Parable: dynticks and Preemptible RCU}
applies Promela to RCU's dyntick-idle implementation.
\subsection{Promela and Spin}
\label{sec:formal:Promela and Spin}
Promela is a language designed to help verify protocols, but which
can also be used to verify small parallel algorithms.
You recode your algorithm and correctness constraints in the C-like
language Promela, and then use Spin to translate it into a C program
that you can compile and run.
The resulting program conducts a full state-space search of your
algorithm, either verifying or finding counter-examples for
assertions that you can include in your Promela program.
This full-state search can be extremely powerful, but can also be a two-edged
sword.
If your algorithm is too complex or your Promela implementation is
careless, there might be more states than fit in memory.
Furthermore, even given sufficient memory, the state-space search might
well run for longer than the expected lifetime of the universe.
Therefore, use this tool for compact but complex parallel algorithms.
Attempts to naively apply it to even moderate-scale algorithms (let alone
the full Linux kernel) will end badly.
Promela and Spin may be downloaded from
\url{http://spinroot.com/spin/whatispin.html}.
The above site also gives links to Gerard Holzmann's excellent
book~\cite{Holzmann03a} on Promela and Spin,
as well as searchable online references starting at:
\url{http://www.spinroot.com/spin/Man/index.html}.
The remainder of this section describes how to use Promela to debug
parallel algorithms, starting with simple examples and progressing to
more complex uses.
\subsubsection{Promela Warm-Up: Non-Atomic Increment}
\label{sec:formal:Promela Warm-Up: Non-Atomic Increment}
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 #define NUMPROCS 2
2
3 byte counter = 0;
4 byte progress[NUMPROCS];
5
6 proctype incrementer(byte me)
7 {
8 int temp;
9
10 temp = counter;
11 counter = temp + 1;
12 progress[me] = 1;
13 }
14
15 init {
16 int i = 0;
17 int sum = 0;
18
19 atomic {
20 i = 0;
21 do
22 :: i < NUMPROCS ->
23 progress[i] = 0;
24 run incrementer(i);
25 i++
26 :: i >= NUMPROCS -> break
27 od;
28 }
29 atomic {
30 i = 0;
31 sum = 0;
32 do
33 :: i < NUMPROCS ->
34 sum = sum + progress[i];
35 i++
36 :: i >= NUMPROCS -> break
37 od;
38 assert(sum < NUMPROCS || counter == NUMPROCS)
39 }
40 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Promela Code for Non-Atomic Increment}
\label{fig:analysis:Promela Code for Non-Atomic Increment}
\end{figure}
Figure~\ref{fig:analysis:Promela Code for Non-Atomic Increment}
demonstrates the textbook race condition
resulting from non-atomic increment.
Line~1 defines the number of processes to run (we will vary this
to see the effect on state space), line~3 defines the counter,
and line~4 is used to implement the assertion that appears on
lines~29-39.
Lines~6-13 define a process that increments the counter non-atomically.
The argument \co{me} is the process number, set by the initialization
block later in the code.
Because simple Promela statements are each assumed atomic, we must
break the increment into the two statements on lines~10-11.
The assignment on line~12 marks the process's completion.
Because the Spin system will fully search the state space, including
all possible sequences of states, there is no need for the loop
that would be used for conventional testing.
Lines~15-40 are the initialization block, which is executed first.
Lines~19-28 actually do the initialization, while lines~29-39
perform the assertion.
Both are atomic blocks in order to avoid unnecessarily increasing
the state space: because they are not part of the algorithm proper,
we lose no verification coverage by making them atomic.
The do-od construct on lines~21-27 implements a Promela loop,
which can be thought of as a C {\tt for (;;)} loop containing a
\co{switch} statement that allows expressions in case labels.
The condition blocks (prefixed by {\tt ::})
are scanned non-deterministically,
though in this case only one of the conditions can possibly hold at a given
time.
The first block of the do-od from lines~22-25 initializes the i-th
incrementer's progress cell, runs the i-th incrementer's process, and
then increments the variable \co{i}.
The second block of the do-od on line~26 exits the loop once
these processes have been started.
The atomic block on lines~29-39 also contains a similar do-od
loop that sums up the progress counters.
The {\tt assert()} statement on line~38 verifies that if all processes
have been completed, then all counts have been correctly recorded.
You can build and run this program as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
spin -a increment.spin # Translate the model to C
cc -DSAFETY -o pan pan.c # Compile the model
./pan # Run the model
\end{verbatim}
\end{minipage}
\vspace{5pt}
\begin{figure*}[tbp]
{ \scriptsize
\begin{verbbox}
pan: assertion violated ((sum<2)||(counter==2)) (at depth 20)
pan: wrote increment.spin.trail
(Spin Version 4.2.5 -- 2 April 2005)
Warning: Search not completed
+ Partial Order Reduction
Full statespace search for:
never claim - (none specified)
assertion violations +
cycle checks - (disabled by -DSAFETY)
invalid end states +
State-vector 40 byte, depth reached 22, errors: 1
45 states, stored
13 states, matched
58 transitions (= stored+matched)
51 atomic steps
hash conflicts: 0 (resolved)
2.622 memory usage (Mbyte)
\end{verbbox}
}
\centering
\theverbbox
\caption{Non-Atomic Increment spin Output}
\label{fig:analysis:Non-Atomic Increment spin Output}
\end{figure*}
This will produce output as shown in
Figure~\ref{fig:analysis:Non-Atomic Increment spin Output}.
The first line tells us that our assertion was violated (as expected
given the non-atomic increment!).
The second line that a \co{trail} file was written describing how the
assertion was violated.
The ``Warning'' line reiterates that all was not well with our model.
The second paragraph describes the type of state-search being carried out,
in this case for assertion violations and invalid end states.
The third paragraph gives state-size statistics: this small model had only
45 states.
The final line shows memory usage.
The \co{trail} file may be rendered human-readable as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
spin -t -p increment.spin
\end{verbatim}
\end{minipage}
\vspace{5pt}
\begin{figure*}[htbp]
{ \scriptsize
\begin{verbbox}
Starting :init: with pid 0
1: proc 0 (:init:) line 20 "increment.spin" (state 1) [i = 0]
2: proc 0 (:init:) line 22 "increment.spin" (state 2) [((i<2))]
2: proc 0 (:init:) line 23 "increment.spin" (state 3) [progress[i] = 0]
Starting incrementer with pid 1
3: proc 0 (:init:) line 24 "increment.spin" (state 4) [(run incrementer(i))]
3: proc 0 (:init:) line 25 "increment.spin" (state 5) [i = (i+1)]
4: proc 0 (:init:) line 22 "increment.spin" (state 2) [((i<2))]
4: proc 0 (:init:) line 23 "increment.spin" (state 3) [progress[i] = 0]
Starting incrementer with pid 2
5: proc 0 (:init:) line 24 "increment.spin" (state 4) [(run incrementer(i))]
5: proc 0 (:init:) line 25 "increment.spin" (state 5) [i = (i+1)]
6: proc 0 (:init:) line 26 "increment.spin" (state 6) [((i>=2))]
7: proc 0 (:init:) line 21 "increment.spin" (state 10) [break]
8: proc 2 (incrementer) line 10 "increment.spin" (state 1) [temp = counter]
9: proc 1 (incrementer) line 10 "increment.spin" (state 1) [temp = counter]
10: proc 2 (incrementer) line 11 "increment.spin" (state 2) [counter = (temp+1)]
11: proc 2 (incrementer) line 12 "increment.spin" (state 3) [progress[me] = 1]
12: proc 2 terminates
13: proc 1 (incrementer) line 11 "increment.spin" (state 2) [counter = (temp+1)]
14: proc 1 (incrementer) line 12 "increment.spin" (state 3) [progress[me] = 1]
15: proc 1 terminates
16: proc 0 (:init:) line 30 "increment.spin" (state 12) [i = 0]
16: proc 0 (:init:) line 31 "increment.spin" (state 13) [sum = 0]
17: proc 0 (:init:) line 33 "increment.spin" (state 14) [((i<2))]
17: proc 0 (:init:) line 34 "increment.spin" (state 15) [sum = (sum+progress[i])]
17: proc 0 (:init:) line 35 "increment.spin" (state 16) [i = (i+1)]
18: proc 0 (:init:) line 33 "increment.spin" (state 14) [((i<2))]
18: proc 0 (:init:) line 34 "increment.spin" (state 15) [sum = (sum+progress[i])]
18: proc 0 (:init:) line 35 "increment.spin" (state 16) [i = (i+1)]
19: proc 0 (:init:) line 36 "increment.spin" (state 17) [((i>=2))]
20: proc 0 (:init:) line 32 "increment.spin" (state 21) [break]
spin: line 38 "increment.spin", Error: assertion violated
spin: text of failed assertion: assert(((sum<2)||(counter==2)))
21: proc 0 (:init:) line 38 "increment.spin" (state 22) [assert(((sum<2)||(counter==2)))]
spin: trail ends after 21 steps
#processes: 1
counter = 1
progress[0] = 1
progress[1] = 1
21: proc 0 (:init:) line 40 "increment.spin" (state 24) <valid end state>
3 processes created
\end{verbbox}
}
\centering
\theverbbox
\caption{Non-Atomic Increment Error Trail}
\label{fig:analysis:Non-Atomic Increment Error Trail}
\end{figure*}
This gives the output shown in
Figure~\ref{fig:analysis:Non-Atomic Increment Error Trail}.
As can be seen, the first portion of the init block created both
incrementer processes, both of which first fetched the counter,
then both incremented and stored it, losing a count.
The assertion then triggered, after which the global state is displayed.
\subsubsection{Promela Warm-Up: Atomic Increment}
\label{sec:formal:Promela Warm-Up: Atomic Increment}
\begin{figure}[htbp]
{ \scriptsize
\begin{verbbox}
1 proctype incrementer(byte me)
2 {
3 int temp;
4
5 atomic {
6 temp = counter;
7 counter = temp + 1;
8 }
9 progress[me] = 1;
10 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Promela Code for Atomic Increment}
\label{fig:analysis:Promela Code for Atomic Increment}
\end{figure}
\begin{figure}[htbp]
{ \scriptsize
\begin{verbbox}
(Spin Version 4.2.5 -- 2 April 2005)
+ Partial Order Reduction
Full statespace search for:
never claim - (none specified)
assertion violations +
cycle checks - (disabled by -DSAFETY)
invalid end states +
State-vector 40 byte, depth reached 20, errors: 0
52 states, stored
21 states, matched
73 transitions (= stored+matched)
66 atomic steps
hash conflicts: 0 (resolved)
2.622 memory usage (Mbyte)
unreached in proctype incrementer
(0 of 5 states)
unreached in proctype :init:
(0 of 24 states)
\end{verbbox}
}
\centering
\theverbbox
\caption{Atomic Increment spin Output}
\label{fig:analysis:Atomic Increment spin Output}
\end{figure}
It is easy to fix this example by placing the body of the incrementer
processes in an atomic blocks as shown in
Figure~\ref{fig:analysis:Promela Code for Atomic Increment}.
One could also have simply replaced the pair of statements with
{\tt counter = counter + 1}, because Promela statements are
atomic.
Either way, running this modified model gives us an error-free traversal
of the state space, as shown in
Figure~\ref{fig:analysis:Atomic Increment spin Output}.
\begin{table}
\footnotesize
\centering
\begin{tabular}{c|r|r}
\# incrementers & \# states & megabytes \\
\hline
\hline
1 & 11 & 2.6 \\
\hline
2 & 52 & 2.6 \\
\hline
3 & 372 & 2.6 \\
\hline
4 & 3,496 & 2.7 \\
\hline
5 & 40,221 & 5.0 \\
\hline
6 & 545,720 & 40.5 \\
\hline
7 & 8,521,450 & 652.7 \\
\end{tabular}
\caption{Memory Usage of Increment Model}
\label{tab:advsync:Memory Usage of Increment Model}
\end{table}
Table~\ref{tab:advsync:Memory Usage of Increment Model}
shows the number of states and memory consumed
as a function of number of incrementers modeled
(by redefining {\tt NUMPROCS}):
Running unnecessarily large models is thus subtly discouraged, although
652MB is well within the limits of modern desktop and laptop machines.
With this example under our belt, let's take a closer look at the
commands used to analyze Promela models and then look at more
elaborate examples.
\subsection{How to Use Promela}
\label{sec:formal:How to Use Promela}
Given a source file \path{qrcu.spin}, one can use the following commands:
\begin{description}[style=nextline]
\item [\tco{spin -a qrcu.spin}]
Create a file \path{pan.c} that fully searches the state machine.
\item [\tco{cc -DSAFETY -o pan pan.c}]
Compile the generated state-machine search. The \co{-DSAFETY}
generates optimizations that are appropriate if you have only
assertions (and perhaps \co{never} statements). If you have
liveness, fairness, or forward-progress checks, you may need
to compile without \co{-DSAFETY}. If you leave off \co{-DSAFETY}
when you could have used it, the program will let you know.
The optimizations produced by \co{-DSAFETY} greatly speed things
up, so you should use it when you can.
An example situation where you cannot use \co{-DSAFETY} is
when checking for livelocks (AKA ``non-progress cycles'')
via \co{-DNP}.
\item [\tco{./pan}]
This actually searches the state space. The number of states
can reach into the tens of millions with very small state
machines, so you will need a machine with large memory.
For example, \path{qrcu.spin} with 3 readers and 2 updaters required
2.7GB of memory.
If you aren't sure whether your machine has enough memory,
run \co{top} in one window and \co{./pan} in another. Keep the
focus on the \co{./pan} window so that you can quickly kill
execution if need be. As soon as CPU time drops much below
100\%, kill \co{./pan}. If you have removed focus from the
window running \co{./pan}, you may wait a long time for the
windowing system to grab enough memory to do anything for
you.
Don't forget to capture the output, especially
if you are working on a remote machine.
If your model includes forward-progress checks, you will likely
need to enable ``weak fairness'' via the \co{-f} command-line
argument to \co{./pan}.
If your forward-progress checks involve \co{accept} labels,
you will also need the \co{-a} argument.
% forward reference to model: formal.2009.02.19a in
% /home/linux/git/userspace-rcu/formal-model.
\item [\tco{spin -t -p qrcu.spin}]
Given \co{trail} file output by a run that encountered an
error, output the sequence of steps leading to that error.
The \co{-g} flag will also include the values of changed
global variables, and the \co{-l} flag will also include
the values of changed local variables.
\end{description}
\subsubsection{Promela Peculiarities}
\label{sec:formal:Promela Peculiarities}
Although all computer languages have underlying similarities,
Promela will provide some surprises to people used to coding in C,
C++, or Java.
\begin{enumerate}
\item In C, ``\co{;}'' terminates statements. In Promela it separates them.
Fortunately, more recent versions of Spin have become
much more forgiving of ``extra'' semicolons.
\item Promela's looping construct, the \co{do} statement, takes
conditions.
This \co{do} statement closely resembles a looping if-then-else
statement.
\item In C's \co{switch} statement, if there is no matching case, the whole
statement is skipped. In Promela's equivalent, confusingly called
\co{if}, if there is no matching guard expression, you get an error
without a recognizable corresponding error message.
So, if the error output indicates an innocent line of code,
check to see if you left out a condition from an \co{if} or \co{do}
statement.
\item When creating stress tests in C, one usually races suspect operations
against each other repeatedly. In Promela, one instead sets up
a single race, because Promela will search out all the possible
outcomes from that single race. Sometimes you do need to loop
in Promela, for example, if multiple operations overlap, but
doing so greatly increases the size of your state space.
\item In C, the easiest thing to do is to maintain a loop counter to track
progress and terminate the loop. In Promela, loop counters
must be avoided like the plague because they cause the state
space to explode. On the other hand, there is no penalty for
infinite loops in Promela as long as none of the variables
monotonically increase or decrease---Promela will figure out
how many passes through the loop really matter, and automatically
prune execution beyond that point.
\item In C torture-test code, it is often wise to keep per-task control
variables. They are cheap to read, and greatly aid in debugging the
test code. In Promela, per-task control variables should be used
only when there is no other alternative. To see this, consider
a 5-task verification with one bit each to indicate completion.
This gives 32 states. In contrast, a simple counter would have
only six states, more than a five-fold reduction. That factor
of five might not seem like a problem, at least not until you
are struggling with a verification program possessing more than
150 million states consuming more than 10GB of memory!
\item One of the most challenging things both in C torture-test code and
in Promela is formulating good assertions. Promela also allows
\co{never} claims that act sort of like an assertion replicated
between every line of code.
\item Dividing and conquering is extremely helpful in Promela in keeping
the state space under control. Splitting a large model into two
roughly equal halves will result in the state space of each
half being roughly the square root of the whole.
For example, a million-state combined model might reduce to a
pair of thousand-state models.
Not only will Promela handle the two smaller models much more
quickly with much less memory, but the two smaller algorithms
are easier for people to understand.
\end{enumerate}
\subsubsection{Promela Coding Tricks}
\label{sec:formal:Promela Coding Tricks}
Promela was designed to analyze protocols, so using it on parallel programs
is a bit abusive.
The following tricks can help you to abuse Promela safely:
\begin{enumerate}
\item Memory reordering. Suppose you have a pair of statements
copying globals x and y to locals r1 and r2, where ordering
matters (e.g., unprotected by locks), but where you have
no memory barriers. This can be modeled in Promela as follows:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
1 if
2 :: 1 -> r1 = x;
3 r2 = y
4 :: 1 -> r2 = y;
5 r1 = x
6 fi
\end{verbatim}
\end{minipage}
\vspace{5pt}
The two branches of the \co{if} statement will be selected
nondeterministically, since they both are available.
Because the full state space is searched, \emph{both} choices
will eventually be made in all cases.
Of course, this trick will cause your state space to explode
if used too heavily.
In addition, it requires you to anticipate possible reorderings.
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 i = 0;
2 sum = 0;
3 do
4 :: i < N_QRCU_READERS ->
5 sum = sum + (readerstart[i] == 1 &&
6 readerprogress[i] == 1);
7 i++
8 :: i >= N_QRCU_READERS ->
9 assert(sum == 0);
10 break
11 od
\end{verbbox}
}
\centering
\theverbbox
\caption{Complex Promela Assertion}
\label{fig:analysis:Complex Promela Assertion}
\end{figure}
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 atomic {
2 i = 0;
3 sum = 0;
4 do
5 :: i < N_QRCU_READERS ->
6 sum = sum + (readerstart[i] == 1 &&
7 readerprogress[i] == 1);
8 i++
9 :: i >= N_QRCU_READERS ->
10 assert(sum == 0);
11 break
12 od
13 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Atomic Block for Complex Promela Assertion}
\label{fig:analysis:Atomic Block for Complex Promela Assertion}
\end{figure}
\item State reduction. If you have complex assertions, evaluate
them under \co{atomic}. After all, they are not part of the
algorithm. One example of a complex assertion (to be discussed
in more detail later) is as shown in
Figure~\ref{fig:analysis:Complex Promela Assertion}.
There is no reason to evaluate this assertion
non-atomically, since it is not actually part of the algorithm.
Because each statement contributes to state, we can reduce
the number of useless states by enclosing it in an \co{atomic}
block as shown in
Figure~\ref{fig:analysis:Atomic Block for Complex Promela Assertion}.
\item Promela does not provide functions.
You must instead use C preprocessor macros.
However, you must use them carefully in order to avoid
combinatorial explosion.
\end{enumerate}
Now we are ready for more complex examples.
\subsection{Promela Example: Locking}
\label{sec:formal:Promela Example: Locking}
\begin{figure}[tbp]
{ \scriptsize
\begin{verbbox}
1 #define spin_lock(mutex) \
2 do \
3 :: 1 -> atomic { \
4 if \
5 :: mutex == 0 -> \
6 mutex = 1; \
7 break \
8 :: else -> skip \
9 fi \
10 } \
11 od
12
13 #define spin_unlock(mutex) \
14 mutex = 0
\end{verbbox}
}
\centering
\theverbbox
\caption{Promela Code for Spinlock}
\label{fig:analysis:Promela Code for Spinlock}
\end{figure}
Since locks are generally useful, \co{spin_lock()} and
\co{spin_unlock()}
macros are provided in \path{lock.h}, which may be included from
multiple Promela models, as shown in
Figure~\ref{fig:analysis:Promela Code for Spinlock}.
The \co{spin_lock()} macro contains an infinite do-od loop
spanning lines~2-11,
courtesy of the single guard expression of ``1'' on line~3.
The body of this loop is a single atomic block that contains
an if-fi statement.
The if-fi construct is similar to the do-od construct, except
that it takes a single pass rather than looping.
If the lock is not held on line~5, then line~6 acquires it and
line~7 breaks out of the enclosing do-od loop (and also exits
the atomic block).
On the other hand, if the lock is already held on line~8,
we do nothing (\co{skip}), and fall out of the if-fi and the
atomic block so as to take another pass through the outer
loop, repeating until the lock is available.
The \co{spin_unlock()} macro simply marks the lock as no
longer held.
Note that memory barriers are not needed because Promela assumes
full ordering.
In any given Promela state, all processes agree on both the current
state and the order of state changes that caused us to arrive at
the current state.
This is analogous to the ``sequentially consistent'' memory model
used by a few computer systems (such as MIPS and PA-RISC).
As noted earlier, and as will be seen in a later example,
weak memory ordering must be explicitly coded.
\begin{figure}[tb]
{ \scriptsize
\begin{verbbox}
1 #include "lock.h"
2
3 #define N_LOCKERS 3
4
5 bit mutex = 0;
6 bit havelock[N_LOCKERS];
7 int sum;
8
9 proctype locker(byte me)
10 {
11 do
12 :: 1 ->
13 spin_lock(mutex);
14 havelock[me] = 1;
15 havelock[me] = 0;
16 spin_unlock(mutex)
17 od
18 }
19
20 init {
21 int i = 0;
22 int j;
23
24 end: do
25 :: i < N_LOCKERS ->
26 havelock[i] = 0;
27 run locker(i);
28 i++
29 :: i >= N_LOCKERS ->
30 sum = 0;
31 j = 0;
32 atomic {
33 do
34 :: j < N_LOCKERS ->
35 sum = sum + havelock[j];
36 j = j + 1
37 :: j >= N_LOCKERS ->
38 break
39 od
40 }
41 assert(sum <= 1);
42 break
43 od
44 }
\end{verbbox}
}
\centering
\theverbbox
\caption{Promela Code to Test Spinlocks}
\label{fig:analysis:Promela Code to Test Spinlocks}
\end{figure}
These macros are tested by the Promela code shown in
Figure~\ref{fig:analysis:Promela Code to Test Spinlocks}.
This code is similar to that used to test the increments,
with the number of locking processes defined by the \co{N_LOCKERS}
macro definition on line~3.
The mutex itself is defined on line~5, an array to track the lock owner
on line~6, and line~7 is used by assertion
code to verify that only one process holds the lock.
The locker process is on lines~9-18, and simply loops forever
acquiring the lock on line~13, claiming it on line~14,
unclaiming it on line~15, and releasing it on line~16.
The init block on lines~20-44 initializes the current locker's
havelock array entry on line~26, starts the current locker on
line~27, and advances to the next locker on line~28.
Once all locker processes are spawned, the do-od loop
moves to line~29, which checks the assertion.
Lines~30 and~31 initialize the control variables,
lines~32-40 atomically sum the havelock array entries,
line~41 is the assertion, and line~42 exits the loop.
We can run this model by placing the above two code fragments into
files named \path{lock.h} and \path{lock.spin}, respectively, and then running
the following commands:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
spin -a lock.spin
cc -DSAFETY -o pan pan.c
./pan
\end{verbatim}
\end{minipage}
\vspace{5pt}
\begin{figure}[htbp]
{ \scriptsize
\begin{verbbox}
(Spin Version 4.2.5 -- 2 April 2005)
+ Partial Order Reduction
Full statespace search for:
never claim - (none specified)
assertion violations +
cycle checks - (disabled by -DSAFETY)
invalid end states +
State-vector 40 byte, depth reached 357, errors: 0
564 states, stored
929 states, matched
1493 transitions (= stored+matched)
368 atomic steps
hash conflicts: 0 (resolved)
2.622 memory usage (Mbyte)
unreached in proctype locker
line 18, state 20, "-end-"
(1 of 20 states)
unreached in proctype :init:
(0 of 22 states)
\end{verbbox}
}
\centering
\theverbbox
\caption{Output for Spinlock Test}
\label{fig:analysis:Output for Spinlock Test}
\end{figure}
The output will look something like that shown in
Figure~\ref{fig:analysis:Output for Spinlock Test}.
As expected, this run has no assertion failures (``errors: 0'').
\QuickQuiz{}
Why is there an unreached statement in
locker? After all, isn't this a \emph{full} state-space
search?
\QuickQuizAnswer{
The locker process is an infinite loop, so control
never reaches the end of this process.
However, since there are no monotonically increasing variables,
Promela is able to model this infinite loop with a small
number of states.
} \QuickQuizEnd
\QuickQuiz{}
What are some Promela code-style issues with this example?
\QuickQuizAnswer{
There are several:
\begin{enumerate}
\item The declaration of {\tt sum} should be moved to within
the init block, since it is not used anywhere else.
\item The assertion code should be moved outside of the
initialization loop. The initialization loop can
then be placed in an atomic block, greatly reducing
the state space (by how much?).
\item The atomic block covering the assertion code should
be extended to include the initialization of {\tt sum}
and {\tt j}, and also to cover the assertion.
This also reduces the state space (again, by how
much?).
\end{enumerate}
} \QuickQuizEnd
\subsection{Promela Example: QRCU}
\label{sec:formal:Promela Example: QRCU}
This final example demonstrates a real-world use of Promela on Oleg
Nesterov's
QRCU~\cite{OlegNesterov2006QRCU,OlegNesterov2006aQRCU},
but modified to speed up the \co{synchronize_qrcu()}
fastpath.
But first, what is QRCU?
QRCU is a variant of SRCU~\cite{PaulEMcKenney2006c}
that trades somewhat higher read overhead
(atomic increment and decrement on a global variable) for extremely
low grace-period latencies.
If there are no readers, the grace period will be detected in less
than a microsecond, compared to the multi-millisecond grace-period
latencies of most other RCU implementations.
\begin{enumerate}
\item There is a \co{qrcu_struct} that defines a QRCU domain.
Like SRCU (and unlike other variants of RCU) QRCU's action
is not global, but instead focused on the specified
\co{qrcu_struct}.
\item There are \co{qrcu_read_lock()} and \co{qrcu_read_unlock()}
primitives that delimit QRCU read-side critical sections.
The corresponding \co{qrcu_struct} must be passed into
these primitives, and the return value from \co{rcu_read_lock()}
must be passed to \co{rcu_read_unlock()}.
For example:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
idx = qrcu_read_lock(&my_qrcu_struct);
/* read-side critical section. */
qrcu_read_unlock(&my_qrcu_struct, idx);
\end{verbatim}
\end{minipage}
\vspace{5pt}
\item There is a \co{synchronize_qrcu()} primitive that blocks until
all pre-existing QRCU read-side critical sections complete,
but, like SRCU's \co{synchronize_srcu()}, QRCU's
\co{synchronize_qrcu()} need wait only for those read-side
critical sections that are using the same \co{qrcu_struct}.
For example, \co{synchronize_qrcu(&your_qrcu_struct)}
would \emph{not} need to wait on the earlier QRCU read-side
critical section.
In contrast, \co{synchronize_qrcu(&my_qrcu_struct)}
\emph{would} need to wait, since it shares the same
\co{qrcu_struct}.
\end{enumerate}
A Linux-kernel patch for QRCU has been
produced~\cite{PaulMcKenney2007QRCUpatch},
but has not yet been included in the Linux kernel as of
April 2008.
\begin{figure}[htbp]
{ \scriptsize
\begin{verbbox}
1 #include "lock.h"
2
3 #define N_QRCU_READERS 2
4 #define N_QRCU_UPDATERS 2
5
6 bit idx = 0;
7 byte ctr[2];
8 byte readerprogress[N_QRCU_READERS];
9 bit mutex = 0;
\end{verbbox}
}
\centering
\theverbbox
\caption{QRCU Global Variables}
\label{fig:analysis:QRCU Global Variables}
\end{figure}
Returning to the Promela code for QRCU, the global variables are as shown in
Figure~\ref{fig:analysis:QRCU Global Variables}.
This example uses locking, hence including \path{lock.h}.
Both the number of readers and writers can be varied using the
two \co{#define} statements, giving us not one but two ways to create
combinatorial explosion.
The \co{idx} variable controls which of the two elements of the \co{ctr}
array will be used by readers, and the \co{readerprogress} variable
allows an assertion to determine when all the readers are finished
(since a QRCU update cannot be permitted to complete until all
pre-existing readers have completed their QRCU read-side critical
sections).
The \co{readerprogress} array elements have values as follows,
indicating the state of the corresponding reader:
\begin{enumerate}
\item 0: not yet started.
\item 1: within QRCU read-side critical section.
\item 2: finished with QRCU read-side critical section.
\end{enumerate}
Finally, the \co{mutex} variable is used to serialize updaters' slowpaths.
\begin{figure}[htbp]
{ \scriptsize
\begin{verbbox}
1 proctype qrcu_reader(byte me)
2 {
3 int myidx;
4
5 do
6 :: 1 ->
7 myidx = idx;
8 atomic {
9 if
10 :: ctr[myidx] > 0 ->
11 ctr[myidx]++;
12 break
13 :: else -> skip
14 fi
15 }
16 od;
17 readerprogress[me] = 1;
18 readerprogress[me] = 2;
19 atomic { ctr[myidx]-- }
20 }
\end{verbbox}
}
\centering
\theverbbox
\caption{QRCU Reader Process}
\label{fig:analysis:QRCU Reader Process}
\end{figure}
QRCU readers are modeled by the \co{qrcu_reader()} process shown in
Figure~\ref{fig:analysis:QRCU Reader Process}.
A do-od loop spans lines~5-16, with a single guard of ``1''
on line~6 that makes it an infinite loop.
Line~7 captures the current value of the global index, and lines~8-15
atomically increment it (and break from the infinite loop)
if its value was non-zero (\co{atomic_inc_not_zero()}).
Line~17 marks entry into the RCU read-side critical section, and
line~18 marks exit from this critical section, both lines for the benefit of
the {\tt assert()} statement that we shall encounter later.
Line~19 atomically decrements the same counter that we incremented,
thereby exiting the RCU read-side critical section.
\begin{figure}[htbp]
{ \scriptsize
\begin{verbbox}
1 #define sum_unordered \
2 atomic { \
3 do \
4 :: 1 -> \
5 sum = ctr[0]; \
6 i = 1; \
7 break \
8 :: 1 -> \
9 sum = ctr[1]; \
10 i = 0; \
11 break \
12 od; \
13 } \
14 sum = sum + ctr[i]
\end{verbbox}
}
\centering
\theverbbox
\caption{QRCU Unordered Summation}
\label{fig:analysis:QRCU Unordered Summation}
\end{figure}
The C-preprocessor macro shown in
Figure~\ref{fig:analysis:QRCU Unordered Summation}
sums the pair of counters so as to emulate weak memory ordering.
Lines~2-13 fetch one of the counters, and line~14 fetches the other
of the pair and sums them.
The atomic block consists of a single do-od statement.
This do-od statement (spanning lines~3-12) is unusual in that
it contains two unconditional
branches with guards on lines~4 and~8, which causes Promela to
non-deterministically choose one of the two (but again, the full
state-space search causes Promela to eventually make all possible
choices in each applicable situation).
The first branch fetches the zero-th counter and sets \co{i} to 1 (so
that line~14 will fetch the first counter), while the second
branch does the opposite, fetching the first counter and setting \co{i}
to 0 (so that line~14 will fetch the second counter).
\QuickQuiz{}
Is there a more straightforward way to code the do-od statement?
\QuickQuizAnswer{
Yes.
Replace it with {\tt if-fi} and remove the two {\tt break} statements.
} \QuickQuizEnd
\begin{figure}[htbp]
{ \scriptsize
\begin{verbbox}
1 proctype qrcu_updater(byte me)
2 {
3 int i;
4 byte readerstart[N_QRCU_READERS];
5 int sum;
6
7 do
8 :: 1 ->
9
10 /* Snapshot reader state. */
11
12 atomic {
13 i = 0;
14 do
15 :: i < N_QRCU_READERS ->
16 readerstart[i] = readerprogress[i];
17 i++
18 :: i >= N_QRCU_READERS ->
19 break
20 od
21 }
22
23 sum_unordered;
24 if
25 :: sum <= 1 -> sum_unordered
26 :: else -> skip
27 fi;
28 if
29 :: sum > 1 ->
30 spin_lock(mutex);
31 atomic { ctr[!idx]++ }
32 idx = !idx;
33 atomic { ctr[!idx]-- }
34 do
35 :: ctr[!idx] > 0 -> skip
36 :: ctr[!idx] == 0 -> break
37 od;
38 spin_unlock(mutex);
39 :: else -> skip
40 fi;
41
42 /* Verify reader progress. */
43
44 atomic {
45 i = 0;
46 sum = 0;
47 do
48 :: i < N_QRCU_READERS ->
49 sum = sum + (readerstart[i] == 1 &&
50 readerprogress[i] == 1);
51 i++
52 :: i >= N_QRCU_READERS ->
53 assert(sum == 0);
54 break
55 od
56 }
57 od
58 }
\end{verbbox}
}
\centering
\theverbbox
\caption{QRCU Updater Process}
\label{fig:analysis:QRCU Updater Process}
\end{figure}
With the \co{sum_unordered} macro in place, we can now proceed
to the update-side process shown in
Figure~\ref{fig:analysis:QRCU Updater Process}.
The update-side process repeats indefinitely, with the corresponding
do-od loop ranging over lines~7-57.
Each pass through the loop first snapshots the global {\tt readerprogress}
array into the local {\tt readerstart} array on lines~12-21.
This snapshot will be used for the assertion on line~53.
Line~23 invokes \co{sum_unordered}, and then lines~24-27
re-invoke \co{sum_unordered} if the fastpath is potentially
usable.
Lines~28-40 execute the slowpath code if need be, with
lines~30 and~38 acquiring and releasing the update-side lock,
lines~31-33 flipping the index, and lines~34-37 waiting for
all pre-existing readers to complete.
Lines~44-56 then compare the current values in the {\tt readerprogress}
array to those collected in the {\tt readerstart} array,
forcing an assertion failure should any readers that started before
this update still be in progress.
\QuickQuiz{}
Why are there atomic blocks at lines~12-21
and lines~44-56, when the operations within those atomic
blocks have no atomic implementation on any current
production microprocessor?
\QuickQuizAnswer{
Because those operations are for the benefit of the
assertion only. They are not part of the algorithm itself.
There is therefore no harm in marking them atomic, and
so marking them greatly reduces the state space that must
be searched by the Promela model.
} \QuickQuizEnd
\QuickQuiz{}
Is the re-summing of the counters on lines~24-27
\emph{really} necessary?
\QuickQuizAnswer{
Yes. To see this, delete these lines and run the model.
Alternatively, consider the following sequence of steps:
\begin{enumerate}
\item One process is within its RCU read-side critical
section, so that the value of {\tt ctr[0]} is zero and
the value of {\tt ctr[1]} is two.
\item An updater starts executing, and sees that the sum of
the counters is two so that the fastpath cannot be
executed. It therefore acquires the lock.
\item A second updater starts executing, and fetches the value
of {\tt ctr[0]}, which is zero.
\item The first updater adds one to {\tt ctr[0]}, flips
the index (which now becomes zero), then subtracts
one from {\tt ctr[1]} (which now becomes one).
\item The second updater fetches the value of {\tt ctr[1]},
which is now one.
\item The second updater now incorrectly concludes that it
is safe to proceed on the fastpath, despite the fact
that the original reader has not yet completed.
\end{enumerate}
} \QuickQuizEnd
\begin{figure}[htbp]
{ \scriptsize
\begin{verbbox}
1 init {
2 int i;
3
4 atomic {
5 ctr[idx] = 1;
6 ctr[!idx] = 0;
7 i = 0;
8 do
9 :: i < N_QRCU_READERS ->
10 readerprogress[i] = 0;
11 run qrcu_reader(i);
12 i++
13 :: i >= N_QRCU_READERS -> break
14 od;
15 i = 0;
16 do
17 :: i < N_QRCU_UPDATERS ->
18 run qrcu_updater(i);
19 i++
20 :: i >= N_QRCU_UPDATERS -> break
21 od
22 }
23 }
\end{verbbox}
}
\centering
\theverbbox
\caption{QRCU Initialization Process}
\label{fig:analysis:QRCU Initialization Process}
\end{figure}
All that remains is the initialization block shown in
Figure~\ref{fig:analysis:QRCU Initialization Process}.
This block simply initializes the counter pair on lines~5-6,
spawns the reader processes on lines~7-14, and spawns the updater
processes on lines~15-21.
This is all done within an atomic block to reduce state space.
\subsubsection{Running the QRCU Example}
\label{sec:formal:Running the QRCU Example}
To run the QRCU example, combine the code fragments in the previous
section into a single file named \path{qrcu.spin}, and place the definitions
for \co{spin_lock()} and \co{spin_unlock()} into a file named
\path{lock.h}.
Then use the following commands to build and run the QRCU model:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\scriptsize
\begin{verbatim}
spin -a qrcu.spin
cc -DSAFETY -o pan pan.c
./pan
\end{verbatim}
\end{minipage}
\vspace{5pt}
\begin{table}
\footnotesize
\centering
\begin{tabular}{c|r|r|r}
updaters &
readers &
\# states & MB \\
\hline
1 & 1 & 376 & 2.6 \\
\hline
1 & 2 & 6,177 & 2.9 \\
\hline
1 & 3 & 82,127 & 7.5 \\
\hline
2 & 1 & 29,399 & 4.5 \\
\hline
2 & 2 & 1,071,180 & 75.4 \\
\hline
2 & 3 & 33,866,700 & 2,715.2 \\
\hline
3 & 1 & 258,605 & 22.3 \\
\hline
3 & 2 & 169,533,000 & 14,979.9 \\
\end{tabular}
\caption{Memory Usage of QRCU Model}
\label{tab:advsync:Memory Usage of QRCU Model}
\end{table}
The resulting output shows that this model passes all of the cases
shown in
Table~\ref{tab:advsync:Memory Usage of QRCU Model}.
Now, it would be nice to run the case with three readers and three
updaters, however, simple extrapolation indicates that this will
require on the order of a terabyte of memory best case.
So, what to do?
Here are some possible approaches:
\begin{enumerate}
\item See whether a smaller number of readers and updaters suffice
to prove the general case.
\item Manually construct a proof of correctness.
\item Use a more capable tool.
\item Divide and conquer.
\end{enumerate}
The following sections discuss each of these approaches.
\subsubsection{How Many Readers and Updaters Are Really Needed?}
\label{sec:formal:How Many Readers and Updaters Are Really Needed?}
One approach is to look carefully at the Promela code for
\co{qrcu_updater()} and notice that the only global state
change is happening under the lock.
Therefore, only one updater at a time can possibly be modifying
state visible to either readers or other updaters.
This means that any sequences of state changes can be carried
out serially by a single updater due to the fact that Promela does a full
state-space search.
Therefore, at most two updaters are required: one to change state
and a second to become confused.
The situation with the readers is less clear-cut, as each reader
does only a single read-side critical section then terminates.
It is possible to argue that the useful number of readers is limited,
due to the fact that the fastpath must see at most a zero and a one
in the counters.
This is a fruitful avenue of investigation, in fact, it leads to
the full proof of correctness described in the next section.
\subsubsection{Alternative Approach: Proof of Correctness}
\label{sec:formal:Alternative Approach: Proof of Correctness}
An informal proof~\cite{PaulMcKenney2007QRCUpatch}
follows:
\begin{enumerate}
\item For \co{synchronize_qrcu()} to exit too early, then
by definition there must have been at least one reader
present during \co{synchronize_qrcu()}'s full
execution.
\item The counter corresponding to this reader will have been
at least 1 during this time interval.
\item The \co{synchronize_qrcu()} code forces at least one
of the counters to be at least 1 at all times.
\item Therefore, at any given point in time, either one of the
counters will be at least 2, or both of the counters will
be at least one.
\item However, the \co{synchronize_qrcu()} fastpath code
can read only one of the counters at a given time.
It is therefore possible for the fastpath code to fetch
the first counter while zero, but to race with a counter
flip so that the second counter is seen as one.
\item There can be at most one reader persisting through such
a race condition, as otherwise the sum would be two or
greater, which would cause the updater to take the slowpath.
\item But if the race occurs on the fastpath's first read of the
counters, and then again on its second read, there have
to have been two counter flips.
\item Because a given updater flips the counter only once, and
because the update-side lock prevents a pair of updaters
from concurrently flipping the counters, the only way that
the fastpath code can race with a flip twice is if the
first updater completes.
\item But the first updater will not complete until after all
pre-existing readers have completed.
\item Therefore, if the fastpath races with a counter flip
twice in succession, all pre-existing readers must have
completed, so that it is safe to take the fastpath.
\end{enumerate}
Of course, not all parallel algorithms have such simple proofs.
In such cases, it may be necessary to enlist more capable tools.
\subsubsection{Alternative Approach: More Capable Tools}
\label{sec:formal:Alternative Approach: More Capable Tools}
Although Promela and Spin are quite useful,
much more capable tools are available, particularly for verifying
hardware.
This means that if it is possible to translate your algorithm
to the hardware-design VHDL language, as it often will be for
low-level parallel algorithms, then it is possible to apply these
tools to your code (for example, this was done for the first
realtime RCU algorithm).
However, such tools can be quite expensive.
Although the advent of commodity multiprocessing
might eventually result in powerful free-software model-checkers
featuring fancy state-space-reduction capabilities,
this does not help much in the here and now.
As an aside, there are Spin features that support approximate searches
that require fixed amounts of memory, however, I have never been able
to bring myself to trust approximations when verifying parallel
algorithms.
Another approach might be to divide and conquer.
\subsubsection{Alternative Approach: Divide and Conquer}
\label{sec:formal:Alternative Approach: Divide and Conquer}
It is often possible to break down a larger parallel algorithm into
smaller pieces, which can then be proven separately.
For example, a 10-billion-state model might be broken into a pair
of 100,000-state models.
Taking this approach not only makes it easier for tools such as
Promela to verify your algorithms, it can also make your algorithms
easier to understand.
\subsubsection{Is QRCU Really Correct?}
\label{sec:formal:Is QRCU Really Correct?}
Is QRCU really correct?
We have a Promela-based mechanical proof and a by-hand proof that both
say that it is.
However, a recent paper by Alglave et al.~\cite{JadeAlglave2013-cav}
says otherwise (see Section~5.1 of the paper at the bottom of page~12).
Which is it?
It turns out that both are correct!
When QRCU was added to a suite of formal-verification benchmarks,
its memory barriers were omitted, thus resulting in a buggy version
of QRCU.
So the real news here is that a number of formal-verification tools
incorrectly proved this buggy QRCU correct.
And this is why formal-verification tools themselves should be tested
using bug-injected versions of the code being verified.
If a given tool cannot find the injected bugs, then that tool is
clearly untrustworthy.
\QuickQuiz{}
But different formal-verification tools are often designed to
locate particular classes of bugs.
For example, very few formal-verification tools will find
an error in the specification.
So isn't this ``clearly untrustworthy'' judgment a bit harsh?
\QuickQuizAnswer{
It is certainly true that many formal-verification tools are
specialized in some way.
For example, Promela does not handle realistic memory models
(though they can be programmed into
Promela~\cite{Desnoyers:2013:MSM:2506164.2506174}),
CBMC~\cite{EdmundClarke2004CBMC} does not detect probabilistic
hangs and deadlocks, and
Nidhugg~\cite{CarlLeonardsson2014Nidhugg} does not detect
bugs involving data nondeterminism.
But this means that that these tools cannot be trusted to find
bugs that they are not designed to locate.
And therefore people creating formal-verification tools should
``tell the truth on the label'', clearly calling out what
classes of bugs their tools can and cannot detect.
Otherwise, the first time a practitioner finds a tool
failing to detect a bug, that practitioner is likely to
make extremely harsh and extremely public denunciations
of that tool.
Yes, yes, there is something to be said for putting your
best foot forward, but putting it too far forward without
appropriate disclaimers can easily trigger a land mine of
negative reaction that your tool might or might not be able
to recover from.
You have been warned!
} \QuickQuizEnd
Therefore, if you do intend to use QRCU, please take care.
Its proofs of correctness might or might not themselves be correct.
Which is one reason why formal verification is unlikely to
completely replace testing, as Donald Knuth pointed out so long ago.
\QuickQuiz{}
Given that we have two independent proofs of correctness for
the QRCU algorithm described herein, and given that the
proof of incorrectness covers what is known to be a different
algorithm, why is there any room for doubt?
\QuickQuizAnswer{
There is always room for doubt.
In this case, it is important to keep in mind that the two proofs
of correctness preceded the formalization of real-world memory
models, raising the possibility that these two proofs are based
on incorrect memory-ordering assumptions.
Furthermore, since both proofs were constructed by the same person,
it is quite possible that they contain a common error.
Again, there is always room for doubt.
} \QuickQuizEnd