blob: 93aece53b1be5f272721805f0a3e54e1750138ca [file] [log] [blame]
% future/tm.tex
\section{Transactional Memory}
\label{sec:future:Transactional Memory}
The idea of using transactions outside of databases goes back many
decades~\cite{DBLomet1977SIGSOFT}, with the key difference between
database and non-database transactions being that non-database transactions
drop the ``D'' in the ``ACID'' properties defining database transactions.
The idea of supporting memory-based transactions, or ``transactional memory''
(TM), in hardware
is more recent~\cite{Herlihy93a}, but unfortunately, support for such
transactions in commodity hardware was not immediately forthcoming,
despite other somewhat similar proposals being put forward~\cite{JMStone93}.
Not long after, Shavit and Touitou proposed a software-only implementation
of transactional memory (STM) that was capable of running on commodity
hardware, give or take memory-ordering issues.
This proposal languished for many years, perhaps due to the fact that
the research community's attention was absorbed by non-blocking
synchronization (see Section~\ref{sec:advsync:Non-Blocking Synchronization}).
But by the turn of the century, TM started receiving
more attention~\cite{Martinez01a,Rajwar01a}, and by the middle of the
decade, the level of interest can only be termed
``incandescent''~\cite{MauriceHerlihy2005-TM-manifesto.pldi,
DanGrossman2007TMGCAnalogy}, despite a few voices of
caution~\cite{Blundell2005DebunkTM,McKenney2007PLOSTM}.
The basic idea behind TM is to execute a section of
code atomically, so that other threads see no intermediate state.
As such, the semantics of TM could be implemented
by simply replacing each transaction with a recursively acquirable
global lock acquisition and release, albeit with abysmal performance
and scalability.
Much of the complexity inherent in TM implementations, whether hardware
or software, is efficiently detecting when concurrent transactions can safely
run in parallel.
Because this detection is done dynamically, conflicting transactions
can be aborted or ``rolled back'', and in some implementations, this
failure mode is visible to the programmer.
Because transaction roll-back is increasingly unlikely as transaction
size decreases, TM might become quite attractive for small memory-based
operations,
such as linked-list manipulations used for stacks, queues, hash tables,
and search trees.
However, it is currently much more difficult to make the case for large
transactions, particularly those containing non-memory operations such
as I/O and process creation.
The following sections look at current challenges to the grand vision of
``Transactional Memory Everywhere''~\cite{PaulEMcKenney2009TMeverywhere}.
Section~\ref{sec:future:Outside World} examines the challenges faced
interacting with the outside world,
Section~\ref{sec:future:Process Modification} looks at interactions
with process modification primitives,
Section~\ref{sec:future:Synchronization} explores interactions with
other synchronization primitives, and finally
Section~\ref{sec:future:Discussion} closes with some discussion.
\subsection{Outside World}
\label{sec:future:Outside World}
In the words of Donald Knuth:
\begin{quote}
Many computer users feel that input and output are not actually part
of ``real programming,'' they are merely things that (unfortunately)
must be done in order to get information in and out of the machine.
\end{quote}
Whether we believe that input and output are ``real programming'' or
not, the fact is that for most computer systems, interaction with the
outside world is a first-class requirement.
This section therefore critiques transactional memory's ability to
so interact, whether via I/O operations, time delays, or persistent
storage.
\subsubsection{I/O Operations}
\label{sec:future:I/O Operations}
One can execute I/O operations within a lock-based critical section,
and, at least in principle, from within an RCU read-side critical section.
What happens when you attempt to execute an I/O operation from within
a transaction?
The underlying problem is that transactions may be rolled back, for
example, due to conflicts.
Roughly speaking, this requires that all operations within any given
transaction be revocable, so that executing the operation twice has
the same effect as executing it once.
Unfortunately, I/O is in general the prototypical irrevocable
operation, making it difficult to include general I/O operations in
transactions.
In fact, general I/O is irrevocable:
Once you have pushed the button launching the nuclear warheads, there
is no turning back.
Here are some options for handling of I/O within transactions:
\begin{enumerate}
\item Restrict I/O within transactions to buffered I/O with in-memory
buffers.
These buffers may then be included in the transaction in the
same way that any other memory location might be included.
This seems to be the mechanism of choice, and it does work
well in many common cases of situations such as stream I/O and
mass-storage I/O.
However, special handling is required in cases where multiple
record-oriented output streams are merged onto a single file
from multiple processes, as might be done using the ``a+''
option to \co{fopen()} or the \co{O_APPEND} flag to \co{open()}.
In addition, as will be seen in the next section, common
networking operations cannot be handled via buffering.
\item Prohibit I/O within transactions, so that any attempt to execute
an I/O operation aborts the enclosing transaction (and perhaps
multiple nested transactions).
This approach seems to be the conventional TM approach for
unbuffered I/O, but requires that TM interoperate with other
synchronization primitives that do tolerate I/O.
\item Prohibit I/O within transactions, but enlist the compiler's aid
in enforcing this prohibition.
\item Permit only one special
\emph{irrevocable} transaction~\cite{SpearMichaelScott2008InevitableSTM}
to proceed
at any given time, thus allowing irrevocable transactions to
contain I/O operations.\footnote{
In earlier literature, irrevocable transactions are
termed \emph{inevitable} transactions.}
This works in general, but severely limits the scalability and
performance of I/O operations.
Given that scalability and performance is a first-class goal of
parallelism, this approach's generality seems a bit self-limiting.
Worse yet, use of irrevocability to tolerate I/O operations
seems to prohibit use of manual transaction-abort operations.\footnote{
This difficulty was pointed out by Michael Factor.}
Finally, if there is an irrevocable transaction manipulating
a given data item, any other transaction manipulating that
same data item cannot have non-blocking semantics.
\item Create new hardware and protocols such that I/O operations can
be pulled into the transactional substrate.
In the case of input operations, the hardware would need to
correctly predict the result of the operation, and to abort the
transaction if the prediction failed.
\end{enumerate}
I/O operations are a well-known weakness of TM, and it is not clear
that the problem of supporting I/O in transactions has a reasonable
general solution, at least if ``reasonable'' is to include usable
performance and scalability.
Nevertheless, continued time and attention to this problem will likely
produce additional progress.
\subsubsection{RPC Operations}
\label{sec:future:RPC Operations}
One can execute RPCs within a lock-based critical section, as well as
from within an RCU read-side critical section. What happens when you
attempt to execute an RPC from within a transaction?
If both the RPC request and its response are to be contained within the
transaction, and if some part of the transaction depends on the result
returned by the response, then it is not possible to use the memory-buffer
tricks that can be used in the case of buffered I/O.
Any attempt to
take this buffering approach would deadlock the transaction, as the
request could not be transmitted until the transaction was guaranteed
to succeed, but the transaction's success might not be knowable until
after the response is received, as is the case in the following example:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
1 begin_trans();
2 rpc_request();
3 i = rpc_response();
4 a[i]++;
5 end_trans();
\end{verbatim}
\end{minipage}
\vspace{5pt}
The transaction's memory footprint cannot be determined until after the
RPC response is received, and until the transaction's memory footprint
can be determined, it is impossible to determine whether the transaction
can be allowed to commit.
The only action consistent with transactional semantics is therefore to
unconditionally abort the transaction, which is, to say the least,
unhelpful.
Here are some options available to TM:
\begin{enumerate}
\item Prohibit RPC within transactions, so that any attempt to execute
an RPC operation aborts the enclosing transaction (and perhaps
multiple nested transactions).
Alternatively, enlist the compiler to enforce RPC-free
transactions.
This approach does work, but will require TM to
interact with other synchronization primitives.
\item Permit only one special
irrevocable transaction~\cite{SpearMichaelScott2008InevitableSTM}
to proceed at any given time, thus allowing irrevocable
transactions to contain RPC operations.
This works in general, but severely limits the scalability and
performance of RPC operations.
Given that scalability and performance is a first-class goal of
parallelism, this approach's generality seems a bit self-limiting.
Furthermore, use of irrevocable transactions to permit RPC
operations rules out manual transaction-abort operations
once the RPC operation has started.
Finally, if there is an irrevocable transaction manipulating
a given data item, any other transaction manipulating that
same data item cannot have non-blocking semantics.
\item Identify special cases where the success of the transaction may
be determined before the RPC response is received, and
automatically convert these to irrevocable transactions immediately
before sending the RPC request.
Of course, if several concurrent transactions attempt RPC calls
in this manner, it might be necessary to roll all but one of them
back, with consequent degradation of performance and scalability.
This approach nevertheless might be valuable given long-running
transactions ending with an RPC.
This approach still has problems with manual transaction-abort
operations.
\item Identify special cases where the RPC response may be moved out
of the transaction, and then proceed using techniques similar
to those used for buffered I/O.
\item Extend the transactional substrate to include the RPC server as
well as its client.
This is in theory possible, as has been demonstrated by
distributed databases.
However, it is unclear whether the requisite performance and
scalability requirements can be met by distributed-database
techniques, given that memory-based TM cannot hide such latencies
behind those of slow disk drives.
Of course, given the advent of solid-state disks, it is also unclear
how much longer databases will be permitted to hide their latencies
behind those of disks drives.
\end{enumerate}
As noted in the prior section, I/O is a known weakness of TM, and RPC
is simply an especially problematic case of I/O.
\subsubsection{Time Delays}
\label{sec:future:Time Delays}
An important special case of interaction with extra-transactional accesses
involves explicit time delays within a transaction.
Of course, the idea of a time delay within a transaction flies in the
face of TM's atomicity property, but one can argue that this sort of
thing is what weak atomicity is all about.
Furthermore, correct interaction with memory-mapped I/O sometimes requires
carefully controlled timing, and applications often use time delays
for varied purposes.
So, what can TM do about time delays within transactions?
\begin{enumerate}
\item Ignore time delays within transactions.
This has an appearance of elegance, but like too many other
``elegant'' solutions, fails to survive first contact with
legacy code.
Such code, which might well have important time delays in critical
sections, would fail upon being transactionalized.
\item Abort transactions upon encountering a time-delay operation.
This is attractive, but it is unfortunately not always possible
to automatically detect a time-delay operation.
Is that tight loop computing something important, or is it
instead waiting for time to elapse?
\item Enlist the compiler to prohibit time delays within transactions.
\item Let the time delays execute normally.
Unfortunately, some TM implementations publish modifications only
at commit time, which would in many cases defeat the purpose of
the time delay.
\end{enumerate}
It is not clear that there is a single correct answer.
TM implementations featuring weak atomicity that publish changes
immediately within the transaction (rolling these changes back upon abort)
might be reasonably well served by the last alternative.
Even in this case, the code (or possibly even hardware) at the other
end of the transaction may require a substantial redesign to tolerate
aborted transactions.
This need for redesign would make it more difficult to apply transactional
memory to legacy code.
\subsubsection{Persistence}
\label{sec:future:Persistence}
There are many different types of locking primitives.
One interesting distinction is persistence, in other words, whether the
lock can exist independently of the address space of the process using
the lock.
Non-persistent locks include \co{pthread_mutex_lock()},
\co{pthread_rwlock_rdlock()}, and most kernel-level locking primitives.
If the memory locations instantiating a non-persistent lock's data
structures disappear, so does the lock.
For typical use of \co{pthread_mutex_lock()}, this means that when the
process exits, all of its locks vanish.
This property can be exploited in order to trivialize lock cleanup
at program shutdown time, but makes it more difficult for unrelated
applications to share locks, as such sharing requires the applications
to share memory.
Persistent locks help avoid the need to share memory among unrelated
applications.
Persistent locking APIs include the flock family, \co{lockf()}, System
V semaphores, or the \co{O_CREAT} flag to \co{open()}.
These persistent APIs can be used to protect large-scale operations
spanning runs of multiple applications, and, in the case of \co{O_CREAT}
even surviving operating-system reboot.
If need be, locks can even span multiple computer systems via distributed
lock managers and distributed filesystems---and persist across reboots
of any or all of these computer systems.
Persistent locks can be used by any application, including applications
written using multiple languages and software environments.
In fact, a persistent lock might well be acquired by an application written
in C and released by an application written in Python.
How could a similar persistent functionality be provided for TM?
\begin{enumerate}
\item Restrict persistent transactions to special-purpose environments
designed to support them, for example, SQL.
This clearly works, given the decades-long history of database
systems, but does not provide the same degree of flexibility
provided by persistent locks.
\item Use snapshot facilities provided by some storage devices and/or
filesystems.
Unfortunately, this does not handle network communication,
nor does it handle I/O to devices that do not provide snapshot
capabilities, for example, memory sticks.
\item Build a time machine.
\end{enumerate}
Of course, the fact that it is called transactional \emph{memory}
should give us pause, as the name itself conflicts with the concept of
a persistent transaction.
It is nevertheless worthwhile to consider this possibility as an important
test case probing the inherent limitations of transactional memory.
\subsection{Process Modification}
\label{sec:future:Process Modification}
Processes are not eternal:
They are created and destroyed, their memory mappings are modified,
they are linked to dynamic libraries, and they are debugged.
These sections look at how transactional memory can handle an
ever-changing execution environment.
\subsubsection{Multithreaded Transactions}
\label{sec:future:Multithreaded Transactions}
It is perfectly legal to create processes and threads while holding
a lock or, for that matter, from within an RCU read-side critical
section.
Not only is it legal, but it is quite simple, as can be seen from the
following code fragment:
\vspace{5pt}
\begin{minipage}[t]{\columnwidth}
\small
\begin{verbatim}
1 pthread_mutex_lock(...);
2 for (i = 0; i < ncpus; i++)
3 pthread_create(&tid[i], ...);
4 for (i = 0; i < ncpus; i++)
5 pthread_join(tid[i], ...);
6 pthread_mutex_unlock(...);
\end{verbatim}
\end{minipage}
\vspace{5pt}
This pseudo-code fragment uses \co{pthread_create()} to spawn one thread
per CPU, then uses \co{pthread_join()} to wait for each to complete,
all under the protection of \co{pthread_mutex_lock()}.
The effect is to execute a lock-based critical section in parallel,
and one could obtain a similar effect using \co{fork()} and \co{wait()}.
Of course, the critical section would need to be quite large to justify
the thread-spawning overhead, but there are many examples of large
critical sections in production software.
What might TM do about thread spawning within a transaction?
\begin{enumerate}
\item Declare \co{pthread_create()} to be illegal within transactions,
resulting in transaction abort (preferred) or undefined
behavior. Alternatively, enlist the compiler to enforce
\co{pthread_create()}-free transactions.
\item Permit \co{pthread_create()} to be executed within a
transaction, but only the parent thread will be considered to
be part of the transaction.
This approach seems to be reasonably compatible with existing and
posited TM implementations, but seems to be a trap for the unwary.
This approach raises further questions, such as how to handle
conflicting child-thread accesses.
\item Convert the \co{pthread_create()}s to function calls.
This approach is also an attractive nuisance, as it does not
handle the not-uncommon cases where the child threads communicate
with one another.
In addition, it does not permit parallel execution of the body
of the transaction.
\item Extend the transaction to cover the parent and all child threads.
This approach raises interesting questions about the nature of
conflicting accesses, given that the parent and children are
presumably permitted to conflict with each other, but not with
other threads.
It also raises interesting questions as to what should happen
if the parent thread does not wait for its children before
committing the transaction.
Even more interesting, what happens if the parent conditionally
executes \co{pthread_join()} based on the values of variables
participating in the transaction?
The answers to these questions are reasonably straightforward
in the case of locking.
The answers for TM are left as an exercise for the reader.
\end{enumerate}
Given that parallel execution of transactions is commonplace in the
database world, it is perhaps surprising that current TM proposals do
not provide for it.
On the other hand, the example above is a fairly sophisticated use
of locking that is not normally found in simple textbook examples,
so perhaps its omission is to be expected.
That said, there are rumors that some TM researchers are investigating
fork/join parallelism within transactions, so perhaps this topic will
soon be addressed more thoroughly.
\subsubsection{The \tco{exec()} System Call}
\label{sec:future:The exec System Call}
One can execute an \co{exec()} system call while holding a lock, and
also from within an RCU read-side critical section.
The exact semantics depends on the type of primitive.
In the case of non-persistent primitives (including
\co{pthread_mutex_lock()}, \co{pthread_rwlock_rdlock()}, and RCU),
if the \co{exec()} succeeds, the whole address space vanishes, along
with any locks being held.
Of course, if the \co{exec()} fails, the address space still lives,
so any associated locks would also still live.
A bit strange perhaps, but reasonably well defined.
On the other hand, persistent primitives (including the flock family,
\co{lockf()}, System V semaphores, and the \co{O_CREAT} flag to
\co{open()}) would survive regardless of whether the \co{exec()}
succeeded or failed, so that the \co{exec()}ed program might well
release them.
\QuickQuiz{}
What about non-persistent primitives represented by data
structures in \co{mmap()} regions of memory?
What happens when there is an \co{exec()} within a critical
section of such a primitive?
\QuickQuizAnswer{
If the \co{exec()}ed program maps those same regions of
memory, then this program could in principle simply release
the lock.
The question as to whether this approach is sound from a
software-engineering viewpoint is left as an exercise for
the reader.
} \QuickQuizEnd
What happens when you attempt to execute an \co{exec()} system call
from within a transaction?
\begin{enumerate}
\item Disallow \co{exec()} within transactions, so that the enclosing
transactions abort upon encountering the \co{exec()}.
This is well defined, but clearly requires non-TM synchronization
primitives for use in conjunction with \co{exec()}.
\item Disallow \co{exec()} within transactions, with the compiler
enforcing this prohibition.
There is a draft specification for TM in C++ that takes
this approach, allowing functions to be decorated with
the \co{transaction_safe} and \co{transaction_unsafe}
attributes.\footnote{
Thanks to Mark Moir for pointing me at this spec, and
to Michael Wong for having pointed me at an earlier
revision some time back.}
This approach has some advantages over aborting the transaction
at runtime, but again requires non-TM synchronization primitives
for use in conjunction with \co{exec()}.
\item Treat the transaction in a manner similar to non-persistent
Locking primitives, so that the transaction survives if \co{exec()}
fails, and silently commits if the \co{exec()} succeeds.
The case where some of the variables affected by the transaction
reside in \co{mmap()}ed memory (and thus could survive a successful
\co{exec()} system call) is left as an exercise for the reader.
\item Abort the transaction (and the \co{exec()} system call) if the
\co{exec()} system call would have succeeded, but allow the
transaction to continue if the \co{exec()} system call would
fail.
This is in some sense the ``correct'' approach, but it would
require considerable work for a rather unsatisfying result.
\end{enumerate}
The \co{exec()} system call is perhaps the strangest example of an
obstacle to universal TM applicability, as it is not completely clear
what approach makes sense, and some might argue that this is merely a
reflection of the perils of interacting with execs in real life.
That said, the two options prohibiting \co{exec()} within transactions
are perhaps the most logical of the group.
Similar issues surround the \co{exit()} and \co{kill()} system calls.
\subsubsection{Dynamic Linking and Loading}
\label{sec:future:Dynamic Linking and Loading}
Both lock-based critical sections and RCU read-side critical sections
can legitimately contain code that invokes dynamically linked and loaded
functions, including C/C++ shared libraries and Java class libraries.
Of course, the code contained in these libraries is by definition
unknowable at compile time.
So, what happens if a dynamically loaded function is invoked within
a transaction?
This question has two parts: (a)~how do you dynamically link and load a
function within a transaction and (b)~what do you do about the unknowable
nature of the code within this function?
To be fair, item (b) poses some challenges for locking and RCU as well,
at least in theory.
For example, the dynamically linked function might introduce a deadlock
for locking or might (erroneously) introduce a quiescent state into an
RCU read-side critical section.
The difference is that while the class of operations permitted in locking
and RCU critical sections is well-understood, there appears to still be
considerable uncertainty in the case of TM.
In fact, different implementations of TM seem to have different restrictions.
So what can TM do about dynamically linked and loaded library functions?
Options for part (a), the actual loading of the code, include the following:
\begin{enumerate}
\item Treat the dynamic linking and loading in a manner similar to a
page fault, so that the function is loaded and linked, possibly
aborting the transaction in the process.
If the transaction is aborted, the retry will find the function
already present, and the transaction can thus be expected to
proceed normally.
\item Disallow dynamic linking and loading of functions from within
transactions.
\end{enumerate}
Options for part (b), the inability to detect TM-unfriendly operations
in a not-yet-loaded function, possibilities include the following:
\begin{enumerate}
\item Just execute the code: if there are any TM-unfriendly operations
in the function, simply abort the transaction.
Unfortunately, this approach makes it impossible for the compiler
to determine whether a given group of transactions may be safely
composed.
One way to permit composability regardless is irrevocable
transactions, however, current implementations permit only a
single irrevocable transaction to proceed at any given time,
which can severely limit performance and scalability.
Irrevocable transactions also seem to rule out use of manual
transaction-abort operations.
Finally, if there is an irrevocable transaction manipulating
a given data item, any other transaction manipulating that
same data item cannot have non-blocking semantics.
\item Decorate the function declarations indicating which functions
are TM-friendly.
These decorations can then be enforced by the compiler's type system.
Of course, for many languages, this requires language extensions
to be proposed, standardized, and implemented, with the
corresponding time delays.
That said, the standardization effort is already in
progress~\cite{Ali-Reza-Adl-Tabatabai2009CppTM}.
\item As above, disallow dynamic linking and loading of functions from
within transactions.
\end{enumerate}
I/O operations are of course a known weakness of TM, and dynamic linking
and loading can be thought of as yet another special case of I/O.
Nevertheless, the proponents of TM must either solve this problem, or
resign themselves to a world where TM is but one tool of several in the
parallel programmer's toolbox.
(To be fair, a number of TM proponents have long since resigned themselves
to a world containing more than just TM.)
\subsubsection{Memory-Mapping Operations}
\label{sec:future:Memory-Mapping Operations}
It is perfectly legal to execute memory-mapping operations (including
\co{mmap()}, \co{shmat()}, and \co{munmap()}~\cite{TheOpenGroup1997SUS})
within a lock-based critical section,
and, at least in principle, from within an RCU read-side critical section.
What happens when you attempt to execute such an operation from within
a transaction?
More to the point, what happens if the memory region being remapped
contains some variables participating in the current thread's transaction?
And what if this memory region contains variables participating in some
other thread's transaction?
It should not be necessary to consider cases where the TM system's
metadata is remapped, given that most locking primitives do not define
the outcome of remapping their lock variables.
Here are some memory-mapping options available to TM:
\begin{enumerate}
\item Memory remapping is illegal within a transaction, and will result
in all enclosing transactions being aborted.
This does simplify things somewhat, but also requires that TM
interoperate with synchronization primitives that do tolerate
remapping from within their critical sections.
\item Memory remapping is illegal within a transaction, and the
compiler is enlisted to enforce this prohibition.
\item Memory mapping is legal within a transaction, but aborts all
other transactions having variables in the region mapped over.
\item Memory mapping is legal within a transaction, but the mapping
operation will fail if the region being mapped overlaps with
the current transaction's footprint.
\item All memory-mapping operations, whether within or outside a
transaction, check the region being mapped against the memory
footprint of all transactions in the system.
If there is overlap, then the memory-mapping operation fails.
\item The effect of memory-mapping operations that overlap the memory
footprint of any transaction in the system is determined by the
TM conflict manager, which might dynamically determine whether
to fail the memory-mapping operation or abort any conflicting
transactions.
\end{enumerate}
It is interesting to note that \co{munmap()} leaves the relevant region
of memory unmapped, which could have additional interesting
implications.\footnote{
This difference between mapping and unmapping was noted by
Josh Triplett.}
\subsubsection{Debugging}
\label{sec:future:Debugging}
The usual debugging operations such as breakpoints work normally within
lock-based critical sections and from RCU read-side critical sections.
However, in initial transactional-memory hardware
implementations~\cite{DaveDice2009ASPLOSRockHTM} an exception within
a transaction will abort that transaction, which in turn means that
breakpoints abort all enclosing transactions.
So how can transactions be debugged?
\begin{enumerate}
\item Use software emulation techniques within transactions containing
breakpoints.
Of course, it might be necessary to emulate all transactions
any time a breakpoint is set within the scope of any transaction.
If the runtime system is unable to determine whether or not a
given breakpoint is within the scope of a transaction, then it
might be necessary to emulate all transactions just to be on
the safe side.
However, this approach might impose significant overhead, which
might in turn obscure the bug being pursued.
\item Use only hardware TM implementations that are capable of
handling breakpoint exceptions.
Unfortunately, as of this writing (September 2008), all such
implementations are strictly research prototypes.
\item Use only software TM implementations, which are
(very roughly speaking) more tolerant of exceptions than are
the simpler of the hardware TM implementations.
Of course, software TM tends to have higher overhead than hardware
TM, so this approach may not be acceptable in all situations.
\item Program more carefully, so as to avoid having bugs in the
transactions in the first place.
As soon as you figure out how to do this, please do let everyone
know the secret!
\end{enumerate}
There is some reason to believe that transactional memory will deliver
productivity improvements compared to other synchronization mechanisms,
but it does seem quite possible that these improvements could easily
be lost if traditional debugging techniques cannot be applied to
transactions.
This seems especially true if transactional memory is to be used by
novices on large transactions.
In contrast, macho ``top-gun'' programmers might be able to dispense with
such debugging aids, especially for small transactions.
Therefore, if transactional memory is to deliver on its productivity
promises to novice programmers, the debugging problem does need to
be solved.
\subsection{Synchronization}
\label{sec:future:Synchronization}
If transactional memory someday proves that it can be everything to everyone,
it will not need to interact with any other synchronization mechanism.
Until then, it will need to work with synchronization mechanisms that
can do what it cannot, or that work more naturally in a given situation.
The following sections outline the current challenges in this area.
\subsubsection{Locking}
\label{sec:future:Locking}
It is commonplace to acquire locks while holding other locks, which works
quite well, at least as long as the usual well-known software-engineering
techniques are employed to avoid deadlock.
It is not unusual to acquire locks from within RCU read-side critical
sections, which eases deadlock concerns because RCU read-side primitives
cannot participate in lock-based deadlock cycles.
But what happens when you attempt to acquire a lock from within a transaction?
In theory, the answer is trivial: simply manipulate the data structure
representing the lock as part of the transaction, and everything works
out perfectly.
In practice, a number of non-obvious complications~\cite{Volos2008TRANSACT}
can arise, depending on implementation details of the TM system.
These complications can be resolved, but at the cost of a 45\% increase in
overhead for locks acquired outside of transactions and a 300\% increase
in overhead for locks acquired within transactions.
Although these overheads might be acceptable for transactional
programs containing small amounts of locking, they are often completely
unacceptable for production-quality lock-based programs wishing to use
the occasional transaction.
\begin{enumerate}
\item Use only locking-friendly TM implementations.
Unfortunately, the locking-unfriendly implementations have some
attractive properties, including low overhead for successful
transactions and the ability to accommodate extremely large
transactions.
\item Use TM only ``in the small'' when introducing TM to lock-based
programs, thereby accommodating the limitations of
locking-friendly TM implementations.
\item Set aside locking-based legacy systems entirely, re-implementing
everything in terms of transactions.
This approach has no shortage of advocates, but this requires
that all the issues described in this series be resolved.
During the time it takes to resolve these issues, competing
synchronization mechanisms will of course also have the
opportunity to improve.
\item Use TM strictly as an optimization in lock-based systems, as was
done by the TxLinux~\cite{ChistopherJRossbach2007a} group.
This approach seems sound, but leaves the locking design
constraints (such as the need to avoid deadlock) firmly in place.
\item Strive to reduce the overhead imposed on locking primitives.
\end{enumerate}
The fact that there could possibly be a problem interfacing TM and locking
came as a surprise to many, which underscores the need to try out new
mechanisms and primitives in real-world production software.
Fortunately, the advent of open source means that a huge quantity of
such software is now freely available to everyone, including researchers.
\subsubsection{Reader-Writer Locking}
\label{sec:future:Reader-Writer Locking}
It is commonplace to read-acquire reader-writer locks while holding
other locks, which just works, at least as long as the usual well-known
software-engineering techniques are employed to avoid deadlock.
Read-acquiring reader-writer locks from within RCU read-side critical
sections also works, and doing so eases deadlock concerns because RCU
read-side primitives cannot participate in lock-based deadlock cycles.
But what happens when you attempt to read-acquire a reader-writer lock
from within a transaction?
Unfortunately, the straightforward approach to read-acquiring the
traditional counter-based reader-writer lock within a transaction defeats
the purpose of the reader-writer lock.
To see this, consider a pair of transactions concurrently attempting to
read-acquire the same reader-writer lock.
Because read-acquisition involves modifying the reader-writer lock's
data structures, a conflict will result, which will roll back one of
the two transactions.
This behavior is completely inconsistent with the reader-writer lock's
goal of allowing concurrent readers.
Here are some options available to TM:
\begin{enumerate}
\item Use per-CPU or per-thread reader-writer
locking~\cite{WilsonCHsieh92a}, which allows a
given CPU (or thread, respectively) to manipulate only local
data when read-acquiring the lock.
This would avoid the conflict between the two transactions
concurrently read-acquiring the lock, permitting both to proceed,
as intended.
Unfortunately, (1)~the write-acquisition overhead of
per-CPU/thread locking can be extremely high, (2)~the memory
overhead of per-CPU/thread locking can be prohibitive, and
(3)~this transformation is available only when you have access to
the source code in question.
Other more-recent scalable
reader-writer locks~\cite{YossiLev2009SNZIrwlock}
might avoid some or all of these problems.
\item Use TM only ``in the small'' when introducing TM to lock-based
programs, thereby avoiding read-acquiring reader-writer locks
from within transactions.
\item Set aside locking-based legacy systems entirely, re-implementing
everything in terms of transactions.
This approach has no shortage of advocates, but this requires
that \emph{all} the issues described in this series be resolved.
During the time it takes to resolve these issues, competing
synchronization mechanisms will of course also have the
opportunity to improve.
\item Use TM strictly as an optimization in lock-based systems, as was
done by the TxLinux~\cite{ChistopherJRossbach2007a} group.
This approach seems sound, but leaves the locking design
constraints (such as the need to avoid deadlock) firmly in place.
Furthermore, this approach can result in unnecessary transaction
rollbacks when multiple transactions attempt to read-acquire
the same lock.
\end{enumerate}
Of course, there might well be other non-obvious issues surrounding
combining TM with reader-writer locking, as there in fact were with
exclusive locking.
\subsubsection{RCU}
\label{sec:future:RCU}
Because read-copy update (RCU) finds its main use in the Linux kernel,
one might be forgiven for assuming that there had been no academic work
on combining RCU and TM.\footnote{
However, the in-kernel excuse is wearing thin with the advent
of user-space RCU~\cite{MathieuDesnoyers2009URCU,MathieuDesnoyers2012URCU}.}
However, the TxLinux group from the University of Texas at Austin had
no choice~\cite{ChistopherJRossbach2007a}.
The fact that they applied TM to the Linux 2.6 kernel, which uses RCU,
forced them to integrate TM and RCU, with TM taking the place of locking
for RCU updates.
Unfortunately, although the paper does state that the RCU implementation's
locks (e.g., \co{rcu_ctrlblk.lock}) were converted to transactions,
it is silent about what happened to locks used in RCU-based updates
(e.g., \co{dcache_lock}).
It is important to note that RCU permits readers and updaters to run
concurrently, further permitting RCU readers to access data that is in
the act of being updated.
Of course, this property of RCU, whatever its performance, scalability,
and real-time-response benefits might be, flies in the face of the
underlying atomicity properties of TM.
So how should TM-based updates interact with concurrent RCU readers?
Some possibilities are as follows:
\begin{enumerate}
\item RCU readers abort concurrent conflicting TM updates.
This is in fact the approach taken by the TxLinux project.
This approach does preserve RCU semantics, and also preserves
RCU's read-side performance, scalability, and real-time-response
properties, but it does have the unfortunate side-effect of
unnecessarily aborting conflicting updates.
In the worst case, a long sequence of RCU readers could
potentially starve all updaters, which could in theory result
in system hangs.
In addition, not all TM implementations offer the strong atomicity
required to implement this approach.
\item RCU readers that run concurrently with conflicting TM updates
get old (pre-transaction) values from any conflicting RCU loads.
This preserves RCU semantics and performance, and also prevents
RCU-update starvation.
However, not all TM implementations can provide timely access
to old values of variables that have been tentatively updated
by an in-flight transaction.
In particular, log-based TM implementations that maintain
old values in the log (thus making for excellent TM commit
performance) are not likely to be happy with this approach.
Perhaps the \co{rcu_dereference()} primitive can be leveraged
to permit RCU to access the old values within a greater range
of TM implementations, though performance might still be an issue.
Nevertheless, there are popular TM implementations that can
be easily and efficiently integrated with RCU in this
manner~\cite{DonaldEPorter2007TRANSACT,PhilHoward2011RCUTMRBTree,
PhilipWHoward2013RCUrbtree}.
\item If an RCU reader executes an access that conflicts with an
in-flight transaction, then that RCU access is delayed until
the conflicting transaction either commits or aborts.
This approach preserves RCU semantics, but not RCU's performance
or real-time response, particularly in presence of long-running
transactions.
In addition, not all TM implementations are capable of delaying
conflicting accesses.
That said, this approach seems eminently reasonable for hardware
TM implementations that support only small transactions.
\item RCU readers are converted to transactions.
This approach pretty much guarantees that RCU is compatible with
any TM implementation, but it also imposes TM's rollbacks on RCU
read-side critical sections, destroying RCU's real-time response
guarantees, and also degrading RCU's read-side performance.
Furthermore, this approach is infeasible in cases where any of
the RCU read-side critical sections contains operations that
the TM implementation in question is incapable of handling.
\item Many update-side uses of RCU modify a single pointer to publish
a new data structure.
In some of these cases, RCU can safely be permitted to see a
transactional pointer update that is subsequently rolled back,
as long as the transaction respects memory ordering and as long
as the roll-back process uses \co{call_rcu()} to free up the
corresponding structure.
Unfortunately, not all TM implementations respect memory barriers
within a transaction.
Apparently, the thought is that because transactions are supposed
to be atomic, the ordering of the accesses within the transaction
is not supposed to matter.
\item Prohibit use of TM in RCU updates.
This is guaranteed to work, but seems a bit restrictive.
\end{enumerate}
It seems likely that additional approaches will be uncovered, especially
given the advent of user-level RCU implementations.\footnote{
Kudos to the TxLinux group, Maged Michael, and Josh Triplett
for coming up with a number of the above alternatives.}
\subsubsection{Extra-Transactional Accesses}
\label{sec:future:Extra-Transactional Accesses}
Within a lock-based critical section, it is perfectly legal to manipulate
variables that are concurrently accessed or even modified outside that
lock's critical section, with one common example being statistical
counters.
The same thing is possible within RCU read-side critical
sections, and is in fact the common case.
Given mechanisms such as the so-called ``dirty reads'' that are
prevalent in production database systems, it is not surprising
that extra-transactional accesses have received serious attention
from the proponents of TM, with the concepts of weak and strong
atomicity~\cite{Blundell2006TMdeadlock} being but one case in point.
Here are some extra-transactional options available to TM:
\begin{enumerate}
\item Conflicts due to extra-transactional accesses always abort
transactions.
This is strong atomicity.
\item Conflicts due to extra-transactional accesses are ignored,
so only conflicts among transactions can abort transactions.
This is weak atomicity.
\item Transactions are permitted to carry out non-transactional
operations in special cases, such as when allocating memory or
interacting with lock-based critical sections.
\item Produce hardware extensions that permit some operations
(for example, addition) to be carried out concurrently on a
single variable by multiple transactions.
\item Introduce weak semantics to transactional memory.
One approach is the combination with RCU described in
Section~\ref{sec:future:RCU}, while Gramoli and Guerraoui
survey a number of other weak-transaction
approaches~\cite{Gramoli:2014:DTP:2541883.2541900}, for example,
restricted partitioning of large
``elastic'' transactions into smaller transactions, thus
reducing conflict probabilities (albeit with tepid performance
and scalability).
Perhaps further experience will show that some uses of
extra-transactional accesses can be replaced by weak
transactions.
\end{enumerate}
It appears that transactions were conceived as standing alone, with no
interaction required with any other synchronization mechanism.
If so, it is no surprise that much confusion and complexity arises when
combining transactions with non-transactional accesses.
But unless transactions are to be confined to small updates to isolated
data structures, or alternatively to be confined to new programs
that do not interact with the huge body of existing parallel code,
then transactions absolutely must be so combined if they are to have
large-scale practical impact in the near term.
% @@@ Huge transactions. Or perhaps conflict handling.
\subsection{Discussion}
\label{sec:future:Discussion}
The obstacles to universal TM adoption lead to the following
conclusions:
\begin{enumerate}
\item One interesting property of TM is the fact that transactions are
subject to rollback and retry.
This property underlies TM's difficulties with irreversible
operations, including unbuffered I/O, RPCs, memory-mapping
operations, time delays, and the \co{exec()} system call.
This property also has the unfortunate consequence of introducing
all the complexities inherent in the possibility of
failure into synchronization primitives, often in a developer-visible
manner.
\item Another interesting property of TM, noted by
Shpeisman et al.~\cite{TatianaShpeisman2009CppTM}, is that TM
intertwines the synchronization with the data it protects.
This property underlies TM's issues with I/O, memory-mapping
operations, extra-transactional accesses, and debugging
breakpoints.
In contrast, conventional synchronization primitives, including
locking and RCU, maintain a clear separation between the
synchronization primitives and the data that they protect.
\item One of the stated goals of many workers in the TM area is to
ease parallelization of large sequential programs.
As such, individual transactions are commonly expected to
execute serially, which might do much to explain TM's issues
with multithreaded transactions.
\end{enumerate}
What should TM researchers and developers do about all of this?
One approach is to focus on TM in the small, focusing on situations
where hardware assist potentially provides substantial advantages over
other synchronization primitives.
This is in fact the approach Sun took with its Rock research
CPU~\cite{DaveDice2009ASPLOSRockHTM}.
Some TM researchers seem to agree with this approach, while others have
much higher hopes for TM.
Of course, it is quite possible that TM will be able to take on larger
problems, and this section lists a few of the issues that
must be resolved if TM is to achieve this lofty goal.
Of course, everyone involved should treat this as a learning experience.
It would seem that TM researchers have great deal to learn from
practitioners who have successfully built large software systems using
traditional synchronization primitives.
And vice versa.
\begin{figure}[tb]
\centering
\resizebox{3in}{!}{\includegraphics{cartoons/TM-the-vision}}
\caption{The STM Vision}
\ContributedBy{Figure}{fig:future:The STM Vision}{Melissa Broussard}
\end{figure}
\begin{figure}[tb]
\centering
\resizebox{3in}{!}{\includegraphics{cartoons/TM-the-reality-conflict}}
\caption{The STM Reality: Conflicts}
\ContributedBy{Figure}{fig:future:The STM Reality: Conflicts}{Melissa Broussard}
\end{figure}
\begin{figure}[tb]
\centering
\resizebox{3in}{!}{\includegraphics{cartoons/TM-the-reality-nonidempotent}}
\caption{The STM Reality: Irrevocable Operations}
\ContributedBy{Figure}{fig:future:The STM Reality: Irrevocable Operations}{Melissa Broussard}
\end{figure}
\begin{figure}[tb]
\centering
\resizebox{3in}{!}{\includegraphics{cartoons/TM-the-reality-realtime}}
\caption{The STM Reality: Realtime Response}
\ContributedBy{Figure}{fig:future:The STM Reality: Realtime Response}{Melissa Broussard}
\end{figure}
But for the moment, the current state of STM
can best be summarized with a series of cartoons.
First,
Figure~\ref{fig:future:The STM Vision}
shows the STM vision.
As always, the reality is a bit more nuanced, as fancifully depicted by
Figures~\ref{fig:future:The STM Reality: Conflicts},
\ref{fig:future:The STM Reality: Irrevocable Operations},
and~\ref{fig:future:The STM Reality: Realtime Response}.
Less fanciful STM retrospectives are also
available~\cite{JoeDuffy2010RetroTM,JoeDuffy2010RetroTM2}.
Recent advances in commercially available hardware have opened the door
for variants of HTM, which are addressed in the following section.
% @@@ Don Porter's user-level TM work for Linux.