blob: db38ca801f0bae5f6bc954c67ecffb5f4cce6efe [file] [log] [blame]
% future/tm.tex
% mainfile: ../perfbook.tex
% SPDX-License-Identifier: CC-BY-SA-3.0
\section{Transactional Memory}
\label{sec:future:Transactional Memory}
\epigraph{Everything should be as simple as it can be, but not simpler.}
{Albert Einstein, by way of Louis Zukofsky}
The idea of using transactions outside of databases goes back many
decades~\cite{DBLomet1977SIGSOFT,Knight:1986:AMF:319838.319854,Herlihy93a},
with the key difference between
database and non-database transactions being that non-database transactions
drop the ``D'' in the ``ACID''\footnote{
Atomicity, consistency, isolation, and durability.}
properties defining database transactions.
The idea of supporting memory-based transactions, or ``transactional memory''
(TM), in hardware
is more recent~\cite{Herlihy93a}, but unfortunately, support for such
transactions in commodity hardware was not immediately forthcoming,
despite other somewhat similar proposals being put forward~\cite{JMStone93}.
Not long after, Shavit and Touitou proposed a software-only implementation
of transactional memory (STM) that was capable of running on commodity
hardware, give or take memory-ordering issues~\cite{Shavit95}.
This proposal languished for many years, perhaps due to the fact that
the research community's attention was absorbed by \IXacrl{nbs}
(see \cref{sec:advsync:Non-Blocking Synchronization}).
But by the turn of the century, TM started receiving
more attention~\cite{Martinez01a,Rajwar01a}, and by the middle of the
decade, the level of interest can only be termed
``incandescent''~\cite{MauriceHerlihy2005-TM-manifesto.pldi,
DanGrossman2007TMGCAnalogy}, with only a few voices of
caution~\cite{Blundell2005DebunkTM,McKenney2007PLOSTM}.
The basic idea behind TM is to execute a section of
code atomically, so that other threads see no intermediate state.
As such, the semantics of TM could be implemented
by simply replacing each transaction with a recursively acquirable
global lock acquisition and release, albeit with abysmal performance
and scalability.
Much of the complexity inherent in TM implementations, whether hardware
or software, is efficiently detecting when concurrent transactions can safely
run in parallel.
Because this detection is done dynamically, conflicting transactions
can be aborted or ``rolled back'', and in some implementations, this
failure mode is visible to the programmer.
Because transaction roll-back is increasingly unlikely as transaction
size decreases, TM might become quite attractive for small memory-based
operations,
such as linked-list manipulations used for stacks, queues, hash tables,
and search trees.
However, it is currently much more difficult to make the case for large
transactions, particularly those containing non-memory operations such
as I/O and process creation.
The following sections look at current challenges to the grand vision of
``Transactional Memory Everywhere''~\cite{PaulEMcKenney2009TMeverywhere}.
\Cref{sec:future:Outside World} examines the challenges faced
interacting with the outside world,
\cref{sec:future:Process Modification} looks at interactions
with process modification primitives,
\cref{sec:future:Synchronization} explores interactions with
other synchronization primitives, and finally
\cref{sec:future:Discussion} closes with some discussion.
\subsection{Outside World}
\label{sec:future:Outside World}
In the wise words of \ppl{Donald}{Knuth}:
\begin{quote}
Many computer users feel that input and output are not actually part
of ``real programming,'' they are merely things that (unfortunately)
must be done in order to get information in and out of the machine.
\end{quote}
Whether or not we believe that input and output are ``real programming'',
the fact is that software absolutely must deal with the outside world.
This section therefore critiques transactional memory's outside-world
capabilities, focusing on I/O operations, time delays, and persistent
storage.
And these interactions with the outside world are the rock upon which
the claims of unconditional TM composability are shattered.
Yes, you can compose transactions, but only as long as there are no
intervening TM-unfriendly operations.
Just as you can compose lock-based critical sections, but only as long
as doing so does not introduce deadlocks.
In the end, it is unclear whether TM really provides better composability
than does locking in real-world code.
\subsubsection{I/O Operations}
\label{sec:future:I/O Operations}
One can execute I/O operations within a lock-based critical section,
while holding a \IX{hazard pointer}, within a sequence-locking read-side
critical section, and from within a userspace-RCU read-side critical
section, and even all at the same time, if need be.
What happens when you attempt to execute an I/O operation from within
a transaction?
The underlying problem is that transactions may be rolled back, for
example, due to conflicts.
Roughly speaking, this requires that all operations within any given
transaction be revocable, so that executing the operation twice has
the same effect as executing it once.
Unfortunately, I/O is in general the prototypical irrevocable
operation, making it difficult to include general I/O operations in
transactions.
In fact, general I/O is irrevocable:
Once you have pushed the proverbial button launching the nuclear warheads,
there is no turning back.
Here are some options for handling of I/O within transactions:
\begin{enumerate}
\item Restrict I/O within transactions to buffered I/O with in-memory
buffers.
These buffers may then be included in the transaction in the
same way that any other memory location might be included.
This seems to be the mechanism of choice, and it does work
well in many common cases of situations such as stream I/O and
mass-storage I/O\@.
However, special handling is required in cases where multiple
record-oriented output streams are merged onto a single file
from multiple processes, as might be done using the ``a+''
option to \co{fopen()} or the \co{O_APPEND} flag to \co{open()}.
In addition, as will be seen in the next section, common
networking operations cannot be handled via buffering.
\item Prohibit I/O within transactions, so that any attempt to execute
an I/O operation aborts the enclosing transaction (and perhaps
multiple nested transactions).
This approach seems to be the conventional TM approach for
unbuffered I/O, but requires that TM interoperate with other
synchronization primitives tolerating I/O.
\item Prohibit I/O within transactions, but enlist the compiler's aid
in enforcing this prohibition.
\item Permit only one special
\emph{irrevocable} transaction~\cite{SpearMichaelScott2008InevitableSTM}
to proceed
at any given time, thus allowing irrevocable transactions to
contain I/O operations.\footnote{
In earlier literature, irrevocable transactions are
termed \emph{inevitable} transactions.}
This works in general, but severely limits the scalability and
performance of I/O operations.
Given that scalability and performance is a first-class goal of
parallelism, this approach's generality seems a bit self-limiting.
Worse yet, use of irrevocability to tolerate I/O operations
seems to greatly restrict use of manual transaction-abort
operations.\footnote{
This difficulty was pointed out by Michael Factor.
To see the problem, think through what TM should do
in response to an attempt to abort a transaction after
it has executed an irrevocable operation.}
Finally, if there is an irrevocable transaction manipulating
a given data item, any other transaction manipulating that
same data item cannot have non-blocking semantics.
\item Create new hardware and protocols such that I/O operations can
be pulled into the transactional substrate.
In the case of input operations, the hardware would need to
correctly predict the result of the operation, and to abort the
transaction if the prediction failed.
\end{enumerate}
I/O operations are a well-known weakness of TM, and it is not clear
that the problem of supporting I/O in transactions has a reasonable
general solution, at least if ``reasonable'' is to include usable
performance and scalability.
Nevertheless, continued time and attention to this problem will likely
produce additional progress.
\subsubsection{RPC Operations}
\label{sec:future:RPC Operations}
One can execute RPCs within a lock-based critical section, while holding
a hazard pointer, within a sequence-locking read-side critical section,
and from within a userspace-RCU read-side critical section, and even
all at the same time, if need be.
What happens when you attempt to execute an RPC from within a transaction?
If both the RPC request and its response are to be contained within the
transaction, and if some part of the transaction depends on the result
returned by the response, then it is not possible to use the memory-buffer
tricks that can be used in the case of buffered I/O\@.
Any attempt to
take this buffering approach would deadlock the transaction, as the
request could not be transmitted until the transaction was guaranteed
to succeed, but the transaction's success might not be knowable until
after the response is received, as is the case in the following example:
\begin{VerbatimN}[samepage=true]
begin_trans();
rpc_request();
i = rpc_response();
a[i]++;
end_trans();
\end{VerbatimN}
The transaction's memory footprint cannot be determined until after the
RPC response is received, and until the transaction's memory footprint
can be determined, it is impossible to determine whether the transaction
can be allowed to commit.
The only action consistent with transactional semantics is therefore to
unconditionally abort the transaction, which is, to say the least,
unhelpful.
Here are some options available to TM:
\begin{enumerate}
\item Prohibit RPC within transactions, so that any attempt to execute
an RPC operation aborts the enclosing transaction (and perhaps
multiple nested transactions).
Alternatively, enlist the compiler to enforce RPC-free
transactions.
This approach does work, but will require TM to
interact with other synchronization primitives.
\item Permit only one special
irrevocable transaction~\cite{SpearMichaelScott2008InevitableSTM}
to proceed at any given time, thus allowing irrevocable
transactions to contain RPC operations.
This works in general, but severely limits the scalability and
performance of RPC operations.
Given that scalability and performance is a first-class goal of
parallelism, this approach's generality seems a bit self-limiting.
Furthermore, use of irrevocable transactions to permit RPC
operations restricts manual transaction-abort operations
once the RPC operation has started.
Finally, if there is an irrevocable transaction manipulating
a given data item, any other transaction manipulating that
same data item must have blocking semantics.
\item Identify special cases where the success of the transaction may
be determined before the RPC response is received, and
automatically convert these to irrevocable transactions immediately
before sending the RPC request.
Of course, if several concurrent transactions attempt RPC calls
in this manner, it might be necessary to roll all but one of them
back, with consequent degradation of performance and scalability.
This approach nevertheless might be valuable given long-running
transactions ending with an RPC\@.
This approach must still restrict manual transaction-abort
operations.
\item Identify special cases where the RPC response may be moved out
of the transaction, and then proceed using techniques similar
to those used for buffered I/O.
\item Extend the transactional substrate to include the RPC server as
well as its client.
This is in theory possible, as has been demonstrated by
distributed databases.
However, it is unclear whether the requisite performance and
scalability requirements can be met by distributed-database
techniques, given that memory-based TM has no slow disk drives
behind which to hide such latencies.
Of course, given the advent of solid-state disks, it is also quite
possible that databases will need to redesign their approach to
latency hiding.
\end{enumerate}
As noted in the prior section, I/O is a known weakness of TM, and RPC
is simply an especially problematic case of I/O.
\subsubsection{Time Delays}
\label{sec:future:Time Delays}
An important special case of interaction with extra-transactional accesses
involves explicit time delays within a transaction.
Of course, the idea of a time delay within a transaction flies in the
face of TM's atomicity property, but this sort of thing is arguably what
weak atomicity is all about.
Furthermore, correct interaction with memory-mapped I/O sometimes requires
carefully controlled timing, and applications often use time delays
for varied purposes.
Finally, one can execute time delays within a lock-based critical section,
while holding a hazard pointer, within a sequence-locking read-side
critical section, and from within a userspace-RCU read-side critical
section, and even all at the same time, if need be.
Doing so might not be wise from a contention or scalability viewpoint,
but then again, doing so does not raise any fundamental conceptual issues.
So, what can TM do about time delays within transactions?
\begin{enumerate}
\item Ignore time delays within transactions.
This has an appearance of elegance, but like too many other
``elegant'' solutions, fails to survive first contact with
legacy code.
Such code, which might well have important time delays in critical
sections, would fail upon being transactionalized.
\item Abort transactions upon encountering a time-delay operation.
This is attractive, but it is unfortunately not always possible
to automatically detect a time-delay operation.
Is that tight loop carrying out a critical computation, or is it
simply waiting for time to elapse?
\item Enlist the compiler to prohibit time delays within transactions.
\item Let the time delays execute normally.
Unfortunately, some TM implementations publish modifications only
at commit time, which could defeat the purpose of the time delay.
\end{enumerate}
It is not clear that there is a single correct answer.
TM implementations featuring weak atomicity that publish changes
immediately within the transaction (rolling these changes back upon abort)
might be reasonably well served by the last alternative.
Even in this case, the code (or possibly even hardware) at the other
end of the transaction may require a substantial redesign to tolerate
aborted transactions.
This need for redesign would make it more difficult to apply transactional
memory to legacy code.
\subsubsection{Persistence}
\label{sec:future:Persistence}
There are many different types of locking primitives.
One interesting distinction is persistence, in other words, whether the
lock can exist independently of the address space of the process using
the lock.
Non-persistent locks include \co{pthread_mutex_lock()},
\co{pthread_rwlock_rdlock()}, and most kernel-level locking primitives.
If the memory locations instantiating a non-persistent lock's data
structures disappear, so does the lock.
For typical use of \co{pthread_mutex_lock()}, this means that when the
process exits, all of its locks vanish.
This property can be exploited in order to trivialize lock cleanup
at program shutdown time, but makes it more difficult for unrelated
applications to share locks, as such sharing requires the applications
to share memory.
\QuickQuiz{
But suppose that an application exits while holding a
\co{pthread_mutex_lock()} that happens to be located in a
file-mapped region of memory?
}\QuickQuizAnswer{
Indeed, in this case the lock would persist, much to the
consternation of other processes attempting to acquire this
lock that is held by a process that no longer exists.
Which is why great care is required when using \co{pthread_mutex}
objects located in file-mapped memory regions.
}\QuickQuizEnd
Persistent locks help avoid the need to share memory among unrelated
applications.
Persistent locking APIs include the flock family, \co{lockf()}, System
V semaphores, or the \co{O_CREAT} flag to \co{open()}.
These persistent APIs can be used to protect large-scale operations
spanning runs of multiple applications, and, in the case of \co{O_CREAT}
even surviving operating-system reboot.
If need be, locks can even span multiple computer systems via distributed
lock managers and distributed filesystems---and persist across reboots
of any or all of those computer systems.
Persistent locks can be used by any application, including applications
written using multiple languages and software environments.
In fact, a persistent lock might well be acquired by an application written
in C and released by an application written in Python.
How could a similar persistent functionality be provided for TM?
\begin{enumerate}
\item Restrict persistent transactions to special-purpose environments
designed to support them, for example, SQL\@.
This clearly works, given the decades-long history of database
systems, but does not provide the same degree of flexibility
provided by persistent locks.
\item Use snapshot facilities provided by some storage devices and/or
filesystems.
Unfortunately, this does not handle network communication,
nor does it handle I/O to devices that do not provide snapshot
capabilities, for example, memory sticks.
\item Build a time machine.
\item Avoid the problem entirely by using existing persistent facilities,
presumably avoiding such use within transactions.
\end{enumerate}
Of course, the fact that it is called transactional \emph{memory}
should give us pause, as the name itself conflicts with the concept of
a persistent transaction.
It is nevertheless worthwhile to consider this possibility as an important
test case probing the inherent limitations of transactional memory.
\subsection{Process Modification}
\label{sec:future:Process Modification}
Processes are not eternal:
They are created and destroyed, their memory mappings are modified,
they are linked to dynamic libraries, and they are debugged.
These sections look at how transactional memory can handle an
ever-changing execution environment.
\subsubsection{Multithreaded Transactions}
\label{sec:future:Multithreaded Transactions}
It is perfectly legal to create processes and threads while holding
a lock or, for that matter, while holding a hazard pointer, within
a sequence-locking read-side critical section, and from within a
userspace-RCU read-side critical section, and even all at the same time,
if need be.
Not only is it legal, but it is quite simple, as can be seen from the
following code fragment:
\begin{VerbatimN}
pthread_mutex_lock(...);
for (i = 0; i < ncpus; i++)
pthread_create(&tid[i], ...);
for (i = 0; i < ncpus; i++)
pthread_join(tid[i], ...);
pthread_mutex_unlock(...);
\end{VerbatimN}
This pseudo-code fragment uses \co{pthread_create()} to spawn one thread
per CPU, then uses \co{pthread_join()} to wait for each to complete,
all under the protection of \co{pthread_mutex_lock()}.
The effect is to execute a lock-based critical section in parallel,
and one could obtain a similar effect using \co{fork()} and \co{wait()}.
Of course, the critical section would need to be quite large to justify
the thread-spawning overhead, but there are many examples of large
critical sections in production software.
What might TM do about thread spawning within a transaction?
\begin{enumerate}
\item Declare \co{pthread_create()} to be illegal within transactions,
preferably by aborting the transaction.
Alternatively, enlist the compiler to enforce
\co{pthread_create()}-free transactions.
\item Permit \co{pthread_create()} to be executed within a
transaction, but only the parent thread will be considered to
be part of the transaction.
This approach seems to be reasonably compatible with existing and
posited TM implementations, but seems to be a trap for the unwary.
This approach raises further questions, such as how to handle
conflicting child-thread accesses.
\item Convert the \co{pthread_create()}s to function calls.
This approach is also an attractive nuisance, as it does not
handle the not-uncommon cases where the child threads communicate
with one another.
In addition, it does not permit concurrent execution of the body
of the transaction.
\item Extend the transaction to cover the parent and all child threads.
This approach raises interesting questions about the nature of
conflicting accesses, given that the parent and children are
presumably permitted to conflict with each other, but not with
other threads.
It also raises interesting questions as to what should happen
if the parent thread does not wait for its children before
committing the transaction.
Even more interesting, what happens if the parent conditionally
executes \co{pthread_join()} based on the values of variables
participating in the transaction?
The answers to these questions are reasonably straightforward
in the case of locking.
The answers for TM are left as an exercise for the reader.
\end{enumerate}
Given that parallel execution of transactions is commonplace in the
database world, it is perhaps surprising that current TM proposals do
not provide for it.
On the other hand, the example above is a fairly sophisticated use
of locking that is not normally found in simple textbook examples,
so perhaps its omission is to be expected.
That said, some researchers are using transactions to autoparallelize
code~\cite{ArunRaman2010MultithreadedTransactions},
and there are rumors that other TM researchers are investigating
fork/join parallelism within transactions, so perhaps this topic will
soon be addressed more thoroughly.
\subsubsection{The \tco{exec()} System Call}
\label{sec:future:The exec System Call}
One can execute an \co{exec()} system call within a lock-based critical
section, while holding a hazard pointer, within a sequence-locking
read-side critical section, and from within a userspace-RCU read-side
critical section, and even all at the same time, if need be.
The exact semantics depends on the type of primitive.
In the case of non-persistent primitives (including
\co{pthread_mutex_lock()}, \co{pthread_rwlock_rdlock()}, and userspace RCU),
if the \co{exec()} succeeds, the whole address space vanishes, along
with any locks being held.
Of course, if the \co{exec()} fails, the address space still lives,
so any associated locks would also still live.
A bit strange perhaps, but well defined.
On the other hand, persistent primitives (including the flock family,
\co{lockf()}, System V semaphores, and the \co{O_CREAT} flag to
\co{open()}) would survive regardless of whether the \co{exec()}
succeeded or failed, so that the \co{exec()}ed program might well
release them.
\QuickQuiz{
What about non-persistent primitives represented by data
structures in \co{mmap()} regions of memory?
What happens when there is an \co{exec()} within a critical
section of such a primitive?
}\QuickQuizAnswer{
If the \co{exec()}ed program maps those same regions of
memory, then this program could in principle simply release
the lock.
The question as to whether this approach is sound from a
software-engineering viewpoint is left as an exercise for
the reader.
}\QuickQuizEnd
What happens when you attempt to execute an \co{exec()} system call
from within a transaction?
\begin{enumerate}
\item Disallow \co{exec()} within transactions, so that the enclosing
transactions abort upon encountering the \co{exec()}.
This is well defined, but clearly requires non-TM synchronization
primitives for use in conjunction with \co{exec()}.
\item Disallow \co{exec()} within transactions, with the compiler
enforcing this prohibition.
There is a draft specification for TM in C++ that takes
this approach, allowing functions to be decorated with
the \co{transaction_safe} and \co{transaction_unsafe}
attributes.\footnote{
Thanks to Mark Moir for pointing me at this spec, and
to Michael Wong for having pointed me at an earlier
revision some time back.}
This approach has some advantages over aborting the transaction
at runtime, but again requires non-TM synchronization primitives
for use in conjunction with \co{exec()}.
One disadvantage is the need to decorate a great many library
functions with \co{transaction_safe} and \co{transaction_unsafe}
attributes.
\item Treat the transaction in a manner similar to non-persistent
locking primitives, so that the transaction survives if \co{exec()}
fails, and silently commits if the \co{exec()} succeeds.
The case where only some of the variables affected by the
transaction reside in \co{mmap()}ed memory (and thus could
survive a successful \co{exec()} system call) is left as an
exercise for the reader.
\item Abort the transaction (and the \co{exec()} system call) if the
\co{exec()} system call would have succeeded, but allow the
transaction to continue if the \co{exec()} system call would
fail.
This is in some sense the ``correct'' approach, but it would
require considerable work for a rather unsatisfying result.
\end{enumerate}
The \co{exec()} system call is perhaps the strangest example of an
obstacle to universal TM applicability, as it is not completely clear
what approach makes sense, and some might argue that this is merely a
reflection of the perils of real-life interaction with \co{exec()}.
That said, the two options prohibiting \co{exec()} within transactions
are perhaps the most logical of the group.
Similar issues surround the \co{exit()} and \co{kill()} system calls,
as well as a \co{longjmp()} or an exception that would exit the transaction.
(Where did the \co{longjmp()} or exception come from?)
\subsubsection{Dynamic Linking and Loading}
\label{sec:future:Dynamic Linking and Loading}
Lock-based critical section, code holding a hazard pointer,
sequence-locking read-side critical sections, and userspace-RCU read-side
critical sections can (separately or in combination) legitimately contain
code that invokes dynamically linked and loaded functions, including C/C++
shared libraries and Java class libraries.
Of course, the code contained in these libraries is by definition
unknowable at compile time.
So, what happens if a dynamically loaded function is invoked within
a transaction?
This question has two parts:
\begin{enumerate*}[(a)]
\item How do you dynamically link and load a function within a transaction
and
\item What do you do about the unknowable nature of the code within
this function?
\end{enumerate*}
To be fair, item (b) poses some challenges for locking and userspace-RCU
as well, at least in theory.
For example, the dynamically linked function might introduce a \IX{deadlock}
for locking or might (erroneously) introduce a \IX{quiescent state} into a
userspace-RCU read-side critical section.
The difference is that while the class of operations permitted in locking
and userspace-RCU critical sections is well-understood, there appears
to still be considerable uncertainty in the case of TM\@.
In fact, different implementations of TM seem to have different restrictions.
So what can TM do about dynamically linked and loaded library functions?
Options for part (a), the actual loading of the code, include the following:
\begin{enumerate}
\item Treat the dynamic linking and loading in a manner similar to a
page fault, so that the function is loaded and linked, possibly
aborting the transaction in the process.
If the transaction is aborted, the retry will find the function
already present, and the transaction can thus be expected to
proceed normally.
\item Disallow dynamic linking and loading of functions from within
transactions.
\end{enumerate}
Options for part (b), the inability to detect TM-unfriendly operations
in a not-yet-loaded function, possibilities include the following:
\begin{enumerate}
\item Just execute the code:
If there are any TM-unfriendly operations in the function,
simply abort the transaction.
Unfortunately, this approach makes it impossible for the compiler
to determine whether a given group of transactions may be safely
composed.
One way to permit composability regardless is irrevocable
transactions, however, current implementations permit only a
single irrevocable transaction to proceed at any given time,
which can severely limit performance and scalability.
Irrevocable transactions also to restrict use of manual
transaction-abort operations.
Finally, if there is an irrevocable transaction manipulating
a given data item, any other transaction manipulating that
same data item cannot have non-blocking semantics.
\item Decorate the function declarations indicating which functions
are TM-friendly.
These decorations can then be enforced by the compiler's type system.
Of course, for many languages, this requires language extensions
to be proposed, standardized, and implemented, with the
corresponding time delays, and also with the corresponding
decoration of a great many otherwise uninvolved library functions.
That said, the standardization effort is already in
progress~\cite{Ali-Reza-Adl-Tabatabai2009CppTM,Ali-Reza-Adl-Tabatabai2012CppTM}.
\item As above, disallow dynamic linking and loading of functions from
within transactions.
\end{enumerate}
I/O operations are of course a known weakness of TM, and dynamic linking
and loading can be thought of as yet another special case of I/O\@.
Nevertheless, the proponents of TM must either solve this problem, or
resign themselves to a world where TM is but one tool of several in the
parallel programmer's toolbox.
(To be fair, a number of TM proponents have long since resigned themselves
to a world containing more than just TM.)
\subsubsection{Memory-Mapping Operations}
\label{sec:future:Memory-Mapping Operations}
It is perfectly legal to execute memory-mapping operations (including
\co{mmap()}, \co{shmat()}, and \co{munmap()}~\cite{TheOpenGroup1997SUS})
within a lock-based critical section, while holding a hazard pointer,
within a sequence-locking read-side critical section, and from within a
userspace-RCU read-side critical section, and even all at the same time,
if need be.
What happens when you attempt to execute such an operation from within
a transaction?
More to the point, what happens if the memory region being remapped
contains some variables participating in the current thread's transaction?
And what if this memory region contains variables participating in some
other thread's transaction?
It should not be necessary to consider cases where the TM system's
metadata is remapped, given that most locking primitives do not define
the outcome of remapping their lock variables.
Here are some TM memory-mapping options:
\begin{enumerate}
\item Memory remapping is illegal within a transaction, and will result
in all enclosing transactions being aborted.
This does simplify things somewhat, but also requires that TM
interoperate with synchronization primitives that do tolerate
remapping from within their critical sections.
\item Memory remapping is illegal within a transaction, and the
compiler is enlisted to enforce this prohibition.
\item Memory mapping is legal within a transaction, but aborts all
other transactions having variables in the region mapped over.
\item Memory mapping is legal within a transaction, but the mapping
operation will fail if the region being mapped overlaps with
the current transaction's footprint.
\item All memory-mapping operations, whether within or outside a
transaction, check the region being mapped against the memory
footprint of all transactions in the system.
If there is overlap, then the memory-mapping operation fails.
\item The effect of memory-mapping operations that overlap the memory
footprint of any transaction in the system is determined by the
TM conflict manager, which might dynamically determine whether
to fail the memory-mapping operation or abort any conflicting
transactions.
\end{enumerate}
It is interesting to note that \co{munmap()} leaves the relevant region
of memory unmapped, which could have additional interesting
implications.\footnote{
This difference between mapping and unmapping was noted by
Josh Triplett.}
\subsubsection{Debugging}
\label{sec:future:Debugging}
The usual debugging operations such as breakpoints work normally within
lock-based critical sections and from usespace-RCU read-side critical sections.
However, in initial transactional-memory hardware
implementations~\cite{DaveDice2009ASPLOSRockHTM} an exception within
a transaction will abort that transaction, which in turn means that
breakpoints abort all enclosing transactions.
So how can transactions be debugged?
\begin{enumerate}
\item Use software emulation techniques within transactions containing
breakpoints.
Of course, it might be necessary to emulate all transactions
any time a breakpoint is set within the scope of any transaction.
If the runtime system is unable to determine whether or not a
given breakpoint is within the scope of a transaction, then it
might be necessary to emulate all transactions just to be on
the safe side.
However, this approach might impose significant overhead, which
might in turn obscure the bug being pursued.
\item Use only hardware TM implementations that are capable of
handling breakpoint exceptions.
Unfortunately, as of this writing (March 2021), all such
implementations are research prototypes.
\item Use only software TM implementations, which are
(very roughly speaking) more tolerant of exceptions than are
the simpler of the hardware TM implementations.
Of course, software TM tends to have higher overhead than hardware
TM, so this approach may not be acceptable in all situations.
\item Program more carefully, so as to avoid having bugs in the
transactions in the first place.
As soon as you figure out how to do this, please do let everyone
know the secret!
\end{enumerate}
There is some reason to believe that transactional memory will deliver
\IX{productivity} improvements compared to other synchronization mechanisms,
but it does seem quite possible that these improvements could easily
be lost if traditional debugging techniques cannot be applied to
transactions.
This seems especially true if transactional memory is to be used by
novices on large transactions.
In contrast, macho ``top-gun'' programmers might be able to dispense with
such debugging aids, especially for small transactions.
Therefore, if transactional memory is to deliver on its productivity
promises to novice programmers, the debugging problem does need to
be solved.
\subsection{Synchronization}
\label{sec:future:Synchronization}
If transactional memory someday proves that it can be everything to everyone,
it will not need to interact with any other synchronization mechanism.
Until then, it will need to work with synchronization mechanisms that
can do what it cannot, or that work more naturally in a given situation.
The following sections outline the current challenges in this area.
\subsubsection{Locking}
\label{sec:future:Locking}
It is commonplace to acquire locks while holding other locks, which works
quite well, at least as long as the usual well-known software-engineering
techniques are employed to avoid deadlock.
It is not unusual to acquire locks from within RCU read-side critical
sections, which eases deadlock concerns because RCU read-side primitives
cannot participate in lock-based deadlock cycles.
It is also possible to acquire locks while holding hazard pointers and
within sequence-lock read-side critical sections.
But what happens when you attempt to acquire a lock from within a transaction?
In theory, the answer is trivial:
Simply manipulate the data structure representing the lock as part of
the transaction, and everything works out perfectly.
In practice, a number of non-obvious complications~\cite{Volos2008TRANSACT}
can arise, depending on implementation details of the TM system.
These complications can be resolved, but at the cost of a 45\pct\ increase in
overhead for locks acquired outside of transactions and a 300\pct\ increase
in overhead for locks acquired within transactions.
Although these overheads might be acceptable for transactional
programs containing small amounts of locking, they are often completely
unacceptable for production-quality lock-based programs wishing to use
the occasional transaction.
Here are some options available to TM:
\begin{enumerate}
\item Use only locking-friendly TM implementations~\cite{DaveDice2006DISC}.
Unfortunately, the locking-unfriendly implementations have some
attractive properties, including low overhead for successful
transactions and the ability to accommodate extremely large
transactions.
\item Use TM only ``in the small'' when introducing TM to lock-based
programs, thereby accommodating the limitations of
locking-friendly TM implementations.
\item Set aside locking-based legacy systems entirely, re-implementing
everything in terms of transactions.
This approach has no shortage of advocates, but this requires
that all the issues described in this series be resolved.
During the time it takes to resolve these issues, competing
synchronization mechanisms will of course also have the
opportunity to improve.
\item Use TM strictly as an optimization in lock-based systems, as was
done by the TxLinux~\cite{ChistopherJRossbach2007a} group and
by a great many transactional lock elision
projects~\cite{MartinPohlack2011HTM2TLE,Kleen:2014:SEL:2566590.2576793,PascalFelber2016rwlockElision,SeongJaePark2020HTMRCUlock}.
This approach seems sound, but leaves the locking design
constraints (such as the need to avoid deadlock) firmly in place.
\item Strive to reduce the overhead imposed on locking primitives.
\end{enumerate}
The fact that there could possibly be a problem interfacing TM and locking
came as a surprise to many, which underscores the need to try out new
mechanisms and primitives in real-world production software.
Fortunately, the advent of open source means that a huge quantity of
such software is now freely available to everyone, including researchers.
\subsubsection{Reader-Writer Locking}
\label{sec:future:Reader-Writer Locking}
It is commonplace to read-acquire reader-writer locks while holding
other locks, which just works, at least as long as the usual well-known
software-engineering techniques are employed to avoid deadlock.
Read-acquiring reader-writer locks from within RCU read-side critical
sections also works, and doing so eases deadlock concerns because RCU
read-side primitives cannot participate in lock-based deadlock cycles.
It is also possible to acquire locks while holding hazard pointers and
within sequence-lock read-side critical sections.
But what happens when you attempt to read-acquire a reader-writer lock
from within a transaction?
Unfortunately, the straightforward approach to read-acquiring the
traditional counter-based reader-writer lock within a transaction defeats
the purpose of the reader-writer lock.
To see this, consider a pair of transactions concurrently attempting to
read-acquire the same reader-writer lock.
Because read-acquisition involves modifying the reader-writer lock's
data structures, a conflict will result, which will roll back one of
the two transactions.
This behavior is completely inconsistent with the reader-writer lock's
goal of allowing concurrent readers.
Here are some options available to TM:
\begin{enumerate}
\item Use per-CPU or per-thread reader-writer
locking~\cite{WilsonCHsieh92a}, which allows a
given CPU (or thread, respectively) to manipulate only local
data when read-acquiring the lock.
This would avoid the conflict between the two transactions
concurrently read-acquiring the lock, permitting both to proceed,
as intended.
Unfortunately, (1)~the write-acquisition overhead of
per-CPU/thread locking can be extremely high, (2)~the memory
overhead of per-CPU/thread locking can be prohibitive, and
(3)~this transformation is available only when you have access to
the source code in question.
Other more-recent scalable
reader-writer locks~\cite{YossiLev2009SNZIrwlock}
might avoid some or all of these problems.
\item Use TM only ``in the small'' when introducing TM to lock-based
programs, thereby avoiding read-acquiring reader-writer locks
from within transactions.
\item Set aside locking-based legacy systems entirely, re-implementing
everything in terms of transactions.
This approach has no shortage of advocates, but this requires
that \emph{all} the issues described in this series be resolved.
During the time it takes to resolve these issues, competing
synchronization mechanisms will of course also have the
opportunity to improve.
\item Use TM strictly as an optimization in lock-based systems, as was
done by the TxLinux~\cite{ChistopherJRossbach2007a} group,
and as has been done by more recent work using TM to elide
reader-writer locks~\cite{PascalFelber2016rwlockElision}.
This approach seems sound, at least on \Power{8}
CPUs~\cite{Le:2015:TMS:3266491.3266500}, but leaves the locking
design constraints (such as the need to avoid deadlock) firmly
in place.
\end{enumerate}
Of course, there might well be other non-obvious issues surrounding
combining TM with reader-writer locking, as there in fact were with
exclusive locking.
\subsubsection{Deferred Reclamation}
\label{sec:future:Deferred Reclamation}
This section focuses mainly on RCU\@.
Similar issues and possible resolutions arise when combining TM with
other deferred-reclamation mechanisms such as
\IXalt{reference counters}{reference count} and
hazard pointers.
In the text below, known differences are specifically called out.
Reference counting, hazard pointers, and RCU are all heavily used, as noted in
\cref{sec:defer:RCU Related Work,sec:defer:Which to Choose? (Production Use)}.
This means that any TM implementation that chooses not to surmount each
and every challenge called out in this section needs to interoperate
cleanly and efficiently with all of these synchronization mechanisms.
The TxLinux group from the University of Texas at Austin appears to be
the group to take on the challenge of RCU/TM
interoperability~\cite{ChistopherJRossbach2007a}.
Because they applied TM to the Linux 2.6 kernel, which uses RCU, they
had no choice but to integrate TM and RCU, with TM taking the place of
locking for RCU updates.
Unfortunately, although the paper does state that the RCU implementation's
locks (e.g., \co{rcu_ctrlblk.lock}) were converted to transactions,
it is silent about what was done with those locks used by RCU-based updates
(for example, \co{dcache_lock}).
More recently, \ppl{Dimitrios}{Siakavaras} et al.~have applied
HTM and RCU to search trees~\cite{Siakavaras2017CombiningHA,DimitriosSiakavaras2020RCU-HTM-B+Trees},
\ppl{Christina}{Giannoula} et al.~have used HTM and RCU to color
graphs~\cite{ChristinaGiannoula2018HTM-RCU-graphcoloring},
and
\ppl{SeongJae}{Park} et al.~have used HTM and RCU to optimize high-contention
locking on \IXacr{numa} systems~\cite{SeongJaePark2020HTMRCUlock}.
It is important to note that RCU permits readers and updaters to run
concurrently, further permitting RCU readers to access data that is in
the act of being updated.
Of course, this property of RCU, whatever its performance, scalability,
and real-time-response benefits might be, flies in the face of the
underlying atomicity properties of TM, although the \Power{8} CPU family's
suspended-transaction facility~\cite{Le:2015:TMS:3266491.3266500} makes
it an exception to this rule.
So how should TM-based updates interact with concurrent RCU readers?
Some possibilities are as follows:
\begin{enumerate}
\item RCU readers abort concurrent conflicting TM updates.
This is in fact the approach taken by the TxLinux project.
This approach does preserve RCU semantics, and also preserves
RCU's read-side performance, scalability, and real-time-response
properties, but it does have the unfortunate side-effect of
unnecessarily aborting conflicting updates.
In the worst case, a long sequence of RCU readers could
potentially starve all updaters, which could in theory result
in system hangs.
In addition, not all TM implementations offer the strong atomicity
required to implement this approach, and for good reasons.
\item RCU readers that run concurrently with conflicting TM updates
get old (pre-transaction) values from any conflicting RCU loads.
This preserves RCU semantics and performance, and also prevents
RCU-update \IX{starvation}.
However, not all TM implementations can provide timely access
to old values of variables that have been tentatively updated
by an in-flight transaction.
In particular, log-based TM implementations that maintain
old values in the log (thus providing excellent TM commit
performance) are not likely to be happy with this approach.
Perhaps the \co{rcu_dereference()} primitive can be leveraged
to permit RCU to access the old values within a greater range
of TM implementations, though performance might still be an issue.
Nevertheless, there are popular TM implementations that have
been integrated with RCU in this
manner~\cite{DonaldEPorter2007TRANSACT,PhilHoward2011RCUTMRBTree,
PhilipWHoward2013RCUrbtree}.
\item If an RCU reader executes an access that conflicts with an
in-flight transaction, then that RCU access is delayed until
the conflicting transaction either commits or aborts.
This approach preserves RCU semantics, but not RCU's performance
or real-time response, particularly in presence of long-running
transactions.
In addition, not all TM implementations are capable of delaying
conflicting accesses.
Nevertheless, this approach seems eminently reasonable for hardware
TM implementations that support only small transactions.
\item RCU readers are converted to transactions.
This approach pretty much guarantees that RCU is compatible with
any TM implementation, but it also imposes TM's rollbacks on RCU
read-side critical sections, destroying RCU's real-time response
guarantees, and also degrading RCU's read-side performance.
Furthermore, this approach is infeasible in cases where any of
the RCU read-side critical sections contains operations that
the TM implementation in question is incapable of handling.
This approach is more difficult to apply to hazard pointers and
reference counters, which do not have a sharply defined notion
of a reader as a section of code.
\item Many update-side uses of RCU modify a single pointer to publish
a new data structure.
In some of these cases, RCU can safely be permitted to see a
transactional pointer update that is subsequently rolled back,
as long as the transaction respects memory ordering and as long
as the roll-back process uses \co{call_rcu()} to free up the
corresponding structure.
Unfortunately, not all TM implementations respect memory barriers
within a transaction.
Apparently, the thought is that because transactions are supposed
to be atomic, the ordering of the accesses within the transaction
is not supposed to matter.
\item Prohibit use of TM in RCU updates.
This is guaranteed to work, but restricts use of TM.
\end{enumerate}
It seems likely that additional approaches will be uncovered, especially
given the advent of user-level RCU and hazard-pointer
implementations.\footnote{
Kudos to the TxLinux group, Maged Michael, and Josh Triplett
for coming up with a number of the above alternatives.}
It is interesting to note that many of the better performing and
scaling STM implementations make use of RCU-like techniques
internally~\cite{UCAM-CL-TR-579,KeirFraser2007withoutLocks,Gu:2019:PSE:3358807.3358885,Kim:2019:MSR:3297858.3304040}.
\QuickQuiz{
MV-RLU looks pretty good!
Doesn't it beat RCU hands down?
}\QuickQuizAnswer{
One might get that impression from a quick read of the abstract,
but more careful readers will notice the ``for a wide range of
workloads'' phrase in the last sentence.
It turns out that this phrase is quite important:
\begin{enumerate}
\item Their RCU evaluation uses synchronous grace periods, which
needlessly throttle updates, as noted in their
Section~6.2.1.
See \cref{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo in the Presence of Updates}
\cpageref{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo in the Presence of Updates}
of this book to see that the venerable asynchronous
\co{call_rcu()} primitive enables RCU to perform and
scale quite well with large numbers of updaters.
Furthermore, in Section~3.7 of their paper, the authors
admit that asynchronous grace periods are important to
MV-RLU scalability.
A fair comparison would also allow RCU the benefits of
asynchrony.
\item They use a poorly tuned 1,000-bucket hash table containing
10,000~elements.
In addition, their 448~hardware threads need considerably
more than 1,000~buckets to avoid the lock contention
that they correctly state limits RCU performance in
their benchmarks.
A useful comparison would feature a properly tuned
hash table.
\item Their RCU hash table used per-bucket locks, which they
call out as a bottleneck, which is not a surprise given
the long hash chains and small ratio of buckets to threads.
A number of their competing mechanisms instead use
lockfree techniques, thus avoiding the per-bucket-lock
bottleneck, which cynics might claim sheds some light
on the authors' otherwise inexplicable choice of poorly
tuned hash tables.
The first graph in the middle row of the authors'
Figure~4 show what RCU can achieve if not hobbled by
artificial bottlenecks, as does the first portion of
the second graph in that same row.
\item Their linked-list operation permits RLU to do concurrent
modifications of different elements in the list, while
RCU is forced to serialize updates.
Again, RCU has always worked just fine in conjunction
with lockless updaters, a fact that has been set forth
in academic literature that the authors
cited~\cite{MathieuDesnoyers2012URCU}.
A fair comparison would use the same style of update
for RCU as it does for MV-RLU.
\item The authors fail to consider combining RCU and sequence
locking, which is used in the Linux kernel to give
readers coherent views of multi-pointer updates.
\item The authors fail to consider RCU-based solutions to the
Issaquah Challenge~\cite{PaulEMcKenney2016IssaquahCPPCON},
which also gives readers a coherent view of multi-pointer
updates, albeit with a weaker view of ``coherent''.
\end{enumerate}
It is surprising that the anonymous reviewers of this paper did
not demand an apples-to-apples comparison of MV-RLU and RCU\@.
Nevertheless, the authors should be congratulated on producing
an academic paper that presents an all-too-rare example of good
scalability combined with strong read-side coherence.
They are also to be congratulated on overcoming the traditional
academic prejudice against asynchronous grace periods,
which greatly aided their scalability.
Interestingly enough, RLU and RCU take different approaches to avoid
the inherent limitations of STM noted by \ppl{Hagit}{Attiya} et
al.~\cite{Attiya:2009:STMReadOnlyLimits}.
RCU avoids providing strict serializability and RLU avoids providing
invisible read-only transactions, both thus avoiding the
limitations.
}\QuickQuizEnd
\subsubsection{Extra-Transactional Accesses}
\label{sec:future:Extra-Transactional Accesses}
Within a lock-based critical section, it is perfectly legal to manipulate
variables that are concurrently accessed or even modified outside that
lock's critical section, with one common example being statistical
counters.
The same thing is possible within RCU read-side critical
sections, and is in fact the common case.
Given mechanisms such as the so-called ``dirty reads'' that are
prevalent in production database systems, it is not surprising
that extra-transactional accesses have received serious attention
from the proponents of TM, with the concept of weak
atomicity~\cite{Blundell2006TMdeadlock} being but one case in point.
Here are some extra-transactional options:
\begin{enumerate}
\item Conflicts due to extra-transactional accesses always abort
transactions.
This is strong atomicity.
\item Conflicts due to extra-transactional accesses are ignored,
so only conflicts among transactions can abort transactions.
This is weak atomicity.
\item Transactions are permitted to carry out non-transactional
operations in special cases, such as when allocating memory or
interacting with lock-based critical sections.
\item Produce hardware extensions that permit some operations
(for example, addition) to be carried out concurrently on a
single variable by multiple transactions.
\item Introduce weak semantics to transactional memory.
One approach is the combination with RCU described in
\cref{sec:future:Deferred Reclamation},
while Gramoli and Guerraoui
survey a number of other weak-transaction
approaches~\cite{Gramoli:2014:DTP:2541883.2541900}, for example,
restricted partitioning of large
``elastic'' transactions into smaller transactions, thus
reducing conflict probabilities (albeit with tepid performance
and scalability).
Perhaps further experience will show that some uses of
extra-transactional accesses can be replaced by weak
transactions.
\end{enumerate}
It appears that transactions were conceived in a vacuum, with no
interaction required with any other synchronization mechanism.
If so, it is no surprise that much confusion and complexity arises when
combining transactions with non-transactional accesses.
But unless transactions are to be confined to small updates to isolated
data structures, or alternatively to be confined to new programs
that do not interact with the huge body of existing parallel code,
then transactions absolutely must be so combined if they are to have
large-scale practical impact in the near term.
\subsection{Other Transactions}
\label{sec:future:Other Transactions}
% If we ever get epigraphs down at this level, perhaps the old
% "Hell is other people" quote by Jean-Paul Sartre. His clarification
% also applies, which can be stated as "Heaven is also other people".
Because transactions are to appear atomic with respect to each other,
something must be done when multiple transactions attempt to access the
same variable at the same time.
If all the transactions are reading that variable and none of them are
updating it, then that something can be ``nothing'', but otherwise
the transactions must be at least partially serialized.
This serialization is often achieved by aborting and rolling back some
subset of those transactions.
How to choose which to abort and roll back?
There has been much work on this question.
The answer is ``use a contention manager'', but that simply shifts this
question to ``what should a contention manager do?''
A full exposition on contention managers is beyond the scope of this section.
This section will therefore limit itself to a few seminal papers
in this area.
Herlihy et al.~\cite{HerlihyLMS03}
have writing transactions unconditionally abort reading transactions,
which has the virtue of simplicity, but which can result in reader
starvation.
This paper also introduces the concept of early release, which can
allow a long traversal to proceed concurrently with small updates.
From this viewpoint, one might characterize RCU readers as ``instant
release'', but this is a property that causes RCU readers to behave in
a decidedly non-STM manner.
Hammond et al.~\cite{LanceHammond2004a}
advise using small transactions, which does reduce contention-manager
complexity, but which also fails to deliver on the full STM promise
of automated concurrency.
This paper also introduced annotations to identify different types of
transactions.
Scherer and Scott~\cite{WilliamNSchererIII2005}
describe a number of STM contention-management strategies.
This paper defined ``visible'' and ``invisible'' readers for
extra-transactional accesses, and discuss a number of contention-management
strategies, including:
\begin{enumerate}
\item Exponential backoff, which they conclude required manual
tuning.
\item Preferentially abort transactions that have done the least work,
where the work of an aborted prior attempt to complete a
transactions is counted for the next retry of that transaction.
This has nice fairness properties, but can cause huge transactions
to hog the system for an extended time.
\item Preferentially abort transactions that have done the least work,
but also credit a given transaction with the work done by any
other transactions blocked waiting for that transaction to complete.
This adds a form of priority inheritance to the mix.
\item Track how many times a given transaction has been aborted by
another transaction as another way to promote some notion of
fairness.
\item Use transaction-start timestamps in order to preferentially abort
transactions that started most recently.
\item Use transaction-start timestamps, but also add an indication
of when a given transaction last ran to prevent preempted
transactions from aborting other transactions running in
higher-priority processes.
\end{enumerate}
This 2005 paper thus gave an early indication of the complexity inherent
in contention management.
Porter and Witchel~\cite{DonaldEPorter2007TRANSACT,Ramadan:2008:DTM:1521747.1521799}
considered loosening STM consistency requirements in order to simplify
contention management.
They noted RCU as a case in which an update will cause concurrent reads
to access old versions of the data.
Contention management is also an issue in HTM, and will be discussed
in \cref{sec:future:Conflict Handling,sec:future:Aborts and Rollbacks}.
\QuickQuiz{
Why not get with the times and apply machine learning to
contention management?
}\QuickQuizAnswer{
Many transactions have sub-microsecond overheads, so it would
be all too easy to burn more CPU on machine learning than was
saved by more efficiently executing transactions.
Machine learning might nevertheless have a role to play in
tuning contention managers to specific workloads.
It would of course be even better to not need the tuning, but
sometimes ``better'' is unattainable.
}\QuickQuizEnd
\subsection{Case Study:
Sequence Locking}
\label{sec:future:Case Study: Sequence Locking}
Sequence locking, described in
\cref{sec:defer:Sequence Locks},
is sometimes thought of as a trivial form of STM\@.
The sequence-locking read-side critical section is normally restricted
to doing loads, and it is retried whenever there is a conflict with a
sequence-locking updater.
Crucially, sequence-locking read-side critical section do not conflict
with each other.
It is therefore instructive to consider how sequence locking handles
the STM challenges called out in the preceding sections.
Sequence locking handles interactions with the outside world
(\cref{sec:future:Outside World}) by simply executing the
TM-unfriendly operation on each pass through its read-side
critical section.
For example, if the critical section doing an I/O write
is retried three times, it will execute that I/O write four
times, once for the original pass through the critical
section and once each for the three retries.
Doing I/O operations within a sequence-locking read-side
critical section might seem a bit unconventional.
It nevertheless makes a good deal of sense to do
an I/O write (\cref{sec:future:I/O Operations})
to a log for historical, debugging, or auditing purposes.
An RPC operation (\cref{sec:future:RPC Operations}) might gather needed data.
A time delay (\cref{sec:future:Time Delays}) might serve valuable
debugging purposes such as exercising the retry code path.
Even persistent operations (\cref{sec:future:Persistence}) are potentially
useful, for example, to carry out cross-application mutual exclusion.
Alternatively, just as with STM, users of sequence locking may choose
to buffer I/O and execute it just after the end of the read-side critical
section.
Sequence locking handles process modifications
(\cref{sec:future:Process Modification})
in the same way, by simply executing the process-modification operation
on each pass through its read-side critical section.
Creating processes and threads within a critical section
(\cref{sec:future:Multithreaded Transactions})
operates normally, though care should be taken in order to prevent
an often-retried critical section from becoming a ``fork bomb''.
The \co{exec()} system call (\cref{sec:future:The exec System Call})
has perfectly reasonable (if rather drastic) semantics.
Calls to functions within dynamic libraries
(\cref{sec:future:Dynamic Linking and Loading}),
such as Linux-kernel loadable modules, work just as well as do calls
to built-in functions.
Of course, if such a call were cause a module to load, this might result
in a high retry probability on the first pass through the critical
section.
Remapping memory (\cref{sec:future:Memory-Mapping Operations})
behaves intuitively, as least assuming that neither the code making
up the critical section nor the sequence lock itself are remapped.
Note that these same restrictions also apply to normal locks.
Debuggers (\cref{sec:future:Debugging})
work normally within sequence-locking read-side critical sections,
just as they do within the critical sections for normal locks.
Synchronization primitives (\cref{sec:future:Synchronization})
can be used freely within sequence-locking read-side critical sections.
Locking (\cref{sec:future:Locking}),
reader-writer locking (\cref{sec:future:Reader-Writer Locking}),
deferred reclamation (\cref{sec:future:Deferred Reclamation}),
and extra-transactional accesses
(\cref{sec:future:Extra-Transactional Accesses})
may all be used within sequence-locking read-side critical sections,
and all give the expected results.
Sequence locking uses a trivial contention manager
(\cref{sec:future:Other Transactions}).
Sequence-locking readers are always rolled back in case of a concurrent
updater (as do Herlihy et al.~\cite{HerlihyLMS03}), sequence-locking
updaters typically delegate their contention management to some form of
mutual exclusion, and sequence-locking readers never conflict, and thus
never need their non-existent contention to be managed.
This rough-and-ready approach means that sequence-locking updaters
can starve readers, and users of sequence locking must therefore take
care to avoid frequent updates.
In short, sequence locking handles these STM challenges quite well.
This might be one reason that sequence locking is heavily used in
production.
However, a key enabler of sequence locking is the pushing of a number
of these challenges back onto the user.
This approach is reasonable because sequence locking has no ambitions
to be the one synchronization mechanism to rule them all.
So perhaps the greatest impediment to widespread STM adoption is its
proponents' overweening desire for full generality.
\subsection{Discussion}
\label{sec:future:Discussion}
The obstacles to universal TM adoption lead to the following
conclusions:
\begin{enumerate}
\item One interesting property of TM is the fact that transactions are
subject to rollback and retry.
This property underlies TM's difficulties with irreversible
operations, including unbuffered I/O, RPCs, memory-mapping
operations, time delays, and the \co{exec()} system call.
This property also has the unfortunate consequence of introducing
all the complexities inherent in the possibility of failure,
often in a developer-visible manner.
\item Another interesting property of TM, noted by
Shpeisman et al.~\cite{TatianaShpeisman2009CppTM}, is that TM
intertwines the synchronization with the data it protects.
This property underlies TM's issues with I/O, memory-mapping
operations, extra-transactional accesses, and debugging
breakpoints.
In contrast, conventional synchronization primitives, including
locking and RCU, maintain a clear separation between the
synchronization primitives and the data that they protect.
\item One of the stated goals of many workers in the TM area is to
ease parallelization of large sequential programs.
As such, individual transactions are commonly expected to
execute serially, which might do much to explain TM's issues
with multithreaded transactions.
\end{enumerate}
\QuickQuiz{
Given things like \co{spin_trylock()}, how does it make any
sense at all to claim that TM introduces the concept of failure???
}\QuickQuizAnswer{
When using locking, \co{spin_trylock()} is a choice, with a
corresponding failure-free choice being \co{spin_lock()},
which is used in the common case, as in there are more than
100 times as many calls to \co{spin_lock()} than to
\co{spin_trylock()} in the v5.11 Linux kernel.
When using TM, the only failure-free choice is the irrevocable
transaction, which is not used in the common case.
In fact, the irrevocable transaction is not even available
in all TM implementations.
}\QuickQuizEnd
What should TM researchers and developers do about all of this?
One approach is to focus on TM in the small, focusing on small
transactions where hardware assist potentially provides substantial
advantages over other synchronization primitives and on small programs
where there is some evidence for increased productivity for a combined
TM-locking approach~\cite{VPankratius2011TMvsLockingProductivity}.
Sun took the small-transaction approach with its Rock research
CPU~\cite{DaveDice2009ASPLOSRockHTM}.
Some TM researchers seem to agree with these two small-is-beautiful
approaches~\cite{JMStone93}, others have much higher hopes for TM, and yet others
hint that high TM aspirations might be TM's worst
enemy~\cite[Section 6]{Attiya:2010:ICT:1835698.1835699}.
It is nonetheless quite possible that TM will be able to take on larger
problems, and this section has listed a few of the issues that must be
resolved if TM is to achieve this lofty goal.
Of course, everyone involved should treat this as a learning experience.
It would seem that TM researchers have great deal to learn from
practitioners who have successfully built large software systems using
traditional synchronization primitives.
And vice versa.
\QuickQuiz{
What is to learn?
Why not just use TM for memory-based data structures and locking
for those rare cases featuring the many silly corner cases listed
in this silly section???
}\QuickQuizAnswer{
The year 2005 just called, and it wants its incandescent TM
marketing hype back.
In the year 2023, TM still has significant proving to do,
even with the advent of HTM, which is covered in the
upcoming
\cref{sec:future:Hardware Transactional Memory}.
}\QuickQuizEnd
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{cartoons/TM-the-vision}}
\caption{The STM Vision}
\ContributedBy{fig:future:The STM Vision}{Melissa Broussard}
\end{figure}
\begin{figure}
\centering
\resizebox{2.7in}{!}{\includegraphics{cartoons/TM-the-reality-conflict}}
\caption{The STM Reality:
Conflicts}
\ContributedBy{fig:future:The STM Reality: Conflicts}{Melissa Broussard}
\end{figure}
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{cartoons/TM-the-reality-nonidempotent}}
\caption{The STM Reality:
Irrevocable Operations}
\ContributedBy{fig:future:The STM Reality: Irrevocable Operations}{Melissa Broussard}
\end{figure}
\begin{figure}
\centering
\resizebox{2.7in}{!}{\includegraphics{cartoons/TM-the-reality-realtime}}
\caption{The STM Reality:
Realtime Response}
\ContributedBy{fig:future:The STM Reality: Realtime Response}{Melissa Broussard}
\end{figure}
But for the moment, the current state of STM
can best be summarized with a series of cartoons.
First,
\cref{fig:future:The STM Vision}
shows the STM vision.
As always, the reality is a bit more nuanced, as fancifully depicted by
\cref{fig:future:The STM Reality: Conflicts,%
fig:future:The STM Reality: Irrevocable Operations,%
fig:future:The STM Reality: Realtime Response}.\footnote{
Recent academic work-in-progress has investigated lock-based STM
systems for real-time use~\cite{JimAnderson2019STMRT,CatherineNemitz2018LockSTMrealtime},
albeit without any performance results, and with some indications
that real-time hybrid STM/HTM systems must choose between fast
common-case performance and worst-case forward-progress
guarantees~\cite{DBLP:journals/corr/AlistarhKKRS14,MartinSchoeberl2010realtimeTM}.}
Less fanciful STM retrospectives are also
available~\cite{JoeDuffy2010RetroTM,JoeDuffy2010RetroTM2}.
Alternatively, one might argue that sequence locking constitutes a
restricted form of STM that is heavily used in practice, as discussed in
\cref{sec:future:Case Study: Sequence Locking}.
Whether or not sequence locking is considered to be the standard bearer
for STM in actual practice, some commercially available hardware supports
restricted variants of HTM, which are addressed in the following section.