future/tm.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % future/tm.tex

 \section{Transactional Memory}
 \label{sec:future:Transactional Memory}

 The idea of using transactions outside of databases goes back many
 decades~\cite{DBLomet1977SIGSOFT}, with the key difference between
 database and non-database transactions being that non-database transactions
 drop the ``D'' in the ``ACID'' properties defining database transactions.
 The idea of supporting memory-based transactions, or ``transactional memory''
 (TM), in hardware
 is more recent~\cite{Herlihy93a}, but unfortunately, support for such
 transactions in commodity hardware was not immediately forthcoming,
 despite other somewhat similar proposals being put forward~\cite{JMStone93}.
 Not long after, Shavit and Touitou proposed a software-only implementation
 of transactional memory (STM) that was capable of running on commodity
 hardware, give or take memory-ordering issues.
 This proposal languished for many years, perhaps due to the fact that
 the research community's attention was absorbed by non-blocking
 synchronization (see Section~\ref{sec:advsync:Non-Blocking Synchronization}).

 But by the turn of the century, TM started receiving
 more attention~\cite{Martinez01a,Rajwar01a}, and by the middle of the
 decade, the level of interest can only be termed
 ``incandescent''~\cite{MauriceHerlihy2005-TM-manifesto.pldi,
 DanGrossman2007TMGCAnalogy}, despite a few voices of
 caution~\cite{Blundell2005DebunkTM,McKenney2007PLOSTM}.

 The basic idea behind TM is to execute a section of
 code atomically, so that other threads see no intermediate state.
 As such, the semantics of TM could be implemented
 by simply replacing each transaction with a recursively acquirable
 global lock acquisition and release, albeit with abysmal performance
 and scalability.
 Much of the complexity inherent in TM implementations, whether hardware
 or software, is efficiently detecting when concurrent transactions can safely
 run in parallel.
 Because this detection is done dynamically, conflicting transactions
 can be aborted or ``rolled back'', and in some implementations, this
 failure mode is visible to the programmer.

 Because transaction roll-back is increasingly unlikely as transaction
 size decreases, TM might become quite attractive for small memory-based
 operations,
 such as linked-list manipulations used for stacks, queues, hash tables,
 and search trees.
 However, it is currently much more difficult to make the case for large
 transactions, particularly those containing non-memory operations such
 as I/O and process creation.
 The following sections look at current challenges to the grand vision of
 ``Transactional Memory Everywhere''~\cite{PaulEMcKenney2009TMeverywhere}.
 Section~\ref{sec:future:Outside World} examines the challenges faced
 interacting with the outside world,
 Section~\ref{sec:future:Process Modification} looks at interactions
 with process modification primitives,
 Section~\ref{sec:future:Synchronization} explores interactions with
 other synchronization primitives, and finally
 Section~\ref{sec:future:Discussion} closes with some discussion.

 \subsection{Outside World}
 \label{sec:future:Outside World}

 In the words of Donald Knuth:

 \begin{quote}
 	Many computer users feel that input and output are not actually part
 	of ``real programming,'' they are merely things that (unfortunately)
 	must be done in order to get information in and out of the machine.
 \end{quote}

 Whether we believe that input and output are ``real programming'' or
 not, the fact is that for most computer systems, interaction with the
 outside world is a first-class requirement.
 This section therefore critiques transactional memory's ability to
 so interact, whether via I/O operations, time delays, or persistent
 storage.

 \subsubsection{I/O Operations}
 \label{sec:future:I/O Operations}

 One can execute I/O operations within a lock-based critical section,
 and, at least in principle, from within an RCU read-side critical section.
 What happens when you attempt to execute an I/O operation from within
 a transaction?

 The underlying problem is that transactions may be rolled back, for
 example, due to conflicts.
 Roughly speaking, this requires that all operations within any given
 transaction be revocable, so that executing the operation twice has
 the same effect as executing it once.
 Unfortunately, I/O is in general the prototypical irrevocable
 operation, making it difficult to include general I/O operations in
 transactions.
 In fact, general I/O is irrevocable:
 Once you have pushed the button launching the nuclear warheads, there
 is no turning back.

 Here are some options for handling of I/O within transactions:

 \begin{enumerate}
 \item	Restrict I/O within transactions to buffered I/O with in-memory
 	buffers.
 	These buffers may then be included in the transaction in the
 	same way that any other memory location might be included.
 	This seems to be the mechanism of choice, and it does work
 	well in many common cases of situations such as stream I/O and
 	mass-storage I/O.
 	However, special handling is required in cases where multiple
 	record-oriented output streams are merged onto a single file
 	from multiple processes, as might be done using the ``a+''
 	option to \co{fopen()} or the \co{O_APPEND}  flag to \co{open()}.
 	In addition, as will be seen in the next section, common
 	networking operations cannot be handled via buffering.
 \item	Prohibit I/O within transactions, so that any attempt to execute
 	an I/O operation aborts the enclosing transaction (and perhaps
 	multiple nested transactions).
 	This approach seems to be the conventional TM approach for
 	unbuffered I/O, but requires that TM interoperate with other
 	synchronization primitives that do tolerate I/O.
 \item	Prohibit I/O within transactions, but enlist the compiler's aid
 	in enforcing this prohibition.
 \item	Permit only one special
 	\emph{irrevocable} transaction~\cite{SpearMichaelScott2008InevitableSTM}
 	to proceed
 	at any given time, thus allowing irrevocable transactions to
 	contain I/O operations.\footnote{
 		In earlier literature, irrevocable transactions are
 		termed \emph{inevitable} transactions.}
 	This works in general, but severely limits the scalability and
 	performance of I/O operations.
 	Given that scalability and performance is a first-class goal of
 	parallelism, this approach's generality seems a bit self-limiting.
 	Worse yet, use of irrevocability to tolerate I/O operations
 	seems to prohibit use of manual transaction-abort operations.\footnote{
 		This difficulty was pointed out by Michael Factor.}
 	Finally, if there is an irrevocable transaction manipulating
 	a given data item, any other transaction manipulating that
 	same data item cannot have non-blocking semantics.
 \item	Create new hardware and protocols such that I/O operations can
 	be pulled into the transactional substrate.
 	In the case of input operations, the hardware would need to
 	correctly predict the result of the operation, and to abort the
 	transaction if the prediction failed.
 \end{enumerate}

 I/O operations are a well-known weakness of TM, and it is not clear
 that the problem of supporting I/O in transactions has a reasonable
 general solution, at least if ``reasonable'' is to include usable
 performance and scalability.
 Nevertheless, continued time and attention to this problem will likely
 produce additional progress.

 \subsubsection{RPC Operations}
 \label{sec:future:RPC Operations}

 One can execute RPCs within a lock-based critical section, as well as
 from within an RCU read-side critical section. What happens when you
 attempt to execute an RPC from within a transaction?

 If both the RPC request and its response are to be contained within the
 transaction, and if some part of the transaction depends on the result
 returned by the response, then it is not possible to use the memory-buffer
 tricks that can be used in the case of buffered I/O.
 Any attempt to
 take this buffering approach would deadlock the transaction, as the
 request could not be transmitted until the transaction was guaranteed
 to succeed, but the transaction's success might not be knowable until
 after the response is received, as is the case in the following example:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \small
 \begin{verbatim}
   1 begin_trans();
   2 rpc_request();
   3 i = rpc_response();
   4 a[i]++;
   5 end_trans();
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 The transaction's memory footprint cannot be determined until after the
 RPC response is received, and until the transaction's memory footprint
 can be determined, it is impossible to determine whether the transaction
 can be allowed to commit.
 The only action consistent with transactional semantics is therefore to
 unconditionally abort the transaction, which is, to say the least,
 unhelpful.

 Here are some options available to TM:

 \begin{enumerate}
 \item	Prohibit RPC within transactions, so that any attempt to execute
 	an RPC operation aborts the enclosing transaction (and perhaps
 	multiple nested transactions).
 	Alternatively, enlist the compiler to enforce RPC-free
 	transactions.
 	This approach does work, but will require TM to
 	interact with other synchronization primitives.
 \item	Permit only one special
 	irrevocable transaction~\cite{SpearMichaelScott2008InevitableSTM}
 	to proceed at any given time, thus allowing irrevocable
 	transactions to contain RPC operations.
 	This works in general, but severely limits the scalability and
 	performance of RPC operations.
 	Given that scalability and performance is a first-class goal of
 	parallelism, this approach's generality seems a bit self-limiting.
 	Furthermore, use of irrevocable transactions to permit RPC
 	operations rules out manual transaction-abort operations
 	once the RPC operation has started.
 	Finally, if there is an irrevocable transaction manipulating
 	a given data item, any other transaction manipulating that
 	same data item cannot have non-blocking semantics.
 \item	Identify special cases where the success of the transaction may
 	be determined before the RPC response is received, and
 	automatically convert these to irrevocable transactions immediately
 	before sending the RPC request.
 	Of course, if several concurrent transactions attempt RPC calls
 	in this manner, it might be necessary to roll all but one of them
 	back, with consequent degradation of performance and scalability.
 	This approach nevertheless might be valuable given long-running
 	transactions ending with an RPC.
 	This approach still has problems with manual transaction-abort
 	operations.
 \item	Identify special cases where the RPC response may be moved out
 	of the transaction, and then proceed using techniques similar
 	to those used for buffered I/O.
 \item	Extend the transactional substrate to include the RPC server as
 	well as its client.
 	This is in theory possible, as has been demonstrated by
 	distributed databases.
 	However, it is unclear whether the requisite performance and
 	scalability requirements can be met by distributed-database
 	techniques, given that memory-based TM cannot hide such latencies
 	behind those of slow disk drives.
 	Of course, given the advent of solid-state disks, it is also unclear
 	how much longer databases will be permitted to hide their latencies
 	behind those of disks drives.
 \end{enumerate}

 As noted in the prior section, I/O is a known weakness of TM, and RPC
 is simply an especially problematic case of I/O.

 \subsubsection{Time Delays}
 \label{sec:future:Time Delays}

 An important special case of interaction with extra-transactional accesses
 involves explicit time delays within a transaction.
 Of course, the idea of a time delay within a transaction flies in the
 face of TM's atomicity property, but one can argue that this sort of
 thing is what weak atomicity is all about.
 Furthermore, correct interaction with memory-mapped I/O sometimes requires
 carefully controlled timing, and applications often use time delays
 for varied purposes.

 So, what can TM do about time delays within transactions?

 \begin{enumerate}
 \item	Ignore time delays within transactions.
 	This has an appearance of elegance, but like too many other
 	``elegant'' solutions, fails to survive first contact with
 	legacy code.
 	Such code, which might well have important time delays in critical
 	sections, would fail upon being transactionalized.
 \item	Abort transactions upon encountering a time-delay operation.
 	This is attractive, but it is unfortunately not always possible
 	to automatically detect a time-delay operation.
 	Is that tight loop computing something important, or is it
 	instead waiting for time to elapse?
 \item	Enlist the compiler to prohibit time delays within transactions.
 \item	Let the time delays execute normally.
 	Unfortunately, some TM implementations publish modifications only
 	at commit time, which would in many cases defeat the purpose of
 	the time delay.
 \end{enumerate}

 It is not clear that there is a single correct answer.
 TM implementations featuring weak atomicity that publish changes
 immediately within the transaction (rolling these changes back upon abort)
 might be reasonably well served by the last alternative.
 Even in this case, the code (or possibly even hardware) at the other
 end of the transaction may require a substantial redesign to tolerate
 aborted transactions.
 This need for redesign would make it more difficult to apply transactional
 memory to legacy code.

 \subsubsection{Persistence}
 \label{sec:future:Persistence}

 There are many different types of locking primitives.
 One interesting distinction is persistence, in other words, whether the
 lock can exist independently of the address space of the process using
 the lock.

 Non-persistent locks include \co{pthread_mutex_lock()},
 \co{pthread_rwlock_rdlock()}, and most kernel-level locking primitives.
 If the memory locations instantiating a non-persistent lock's data
 structures disappear, so does the lock.
 For typical use of \co{pthread_mutex_lock()}, this means that when the
 process exits, all of its locks vanish.
 This property can be exploited in order to trivialize lock cleanup
 at program shutdown time, but makes it more difficult for unrelated
 applications to share locks, as such sharing requires the applications
 to share memory.

 Persistent locks help avoid the need to share memory among unrelated
 applications.
 Persistent locking APIs include the flock family, \co{lockf()}, System
 V semaphores, or the \co{O_CREAT} flag to \co{open()}.
 These persistent APIs can be used to protect large-scale operations
 spanning runs of multiple applications, and, in the case of \co{O_CREAT}
 even surviving operating-system reboot.
 If need be, locks can even span multiple computer systems via distributed
 lock managers and distributed filesystems---and persist across reboots
 of any or all of these computer systems.

 Persistent locks can be used by any application, including applications
 written using multiple languages and software environments.
 In fact, a persistent lock might well be acquired by an application written
 in C and released by an application written in Python.

 How could a similar persistent functionality be provided for TM?

 \begin{enumerate}
 \item	Restrict persistent transactions to special-purpose environments
 	designed to support them, for example, SQL.
 	This clearly works, given the decades-long history of database
 	systems, but does not provide the same degree of flexibility
 	provided by persistent locks.
 \item	Use snapshot facilities provided by some storage devices and/or
 	filesystems.
 	Unfortunately, this does not handle network communication,
 	nor does it handle I/O to devices that do not provide snapshot
 	capabilities, for example, memory sticks.
 \item	Build a time machine.
 \end{enumerate}

 Of course, the fact that it is called transactional \emph{memory}
 should give us pause, as the name itself conflicts with the concept of
 a persistent transaction.
 It is nevertheless worthwhile to consider this possibility as an important
 test case probing the inherent limitations of transactional memory.

 \subsection{Process Modification}
 \label{sec:future:Process Modification}

 Processes are not eternal:
 They are created and destroyed, their memory mappings are modified,
 they are linked to dynamic libraries, and they are debugged.
 These sections look at how transactional memory can handle an
 ever-changing execution environment.

 \subsubsection{Multithreaded Transactions}
 \label{sec:future:Multithreaded Transactions}

 It is perfectly legal to create processes and threads while holding
 a lock or, for that matter, from within an RCU read-side critical
 section.
 Not only is it legal, but it is quite simple, as can be seen from the
 following code fragment:

 \vspace{5pt}
 \begin{minipage}[t]{\columnwidth}
 \small
 \begin{verbatim}
   1 pthread_mutex_lock(...);
   2 for (i = 0; i < ncpus; i++)
   3   pthread_create(&tid[i], ...);
   4 for (i = 0; i < ncpus; i++)
   5   pthread_join(tid[i], ...);
   6 pthread_mutex_unlock(...);
 \end{verbatim}
 \end{minipage}
 \vspace{5pt}

 This pseudo-code fragment uses \co{pthread_create()} to spawn one thread
 per CPU, then uses \co{pthread_join()} to wait for each to complete,
 all under the protection of \co{pthread_mutex_lock()}.
 The effect is to execute a lock-based critical section in parallel,
 and one could obtain a similar effect using \co{fork()} and \co{wait()}.
 Of course, the critical section would need to be quite large to justify
 the thread-spawning overhead, but there are many examples of large
 critical sections in production software.

 What might TM do about thread spawning within a transaction?

 \begin{enumerate}
 \item	Declare \co{pthread_create()} to be illegal within transactions,
 	resulting in transaction abort (preferred) or undefined
 	behavior. Alternatively, enlist the compiler to enforce
 	\co{pthread_create()}-free transactions.
 \item	Permit \co{pthread_create()} to be executed within a
 	transaction, but only the parent thread will be considered to
 	be part of the transaction.
 	This approach seems to be reasonably compatible with existing and
 	posited TM implementations, but seems to be a trap for the unwary.
 	This approach raises further questions, such as how to handle
 	conflicting child-thread accesses.
 \item	Convert the \co{pthread_create()}s to function calls.
 	This approach is also an attractive nuisance, as it does not
 	handle the not-uncommon cases where the child threads communicate
 	with one another.
 	In addition, it does not permit parallel execution of the body
 	of the transaction.
 \item	Extend the transaction to cover the parent and all child threads.
 	This approach raises interesting questions about the nature of
 	conflicting accesses, given that the parent and children are
 	presumably permitted to conflict with each other, but not with
 	other threads.
 	It also raises interesting questions as to what should happen
 	if the parent thread does not wait for its children before
 	committing the transaction.
 	Even more interesting, what happens if the parent conditionally
 	executes \co{pthread_join()} based on the values of variables
 	participating in the transaction?
 	The answers to these questions are reasonably straightforward
 	in the case of locking.
 	The answers for TM are left as an exercise for the reader.
 \end{enumerate}

 Given that parallel execution of transactions is commonplace in the
 database world, it is perhaps surprising that current TM proposals do
 not provide for it.
 On the other hand, the example above is a fairly sophisticated use
 of locking that is not normally found in simple textbook examples,
 so perhaps its omission is to be expected.
 That said, there are rumors that some TM researchers are investigating
 fork/join parallelism within transactions, so perhaps this topic will
 soon be addressed more thoroughly.

 \subsubsection{The \tco{exec()} System Call}
 \label{sec:future:The exec System Call}

 One can execute an \co{exec()} system call while holding a lock, and
 also from within an RCU read-side critical section.
 The exact semantics depends on the type of primitive.

 In the case of non-persistent primitives (including
 \co{pthread_mutex_lock()}, \co{pthread_rwlock_rdlock()}, and RCU),
 if the \co{exec()} succeeds, the whole address space vanishes, along
 with any locks being held.
 Of course, if the \co{exec()} fails, the address space still lives,
 so any associated locks would also still live.
 A bit strange perhaps, but reasonably well defined.

 On the other hand, persistent primitives (including the flock family,
 \co{lockf()}, System V semaphores, and the \co{O_CREAT} flag to
 \co{open()}) would survive regardless of whether the \co{exec()}
 succeeded or failed, so that the \co{exec()}ed program might well
 release them.

 \QuickQuiz{}
 	What about non-persistent primitives represented by data
 	structures in \co{mmap()} regions of memory?
 	What happens when there is an \co{exec()} within a critical
 	section of such a primitive?
 \QuickQuizAnswer{
 	If the \co{exec()}ed program maps those same regions of
 	memory, then this program could in principle simply release
 	the lock.
 	The question as to whether this approach is sound from a
 	software-engineering viewpoint is left as an exercise for
 	the reader.
 } \QuickQuizEnd

 What happens when you attempt to execute an \co{exec()} system call
 from within a transaction?

 \begin{enumerate}
 \item	Disallow \co{exec()} within transactions, so that the enclosing
 	transactions abort upon encountering the \co{exec()}.
 	This is well defined, but clearly requires non-TM synchronization
 	primitives for use in conjunction with \co{exec()}.
 \item	Disallow \co{exec()} within transactions, with the compiler
 	enforcing this prohibition.
 	There is a draft specification for TM in C++ that takes
 	this approach, allowing functions to be decorated with
 	the \co{transaction_safe} and \co{transaction_unsafe}
 	attributes.\footnote{
 		Thanks to Mark Moir for pointing me at this spec, and
 		to Michael Wong for having pointed me at an earlier
 		revision some time back.}
 	This approach has some advantages over aborting the transaction
 	at runtime, but again requires non-TM synchronization primitives
 	for use in conjunction with \co{exec()}.
 \item	Treat the transaction in a manner similar to non-persistent
 	Locking primitives, so that the transaction survives if \co{exec()}
 	fails, and silently commits if the \co{exec()} succeeds.
 	The case where some of the variables affected by the transaction
 	reside in \co{mmap()}ed memory (and thus could survive a successful
 	\co{exec()} system call) is left as an exercise for the reader.
 \item	Abort the transaction (and the \co{exec()} system call) if the
 	\co{exec()} system call would have succeeded, but allow the
 	transaction to continue if the \co{exec()} system call would
 	fail.
 	This is in some sense the ``correct'' approach, but it would
 	require considerable work for a rather unsatisfying result.
 \end{enumerate}

 The \co{exec()} system call is perhaps the strangest example of an
 obstacle to universal TM applicability, as it is not completely clear
 what approach makes sense, and some might argue that this is merely a
 reflection of the perils of interacting with execs in real life.
 That said, the two options prohibiting \co{exec()} within transactions
 are perhaps the most logical of the group.

 Similar issues surround the \co{exit()} and \co{kill()} system calls.

 \subsubsection{Dynamic Linking and Loading}
 \label{sec:future:Dynamic Linking and Loading}

 Both lock-based critical sections and RCU read-side critical sections
 can legitimately contain code that invokes dynamically linked and loaded
 functions, including C/C++ shared libraries and Java class libraries.
 Of course, the code contained in these libraries is by definition
 unknowable at compile time.
 So, what happens if a dynamically loaded function is invoked within
 a transaction?

 This question has two parts: (a)~how do you dynamically link and load a
 function within a transaction and (b)~what do you do about the unknowable
 nature of the code within this function?
 To be fair, item (b) poses some challenges for locking and RCU as well,
 at least in theory.
 For example, the dynamically linked function might introduce a deadlock
 for locking or might (erroneously) introduce a quiescent state into an
 RCU read-side critical section.
 The difference is that while the class of operations permitted in locking
 and RCU critical sections is well-understood, there appears to still be
 considerable uncertainty in the case of TM.
 In fact, different implementations of TM seem to have different restrictions.

 So what can TM do about dynamically linked and loaded library functions?
 Options for part (a), the actual loading of the code, include the following:

 \begin{enumerate}
 \item	Treat the dynamic linking and loading in a manner similar to a
 	page fault, so that the function is loaded and linked, possibly
 	aborting the transaction in the process.
 	If the transaction is aborted, the retry will find the function
 	already present, and the transaction can thus be expected to
 	proceed normally.
 \item	Disallow dynamic linking and loading of functions from within
 	transactions.
 \end{enumerate}

 Options for part (b), the inability to detect TM-unfriendly operations
 in a not-yet-loaded function, possibilities include the following:

 \begin{enumerate}
 \item	Just execute the code: if there are any TM-unfriendly operations
 	in the function, simply abort the transaction.
 	Unfortunately, this approach makes it impossible for the compiler
 	to determine whether a given group of transactions may be safely
 	composed.
 	One way to permit composability regardless is irrevocable
 	transactions, however, current implementations permit only a
 	single irrevocable transaction to proceed at any given time,
 	which can severely limit performance and scalability.
 	Irrevocable transactions also seem to rule out use of manual
 	transaction-abort operations.
 	Finally, if there is an irrevocable transaction manipulating
 	a given data item, any other transaction manipulating that
 	same data item cannot have non-blocking semantics.
 \item	Decorate the function declarations indicating which functions
 	are TM-friendly.
 	These decorations can then be enforced by the compiler's type system.
 	Of course, for many languages, this requires language extensions
 	to be proposed, standardized, and implemented, with the
 	corresponding time delays.
 	That said, the standardization effort is already in
 	progress~\cite{Ali-Reza-Adl-Tabatabai2009CppTM}.
 \item	As above, disallow dynamic linking and loading of functions from
 	within transactions.
 \end{enumerate}

 I/O operations are of course a known weakness of TM, and dynamic linking
 and loading can be thought of as yet another special case of I/O.
 Nevertheless, the proponents of TM must either solve this problem, or
 resign themselves to a world where TM is but one tool of several in the
 parallel programmer's toolbox.
 (To be fair, a number of TM proponents have long since resigned themselves
 to a world containing more than just TM.)

 \subsubsection{Memory-Mapping Operations}
 \label{sec:future:Memory-Mapping Operations}

 It is perfectly legal to execute memory-mapping operations (including
 \co{mmap()}, \co{shmat()}, and \co{munmap()}~\cite{TheOpenGroup1997SUS})
 within a lock-based critical section,
 and, at least in principle, from within an RCU read-side critical section.
 What happens when you attempt to execute such an operation from within
 a transaction?
 More to the point, what happens if the memory region being remapped
 contains some variables participating in the current thread's transaction?
 And what if this memory region contains variables participating in some
 other thread's transaction?

 It should not be necessary to consider cases where the TM system's
 metadata is remapped, given that most locking primitives do not define
 the outcome of remapping their lock variables.

 Here are some memory-mapping options available to TM:

 \begin{enumerate}
 \item	Memory remapping is illegal within a transaction, and will result
 	in all enclosing transactions being aborted.
 	This does simplify things somewhat, but also requires that TM
 	interoperate with synchronization primitives that do tolerate
 	remapping from within their critical sections.
 \item	Memory remapping is illegal within a transaction, and the
 	compiler is enlisted to enforce this prohibition.
 \item	Memory mapping is legal within a transaction, but aborts all
 	other transactions having variables in the region mapped over.
 \item	Memory mapping is legal within a transaction, but the mapping
 	operation will fail if the region being mapped overlaps with
 	the current transaction's footprint.
 \item	All memory-mapping operations, whether within or outside a
 	transaction, check the region being mapped against the memory
 	footprint of all transactions in the system.
 	If there is overlap, then the memory-mapping operation fails.
 \item	The effect of memory-mapping operations that overlap the memory
 	footprint of any transaction in the system is determined by the
 	TM conflict manager, which might dynamically determine whether
 	to fail the memory-mapping operation or abort any conflicting
 	transactions.
 \end{enumerate}

 It is interesting to note that \co{munmap()} leaves the relevant region
 of memory unmapped, which could have additional interesting
 implications.\footnote{
 	This difference between mapping and unmapping was noted by
 	Josh Triplett.}

 \subsubsection{Debugging}
 \label{sec:future:Debugging}

 The usual debugging operations such as breakpoints work normally within
 lock-based critical sections and from RCU read-side critical sections.
 However, in initial transactional-memory hardware
 implementations~\cite{DaveDice2009ASPLOSRockHTM} an exception within
 a transaction will abort that transaction, which in turn means that
 breakpoints abort all enclosing transactions.

 So how can transactions be debugged?

 \begin{enumerate}
 \item	Use software emulation techniques within transactions containing
 	breakpoints.
 	Of course, it might be necessary to emulate all transactions
 	any time a breakpoint is set within the scope of any transaction.
 	If the runtime system is unable to determine whether or not a
 	given breakpoint is within the scope of a transaction, then it
 	might be necessary to emulate all transactions just to be on
 	the safe side.
 	However, this approach might impose significant overhead, which
 	might in turn obscure the bug being pursued.
 \item	Use only hardware TM implementations that are capable of
 	handling breakpoint exceptions.
 	Unfortunately, as of this writing (September 2008), all such
 	implementations are strictly research prototypes.
 \item	Use only software TM implementations, which are
 	(very roughly speaking) more tolerant of exceptions than are
 	the simpler of the hardware TM implementations.
 	Of course, software TM tends to have higher overhead than hardware
 	TM, so this approach may not be acceptable in all situations.
 \item	Program more carefully, so as to avoid having bugs in the
 	transactions in the first place.
 	As soon as you figure out how to do this, please do let everyone
 	know the secret!
 \end{enumerate}

 There is some reason to believe that transactional memory will deliver
 productivity improvements compared to other synchronization mechanisms,
 but it does seem quite possible that these improvements could easily
 be lost if traditional debugging techniques cannot be applied to
 transactions.
 This seems especially true if transactional memory is to be used by
 novices on large transactions.
 In contrast, macho ``top-gun'' programmers might be able to dispense with
 such debugging aids, especially for small transactions.

 Therefore, if transactional memory is to deliver on its productivity
 promises to novice programmers, the debugging problem does need to
 be solved.

 \subsection{Synchronization}
 \label{sec:future:Synchronization}

 If transactional memory someday proves that it can be everything to everyone,
 it will not need to interact with any other synchronization mechanism.
 Until then, it will need to work with synchronization mechanisms that
 can do what it cannot, or that work more naturally in a given situation.
 The following sections outline the current challenges in this area.

 \subsubsection{Locking}
 \label{sec:future:Locking}

 It is commonplace to acquire locks while holding other locks, which works
 quite well, at least as long as the usual well-known software-engineering
 techniques are employed to avoid deadlock.
 It is not unusual to acquire locks from within RCU read-side critical
 sections, which eases deadlock concerns because RCU read-side primitives
 cannot participate in lock-based deadlock cycles.
 But what happens when you attempt to acquire a lock from within a transaction?

 In theory, the answer is trivial: simply manipulate the data structure
 representing the lock as part of the transaction, and everything works
 out perfectly.
 In practice, a number of non-obvious complications~\cite{Volos2008TRANSACT}
 can arise, depending on implementation details of the TM system.
 These complications can be resolved, but at the cost of a 45\% increase in
 overhead for locks acquired outside of transactions and a 300\% increase
 in overhead for locks acquired within transactions.
 Although these overheads might be acceptable for transactional
 programs containing small amounts of locking, they are often completely
 unacceptable for production-quality lock-based programs wishing to use
 the occasional transaction.

 \begin{enumerate}
 \item	Use only locking-friendly TM implementations.
 	Unfortunately, the locking-unfriendly implementations have some
 	attractive properties, including low overhead for successful
 	transactions and the ability to accommodate extremely large
 	transactions.
 \item	Use TM only ``in the small'' when introducing TM to lock-based
 	programs, thereby accommodating the limitations of
 	locking-friendly TM implementations.
 \item	Set aside locking-based legacy systems entirely, re-implementing
 	everything in terms of transactions.
 	This approach has no shortage of advocates, but this requires
 	that all the issues described in this series be resolved.
 	During the time it takes to resolve these issues, competing
 	synchronization mechanisms will of course also have the
 	opportunity to improve.
 \item	Use TM strictly as an optimization in lock-based systems, as was
 	done by the TxLinux~\cite{ChistopherJRossbach2007a} group.
 	This approach seems sound, but leaves the locking design
 	constraints (such as the need to avoid deadlock) firmly in place.
 \item	Strive to reduce the overhead imposed on locking primitives.
 \end{enumerate}

 The fact that there could possibly be a problem interfacing TM and locking
 came as a surprise to many, which underscores the need to try out new
 mechanisms and primitives in real-world production software.
 Fortunately, the advent of open source means that a huge quantity of
 such software is now freely available to everyone, including researchers.

 \subsubsection{Reader-Writer Locking}
 \label{sec:future:Reader-Writer Locking}

 It is commonplace to read-acquire reader-writer locks while holding
 other locks, which just works, at least as long as the usual well-known
 software-engineering techniques are employed to avoid deadlock.
 Read-acquiring reader-writer locks from within RCU read-side critical
 sections also works, and doing so eases deadlock concerns because RCU
 read-side primitives cannot participate in lock-based deadlock cycles.
 But what happens when you attempt to read-acquire a reader-writer lock
 from within a transaction?

 Unfortunately, the straightforward approach to read-acquiring the
 traditional counter-based reader-writer lock within a transaction defeats
 the purpose of the reader-writer lock.
 To see this, consider a pair of transactions concurrently attempting to
 read-acquire the same reader-writer lock.
 Because read-acquisition involves modifying the reader-writer lock's
 data structures, a conflict will result, which will roll back one of
 the two transactions.
 This behavior is completely inconsistent with the reader-writer lock's
 goal of allowing concurrent readers.

 Here are some options available to TM:

 \begin{enumerate}
 \item	Use per-CPU or per-thread reader-writer
 	locking~\cite{WilsonCHsieh92a}, which allows a
 	given CPU (or thread, respectively) to manipulate only local
 	data when read-acquiring the lock.
 	This would avoid the conflict between the two transactions
 	concurrently read-acquiring the lock, permitting both to proceed,
 	as intended.
 	Unfortunately, (1)~the write-acquisition overhead of
 	per-CPU/thread locking can be extremely high, (2)~the memory
 	overhead of per-CPU/thread locking can be prohibitive, and
 	(3)~this transformation is available only when you have access to
 	the source code in question.
 	Other more-recent scalable
 	reader-writer locks~\cite{YossiLev2009SNZIrwlock}
 	might avoid some or all of these problems.
 \item	Use TM only ``in the small'' when introducing TM to lock-based
 	programs, thereby avoiding read-acquiring reader-writer locks
 	from within transactions.
 \item	Set aside locking-based legacy systems entirely, re-implementing
 	everything in terms of transactions.
 	This approach has no shortage of advocates, but this requires
 	that \emph{all} the issues described in this series be resolved.
 	During the time it takes to resolve these issues, competing
 	synchronization mechanisms will of course also have the
 	opportunity to improve.
 \item	Use TM strictly as an optimization in lock-based systems, as was
 	done by the TxLinux~\cite{ChistopherJRossbach2007a} group.
 	This approach seems sound, but leaves the locking design
 	constraints (such as the need to avoid deadlock) firmly in place.
 	Furthermore, this approach can result in unnecessary transaction
 	rollbacks when multiple transactions attempt to read-acquire
 	the same lock.
 \end{enumerate}

 Of course, there might well be other non-obvious issues surrounding
 combining TM with reader-writer locking, as there in fact were with
 exclusive locking.

 \subsubsection{RCU}
 \label{sec:future:RCU}

 Because read-copy update (RCU) finds its main use in the Linux kernel,
 one might be forgiven for assuming that there had been no academic work
 on combining RCU and TM.\footnote{
 	However, the in-kernel excuse is wearing thin with the advent
 	of user-space RCU~\cite{MathieuDesnoyers2009URCU,MathieuDesnoyers2012URCU}.}
 However, the TxLinux group from the University of Texas at Austin had
 no choice~\cite{ChistopherJRossbach2007a}.
 The fact that they applied TM to the Linux 2.6 kernel, which uses RCU,
 forced them to integrate TM and RCU, with TM taking the place of locking
 for RCU updates.
 Unfortunately, although the paper does state that the RCU implementation's
 locks (e.g., \co{rcu_ctrlblk.lock}) were converted to transactions,
 it is silent about what happened to locks used in RCU-based updates
 (e.g., \co{dcache_lock}).

 It is important to note that RCU permits readers and updaters to run
 concurrently, further permitting RCU readers to access data that is in
 the act of being updated.
 Of course, this property of RCU, whatever its performance, scalability,
 and real-time-response benefits might be, flies in the face of the
 underlying atomicity properties of TM.

 So how should TM-based updates interact with concurrent RCU readers?
 Some possibilities are as follows:

 \begin{enumerate}
 \item	RCU readers abort concurrent conflicting TM updates.
 	This is in fact the approach taken by the TxLinux project.
 	This approach does preserve RCU semantics, and also preserves
 	RCU's read-side performance, scalability, and real-time-response
 	properties, but it does have the unfortunate side-effect of
 	unnecessarily aborting conflicting updates.
 	In the worst case, a long sequence of RCU readers could
 	potentially starve all updaters, which could in theory result
 	in system hangs.
 	In addition, not all TM implementations offer the strong atomicity
 	required to implement this approach.
 \item	RCU readers that run concurrently with conflicting TM updates
 	get old (pre-transaction) values from any conflicting RCU loads.
 	This preserves RCU semantics and performance, and also prevents
 	RCU-update starvation.
 	However, not all TM implementations can provide timely access
 	to old values of variables that have been tentatively updated
 	by an in-flight transaction.
 	In particular, log-based TM implementations that maintain
 	old values in the log (thus making for excellent TM commit
 	performance) are not likely to be happy with this approach.
 	Perhaps the \co{rcu_dereference()} primitive can be leveraged
 	to permit RCU to access the old values within a greater range
 	of TM implementations, though performance might still be an issue.
 	Nevertheless, there are popular TM implementations that can
 	be easily and efficiently integrated with RCU in this
 	manner~\cite{DonaldEPorter2007TRANSACT,PhilHoward2011RCUTMRBTree,
 	PhilipWHoward2013RCUrbtree}.
 \item	If an RCU reader executes an access that conflicts with an
 	in-flight transaction, then that RCU access is delayed until
 	the conflicting transaction either commits or aborts.
 	This approach preserves RCU semantics, but not RCU's performance
 	or real-time response, particularly in presence of long-running
 	transactions.
 	In addition, not all TM implementations are capable of delaying
 	conflicting accesses.
 	That said, this approach seems eminently reasonable for hardware
 	TM implementations that support only small transactions.
 \item	RCU readers are converted to transactions.
 	This approach pretty much guarantees that RCU is compatible with
 	any TM implementation, but it also imposes TM's rollbacks on RCU
 	read-side critical sections, destroying RCU's real-time response
 	guarantees, and also degrading RCU's read-side performance.
 	Furthermore, this approach is infeasible in cases where any of
 	the RCU read-side critical sections contains operations that
 	the TM implementation in question is incapable of handling.
 \item	Many update-side uses of RCU modify a single pointer to publish
 	a new data structure.
 	In some of these cases, RCU can safely be permitted to see a
 	transactional pointer update that is subsequently rolled back,
 	as long as the transaction respects memory ordering and as long
 	as the roll-back process uses \co{call_rcu()} to free up the
 	corresponding structure.
 	Unfortunately, not all TM implementations respect memory barriers
 	within a transaction.
 	Apparently, the thought is that because transactions are supposed
 	to be atomic, the ordering of the accesses within the transaction
 	is not supposed to matter.
 \item	Prohibit use of TM in RCU updates.
 	This is guaranteed to work, but seems a bit restrictive.
 \end{enumerate}

 It seems likely that additional approaches will be uncovered, especially
 given the advent of user-level RCU implementations.\footnote{
 	Kudos to the TxLinux group, Maged Michael, and Josh Triplett
 	for coming up with a number of the above alternatives.}

 \subsubsection{Extra-Transactional Accesses}
 \label{sec:future:Extra-Transactional Accesses}

 Within a lock-based critical section, it is perfectly legal to manipulate
 variables that are concurrently accessed or even modified outside that
 lock's critical section, with one common example being statistical
 counters.
 The same thing is possible within RCU read-side critical
 sections, and is in fact the common case.

 Given mechanisms such as the so-called ``dirty reads'' that are
 prevalent in production database systems, it is not surprising
 that extra-transactional accesses have received serious attention
 from the proponents of TM, with the concepts of weak and strong
 atomicity~\cite{Blundell2006TMdeadlock} being but one case in point.

 Here are some extra-transactional options available to TM:

 \begin{enumerate}
 \item	Conflicts due to extra-transactional accesses always abort
 	transactions.
 	This is strong atomicity.
 \item	Conflicts due to extra-transactional accesses are ignored,
 	so only conflicts among transactions can abort transactions.
 	This is weak atomicity.
 \item	Transactions are permitted to carry out non-transactional
 	operations in special cases, such as when allocating memory or
 	interacting with lock-based critical sections.
 \item	Produce hardware extensions that permit some operations
 	(for example, addition) to be carried out concurrently on a
 	single variable by multiple transactions.
 \item	Introduce weak semantics to transactional memory.
 	One approach is the combination with RCU described in
 	Section~\ref{sec:future:RCU}, while Gramoli and Guerraoui
 	survey a number of other weak-transaction
 	approaches~\cite{Gramoli:2014:DTP:2541883.2541900}, for example,
 	restricted partitioning of large
 	``elastic'' transactions into smaller transactions, thus
 	reducing conflict probabilities (albeit with tepid performance
 	and scalability).
 	Perhaps further experience will show that some uses of
 	extra-transactional accesses can be replaced by weak
 	transactions.
 \end{enumerate}

 It appears that transactions were conceived as standing alone, with no
 interaction required with any other synchronization mechanism.
 If so, it is no surprise that much confusion and complexity arises when
 combining transactions with non-transactional accesses.
 But unless transactions are to be confined to small updates to isolated
 data structures, or alternatively to be confined to new programs
 that do not interact with the huge body of existing parallel code,
 then transactions absolutely must be so combined if they are to have
 large-scale practical impact in the near term.

 % @@@ Huge transactions.  Or perhaps conflict handling.

 \subsection{Discussion}
 \label{sec:future:Discussion}

 The obstacles to universal TM adoption lead to the following
 conclusions:

 \begin{enumerate}
 \item	One interesting property of TM is the fact that transactions are
 	subject to rollback and retry.
 	This property underlies TM's difficulties with irreversible
 	operations, including unbuffered I/O, RPCs, memory-mapping
 	operations, time delays, and the \co{exec()} system call.
 	This property also has the unfortunate consequence of introducing
 	all the complexities inherent in the possibility of
 	failure into synchronization primitives, often in a developer-visible
 	manner.
 \item	Another interesting property of TM, noted by
 	Shpeisman et al.~\cite{TatianaShpeisman2009CppTM}, is that TM
 	intertwines the synchronization with the data it protects.
 	This property underlies TM's issues with I/O, memory-mapping
 	operations, extra-transactional accesses, and debugging
 	breakpoints.
 	In contrast, conventional synchronization primitives, including
 	locking and RCU, maintain a clear separation between the
 	synchronization primitives and the data that they protect.
 \item	One of the stated goals of many workers in the TM area is to
 	ease parallelization of large sequential programs.
 	As such, individual transactions are commonly expected to
 	execute serially, which might do much to explain TM's issues
 	with multithreaded transactions.
 \end{enumerate}

 What should TM researchers and developers do about all of this?

 One approach is to focus on TM in the small, focusing on situations
 where hardware assist potentially provides substantial advantages over
 other synchronization primitives.
 This is in fact the approach Sun took with its Rock research
 CPU~\cite{DaveDice2009ASPLOSRockHTM}.
 Some TM researchers seem to agree with this approach, while others have
 much higher hopes for TM.

 Of course, it is quite possible that TM will be able to take on larger
 problems, and this section lists a few of the issues that
 must be resolved if TM is to achieve this lofty goal.

 Of course, everyone involved should treat this as a learning experience.
 It would seem that TM researchers have great deal to learn from
 practitioners who have successfully built large software systems using
 traditional synchronization primitives.

 And vice versa.

 \begin{figure}[tb]
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/TM-the-vision}}
 \caption{The STM Vision}
 \ContributedBy{Figure}{fig:future:The STM Vision}{Melissa Broussard}
 \end{figure}

 \begin{figure}[tb]
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/TM-the-reality-conflict}}
 \caption{The STM Reality: Conflicts}
 \ContributedBy{Figure}{fig:future:The STM Reality: Conflicts}{Melissa Broussard}
 \end{figure}

 \begin{figure}[tb]
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/TM-the-reality-nonidempotent}}
 \caption{The STM Reality: Irrevocable Operations}
 \ContributedBy{Figure}{fig:future:The STM Reality: Irrevocable Operations}{Melissa Broussard}
 \end{figure}

 \begin{figure}[tb]
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/TM-the-reality-realtime}}
 \caption{The STM Reality: Realtime Response}
 \ContributedBy{Figure}{fig:future:The STM Reality: Realtime Response}{Melissa Broussard}
 \end{figure}

 But for the moment, the current state of STM
 can best be summarized with a series of cartoons.
 First,
 Figure~\ref{fig:future:The STM Vision}
 shows the STM vision.
 As always, the reality is a bit more nuanced, as fancifully depicted by
 Figures~\ref{fig:future:The STM Reality: Conflicts},
 \ref{fig:future:The STM Reality: Irrevocable Operations},
 and~\ref{fig:future:The STM Reality: Realtime Response}.
 Less fanciful STM retrospectives are also
 available~\cite{JoeDuffy2010RetroTM,JoeDuffy2010RetroTM2}.

 Recent advances in commercially available hardware have opened the door
 for variants of HTM, which are addressed in the following section.

 % @@@ Don Porter's user-level TM work for Linux.