future/tm.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % future/tm.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \section{Transactional Memory}
 \label{sec:future:Transactional Memory}
 \epigraph{Everything should be as simple as it can be, but not simpler.}
 	 {Albert Einstein, by way of Louis Zukofsky}

 The idea of using transactions outside of databases goes back many
 decades~\cite{DBLomet1977SIGSOFT,Knight:1986:AMF:319838.319854,Herlihy93a},
 with the key difference between
 database and non-database transactions being that non-database transactions
 drop the ``D'' in the ``ACID''\footnote{
 	Atomicity, consistency, isolation, and durability.}
 properties defining database transactions.
 The idea of supporting memory-based transactions, or ``transactional memory''
 (TM), in hardware
 is more recent~\cite{Herlihy93a}, but unfortunately, support for such
 transactions in commodity hardware was not immediately forthcoming,
 despite other somewhat similar proposals being put forward~\cite{JMStone93}.
 Not long after, Shavit and Touitou proposed a software-only implementation
 of transactional memory (STM) that was capable of running on commodity
 hardware, give or take memory-ordering issues~\cite{Shavit95}.
 This proposal languished for many years, perhaps due to the fact that
 the research community's attention was absorbed by \IXacrl{nbs}
 (see \cref{sec:advsync:Non-Blocking Synchronization}).

 But by the turn of the century, TM started receiving
 more attention~\cite{Martinez01a,Rajwar01a}, and by the middle of the
 decade, the level of interest can only be termed
 ``incandescent''~\cite{MauriceHerlihy2005-TM-manifesto.pldi,
 DanGrossman2007TMGCAnalogy}, with only a few voices of
 caution~\cite{Blundell2005DebunkTM,McKenney2007PLOSTM}.

 The basic idea behind TM is to execute a section of
 code atomically, so that other threads see no intermediate state.
 As such, the semantics of TM could be implemented
 by simply replacing each transaction with a recursively acquirable
 global lock acquisition and release, albeit with abysmal performance
 and scalability.
 Much of the complexity inherent in TM implementations, whether hardware
 or software, is efficiently detecting when concurrent transactions can safely
 run in parallel.
 Because this detection is done dynamically, conflicting transactions
 can be aborted or ``rolled back'', and in some implementations, this
 failure mode is visible to the programmer.

 Because transaction roll-back is increasingly unlikely as transaction
 size decreases, TM might become quite attractive for small memory-based
 operations,
 such as linked-list manipulations used for stacks, queues, hash tables,
 and search trees.
 However, it is currently much more difficult to make the case for large
 transactions, particularly those containing non-memory operations such
 as I/O and process creation.
 The following sections look at current challenges to the grand vision of
 ``Transactional Memory Everywhere''~\cite{PaulEMcKenney2009TMeverywhere}.
 \Cref{sec:future:Outside World} examines the challenges faced
 interacting with the outside world,
 \cref{sec:future:Process Modification} looks at interactions
 with process modification primitives,
 \cref{sec:future:Synchronization} explores interactions with
 other synchronization primitives, and finally
 \cref{sec:future:Discussion} closes with some discussion.

 \subsection{Outside World}
 \label{sec:future:Outside World}

 In the wise words of \ppl{Donald}{Knuth}:

 \begin{quote}
 	Many computer users feel that input and output are not actually part
 	of ``real programming,'' they are merely things that (unfortunately)
 	must be done in order to get information in and out of the machine.
 \end{quote}

 Whether or not we believe that input and output are ``real programming'',
 the fact is that software absolutely must deal with the outside world.
 This section therefore critiques transactional memory's outside-world
 capabilities, focusing on I/O operations, time delays, and persistent
 storage.

 And these interactions with the outside world are the rock upon which
 the claims of unconditional TM composability are shattered.
 Yes, you can compose transactions, but only as long as there are no
 intervening TM-unfriendly operations.
 Just as you can compose lock-based critical sections, but only as long
 as doing so does not introduce deadlocks.

 In the end, it is unclear whether TM really provides better composability
 than does locking in real-world code.

 \subsubsection{I/O Operations}
 \label{sec:future:I/O Operations}

 One can execute I/O operations within a lock-based critical section,
 while holding a \IX{hazard pointer}, within a sequence-locking read-side
 critical section, and from within a userspace-RCU read-side critical
 section, and even all at the same time, if need be.
 What happens when you attempt to execute an I/O operation from within
 a transaction?

 The underlying problem is that transactions may be rolled back, for
 example, due to conflicts.
 Roughly speaking, this requires that all operations within any given
 transaction be revocable, so that executing the operation twice has
 the same effect as executing it once.
 Unfortunately, I/O is in general the prototypical irrevocable
 operation, making it difficult to include general I/O operations in
 transactions.
 In fact, general I/O is irrevocable:
 Once you have pushed the proverbial button launching the nuclear warheads,
 there is no turning back.

 Here are some options for handling of I/O within transactions:

 \begin{enumerate}
 \item	Restrict I/O within transactions to buffered I/O with in-memory
 	buffers.
 	These buffers may then be included in the transaction in the
 	same way that any other memory location might be included.
 	This seems to be the mechanism of choice, and it does work
 	well in many common cases of situations such as stream I/O and
 	mass-storage I/O\@.
 	However, special handling is required in cases where multiple
 	record-oriented output streams are merged onto a single file
 	from multiple processes, as might be done using the ``a+''
 	option to \co{fopen()} or the \co{O_APPEND} flag to \co{open()}.
 	In addition, as will be seen in the next section, common
 	networking operations cannot be handled via buffering.
 \item	Prohibit I/O within transactions, so that any attempt to execute
 	an I/O operation aborts the enclosing transaction (and perhaps
 	multiple nested transactions).
 	This approach seems to be the conventional TM approach for
 	unbuffered I/O, but requires that TM interoperate with other
 	synchronization primitives tolerating I/O.
 \item	Prohibit I/O within transactions, but enlist the compiler's aid
 	in enforcing this prohibition.
 \item	Permit only one special
 	\emph{irrevocable} transaction~\cite{SpearMichaelScott2008InevitableSTM}
 	to proceed
 	at any given time, thus allowing irrevocable transactions to
 	contain I/O operations.\footnote{
 		In earlier literature, irrevocable transactions are
 		termed \emph{inevitable} transactions.}
 	This works in general, but severely limits the scalability and
 	performance of I/O operations.
 	Given that scalability and performance is a first-class goal of
 	parallelism, this approach's generality seems a bit self-limiting.
 	Worse yet, use of irrevocability to tolerate I/O operations
 	seems to greatly restrict use of manual transaction-abort
 	operations.\footnote{
 		This difficulty was pointed out by Michael Factor.
 		To see the problem, think through what TM should do
 		in response to an attempt to abort a transaction after
 		it has executed an irrevocable operation.}
 	Finally, if there is an irrevocable transaction manipulating
 	a given data item, any other transaction manipulating that
 	same data item cannot have non-blocking semantics.
 \item	Create new hardware and protocols such that I/O operations can
 	be pulled into the transactional substrate.
 	In the case of input operations, the hardware would need to
 	correctly predict the result of the operation, and to abort the
 	transaction if the prediction failed.
 \end{enumerate}

 I/O operations are a well-known weakness of TM, and it is not clear
 that the problem of supporting I/O in transactions has a reasonable
 general solution, at least if ``reasonable'' is to include usable
 performance and scalability.
 Nevertheless, continued time and attention to this problem will likely
 produce additional progress.

 \subsubsection{RPC Operations}
 \label{sec:future:RPC Operations}

 One can execute RPCs within a lock-based critical section, while holding
 a hazard pointer, within a sequence-locking read-side critical section,
 and from within a userspace-RCU read-side critical section, and even
 all at the same time, if need be.
 What happens when you attempt to execute an RPC from within a transaction?

 If both the RPC request and its response are to be contained within the
 transaction, and if some part of the transaction depends on the result
 returned by the response, then it is not possible to use the memory-buffer
 tricks that can be used in the case of buffered I/O\@.
 Any attempt to
 take this buffering approach would deadlock the transaction, as the
 request could not be transmitted until the transaction was guaranteed
 to succeed, but the transaction's success might not be knowable until
 after the response is received, as is the case in the following example:

 \begin{VerbatimN}[samepage=true]
 begin_trans();
 rpc_request();
 i = rpc_response();
 a[i]++;
 end_trans();
 \end{VerbatimN}

 The transaction's memory footprint cannot be determined until after the
 RPC response is received, and until the transaction's memory footprint
 can be determined, it is impossible to determine whether the transaction
 can be allowed to commit.
 The only action consistent with transactional semantics is therefore to
 unconditionally abort the transaction, which is, to say the least,
 unhelpful.

 Here are some options available to TM:

 \begin{enumerate}
 \item	Prohibit RPC within transactions, so that any attempt to execute
 	an RPC operation aborts the enclosing transaction (and perhaps
 	multiple nested transactions).
 	Alternatively, enlist the compiler to enforce RPC-free
 	transactions.
 	This approach does work, but will require TM to
 	interact with other synchronization primitives.
 \item	Permit only one special
 	irrevocable transaction~\cite{SpearMichaelScott2008InevitableSTM}
 	to proceed at any given time, thus allowing irrevocable
 	transactions to contain RPC operations.
 	This works in general, but severely limits the scalability and
 	performance of RPC operations.
 	Given that scalability and performance is a first-class goal of
 	parallelism, this approach's generality seems a bit self-limiting.
 	Furthermore, use of irrevocable transactions to permit RPC
 	operations restricts manual transaction-abort operations
 	once the RPC operation has started.
 	Finally, if there is an irrevocable transaction manipulating
 	a given data item, any other transaction manipulating that
 	same data item must have blocking semantics.
 \item	Identify special cases where the success of the transaction may
 	be determined before the RPC response is received, and
 	automatically convert these to irrevocable transactions immediately
 	before sending the RPC request.
 	Of course, if several concurrent transactions attempt RPC calls
 	in this manner, it might be necessary to roll all but one of them
 	back, with consequent degradation of performance and scalability.
 	This approach nevertheless might be valuable given long-running
 	transactions ending with an RPC\@.
 	This approach must still restrict manual transaction-abort
 	operations.
 \item	Identify special cases where the RPC response may be moved out
 	of the transaction, and then proceed using techniques similar
 	to those used for buffered I/O.
 \item	Extend the transactional substrate to include the RPC server as
 	well as its client.
 	This is in theory possible, as has been demonstrated by
 	distributed databases.
 	However, it is unclear whether the requisite performance and
 	scalability requirements can be met by distributed-database
 	techniques, given that memory-based TM has no slow disk drives
 	behind which to hide such latencies.
 	Of course, given the advent of solid-state disks, it is also quite
 	possible that databases will need to redesign their approach to
 	latency hiding.
 \end{enumerate}

 As noted in the prior section, I/O is a known weakness of TM, and RPC
 is simply an especially problematic case of I/O.

 \subsubsection{Time Delays}
 \label{sec:future:Time Delays}

 An important special case of interaction with extra-transactional accesses
 involves explicit time delays within a transaction.
 Of course, the idea of a time delay within a transaction flies in the
 face of TM's atomicity property, but this sort of thing is arguably what
 weak atomicity is all about.
 Furthermore, correct interaction with memory-mapped I/O sometimes requires
 carefully controlled timing, and applications often use time delays
 for varied purposes.
 Finally, one can execute time delays within a lock-based critical section,
 while holding a hazard pointer, within a sequence-locking read-side
 critical section, and from within a userspace-RCU read-side critical
 section, and even all at the same time, if need be.
 Doing so might not be wise from a contention or scalability viewpoint,
 but then again, doing so does not raise any fundamental conceptual issues.

 So, what can TM do about time delays within transactions?

 \begin{enumerate}
 \item	Ignore time delays within transactions.
 	This has an appearance of elegance, but like too many other
 	``elegant'' solutions, fails to survive first contact with
 	legacy code.
 	Such code, which might well have important time delays in critical
 	sections, would fail upon being transactionalized.
 \item	Abort transactions upon encountering a time-delay operation.
 	This is attractive, but it is unfortunately not always possible
 	to automatically detect a time-delay operation.
 	Is that tight loop carrying out a critical computation, or is it
 	simply waiting for time to elapse?
 \item	Enlist the compiler to prohibit time delays within transactions.
 \item	Let the time delays execute normally.
 	Unfortunately, some TM implementations publish modifications only
 	at commit time, which could defeat the purpose of the time delay.
 \end{enumerate}

 It is not clear that there is a single correct answer.
 TM implementations featuring weak atomicity that publish changes
 immediately within the transaction (rolling these changes back upon abort)
 might be reasonably well served by the last alternative.
 Even in this case, the code (or possibly even hardware) at the other
 end of the transaction may require a substantial redesign to tolerate
 aborted transactions.
 This need for redesign would make it more difficult to apply transactional
 memory to legacy code.

 \subsubsection{Persistence}
 \label{sec:future:Persistence}

 There are many different types of locking primitives.
 One interesting distinction is persistence, in other words, whether the
 lock can exist independently of the address space of the process using
 the lock.

 Non-persistent locks include \co{pthread_mutex_lock()},
 \co{pthread_rwlock_rdlock()}, and most kernel-level locking primitives.
 If the memory locations instantiating a non-persistent lock's data
 structures disappear, so does the lock.
 For typical use of \co{pthread_mutex_lock()}, this means that when the
 process exits, all of its locks vanish.
 This property can be exploited in order to trivialize lock cleanup
 at program shutdown time, but makes it more difficult for unrelated
 applications to share locks, as such sharing requires the applications
 to share memory.

 \QuickQuiz{
 	But suppose that an application exits while holding a
 	\co{pthread_mutex_lock()} that happens to be located in a
 	file-mapped region of memory?
 }\QuickQuizAnswer{
 	Indeed, in this case the lock would persist, much to the
 	consternation of other processes attempting to acquire this
 	lock that is held by a process that no longer exists.
 	Which is why great care is required when using \co{pthread_mutex}
 	objects located in file-mapped memory regions.
 }\QuickQuizEnd

 Persistent locks help avoid the need to share memory among unrelated
 applications.
 Persistent locking APIs include the flock family, \co{lockf()}, System
 V semaphores, or the \co{O_CREAT} flag to \co{open()}.
 These persistent APIs can be used to protect large-scale operations
 spanning runs of multiple applications, and, in the case of \co{O_CREAT}
 even surviving operating-system reboot.
 If need be, locks can even span multiple computer systems via distributed
 lock managers and distributed filesystems---and persist across reboots
 of any or all of those computer systems.

 Persistent locks can be used by any application, including applications
 written using multiple languages and software environments.
 In fact, a persistent lock might well be acquired by an application written
 in C and released by an application written in Python.

 How could a similar persistent functionality be provided for TM?

 \begin{enumerate}
 \item	Restrict persistent transactions to special-purpose environments
 	designed to support them, for example, SQL\@.
 	This clearly works, given the decades-long history of database
 	systems, but does not provide the same degree of flexibility
 	provided by persistent locks.
 \item	Use snapshot facilities provided by some storage devices and/or
 	filesystems.
 	Unfortunately, this does not handle network communication,
 	nor does it handle I/O to devices that do not provide snapshot
 	capabilities, for example, memory sticks.
 \item	Build a time machine.
 \item	Avoid the problem entirely by using existing persistent facilities,
 	presumably avoiding such use within transactions.
 \end{enumerate}

 Of course, the fact that it is called transactional \emph{memory}
 should give us pause, as the name itself conflicts with the concept of
 a persistent transaction.
 It is nevertheless worthwhile to consider this possibility as an important
 test case probing the inherent limitations of transactional memory.

 \subsection{Process Modification}
 \label{sec:future:Process Modification}

 Processes are not eternal:
 They are created and destroyed, their memory mappings are modified,
 they are linked to dynamic libraries, and they are debugged.
 These sections look at how transactional memory can handle an
 ever-changing execution environment.

 \subsubsection{Multithreaded Transactions}
 \label{sec:future:Multithreaded Transactions}

 It is perfectly legal to create processes and threads while holding
 a lock or, for that matter, while holding a hazard pointer, within
 a sequence-locking read-side critical section, and from within a
 userspace-RCU read-side critical section, and even all at the same time,
 if need be.
 Not only is it legal, but it is quite simple, as can be seen from the
 following code fragment:

 \begin{VerbatimN}
 pthread_mutex_lock(...);
 for (i = 0; i < ncpus; i++)
 	pthread_create(&tid[i], ...);
 for (i = 0; i < ncpus; i++)
 	pthread_join(tid[i], ...);
 pthread_mutex_unlock(...);
 \end{VerbatimN}

 This pseudo-code fragment uses \co{pthread_create()} to spawn one thread
 per CPU, then uses \co{pthread_join()} to wait for each to complete,
 all under the protection of \co{pthread_mutex_lock()}.
 The effect is to execute a lock-based critical section in parallel,
 and one could obtain a similar effect using \co{fork()} and \co{wait()}.
 Of course, the critical section would need to be quite large to justify
 the thread-spawning overhead, but there are many examples of large
 critical sections in production software.

 What might TM do about thread spawning within a transaction?

 \begin{enumerate}
 \item	Declare \co{pthread_create()} to be illegal within transactions,
 	preferably by aborting the transaction.
 	Alternatively, enlist the compiler to enforce
 	\co{pthread_create()}-free transactions.
 \item	Permit \co{pthread_create()} to be executed within a
 	transaction, but only the parent thread will be considered to
 	be part of the transaction.
 	This approach seems to be reasonably compatible with existing and
 	posited TM implementations, but seems to be a trap for the unwary.
 	This approach raises further questions, such as how to handle
 	conflicting child-thread accesses.
 \item	Convert the \co{pthread_create()}s to function calls.
 	This approach is also an attractive nuisance, as it does not
 	handle the not-uncommon cases where the child threads communicate
 	with one another.
 	In addition, it does not permit concurrent execution of the body
 	of the transaction.
 \item	Extend the transaction to cover the parent and all child threads.
 	This approach raises interesting questions about the nature of
 	conflicting accesses, given that the parent and children are
 	presumably permitted to conflict with each other, but not with
 	other threads.
 	It also raises interesting questions as to what should happen
 	if the parent thread does not wait for its children before
 	committing the transaction.
 	Even more interesting, what happens if the parent conditionally
 	executes \co{pthread_join()} based on the values of variables
 	participating in the transaction?
 	The answers to these questions are reasonably straightforward
 	in the case of locking.
 	The answers for TM are left as an exercise for the reader.
 \end{enumerate}

 Given that parallel execution of transactions is commonplace in the
 database world, it is perhaps surprising that current TM proposals do
 not provide for it.
 On the other hand, the example above is a fairly sophisticated use
 of locking that is not normally found in simple textbook examples,
 so perhaps its omission is to be expected.
 That said, some researchers are using transactions to autoparallelize
 code~\cite{ArunRaman2010MultithreadedTransactions},
 and there are rumors that other TM researchers are investigating
 fork/join parallelism within transactions, so perhaps this topic will
 soon be addressed more thoroughly.

 \subsubsection{The \tco{exec()} System Call}
 \label{sec:future:The exec System Call}

 One can execute an \co{exec()} system call within a lock-based critical
 section, while holding a hazard pointer, within a sequence-locking
 read-side critical section, and from within a userspace-RCU read-side
 critical section, and even all at the same time, if need be.
 The exact semantics depends on the type of primitive.

 In the case of non-persistent primitives (including
 \co{pthread_mutex_lock()}, \co{pthread_rwlock_rdlock()}, and userspace RCU),
 if the \co{exec()} succeeds, the whole address space vanishes, along
 with any locks being held.
 Of course, if the \co{exec()} fails, the address space still lives,
 so any associated locks would also still live.
 A bit strange perhaps, but well defined.

 On the other hand, persistent primitives (including the flock family,
 \co{lockf()}, System V semaphores, and the \co{O_CREAT} flag to
 \co{open()}) would survive regardless of whether the \co{exec()}
 succeeded or failed, so that the \co{exec()}ed program might well
 release them.

 \QuickQuiz{
 	What about non-persistent primitives represented by data
 	structures in \co{mmap()} regions of memory?
 	What happens when there is an \co{exec()} within a critical
 	section of such a primitive?
 }\QuickQuizAnswer{
 	If the \co{exec()}ed program maps those same regions of
 	memory, then this program could in principle simply release
 	the lock.
 	The question as to whether this approach is sound from a
 	software-engineering viewpoint is left as an exercise for
 	the reader.
 }\QuickQuizEnd

 What happens when you attempt to execute an \co{exec()} system call
 from within a transaction?

 \begin{enumerate}
 \item	Disallow \co{exec()} within transactions, so that the enclosing
 	transactions abort upon encountering the \co{exec()}.
 	This is well defined, but clearly requires non-TM synchronization
 	primitives for use in conjunction with \co{exec()}.
 \item	Disallow \co{exec()} within transactions, with the compiler
 	enforcing this prohibition.
 	There is a draft specification for TM in C++ that takes
 	this approach, allowing functions to be decorated with
 	the \co{transaction_safe} and \co{transaction_unsafe}
 	attributes.\footnote{
 		Thanks to Mark Moir for pointing me at this spec, and
 		to Michael Wong for having pointed me at an earlier
 		revision some time back.}
 	This approach has some advantages over aborting the transaction
 	at runtime, but again requires non-TM synchronization primitives
 	for use in conjunction with \co{exec()}.
 	One disadvantage is the need to decorate a great many library
 	functions with \co{transaction_safe} and \co{transaction_unsafe}
 	attributes.
 \item	Treat the transaction in a manner similar to non-persistent
 	locking primitives, so that the transaction survives if \co{exec()}
 	fails, and silently commits if the \co{exec()} succeeds.
 	The case where only some of the variables affected by the
 	transaction reside in \co{mmap()}ed memory (and thus could
 	survive a successful \co{exec()} system call) is left as an
 	exercise for the reader.
 \item	Abort the transaction (and the \co{exec()} system call) if the
 	\co{exec()} system call would have succeeded, but allow the
 	transaction to continue if the \co{exec()} system call would
 	fail.
 	This is in some sense the ``correct'' approach, but it would
 	require considerable work for a rather unsatisfying result.
 \end{enumerate}

 The \co{exec()} system call is perhaps the strangest example of an
 obstacle to universal TM applicability, as it is not completely clear
 what approach makes sense, and some might argue that this is merely a
 reflection of the perils of real-life interaction with \co{exec()}.
 That said, the two options prohibiting \co{exec()} within transactions
 are perhaps the most logical of the group.

 Similar issues surround the \co{exit()} and \co{kill()} system calls,
 as well as a \co{longjmp()} or an exception that would exit the transaction.
 (Where did the \co{longjmp()} or exception come from?)

 \subsubsection{Dynamic Linking and Loading}
 \label{sec:future:Dynamic Linking and Loading}

 Lock-based critical section, code holding a hazard pointer,
 sequence-locking read-side critical sections, and userspace-RCU read-side
 critical sections can (separately or in combination) legitimately contain
 code that invokes dynamically linked and loaded functions, including C/C++
 shared libraries and Java class libraries.
 Of course, the code contained in these libraries is by definition
 unknowable at compile time.
 So, what happens if a dynamically loaded function is invoked within
 a transaction?

 This question has two parts:
 \begin{enumerate*}[(a)]
 \item How do you dynamically link and load a function within a transaction
 and
 \item What do you do about the unknowable nature of the code within
 this function?
 \end{enumerate*}
 To be fair, item (b) poses some challenges for locking and userspace-RCU
 as well, at least in theory.
 For example, the dynamically linked function might introduce a \IX{deadlock}
 for locking or might (erroneously) introduce a \IX{quiescent state} into a
 userspace-RCU read-side critical section.
 The difference is that while the class of operations permitted in locking
 and userspace-RCU critical sections is well-understood, there appears
 to still be considerable uncertainty in the case of TM\@.
 In fact, different implementations of TM seem to have different restrictions.

 So what can TM do about dynamically linked and loaded library functions?
 Options for part (a), the actual loading of the code, include the following:

 \begin{enumerate}
 \item	Treat the dynamic linking and loading in a manner similar to a
 	page fault, so that the function is loaded and linked, possibly
 	aborting the transaction in the process.
 	If the transaction is aborted, the retry will find the function
 	already present, and the transaction can thus be expected to
 	proceed normally.
 \item	Disallow dynamic linking and loading of functions from within
 	transactions.
 \end{enumerate}

 Options for part (b), the inability to detect TM-unfriendly operations
 in a not-yet-loaded function, possibilities include the following:

 \begin{enumerate}
 \item	Just execute the code:
 	If there are any TM-unfriendly operations in the function,
 	simply abort the transaction.
 	Unfortunately, this approach makes it impossible for the compiler
 	to determine whether a given group of transactions may be safely
 	composed.
 	One way to permit composability regardless is irrevocable
 	transactions, however, current implementations permit only a
 	single irrevocable transaction to proceed at any given time,
 	which can severely limit performance and scalability.
 	Irrevocable transactions also to restrict use of manual
 	transaction-abort operations.
 	Finally, if there is an irrevocable transaction manipulating
 	a given data item, any other transaction manipulating that
 	same data item cannot have non-blocking semantics.
 \item	Decorate the function declarations indicating which functions
 	are TM-friendly.
 	These decorations can then be enforced by the compiler's type system.
 	Of course, for many languages, this requires language extensions
 	to be proposed, standardized, and implemented, with the
 	corresponding time delays, and also with the corresponding
 	decoration of a great many otherwise uninvolved library functions.
 	That said, the standardization effort is already in
 	progress~\cite{Ali-Reza-Adl-Tabatabai2009CppTM,Ali-Reza-Adl-Tabatabai2012CppTM}.
 \item	As above, disallow dynamic linking and loading of functions from
 	within transactions.
 \end{enumerate}

 I/O operations are of course a known weakness of TM, and dynamic linking
 and loading can be thought of as yet another special case of I/O\@.
 Nevertheless, the proponents of TM must either solve this problem, or
 resign themselves to a world where TM is but one tool of several in the
 parallel programmer's toolbox.
 (To be fair, a number of TM proponents have long since resigned themselves
 to a world containing more than just TM.)

 \subsubsection{Memory-Mapping Operations}
 \label{sec:future:Memory-Mapping Operations}

 It is perfectly legal to execute memory-mapping operations (including
 \co{mmap()}, \co{shmat()}, and \co{munmap()}~\cite{TheOpenGroup1997SUS})
 within a lock-based critical section, while holding a hazard pointer,
 within a sequence-locking read-side critical section, and from within a
 userspace-RCU read-side critical section, and even all at the same time,
 if need be.
 What happens when you attempt to execute such an operation from within
 a transaction?
 More to the point, what happens if the memory region being remapped
 contains some variables participating in the current thread's transaction?
 And what if this memory region contains variables participating in some
 other thread's transaction?

 It should not be necessary to consider cases where the TM system's
 metadata is remapped, given that most locking primitives do not define
 the outcome of remapping their lock variables.

 Here are some TM memory-mapping options:

 \begin{enumerate}
 \item	Memory remapping is illegal within a transaction, and will result
 	in all enclosing transactions being aborted.
 	This does simplify things somewhat, but also requires that TM
 	interoperate with synchronization primitives that do tolerate
 	remapping from within their critical sections.
 \item	Memory remapping is illegal within a transaction, and the
 	compiler is enlisted to enforce this prohibition.
 \item	Memory mapping is legal within a transaction, but aborts all
 	other transactions having variables in the region mapped over.
 \item	Memory mapping is legal within a transaction, but the mapping
 	operation will fail if the region being mapped overlaps with
 	the current transaction's footprint.
 \item	All memory-mapping operations, whether within or outside a
 	transaction, check the region being mapped against the memory
 	footprint of all transactions in the system.
 	If there is overlap, then the memory-mapping operation fails.
 \item	The effect of memory-mapping operations that overlap the memory
 	footprint of any transaction in the system is determined by the
 	TM conflict manager, which might dynamically determine whether
 	to fail the memory-mapping operation or abort any conflicting
 	transactions.
 \end{enumerate}

 It is interesting to note that \co{munmap()} leaves the relevant region
 of memory unmapped, which could have additional interesting
 implications.\footnote{
 	This difference between mapping and unmapping was noted by
 	Josh Triplett.}

 \subsubsection{Debugging}
 \label{sec:future:Debugging}

 The usual debugging operations such as breakpoints work normally within
 lock-based critical sections and from usespace-RCU read-side critical sections.
 However, in initial transactional-memory hardware
 implementations~\cite{DaveDice2009ASPLOSRockHTM} an exception within
 a transaction will abort that transaction, which in turn means that
 breakpoints abort all enclosing transactions.

 So how can transactions be debugged?

 \begin{enumerate}
 \item	Use software emulation techniques within transactions containing
 	breakpoints.
 	Of course, it might be necessary to emulate all transactions
 	any time a breakpoint is set within the scope of any transaction.
 	If the runtime system is unable to determine whether or not a
 	given breakpoint is within the scope of a transaction, then it
 	might be necessary to emulate all transactions just to be on
 	the safe side.
 	However, this approach might impose significant overhead, which
 	might in turn obscure the bug being pursued.
 \item	Use only hardware TM implementations that are capable of
 	handling breakpoint exceptions.
 	Unfortunately, as of this writing (March 2021), all such
 	implementations are research prototypes.
 \item	Use only software TM implementations, which are
 	(very roughly speaking) more tolerant of exceptions than are
 	the simpler of the hardware TM implementations.
 	Of course, software TM tends to have higher overhead than hardware
 	TM, so this approach may not be acceptable in all situations.
 \item	Program more carefully, so as to avoid having bugs in the
 	transactions in the first place.
 	As soon as you figure out how to do this, please do let everyone
 	know the secret!
 \end{enumerate}

 There is some reason to believe that transactional memory will deliver
 \IX{productivity} improvements compared to other synchronization mechanisms,
 but it does seem quite possible that these improvements could easily
 be lost if traditional debugging techniques cannot be applied to
 transactions.
 This seems especially true if transactional memory is to be used by
 novices on large transactions.
 In contrast, macho ``top-gun'' programmers might be able to dispense with
 such debugging aids, especially for small transactions.

 Therefore, if transactional memory is to deliver on its productivity
 promises to novice programmers, the debugging problem does need to
 be solved.

 \subsection{Synchronization}
 \label{sec:future:Synchronization}

 If transactional memory someday proves that it can be everything to everyone,
 it will not need to interact with any other synchronization mechanism.
 Until then, it will need to work with synchronization mechanisms that
 can do what it cannot, or that work more naturally in a given situation.
 The following sections outline the current challenges in this area.

 \subsubsection{Locking}
 \label{sec:future:Locking}

 It is commonplace to acquire locks while holding other locks, which works
 quite well, at least as long as the usual well-known software-engineering
 techniques are employed to avoid deadlock.
 It is not unusual to acquire locks from within RCU read-side critical
 sections, which eases deadlock concerns because RCU read-side primitives
 cannot participate in lock-based deadlock cycles.
 It is also possible to acquire locks while holding hazard pointers and
 within sequence-lock read-side critical sections.
 But what happens when you attempt to acquire a lock from within a transaction?

 In theory, the answer is trivial:
 Simply manipulate the data structure representing the lock as part of
 the transaction, and everything works out perfectly.
 In practice, a number of non-obvious complications~\cite{Volos2008TRANSACT}
 can arise, depending on implementation details of the TM system.
 These complications can be resolved, but at the cost of a 45\pct\ increase in
 overhead for locks acquired outside of transactions and a 300\pct\ increase
 in overhead for locks acquired within transactions.
 Although these overheads might be acceptable for transactional
 programs containing small amounts of locking, they are often completely
 unacceptable for production-quality lock-based programs wishing to use
 the occasional transaction.

 Here are some options available to TM:

 \begin{enumerate}
 \item	Use only locking-friendly TM implementations~\cite{DaveDice2006DISC}.
 	Unfortunately, the locking-unfriendly implementations have some
 	attractive properties, including low overhead for successful
 	transactions and the ability to accommodate extremely large
 	transactions.
 \item	Use TM only ``in the small'' when introducing TM to lock-based
 	programs, thereby accommodating the limitations of
 	locking-friendly TM implementations.
 \item	Set aside locking-based legacy systems entirely, re-implementing
 	everything in terms of transactions.
 	This approach has no shortage of advocates, but this requires
 	that all the issues described in this series be resolved.
 	During the time it takes to resolve these issues, competing
 	synchronization mechanisms will of course also have the
 	opportunity to improve.
 \item	Use TM strictly as an optimization in lock-based systems, as was
 	done by the TxLinux~\cite{ChistopherJRossbach2007a} group and
 	by a great many transactional lock elision
 	projects~\cite{MartinPohlack2011HTM2TLE,Kleen:2014:SEL:2566590.2576793,PascalFelber2016rwlockElision,SeongJaePark2020HTMRCUlock}.
 	This approach seems sound, but leaves the locking design
 	constraints (such as the need to avoid deadlock) firmly in place.
 \item	Strive to reduce the overhead imposed on locking primitives.
 \end{enumerate}

 The fact that there could possibly be a problem interfacing TM and locking
 came as a surprise to many, which underscores the need to try out new
 mechanisms and primitives in real-world production software.
 Fortunately, the advent of open source means that a huge quantity of
 such software is now freely available to everyone, including researchers.

 \subsubsection{Reader-Writer Locking}
 \label{sec:future:Reader-Writer Locking}

 It is commonplace to read-acquire reader-writer locks while holding
 other locks, which just works, at least as long as the usual well-known
 software-engineering techniques are employed to avoid deadlock.
 Read-acquiring reader-writer locks from within RCU read-side critical
 sections also works, and doing so eases deadlock concerns because RCU
 read-side primitives cannot participate in lock-based deadlock cycles.
 It is also possible to acquire locks while holding hazard pointers and
 within sequence-lock read-side critical sections.
 But what happens when you attempt to read-acquire a reader-writer lock
 from within a transaction?

 Unfortunately, the straightforward approach to read-acquiring the
 traditional counter-based reader-writer lock within a transaction defeats
 the purpose of the reader-writer lock.
 To see this, consider a pair of transactions concurrently attempting to
 read-acquire the same reader-writer lock.
 Because read-acquisition involves modifying the reader-writer lock's
 data structures, a conflict will result, which will roll back one of
 the two transactions.
 This behavior is completely inconsistent with the reader-writer lock's
 goal of allowing concurrent readers.

 Here are some options available to TM:

 \begin{enumerate}
 \item	Use per-CPU or per-thread reader-writer
 	locking~\cite{WilsonCHsieh92a}, which allows a
 	given CPU (or thread, respectively) to manipulate only local
 	data when read-acquiring the lock.
 	This would avoid the conflict between the two transactions
 	concurrently read-acquiring the lock, permitting both to proceed,
 	as intended.
 	Unfortunately, (1)~the write-acquisition overhead of
 	per-CPU/thread locking can be extremely high, (2)~the memory
 	overhead of per-CPU/thread locking can be prohibitive, and
 	(3)~this transformation is available only when you have access to
 	the source code in question.
 	Other more-recent scalable
 	reader-writer locks~\cite{YossiLev2009SNZIrwlock}
 	might avoid some or all of these problems.
 \item	Use TM only ``in the small'' when introducing TM to lock-based
 	programs, thereby avoiding read-acquiring reader-writer locks
 	from within transactions.
 \item	Set aside locking-based legacy systems entirely, re-implementing
 	everything in terms of transactions.
 	This approach has no shortage of advocates, but this requires
 	that \emph{all} the issues described in this series be resolved.
 	During the time it takes to resolve these issues, competing
 	synchronization mechanisms will of course also have the
 	opportunity to improve.
 \item	Use TM strictly as an optimization in lock-based systems, as was
 	done by the TxLinux~\cite{ChistopherJRossbach2007a} group,
 	and as has been done by more recent work using TM to elide
 	reader-writer locks~\cite{PascalFelber2016rwlockElision}.
 	This approach seems sound, at least on \Power{8}
 	CPUs~\cite{Le:2015:TMS:3266491.3266500}, but leaves the locking
 	design constraints (such as the need to avoid deadlock) firmly
 	in place.
 \end{enumerate}

 Of course, there might well be other non-obvious issues surrounding
 combining TM with reader-writer locking, as there in fact were with
 exclusive locking.

 \subsubsection{Deferred Reclamation}
 \label{sec:future:Deferred Reclamation}

 This section focuses mainly on RCU\@.
 Similar issues and possible resolutions arise when combining TM with
 other deferred-reclamation mechanisms such as
 \IXalt{reference counters}{reference count} and
 hazard pointers.
 In the text below, known differences are specifically called out.

 Reference counting, hazard pointers, and RCU are all heavily used, as noted in
 \cref{sec:defer:RCU Related Work,sec:defer:Which to Choose? (Production Use)}.
 This means that any TM implementation that chooses not to surmount each
 and every challenge called out in this section needs to interoperate
 cleanly and efficiently with all of these synchronization mechanisms.

 The TxLinux group from the University of Texas at Austin appears to be
 the group to take on the challenge of RCU/TM
 interoperability~\cite{ChistopherJRossbach2007a}.
 Because they applied TM to the Linux 2.6 kernel, which uses RCU, they
 had no choice but to integrate TM and RCU, with TM taking the place of
 locking for RCU updates.
 Unfortunately, although the paper does state that the RCU implementation's
 locks (e.g., \co{rcu_ctrlblk.lock}) were converted to transactions,
 it is silent about what was done with those locks used by RCU-based updates
 (for example, \co{dcache_lock}).

 More recently, \ppl{Dimitrios}{Siakavaras} et al.~have applied
 HTM and RCU to search trees~\cite{Siakavaras2017CombiningHA,DimitriosSiakavaras2020RCU-HTM-B+Trees},
 \ppl{Christina}{Giannoula} et al.~have used HTM and RCU to color
 graphs~\cite{ChristinaGiannoula2018HTM-RCU-graphcoloring},
 and
 \ppl{SeongJae}{Park} et al.~have used HTM and RCU to optimize high-contention
 locking on \IXacr{numa} systems~\cite{SeongJaePark2020HTMRCUlock}.

 It is important to note that RCU permits readers and updaters to run
 concurrently, further permitting RCU readers to access data that is in
 the act of being updated.
 Of course, this property of RCU, whatever its performance, scalability,
 and real-time-response benefits might be, flies in the face of the
 underlying atomicity properties of TM, although the \Power{8} CPU family's
 suspended-transaction facility~\cite{Le:2015:TMS:3266491.3266500} makes
 it an exception to this rule.

 So how should TM-based updates interact with concurrent RCU readers?
 Some possibilities are as follows:

 \begin{enumerate}
 \item	RCU readers abort concurrent conflicting TM updates.
 	This is in fact the approach taken by the TxLinux project.
 	This approach does preserve RCU semantics, and also preserves
 	RCU's read-side performance, scalability, and real-time-response
 	properties, but it does have the unfortunate side-effect of
 	unnecessarily aborting conflicting updates.
 	In the worst case, a long sequence of RCU readers could
 	potentially starve all updaters, which could in theory result
 	in system hangs.
 	In addition, not all TM implementations offer the strong atomicity
 	required to implement this approach, and for good reasons.
 \item	RCU readers that run concurrently with conflicting TM updates
 	get old (pre-transaction) values from any conflicting RCU loads.
 	This preserves RCU semantics and performance, and also prevents
 	RCU-update \IX{starvation}.
 	However, not all TM implementations can provide timely access
 	to old values of variables that have been tentatively updated
 	by an in-flight transaction.
 	In particular, log-based TM implementations that maintain
 	old values in the log (thus providing excellent TM commit
 	performance) are not likely to be happy with this approach.
 	Perhaps the \co{rcu_dereference()} primitive can be leveraged
 	to permit RCU to access the old values within a greater range
 	of TM implementations, though performance might still be an issue.
 	Nevertheless, there are popular TM implementations that have
 	been integrated with RCU in this
 	manner~\cite{DonaldEPorter2007TRANSACT,PhilHoward2011RCUTMRBTree,
 	PhilipWHoward2013RCUrbtree}.
 \item	If an RCU reader executes an access that conflicts with an
 	in-flight transaction, then that RCU access is delayed until
 	the conflicting transaction either commits or aborts.
 	This approach preserves RCU semantics, but not RCU's performance
 	or real-time response, particularly in presence of long-running
 	transactions.
 	In addition, not all TM implementations are capable of delaying
 	conflicting accesses.
 	Nevertheless, this approach seems eminently reasonable for hardware
 	TM implementations that support only small transactions.
 \item	RCU readers are converted to transactions.
 	This approach pretty much guarantees that RCU is compatible with
 	any TM implementation, but it also imposes TM's rollbacks on RCU
 	read-side critical sections, destroying RCU's real-time response
 	guarantees, and also degrading RCU's read-side performance.
 	Furthermore, this approach is infeasible in cases where any of
 	the RCU read-side critical sections contains operations that
 	the TM implementation in question is incapable of handling.
 	This approach is more difficult to apply to hazard pointers and
 	reference counters, which do not have a sharply defined notion
 	of a reader as a section of code.
 \item	Many update-side uses of RCU modify a single pointer to publish
 	a new data structure.
 	In some of these cases, RCU can safely be permitted to see a
 	transactional pointer update that is subsequently rolled back,
 	as long as the transaction respects memory ordering and as long
 	as the roll-back process uses \co{call_rcu()} to free up the
 	corresponding structure.
 	Unfortunately, not all TM implementations respect memory barriers
 	within a transaction.
 	Apparently, the thought is that because transactions are supposed
 	to be atomic, the ordering of the accesses within the transaction
 	is not supposed to matter.
 \item	Prohibit use of TM in RCU updates.
 	This is guaranteed to work, but restricts use of TM.
 \end{enumerate}

 It seems likely that additional approaches will be uncovered, especially
 given the advent of user-level RCU and hazard-pointer
 implementations.\footnote{
 	Kudos to the TxLinux group, Maged Michael, and Josh Triplett
 	for coming up with a number of the above alternatives.}
 It is interesting to note that many of the better performing and
 scaling STM implementations make use of RCU-like techniques
 internally~\cite{UCAM-CL-TR-579,KeirFraser2007withoutLocks,Gu:2019:PSE:3358807.3358885,Kim:2019:MSR:3297858.3304040}.

 \QuickQuiz{
 	MV-RLU looks pretty good!
 	Doesn't it beat RCU hands down?
 }\QuickQuizAnswer{
 	One might get that impression from a quick read of the abstract,
 	but more careful readers will notice the ``for a wide range of
 	workloads'' phrase in the last sentence.
 	It turns out that this phrase is quite important:

 	\begin{enumerate}
 	\item	Their RCU evaluation uses synchronous grace periods, which
 		needlessly throttle updates, as noted in their
 		Section~6.2.1.
 		See \cref{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo in the Presence of Updates}
 		\cpageref{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo in the Presence of Updates}
 		of this book to see that the venerable asynchronous
 		\co{call_rcu()} primitive enables RCU to perform and
 		scale quite well with large numbers of updaters.
 		Furthermore, in Section~3.7 of their paper, the authors
 		admit that asynchronous grace periods are important to
 		MV-RLU scalability.
 		A fair comparison would also allow RCU the benefits of
 		asynchrony.
 	\item	They use a poorly tuned 1,000-bucket hash table containing
 		10,000~elements.
 		In addition, their 448~hardware threads need considerably
 		more than 1,000~buckets to avoid the lock contention
 		that they correctly state limits RCU performance in
 		their benchmarks.
 		A useful comparison would feature a properly tuned
 		hash table.
 	\item	Their RCU hash table used per-bucket locks, which they
 		call out as a bottleneck, which is not a surprise given
 		the long hash chains and small ratio of buckets to threads.
 		A number of their competing mechanisms instead use
 		lockfree techniques, thus avoiding the per-bucket-lock
 		bottleneck, which cynics might claim sheds some light
 		on the authors' otherwise inexplicable choice of poorly
 		tuned hash tables.
 		The first graph in the middle row of the authors'
 		Figure~4 show what RCU can achieve if not hobbled by
 		artificial bottlenecks, as does the first portion of
 		the second graph in that same row.
 	\item	Their linked-list operation permits RLU to do concurrent
 		modifications of different elements in the list, while
 		RCU is forced to serialize updates.
 		Again, RCU has always worked just fine in conjunction
 		with lockless updaters, a fact that has been set forth
 		in academic literature that the authors
 		cited~\cite{MathieuDesnoyers2012URCU}.
 		A fair comparison would use the same style of update
 		for RCU as it does for MV-RLU.
 	\item	The authors fail to consider combining RCU and sequence
 		locking, which is used in the Linux kernel to give
 		readers coherent views of multi-pointer updates.
 	\item	The authors fail to consider RCU-based solutions to the
 		Issaquah Challenge~\cite{PaulEMcKenney2016IssaquahCPPCON},
 		which also gives readers a coherent view of multi-pointer
 		updates, albeit with a weaker view of ``coherent''.
 	\end{enumerate}

 	It is surprising that the anonymous reviewers of this paper did
 	not demand an apples-to-apples comparison of MV-RLU and RCU\@.
 	Nevertheless, the authors should be congratulated on producing
 	an academic paper that presents an all-too-rare example of good
 	scalability combined with strong read-side coherence.
 	They are also to be congratulated on overcoming the traditional
 	academic prejudice against asynchronous grace periods,
 	which greatly aided their scalability.

 	Interestingly enough, RLU and RCU take different approaches to avoid
 	the inherent limitations of STM noted by \ppl{Hagit}{Attiya} et
 	al.~\cite{Attiya:2009:STMReadOnlyLimits}.
 	RCU avoids providing strict serializability and RLU avoids providing
 	invisible read-only transactions, both thus avoiding the
 	limitations.
 }\QuickQuizEnd

 \subsubsection{Extra-Transactional Accesses}
 \label{sec:future:Extra-Transactional Accesses}

 Within a lock-based critical section, it is perfectly legal to manipulate
 variables that are concurrently accessed or even modified outside that
 lock's critical section, with one common example being statistical
 counters.
 The same thing is possible within RCU read-side critical
 sections, and is in fact the common case.

 Given mechanisms such as the so-called ``dirty reads'' that are
 prevalent in production database systems, it is not surprising
 that extra-transactional accesses have received serious attention
 from the proponents of TM, with the concept of weak
 atomicity~\cite{Blundell2006TMdeadlock} being but one case in point.

 Here are some extra-transactional options:

 \begin{enumerate}
 \item	Conflicts due to extra-transactional accesses always abort
 	transactions.
 	This is strong atomicity.
 \item	Conflicts due to extra-transactional accesses are ignored,
 	so only conflicts among transactions can abort transactions.
 	This is weak atomicity.
 \item	Transactions are permitted to carry out non-transactional
 	operations in special cases, such as when allocating memory or
 	interacting with lock-based critical sections.
 \item	Produce hardware extensions that permit some operations
 	(for example, addition) to be carried out concurrently on a
 	single variable by multiple transactions.
 \item	Introduce weak semantics to transactional memory.
 	One approach is the combination with RCU described in
 	\cref{sec:future:Deferred Reclamation},
 	while Gramoli and Guerraoui
 	survey a number of other weak-transaction
 	approaches~\cite{Gramoli:2014:DTP:2541883.2541900}, for example,
 	restricted partitioning of large
 	``elastic'' transactions into smaller transactions, thus
 	reducing conflict probabilities (albeit with tepid performance
 	and scalability).
 	Perhaps further experience will show that some uses of
 	extra-transactional accesses can be replaced by weak
 	transactions.
 \end{enumerate}

 It appears that transactions were conceived in a vacuum, with no
 interaction required with any other synchronization mechanism.
 If so, it is no surprise that much confusion and complexity arises when
 combining transactions with non-transactional accesses.
 But unless transactions are to be confined to small updates to isolated
 data structures, or alternatively to be confined to new programs
 that do not interact with the huge body of existing parallel code,
 then transactions absolutely must be so combined if they are to have
 large-scale practical impact in the near term.

 \subsection{Other Transactions}
 \label{sec:future:Other Transactions}
 % If we ever get epigraphs down at this level, perhaps the old
 % "Hell is other people" quote by Jean-Paul Sartre.  His clarification
 % also applies, which can be stated as "Heaven is also other people".

 Because transactions are to appear atomic with respect to each other,
 something must be done when multiple transactions attempt to access the
 same variable at the same time.
 If all the transactions are reading that variable and none of them are
 updating it, then that something can be ``nothing'', but otherwise
 the transactions must be at least partially serialized.
 This serialization is often achieved by aborting and rolling back some
 subset of those transactions.

 How to choose which to abort and roll back?

 There has been much work on this question.
 The answer is ``use a contention manager'', but that simply shifts this
 question to ``what should a contention manager do?''
 A full exposition on contention managers is beyond the scope of this section.
 This section will therefore limit itself to a few seminal papers
 in this area.

 Herlihy et al.~\cite{HerlihyLMS03}
 have writing transactions unconditionally abort reading transactions,
 which has the virtue of simplicity, but which can result in reader
 starvation.
 This paper also introduces the concept of early release, which can
 allow a long traversal to proceed concurrently with small updates.
 From this viewpoint, one might characterize RCU readers as ``instant
 release'', but this is a property that causes RCU readers to behave in
 a decidedly non-STM manner.

 Hammond et al.~\cite{LanceHammond2004a}
 advise using small transactions, which does reduce contention-manager
 complexity, but which also fails to deliver on the full STM promise
 of automated concurrency.
 This paper also introduced annotations to identify different types of
 transactions.

 Scherer and Scott~\cite{WilliamNSchererIII2005}
 describe a number of STM contention-management strategies.
 This paper defined ``visible'' and ``invisible'' readers for
 extra-transactional accesses, and discuss a number of contention-management
 strategies, including:

 \begin{enumerate}
 \item	Exponential backoff, which they conclude required manual
 	tuning.
 \item	Preferentially abort transactions that have done the least work,
 	where the work of an aborted prior attempt to complete a
 	transactions is counted for the next retry of that transaction.
 	This has nice fairness properties, but can cause huge transactions
 	to hog the system for an extended time.
 \item	Preferentially abort transactions that have done the least work,
 	but also credit a given transaction with the work done by any
 	other transactions blocked waiting for that transaction to complete.
 	This adds a form of priority inheritance to the mix.
 \item	Track how many times a given transaction has been aborted by
 	another transaction as another way to promote some notion of
 	fairness.
 \item	Use transaction-start timestamps in order to preferentially abort
 	transactions that started most recently.
 \item	Use transaction-start timestamps, but also add an indication
 	of when a given transaction last ran to prevent preempted
 	transactions from aborting other transactions running in
 	higher-priority processes.
 \end{enumerate}

 This 2005 paper thus gave an early indication of the complexity inherent
 in contention management.

 Porter and Witchel~\cite{DonaldEPorter2007TRANSACT,Ramadan:2008:DTM:1521747.1521799}
 considered loosening STM consistency requirements in order to simplify
 contention management.
 They noted RCU as a case in which an update will cause concurrent reads
 to access old versions of the data.

 Contention management is also an issue in HTM, and will be discussed
 in \cref{sec:future:Conflict Handling,sec:future:Aborts and Rollbacks}.

 \QuickQuiz{
 	Why not get with the times and apply machine learning to
 	contention management?
 }\QuickQuizAnswer{
 	Many transactions have sub-microsecond overheads, so it would
 	be all too easy to burn more CPU on machine learning than was
 	saved by more efficiently executing transactions.
 	Machine learning might nevertheless have a role to play in
 	tuning contention managers to specific workloads.
 	It would of course be even better to not need the tuning, but
 	sometimes ``better'' is unattainable.
 }\QuickQuizEnd

 \subsection{Case Study:
 	    Sequence Locking}
 \label{sec:future:Case Study: Sequence Locking}

 Sequence locking, described in
 \cref{sec:defer:Sequence Locks},
 is sometimes thought of as a trivial form of STM\@.
 The sequence-locking read-side critical section is normally restricted
 to doing loads, and it is retried whenever there is a conflict with a
 sequence-locking updater.
 Crucially, sequence-locking read-side critical section do not conflict
 with each other.
 It is therefore instructive to consider how sequence locking handles
 the STM challenges called out in the preceding sections.

 Sequence locking handles interactions with the outside world
 (\cref{sec:future:Outside World}) by simply executing the
 TM-unfriendly operation on each pass through its read-side
 critical section.
 For example, if the critical section doing an I/O write
 is retried three times, it will execute that I/O write four
 times, once for the original pass through the critical
 section and once each for the three retries.

 Doing I/O operations within a sequence-locking read-side
 critical section might seem a bit unconventional.
 It nevertheless makes a good deal of sense to do
 an I/O write (\cref{sec:future:I/O Operations})
 to a log for historical, debugging, or auditing purposes.
 An RPC operation (\cref{sec:future:RPC Operations}) might gather needed data.
 A time delay (\cref{sec:future:Time Delays}) might serve valuable
 debugging purposes such as exercising the retry code path.
 Even persistent operations (\cref{sec:future:Persistence}) are potentially
 useful, for example, to carry out cross-application mutual exclusion.
 Alternatively, just as with STM, users of sequence locking may choose
 to buffer I/O and execute it just after the end of the read-side critical
 section.

 Sequence locking handles process modifications
 (\cref{sec:future:Process Modification})
 in the same way, by simply executing the process-modification operation
 on each pass through its read-side critical section.

 Creating processes and threads within a critical section
 (\cref{sec:future:Multithreaded Transactions})
 operates normally, though care should be taken in order to prevent
 an often-retried critical section from becoming a ``fork bomb''.
 The \co{exec()} system call (\cref{sec:future:The exec System Call})
 has perfectly reasonable (if rather drastic) semantics.
 Calls to functions within dynamic libraries
 (\cref{sec:future:Dynamic Linking and Loading}),
 such as Linux-kernel loadable modules, work just as well as do calls
 to built-in functions.
 Of course, if such a call were cause a module to load, this might result
 in a high retry probability on the first pass through the critical
 section.
 Remapping memory (\cref{sec:future:Memory-Mapping Operations})
 behaves intuitively, as least assuming that neither the code making
 up the critical section nor the sequence lock itself are remapped.
 Note that these same restrictions also apply to normal locks.
 Debuggers (\cref{sec:future:Debugging})
 work normally within sequence-locking read-side critical sections,
 just as they do within the critical sections for normal locks.

 Synchronization primitives (\cref{sec:future:Synchronization})
 can be used freely within sequence-locking read-side critical sections.
 Locking (\cref{sec:future:Locking}),
 reader-writer locking (\cref{sec:future:Reader-Writer Locking}),
 deferred reclamation (\cref{sec:future:Deferred Reclamation}),
 and extra-transactional accesses
 (\cref{sec:future:Extra-Transactional Accesses})
 may all be used within sequence-locking read-side critical sections,
 and all give the expected results.

 Sequence locking uses a trivial contention manager
 (\cref{sec:future:Other Transactions}).
 Sequence-locking readers are always rolled back in case of a concurrent
 updater (as do Herlihy et al.~\cite{HerlihyLMS03}), sequence-locking
 updaters typically delegate their contention management to some form of
 mutual exclusion, and sequence-locking readers never conflict, and thus
 never need their non-existent contention to be managed.
 This rough-and-ready approach means that sequence-locking updaters
 can starve readers, and users of sequence locking must therefore take
 care to avoid frequent updates.

 In short, sequence locking handles these STM challenges quite well.
 This might be one reason that sequence locking is heavily used in
 production.
 However, a key enabler of sequence locking is the pushing of a number
 of these challenges back onto the user.
 This approach is reasonable because sequence locking has no ambitions
 to be the one synchronization mechanism to rule them all.
 So perhaps the greatest impediment to widespread STM adoption is its
 proponents' overweening desire for full generality.

 \subsection{Discussion}
 \label{sec:future:Discussion}

 The obstacles to universal TM adoption lead to the following
 conclusions:

 \begin{enumerate}
 \item	One interesting property of TM is the fact that transactions are
 	subject to rollback and retry.
 	This property underlies TM's difficulties with irreversible
 	operations, including unbuffered I/O, RPCs, memory-mapping
 	operations, time delays, and the \co{exec()} system call.
 	This property also has the unfortunate consequence of introducing
 	all the complexities inherent in the possibility of failure,
 	often in a developer-visible manner.
 \item	Another interesting property of TM, noted by
 	Shpeisman et al.~\cite{TatianaShpeisman2009CppTM}, is that TM
 	intertwines the synchronization with the data it protects.
 	This property underlies TM's issues with I/O, memory-mapping
 	operations, extra-transactional accesses, and debugging
 	breakpoints.
 	In contrast, conventional synchronization primitives, including
 	locking and RCU, maintain a clear separation between the
 	synchronization primitives and the data that they protect.
 \item	One of the stated goals of many workers in the TM area is to
 	ease parallelization of large sequential programs.
 	As such, individual transactions are commonly expected to
 	execute serially, which might do much to explain TM's issues
 	with multithreaded transactions.
 \end{enumerate}

 \QuickQuiz{
 	Given things like \co{spin_trylock()}, how does it make any
 	sense at all to claim that TM introduces the concept of failure???
 }\QuickQuizAnswer{
 	When using locking, \co{spin_trylock()} is a choice, with a
 	corresponding failure-free choice being \co{spin_lock()},
 	which is used in the common case, as in there are more than
 	100 times as many calls to \co{spin_lock()} than to
 	\co{spin_trylock()} in the v5.11 Linux kernel.
 	When using TM, the only failure-free choice is the irrevocable
 	transaction, which is not used in the common case.
 	In fact, the irrevocable transaction is not even available
 	in all TM implementations.
 }\QuickQuizEnd

 What should TM researchers and developers do about all of this?

 One approach is to focus on TM in the small, focusing on small
 transactions where hardware assist potentially provides substantial
 advantages over other synchronization primitives and on small programs
 where there is some evidence for increased productivity for a combined
 TM-locking approach~\cite{VPankratius2011TMvsLockingProductivity}.
 Sun took the small-transaction approach with its Rock research
 CPU~\cite{DaveDice2009ASPLOSRockHTM}.
 Some TM researchers seem to agree with these two small-is-beautiful
 approaches~\cite{JMStone93}, others have much higher hopes for TM, and yet others
 hint that high TM aspirations might be TM's worst
 enemy~\cite[Section 6]{Attiya:2010:ICT:1835698.1835699}.
 It is nonetheless quite possible that TM will be able to take on larger
 problems, and this section has listed a few of the issues that must be
 resolved if TM is to achieve this lofty goal.

 Of course, everyone involved should treat this as a learning experience.
 It would seem that TM researchers have great deal to learn from
 practitioners who have successfully built large software systems using
 traditional synchronization primitives.

 And vice versa.

 \QuickQuiz{
 	What is to learn?
 	Why not just use TM for memory-based data structures and locking
 	for those rare cases featuring the many silly corner cases listed
 	in this silly section???
 }\QuickQuizAnswer{
 	The year 2005 just called, and it wants its incandescent TM
 	marketing hype back.

 	In the year 2023, TM still has significant proving to do,
 	even with the advent of HTM, which is covered in the
 	upcoming
 	\cref{sec:future:Hardware Transactional Memory}.
 }\QuickQuizEnd

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/TM-the-vision}}
 \caption{The STM Vision}
 \ContributedBy{fig:future:The STM Vision}{Melissa Broussard}
 \end{figure}

 \begin{figure}
 \centering
 \resizebox{2.7in}{!}{\includegraphics{cartoons/TM-the-reality-conflict}}
 \caption{The STM Reality:
 			  Conflicts}
 \ContributedBy{fig:future:The STM Reality: Conflicts}{Melissa Broussard}
 \end{figure}

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/TM-the-reality-nonidempotent}}
 \caption{The STM Reality:
 			  Irrevocable Operations}
 \ContributedBy{fig:future:The STM Reality: Irrevocable Operations}{Melissa Broussard}
 \end{figure}

 \begin{figure}
 \centering
 \resizebox{2.7in}{!}{\includegraphics{cartoons/TM-the-reality-realtime}}
 \caption{The STM Reality:
 			  Realtime Response}
 \ContributedBy{fig:future:The STM Reality: Realtime Response}{Melissa Broussard}
 \end{figure}

 But for the moment, the current state of STM
 can best be summarized with a series of cartoons.
 First,
 \cref{fig:future:The STM Vision}
 shows the STM vision.
 As always, the reality is a bit more nuanced, as fancifully depicted by
 \cref{fig:future:The STM Reality: Conflicts,%
 fig:future:The STM Reality: Irrevocable Operations,%
 fig:future:The STM Reality: Realtime Response}.\footnote{
 	Recent academic work-in-progress has investigated lock-based STM
 	systems for real-time use~\cite{JimAnderson2019STMRT,CatherineNemitz2018LockSTMrealtime},
 	albeit without any performance results, and with some indications
 	that real-time hybrid STM/HTM systems must choose between fast
 	common-case performance and worst-case forward-progress
 	guarantees~\cite{DBLP:journals/corr/AlistarhKKRS14,MartinSchoeberl2010realtimeTM}.}
 Less fanciful STM retrospectives are also
 available~\cite{JoeDuffy2010RetroTM,JoeDuffy2010RetroTM2}.

 Alternatively, one might argue that sequence locking constitutes a
 restricted form of STM that is heavily used in practice, as discussed in
 \cref{sec:future:Case Study: Sequence Locking}.

 Whether or not sequence locking is considered to be the standard bearer
 for STM in actual practice, some commercially available hardware supports
 restricted variants of HTM, which are addressed in the following section.