| % future/tm.tex |
| % mainfile: ../perfbook.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \section{Transactional Memory} |
| \label{sec:future:Transactional Memory} |
| \epigraph{Everything should be as simple as it can be, but not simpler.} |
| {Albert Einstein, by way of Louis Zukofsky} |
| |
| The idea of using transactions outside of databases goes back many |
| decades~\cite{DBLomet1977SIGSOFT,Knight:1986:AMF:319838.319854,Herlihy93a}, |
| with the key difference between |
| database and non-database transactions being that non-database transactions |
| drop the ``D'' in the ``ACID''\footnote{ |
| Atomicity, consistency, isolation, and durability.} |
| properties defining database transactions. |
| The idea of supporting memory-based transactions, or ``transactional memory'' |
| (TM), in hardware |
| is more recent~\cite{Herlihy93a}, but unfortunately, support for such |
| transactions in commodity hardware was not immediately forthcoming, |
| despite other somewhat similar proposals being put forward~\cite{JMStone93}. |
| Not long after, Shavit and Touitou proposed a software-only implementation |
| of transactional memory (STM) that was capable of running on commodity |
| hardware, give or take memory-ordering issues~\cite{Shavit95}. |
| This proposal languished for many years, perhaps due to the fact that |
| the research community's attention was absorbed by \IXacrl{nbs} |
| (see \cref{sec:advsync:Non-Blocking Synchronization}). |
| |
| But by the turn of the century, TM started receiving |
| more attention~\cite{Martinez01a,Rajwar01a}, and by the middle of the |
| decade, the level of interest can only be termed |
| ``incandescent''~\cite{MauriceHerlihy2005-TM-manifesto.pldi, |
| DanGrossman2007TMGCAnalogy}, with only a few voices of |
| caution~\cite{Blundell2005DebunkTM,McKenney2007PLOSTM}. |
| |
| The basic idea behind TM is to execute a section of |
| code atomically, so that other threads see no intermediate state. |
| As such, the semantics of TM could be implemented |
| by simply replacing each transaction with a recursively acquirable |
| global lock acquisition and release, albeit with abysmal performance |
| and scalability. |
| Much of the complexity inherent in TM implementations, whether hardware |
| or software, is efficiently detecting when concurrent transactions can safely |
| run in parallel. |
| Because this detection is done dynamically, conflicting transactions |
| can be aborted or ``rolled back'', and in some implementations, this |
| failure mode is visible to the programmer. |
| |
| Because transaction roll-back is increasingly unlikely as transaction |
| size decreases, TM might become quite attractive for small memory-based |
| operations, |
| such as linked-list manipulations used for stacks, queues, hash tables, |
| and search trees. |
| However, it is currently much more difficult to make the case for large |
| transactions, particularly those containing non-memory operations such |
| as I/O and process creation. |
| The following sections look at current challenges to the grand vision of |
| ``Transactional Memory Everywhere''~\cite{PaulEMcKenney2009TMeverywhere}. |
| \Cref{sec:future:Outside World} examines the challenges faced |
| interacting with the outside world, |
| \cref{sec:future:Process Modification} looks at interactions |
| with process modification primitives, |
| \cref{sec:future:Synchronization} explores interactions with |
| other synchronization primitives, and finally |
| \cref{sec:future:Discussion} closes with some discussion. |
| |
| \subsection{Outside World} |
| \label{sec:future:Outside World} |
| |
| In the wise words of \ppl{Donald}{Knuth}: |
| |
| \begin{quote} |
| Many computer users feel that input and output are not actually part |
| of ``real programming,'' they are merely things that (unfortunately) |
| must be done in order to get information in and out of the machine. |
| \end{quote} |
| |
| Whether or not we believe that input and output are ``real programming'', |
| the fact is that software absolutely must deal with the outside world. |
| This section therefore critiques transactional memory's outside-world |
| capabilities, focusing on I/O operations, time delays, and persistent |
| storage. |
| |
| And these interactions with the outside world are the rock upon which |
| the claims of unconditional TM composability are shattered. |
| Yes, you can compose transactions, but only as long as there are no |
| intervening TM-unfriendly operations. |
| Just as you can compose lock-based critical sections, but only as long |
| as doing so does not introduce deadlocks. |
| |
| In the end, it is unclear whether TM really provides better composability |
| than does locking in real-world code. |
| |
| \subsubsection{I/O Operations} |
| \label{sec:future:I/O Operations} |
| |
| One can execute I/O operations within a lock-based critical section, |
| while holding a \IX{hazard pointer}, within a sequence-locking read-side |
| critical section, and from within a userspace-RCU read-side critical |
| section, and even all at the same time, if need be. |
| What happens when you attempt to execute an I/O operation from within |
| a transaction? |
| |
| The underlying problem is that transactions may be rolled back, for |
| example, due to conflicts. |
| Roughly speaking, this requires that all operations within any given |
| transaction be revocable, so that executing the operation twice has |
| the same effect as executing it once. |
| Unfortunately, I/O is in general the prototypical irrevocable |
| operation, making it difficult to include general I/O operations in |
| transactions. |
| In fact, general I/O is irrevocable: |
| Once you have pushed the proverbial button launching the nuclear warheads, |
| there is no turning back. |
| |
| Here are some options for handling of I/O within transactions: |
| |
| \begin{enumerate} |
| \item Restrict I/O within transactions to buffered I/O with in-memory |
| buffers. |
| These buffers may then be included in the transaction in the |
| same way that any other memory location might be included. |
| This seems to be the mechanism of choice, and it does work |
| well in many common cases of situations such as stream I/O and |
| mass-storage I/O\@. |
| However, special handling is required in cases where multiple |
| record-oriented output streams are merged onto a single file |
| from multiple processes, as might be done using the ``a+'' |
| option to \co{fopen()} or the \co{O_APPEND} flag to \co{open()}. |
| In addition, as will be seen in the next section, common |
| networking operations cannot be handled via buffering. |
| \item Prohibit I/O within transactions, so that any attempt to execute |
| an I/O operation aborts the enclosing transaction (and perhaps |
| multiple nested transactions). |
| This approach seems to be the conventional TM approach for |
| unbuffered I/O, but requires that TM interoperate with other |
| synchronization primitives tolerating I/O. |
| \item Prohibit I/O within transactions, but enlist the compiler's aid |
| in enforcing this prohibition. |
| \item Permit only one special |
| \emph{irrevocable} transaction~\cite{SpearMichaelScott2008InevitableSTM} |
| to proceed |
| at any given time, thus allowing irrevocable transactions to |
| contain I/O operations.\footnote{ |
| In earlier literature, irrevocable transactions are |
| termed \emph{inevitable} transactions.} |
| This works in general, but severely limits the scalability and |
| performance of I/O operations. |
| Given that scalability and performance is a first-class goal of |
| parallelism, this approach's generality seems a bit self-limiting. |
| Worse yet, use of irrevocability to tolerate I/O operations |
| seems to greatly restrict use of manual transaction-abort |
| operations.\footnote{ |
| This difficulty was pointed out by Michael Factor. |
| To see the problem, think through what TM should do |
| in response to an attempt to abort a transaction after |
| it has executed an irrevocable operation.} |
| Finally, if there is an irrevocable transaction manipulating |
| a given data item, any other transaction manipulating that |
| same data item cannot have non-blocking semantics. |
| \item Create new hardware and protocols such that I/O operations can |
| be pulled into the transactional substrate. |
| In the case of input operations, the hardware would need to |
| correctly predict the result of the operation, and to abort the |
| transaction if the prediction failed. |
| \end{enumerate} |
| |
| I/O operations are a well-known weakness of TM, and it is not clear |
| that the problem of supporting I/O in transactions has a reasonable |
| general solution, at least if ``reasonable'' is to include usable |
| performance and scalability. |
| Nevertheless, continued time and attention to this problem will likely |
| produce additional progress. |
| |
| \subsubsection{RPC Operations} |
| \label{sec:future:RPC Operations} |
| |
| One can execute RPCs within a lock-based critical section, while holding |
| a hazard pointer, within a sequence-locking read-side critical section, |
| and from within a userspace-RCU read-side critical section, and even |
| all at the same time, if need be. |
| What happens when you attempt to execute an RPC from within a transaction? |
| |
| If both the RPC request and its response are to be contained within the |
| transaction, and if some part of the transaction depends on the result |
| returned by the response, then it is not possible to use the memory-buffer |
| tricks that can be used in the case of buffered I/O\@. |
| Any attempt to |
| take this buffering approach would deadlock the transaction, as the |
| request could not be transmitted until the transaction was guaranteed |
| to succeed, but the transaction's success might not be knowable until |
| after the response is received, as is the case in the following example: |
| |
| \begin{VerbatimN}[samepage=true] |
| begin_trans(); |
| rpc_request(); |
| i = rpc_response(); |
| a[i]++; |
| end_trans(); |
| \end{VerbatimN} |
| |
| The transaction's memory footprint cannot be determined until after the |
| RPC response is received, and until the transaction's memory footprint |
| can be determined, it is impossible to determine whether the transaction |
| can be allowed to commit. |
| The only action consistent with transactional semantics is therefore to |
| unconditionally abort the transaction, which is, to say the least, |
| unhelpful. |
| |
| Here are some options available to TM: |
| |
| \begin{enumerate} |
| \item Prohibit RPC within transactions, so that any attempt to execute |
| an RPC operation aborts the enclosing transaction (and perhaps |
| multiple nested transactions). |
| Alternatively, enlist the compiler to enforce RPC-free |
| transactions. |
| This approach does work, but will require TM to |
| interact with other synchronization primitives. |
| \item Permit only one special |
| irrevocable transaction~\cite{SpearMichaelScott2008InevitableSTM} |
| to proceed at any given time, thus allowing irrevocable |
| transactions to contain RPC operations. |
| This works in general, but severely limits the scalability and |
| performance of RPC operations. |
| Given that scalability and performance is a first-class goal of |
| parallelism, this approach's generality seems a bit self-limiting. |
| Furthermore, use of irrevocable transactions to permit RPC |
| operations restricts manual transaction-abort operations |
| once the RPC operation has started. |
| Finally, if there is an irrevocable transaction manipulating |
| a given data item, any other transaction manipulating that |
| same data item must have blocking semantics. |
| \item Identify special cases where the success of the transaction may |
| be determined before the RPC response is received, and |
| automatically convert these to irrevocable transactions immediately |
| before sending the RPC request. |
| Of course, if several concurrent transactions attempt RPC calls |
| in this manner, it might be necessary to roll all but one of them |
| back, with consequent degradation of performance and scalability. |
| This approach nevertheless might be valuable given long-running |
| transactions ending with an RPC\@. |
| This approach must still restrict manual transaction-abort |
| operations. |
| \item Identify special cases where the RPC response may be moved out |
| of the transaction, and then proceed using techniques similar |
| to those used for buffered I/O. |
| \item Extend the transactional substrate to include the RPC server as |
| well as its client. |
| This is in theory possible, as has been demonstrated by |
| distributed databases. |
| However, it is unclear whether the requisite performance and |
| scalability requirements can be met by distributed-database |
| techniques, given that memory-based TM has no slow disk drives |
| behind which to hide such latencies. |
| Of course, given the advent of solid-state disks, it is also quite |
| possible that databases will need to redesign their approach to |
| latency hiding. |
| \end{enumerate} |
| |
| As noted in the prior section, I/O is a known weakness of TM, and RPC |
| is simply an especially problematic case of I/O. |
| |
| \subsubsection{Time Delays} |
| \label{sec:future:Time Delays} |
| |
| An important special case of interaction with extra-transactional accesses |
| involves explicit time delays within a transaction. |
| Of course, the idea of a time delay within a transaction flies in the |
| face of TM's atomicity property, but this sort of thing is arguably what |
| weak atomicity is all about. |
| Furthermore, correct interaction with memory-mapped I/O sometimes requires |
| carefully controlled timing, and applications often use time delays |
| for varied purposes. |
| Finally, one can execute time delays within a lock-based critical section, |
| while holding a hazard pointer, within a sequence-locking read-side |
| critical section, and from within a userspace-RCU read-side critical |
| section, and even all at the same time, if need be. |
| Doing so might not be wise from a contention or scalability viewpoint, |
| but then again, doing so does not raise any fundamental conceptual issues. |
| |
| So, what can TM do about time delays within transactions? |
| |
| \begin{enumerate} |
| \item Ignore time delays within transactions. |
| This has an appearance of elegance, but like too many other |
| ``elegant'' solutions, fails to survive first contact with |
| legacy code. |
| Such code, which might well have important time delays in critical |
| sections, would fail upon being transactionalized. |
| \item Abort transactions upon encountering a time-delay operation. |
| This is attractive, but it is unfortunately not always possible |
| to automatically detect a time-delay operation. |
| Is that tight loop carrying out a critical computation, or is it |
| simply waiting for time to elapse? |
| \item Enlist the compiler to prohibit time delays within transactions. |
| \item Let the time delays execute normally. |
| Unfortunately, some TM implementations publish modifications only |
| at commit time, which could defeat the purpose of the time delay. |
| \end{enumerate} |
| |
| It is not clear that there is a single correct answer. |
| TM implementations featuring weak atomicity that publish changes |
| immediately within the transaction (rolling these changes back upon abort) |
| might be reasonably well served by the last alternative. |
| Even in this case, the code (or possibly even hardware) at the other |
| end of the transaction may require a substantial redesign to tolerate |
| aborted transactions. |
| This need for redesign would make it more difficult to apply transactional |
| memory to legacy code. |
| |
| \subsubsection{Persistence} |
| \label{sec:future:Persistence} |
| |
| There are many different types of locking primitives. |
| One interesting distinction is persistence, in other words, whether the |
| lock can exist independently of the address space of the process using |
| the lock. |
| |
| Non-persistent locks include \co{pthread_mutex_lock()}, |
| \co{pthread_rwlock_rdlock()}, and most kernel-level locking primitives. |
| If the memory locations instantiating a non-persistent lock's data |
| structures disappear, so does the lock. |
| For typical use of \co{pthread_mutex_lock()}, this means that when the |
| process exits, all of its locks vanish. |
| This property can be exploited in order to trivialize lock cleanup |
| at program shutdown time, but makes it more difficult for unrelated |
| applications to share locks, as such sharing requires the applications |
| to share memory. |
| |
| \QuickQuiz{ |
| But suppose that an application exits while holding a |
| \co{pthread_mutex_lock()} that happens to be located in a |
| file-mapped region of memory? |
| }\QuickQuizAnswer{ |
| Indeed, in this case the lock would persist, much to the |
| consternation of other processes attempting to acquire this |
| lock that is held by a process that no longer exists. |
| Which is why great care is required when using \co{pthread_mutex} |
| objects located in file-mapped memory regions. |
| }\QuickQuizEnd |
| |
| Persistent locks help avoid the need to share memory among unrelated |
| applications. |
| Persistent locking APIs include the flock family, \co{lockf()}, System |
| V semaphores, or the \co{O_CREAT} flag to \co{open()}. |
| These persistent APIs can be used to protect large-scale operations |
| spanning runs of multiple applications, and, in the case of \co{O_CREAT} |
| even surviving operating-system reboot. |
| If need be, locks can even span multiple computer systems via distributed |
| lock managers and distributed filesystems---and persist across reboots |
| of any or all of those computer systems. |
| |
| Persistent locks can be used by any application, including applications |
| written using multiple languages and software environments. |
| In fact, a persistent lock might well be acquired by an application written |
| in C and released by an application written in Python. |
| |
| How could a similar persistent functionality be provided for TM? |
| |
| \begin{enumerate} |
| \item Restrict persistent transactions to special-purpose environments |
| designed to support them, for example, SQL\@. |
| This clearly works, given the decades-long history of database |
| systems, but does not provide the same degree of flexibility |
| provided by persistent locks. |
| \item Use snapshot facilities provided by some storage devices and/or |
| filesystems. |
| Unfortunately, this does not handle network communication, |
| nor does it handle I/O to devices that do not provide snapshot |
| capabilities, for example, memory sticks. |
| \item Build a time machine. |
| \item Avoid the problem entirely by using existing persistent facilities, |
| presumably avoiding such use within transactions. |
| \end{enumerate} |
| |
| Of course, the fact that it is called transactional \emph{memory} |
| should give us pause, as the name itself conflicts with the concept of |
| a persistent transaction. |
| It is nevertheless worthwhile to consider this possibility as an important |
| test case probing the inherent limitations of transactional memory. |
| |
| \subsection{Process Modification} |
| \label{sec:future:Process Modification} |
| |
| Processes are not eternal: |
| They are created and destroyed, their memory mappings are modified, |
| they are linked to dynamic libraries, and they are debugged. |
| These sections look at how transactional memory can handle an |
| ever-changing execution environment. |
| |
| \subsubsection{Multithreaded Transactions} |
| \label{sec:future:Multithreaded Transactions} |
| |
| It is perfectly legal to create processes and threads while holding |
| a lock or, for that matter, while holding a hazard pointer, within |
| a sequence-locking read-side critical section, and from within a |
| userspace-RCU read-side critical section, and even all at the same time, |
| if need be. |
| Not only is it legal, but it is quite simple, as can be seen from the |
| following code fragment: |
| |
| \begin{VerbatimN} |
| pthread_mutex_lock(...); |
| for (i = 0; i < ncpus; i++) |
| pthread_create(&tid[i], ...); |
| for (i = 0; i < ncpus; i++) |
| pthread_join(tid[i], ...); |
| pthread_mutex_unlock(...); |
| \end{VerbatimN} |
| |
| This pseudo-code fragment uses \co{pthread_create()} to spawn one thread |
| per CPU, then uses \co{pthread_join()} to wait for each to complete, |
| all under the protection of \co{pthread_mutex_lock()}. |
| The effect is to execute a lock-based critical section in parallel, |
| and one could obtain a similar effect using \co{fork()} and \co{wait()}. |
| Of course, the critical section would need to be quite large to justify |
| the thread-spawning overhead, but there are many examples of large |
| critical sections in production software. |
| |
| What might TM do about thread spawning within a transaction? |
| |
| \begin{enumerate} |
| \item Declare \co{pthread_create()} to be illegal within transactions, |
| preferably by aborting the transaction. |
| Alternatively, enlist the compiler to enforce |
| \co{pthread_create()}-free transactions. |
| \item Permit \co{pthread_create()} to be executed within a |
| transaction, but only the parent thread will be considered to |
| be part of the transaction. |
| This approach seems to be reasonably compatible with existing and |
| posited TM implementations, but seems to be a trap for the unwary. |
| This approach raises further questions, such as how to handle |
| conflicting child-thread accesses. |
| \item Convert the \co{pthread_create()}s to function calls. |
| This approach is also an attractive nuisance, as it does not |
| handle the not-uncommon cases where the child threads communicate |
| with one another. |
| In addition, it does not permit concurrent execution of the body |
| of the transaction. |
| \item Extend the transaction to cover the parent and all child threads. |
| This approach raises interesting questions about the nature of |
| conflicting accesses, given that the parent and children are |
| presumably permitted to conflict with each other, but not with |
| other threads. |
| It also raises interesting questions as to what should happen |
| if the parent thread does not wait for its children before |
| committing the transaction. |
| Even more interesting, what happens if the parent conditionally |
| executes \co{pthread_join()} based on the values of variables |
| participating in the transaction? |
| The answers to these questions are reasonably straightforward |
| in the case of locking. |
| The answers for TM are left as an exercise for the reader. |
| \end{enumerate} |
| |
| Given that parallel execution of transactions is commonplace in the |
| database world, it is perhaps surprising that current TM proposals do |
| not provide for it. |
| On the other hand, the example above is a fairly sophisticated use |
| of locking that is not normally found in simple textbook examples, |
| so perhaps its omission is to be expected. |
| That said, some researchers are using transactions to autoparallelize |
| code~\cite{ArunRaman2010MultithreadedTransactions}, |
| and there are rumors that other TM researchers are investigating |
| fork/join parallelism within transactions, so perhaps this topic will |
| soon be addressed more thoroughly. |
| |
| \subsubsection{The \tco{exec()} System Call} |
| \label{sec:future:The exec System Call} |
| |
| One can execute an \co{exec()} system call within a lock-based critical |
| section, while holding a hazard pointer, within a sequence-locking |
| read-side critical section, and from within a userspace-RCU read-side |
| critical section, and even all at the same time, if need be. |
| The exact semantics depends on the type of primitive. |
| |
| In the case of non-persistent primitives (including |
| \co{pthread_mutex_lock()}, \co{pthread_rwlock_rdlock()}, and userspace RCU), |
| if the \co{exec()} succeeds, the whole address space vanishes, along |
| with any locks being held. |
| Of course, if the \co{exec()} fails, the address space still lives, |
| so any associated locks would also still live. |
| A bit strange perhaps, but well defined. |
| |
| On the other hand, persistent primitives (including the flock family, |
| \co{lockf()}, System V semaphores, and the \co{O_CREAT} flag to |
| \co{open()}) would survive regardless of whether the \co{exec()} |
| succeeded or failed, so that the \co{exec()}ed program might well |
| release them. |
| |
| \QuickQuiz{ |
| What about non-persistent primitives represented by data |
| structures in \co{mmap()} regions of memory? |
| What happens when there is an \co{exec()} within a critical |
| section of such a primitive? |
| }\QuickQuizAnswer{ |
| If the \co{exec()}ed program maps those same regions of |
| memory, then this program could in principle simply release |
| the lock. |
| The question as to whether this approach is sound from a |
| software-engineering viewpoint is left as an exercise for |
| the reader. |
| }\QuickQuizEnd |
| |
| What happens when you attempt to execute an \co{exec()} system call |
| from within a transaction? |
| |
| \begin{enumerate} |
| \item Disallow \co{exec()} within transactions, so that the enclosing |
| transactions abort upon encountering the \co{exec()}. |
| This is well defined, but clearly requires non-TM synchronization |
| primitives for use in conjunction with \co{exec()}. |
| \item Disallow \co{exec()} within transactions, with the compiler |
| enforcing this prohibition. |
| There is a draft specification for TM in C++ that takes |
| this approach, allowing functions to be decorated with |
| the \co{transaction_safe} and \co{transaction_unsafe} |
| attributes.\footnote{ |
| Thanks to Mark Moir for pointing me at this spec, and |
| to Michael Wong for having pointed me at an earlier |
| revision some time back.} |
| This approach has some advantages over aborting the transaction |
| at runtime, but again requires non-TM synchronization primitives |
| for use in conjunction with \co{exec()}. |
| One disadvantage is the need to decorate a great many library |
| functions with \co{transaction_safe} and \co{transaction_unsafe} |
| attributes. |
| \item Treat the transaction in a manner similar to non-persistent |
| locking primitives, so that the transaction survives if \co{exec()} |
| fails, and silently commits if the \co{exec()} succeeds. |
| The case where only some of the variables affected by the |
| transaction reside in \co{mmap()}ed memory (and thus could |
| survive a successful \co{exec()} system call) is left as an |
| exercise for the reader. |
| \item Abort the transaction (and the \co{exec()} system call) if the |
| \co{exec()} system call would have succeeded, but allow the |
| transaction to continue if the \co{exec()} system call would |
| fail. |
| This is in some sense the ``correct'' approach, but it would |
| require considerable work for a rather unsatisfying result. |
| \end{enumerate} |
| |
| The \co{exec()} system call is perhaps the strangest example of an |
| obstacle to universal TM applicability, as it is not completely clear |
| what approach makes sense, and some might argue that this is merely a |
| reflection of the perils of real-life interaction with \co{exec()}. |
| That said, the two options prohibiting \co{exec()} within transactions |
| are perhaps the most logical of the group. |
| |
| Similar issues surround the \co{exit()} and \co{kill()} system calls, |
| as well as a \co{longjmp()} or an exception that would exit the transaction. |
| (Where did the \co{longjmp()} or exception come from?) |
| |
| \subsubsection{Dynamic Linking and Loading} |
| \label{sec:future:Dynamic Linking and Loading} |
| |
| Lock-based critical section, code holding a hazard pointer, |
| sequence-locking read-side critical sections, and userspace-RCU read-side |
| critical sections can (separately or in combination) legitimately contain |
| code that invokes dynamically linked and loaded functions, including C/C++ |
| shared libraries and Java class libraries. |
| Of course, the code contained in these libraries is by definition |
| unknowable at compile time. |
| So, what happens if a dynamically loaded function is invoked within |
| a transaction? |
| |
| This question has two parts: |
| \begin{enumerate*}[(a)] |
| \item How do you dynamically link and load a function within a transaction |
| and |
| \item What do you do about the unknowable nature of the code within |
| this function? |
| \end{enumerate*} |
| To be fair, item (b) poses some challenges for locking and userspace-RCU |
| as well, at least in theory. |
| For example, the dynamically linked function might introduce a \IX{deadlock} |
| for locking or might (erroneously) introduce a \IX{quiescent state} into a |
| userspace-RCU read-side critical section. |
| The difference is that while the class of operations permitted in locking |
| and userspace-RCU critical sections is well-understood, there appears |
| to still be considerable uncertainty in the case of TM\@. |
| In fact, different implementations of TM seem to have different restrictions. |
| |
| So what can TM do about dynamically linked and loaded library functions? |
| Options for part (a), the actual loading of the code, include the following: |
| |
| \begin{enumerate} |
| \item Treat the dynamic linking and loading in a manner similar to a |
| page fault, so that the function is loaded and linked, possibly |
| aborting the transaction in the process. |
| If the transaction is aborted, the retry will find the function |
| already present, and the transaction can thus be expected to |
| proceed normally. |
| \item Disallow dynamic linking and loading of functions from within |
| transactions. |
| \end{enumerate} |
| |
| Options for part (b), the inability to detect TM-unfriendly operations |
| in a not-yet-loaded function, possibilities include the following: |
| |
| \begin{enumerate} |
| \item Just execute the code: |
| If there are any TM-unfriendly operations in the function, |
| simply abort the transaction. |
| Unfortunately, this approach makes it impossible for the compiler |
| to determine whether a given group of transactions may be safely |
| composed. |
| One way to permit composability regardless is irrevocable |
| transactions, however, current implementations permit only a |
| single irrevocable transaction to proceed at any given time, |
| which can severely limit performance and scalability. |
| Irrevocable transactions also to restrict use of manual |
| transaction-abort operations. |
| Finally, if there is an irrevocable transaction manipulating |
| a given data item, any other transaction manipulating that |
| same data item cannot have non-blocking semantics. |
| \item Decorate the function declarations indicating which functions |
| are TM-friendly. |
| These decorations can then be enforced by the compiler's type system. |
| Of course, for many languages, this requires language extensions |
| to be proposed, standardized, and implemented, with the |
| corresponding time delays, and also with the corresponding |
| decoration of a great many otherwise uninvolved library functions. |
| That said, the standardization effort is already in |
| progress~\cite{Ali-Reza-Adl-Tabatabai2009CppTM,Ali-Reza-Adl-Tabatabai2012CppTM}. |
| \item As above, disallow dynamic linking and loading of functions from |
| within transactions. |
| \end{enumerate} |
| |
| I/O operations are of course a known weakness of TM, and dynamic linking |
| and loading can be thought of as yet another special case of I/O\@. |
| Nevertheless, the proponents of TM must either solve this problem, or |
| resign themselves to a world where TM is but one tool of several in the |
| parallel programmer's toolbox. |
| (To be fair, a number of TM proponents have long since resigned themselves |
| to a world containing more than just TM.) |
| |
| \subsubsection{Memory-Mapping Operations} |
| \label{sec:future:Memory-Mapping Operations} |
| |
| It is perfectly legal to execute memory-mapping operations (including |
| \co{mmap()}, \co{shmat()}, and \co{munmap()}~\cite{TheOpenGroup1997SUS}) |
| within a lock-based critical section, while holding a hazard pointer, |
| within a sequence-locking read-side critical section, and from within a |
| userspace-RCU read-side critical section, and even all at the same time, |
| if need be. |
| What happens when you attempt to execute such an operation from within |
| a transaction? |
| More to the point, what happens if the memory region being remapped |
| contains some variables participating in the current thread's transaction? |
| And what if this memory region contains variables participating in some |
| other thread's transaction? |
| |
| It should not be necessary to consider cases where the TM system's |
| metadata is remapped, given that most locking primitives do not define |
| the outcome of remapping their lock variables. |
| |
| Here are some TM memory-mapping options: |
| |
| \begin{enumerate} |
| \item Memory remapping is illegal within a transaction, and will result |
| in all enclosing transactions being aborted. |
| This does simplify things somewhat, but also requires that TM |
| interoperate with synchronization primitives that do tolerate |
| remapping from within their critical sections. |
| \item Memory remapping is illegal within a transaction, and the |
| compiler is enlisted to enforce this prohibition. |
| \item Memory mapping is legal within a transaction, but aborts all |
| other transactions having variables in the region mapped over. |
| \item Memory mapping is legal within a transaction, but the mapping |
| operation will fail if the region being mapped overlaps with |
| the current transaction's footprint. |
| \item All memory-mapping operations, whether within or outside a |
| transaction, check the region being mapped against the memory |
| footprint of all transactions in the system. |
| If there is overlap, then the memory-mapping operation fails. |
| \item The effect of memory-mapping operations that overlap the memory |
| footprint of any transaction in the system is determined by the |
| TM conflict manager, which might dynamically determine whether |
| to fail the memory-mapping operation or abort any conflicting |
| transactions. |
| \end{enumerate} |
| |
| It is interesting to note that \co{munmap()} leaves the relevant region |
| of memory unmapped, which could have additional interesting |
| implications.\footnote{ |
| This difference between mapping and unmapping was noted by |
| Josh Triplett.} |
| |
| \subsubsection{Debugging} |
| \label{sec:future:Debugging} |
| |
| The usual debugging operations such as breakpoints work normally within |
| lock-based critical sections and from usespace-RCU read-side critical sections. |
| However, in initial transactional-memory hardware |
| implementations~\cite{DaveDice2009ASPLOSRockHTM} an exception within |
| a transaction will abort that transaction, which in turn means that |
| breakpoints abort all enclosing transactions. |
| |
| So how can transactions be debugged? |
| |
| \begin{enumerate} |
| \item Use software emulation techniques within transactions containing |
| breakpoints. |
| Of course, it might be necessary to emulate all transactions |
| any time a breakpoint is set within the scope of any transaction. |
| If the runtime system is unable to determine whether or not a |
| given breakpoint is within the scope of a transaction, then it |
| might be necessary to emulate all transactions just to be on |
| the safe side. |
| However, this approach might impose significant overhead, which |
| might in turn obscure the bug being pursued. |
| \item Use only hardware TM implementations that are capable of |
| handling breakpoint exceptions. |
| Unfortunately, as of this writing (March 2021), all such |
| implementations are research prototypes. |
| \item Use only software TM implementations, which are |
| (very roughly speaking) more tolerant of exceptions than are |
| the simpler of the hardware TM implementations. |
| Of course, software TM tends to have higher overhead than hardware |
| TM, so this approach may not be acceptable in all situations. |
| \item Program more carefully, so as to avoid having bugs in the |
| transactions in the first place. |
| As soon as you figure out how to do this, please do let everyone |
| know the secret! |
| \end{enumerate} |
| |
| There is some reason to believe that transactional memory will deliver |
| \IX{productivity} improvements compared to other synchronization mechanisms, |
| but it does seem quite possible that these improvements could easily |
| be lost if traditional debugging techniques cannot be applied to |
| transactions. |
| This seems especially true if transactional memory is to be used by |
| novices on large transactions. |
| In contrast, macho ``top-gun'' programmers might be able to dispense with |
| such debugging aids, especially for small transactions. |
| |
| Therefore, if transactional memory is to deliver on its productivity |
| promises to novice programmers, the debugging problem does need to |
| be solved. |
| |
| \subsection{Synchronization} |
| \label{sec:future:Synchronization} |
| |
| If transactional memory someday proves that it can be everything to everyone, |
| it will not need to interact with any other synchronization mechanism. |
| Until then, it will need to work with synchronization mechanisms that |
| can do what it cannot, or that work more naturally in a given situation. |
| The following sections outline the current challenges in this area. |
| |
| \subsubsection{Locking} |
| \label{sec:future:Locking} |
| |
| It is commonplace to acquire locks while holding other locks, which works |
| quite well, at least as long as the usual well-known software-engineering |
| techniques are employed to avoid deadlock. |
| It is not unusual to acquire locks from within RCU read-side critical |
| sections, which eases deadlock concerns because RCU read-side primitives |
| cannot participate in lock-based deadlock cycles. |
| It is also possible to acquire locks while holding hazard pointers and |
| within sequence-lock read-side critical sections. |
| But what happens when you attempt to acquire a lock from within a transaction? |
| |
| In theory, the answer is trivial: |
| Simply manipulate the data structure representing the lock as part of |
| the transaction, and everything works out perfectly. |
| In practice, a number of non-obvious complications~\cite{Volos2008TRANSACT} |
| can arise, depending on implementation details of the TM system. |
| These complications can be resolved, but at the cost of a 45\pct\ increase in |
| overhead for locks acquired outside of transactions and a 300\pct\ increase |
| in overhead for locks acquired within transactions. |
| Although these overheads might be acceptable for transactional |
| programs containing small amounts of locking, they are often completely |
| unacceptable for production-quality lock-based programs wishing to use |
| the occasional transaction. |
| |
| Here are some options available to TM: |
| |
| \begin{enumerate} |
| \item Use only locking-friendly TM implementations~\cite{DaveDice2006DISC}. |
| Unfortunately, the locking-unfriendly implementations have some |
| attractive properties, including low overhead for successful |
| transactions and the ability to accommodate extremely large |
| transactions. |
| \item Use TM only ``in the small'' when introducing TM to lock-based |
| programs, thereby accommodating the limitations of |
| locking-friendly TM implementations. |
| \item Set aside locking-based legacy systems entirely, re-implementing |
| everything in terms of transactions. |
| This approach has no shortage of advocates, but this requires |
| that all the issues described in this series be resolved. |
| During the time it takes to resolve these issues, competing |
| synchronization mechanisms will of course also have the |
| opportunity to improve. |
| \item Use TM strictly as an optimization in lock-based systems, as was |
| done by the TxLinux~\cite{ChistopherJRossbach2007a} group and |
| by a great many transactional lock elision |
| projects~\cite{MartinPohlack2011HTM2TLE,Kleen:2014:SEL:2566590.2576793,PascalFelber2016rwlockElision,SeongJaePark2020HTMRCUlock}. |
| This approach seems sound, but leaves the locking design |
| constraints (such as the need to avoid deadlock) firmly in place. |
| \item Strive to reduce the overhead imposed on locking primitives. |
| \end{enumerate} |
| |
| The fact that there could possibly be a problem interfacing TM and locking |
| came as a surprise to many, which underscores the need to try out new |
| mechanisms and primitives in real-world production software. |
| Fortunately, the advent of open source means that a huge quantity of |
| such software is now freely available to everyone, including researchers. |
| |
| \subsubsection{Reader-Writer Locking} |
| \label{sec:future:Reader-Writer Locking} |
| |
| It is commonplace to read-acquire reader-writer locks while holding |
| other locks, which just works, at least as long as the usual well-known |
| software-engineering techniques are employed to avoid deadlock. |
| Read-acquiring reader-writer locks from within RCU read-side critical |
| sections also works, and doing so eases deadlock concerns because RCU |
| read-side primitives cannot participate in lock-based deadlock cycles. |
| It is also possible to acquire locks while holding hazard pointers and |
| within sequence-lock read-side critical sections. |
| But what happens when you attempt to read-acquire a reader-writer lock |
| from within a transaction? |
| |
| Unfortunately, the straightforward approach to read-acquiring the |
| traditional counter-based reader-writer lock within a transaction defeats |
| the purpose of the reader-writer lock. |
| To see this, consider a pair of transactions concurrently attempting to |
| read-acquire the same reader-writer lock. |
| Because read-acquisition involves modifying the reader-writer lock's |
| data structures, a conflict will result, which will roll back one of |
| the two transactions. |
| This behavior is completely inconsistent with the reader-writer lock's |
| goal of allowing concurrent readers. |
| |
| Here are some options available to TM: |
| |
| \begin{enumerate} |
| \item Use per-CPU or per-thread reader-writer |
| locking~\cite{WilsonCHsieh92a}, which allows a |
| given CPU (or thread, respectively) to manipulate only local |
| data when read-acquiring the lock. |
| This would avoid the conflict between the two transactions |
| concurrently read-acquiring the lock, permitting both to proceed, |
| as intended. |
| Unfortunately, (1)~the write-acquisition overhead of |
| per-CPU/thread locking can be extremely high, (2)~the memory |
| overhead of per-CPU/thread locking can be prohibitive, and |
| (3)~this transformation is available only when you have access to |
| the source code in question. |
| Other more-recent scalable |
| reader-writer locks~\cite{YossiLev2009SNZIrwlock} |
| might avoid some or all of these problems. |
| \item Use TM only ``in the small'' when introducing TM to lock-based |
| programs, thereby avoiding read-acquiring reader-writer locks |
| from within transactions. |
| \item Set aside locking-based legacy systems entirely, re-implementing |
| everything in terms of transactions. |
| This approach has no shortage of advocates, but this requires |
| that \emph{all} the issues described in this series be resolved. |
| During the time it takes to resolve these issues, competing |
| synchronization mechanisms will of course also have the |
| opportunity to improve. |
| \item Use TM strictly as an optimization in lock-based systems, as was |
| done by the TxLinux~\cite{ChistopherJRossbach2007a} group, |
| and as has been done by more recent work using TM to elide |
| reader-writer locks~\cite{PascalFelber2016rwlockElision}. |
| This approach seems sound, at least on \Power{8} |
| CPUs~\cite{Le:2015:TMS:3266491.3266500}, but leaves the locking |
| design constraints (such as the need to avoid deadlock) firmly |
| in place. |
| \end{enumerate} |
| |
| Of course, there might well be other non-obvious issues surrounding |
| combining TM with reader-writer locking, as there in fact were with |
| exclusive locking. |
| |
| \subsubsection{Deferred Reclamation} |
| \label{sec:future:Deferred Reclamation} |
| |
| This section focuses mainly on RCU\@. |
| Similar issues and possible resolutions arise when combining TM with |
| other deferred-reclamation mechanisms such as |
| \IXalt{reference counters}{reference count} and |
| hazard pointers. |
| In the text below, known differences are specifically called out. |
| |
| Reference counting, hazard pointers, and RCU are all heavily used, as noted in |
| \cref{sec:defer:RCU Related Work,sec:defer:Which to Choose? (Production Use)}. |
| This means that any TM implementation that chooses not to surmount each |
| and every challenge called out in this section needs to interoperate |
| cleanly and efficiently with all of these synchronization mechanisms. |
| |
| The TxLinux group from the University of Texas at Austin appears to be |
| the group to take on the challenge of RCU/TM |
| interoperability~\cite{ChistopherJRossbach2007a}. |
| Because they applied TM to the Linux 2.6 kernel, which uses RCU, they |
| had no choice but to integrate TM and RCU, with TM taking the place of |
| locking for RCU updates. |
| Unfortunately, although the paper does state that the RCU implementation's |
| locks (e.g., \co{rcu_ctrlblk.lock}) were converted to transactions, |
| it is silent about what was done with those locks used by RCU-based updates |
| (for example, \co{dcache_lock}). |
| |
| More recently, \ppl{Dimitrios}{Siakavaras} et al.~have applied |
| HTM and RCU to search trees~\cite{Siakavaras2017CombiningHA,DimitriosSiakavaras2020RCU-HTM-B+Trees}, |
| \ppl{Christina}{Giannoula} et al.~have used HTM and RCU to color |
| graphs~\cite{ChristinaGiannoula2018HTM-RCU-graphcoloring}, |
| and |
| \ppl{SeongJae}{Park} et al.~have used HTM and RCU to optimize high-contention |
| locking on \IXacr{numa} systems~\cite{SeongJaePark2020HTMRCUlock}. |
| |
| It is important to note that RCU permits readers and updaters to run |
| concurrently, further permitting RCU readers to access data that is in |
| the act of being updated. |
| Of course, this property of RCU, whatever its performance, scalability, |
| and real-time-response benefits might be, flies in the face of the |
| underlying atomicity properties of TM, although the \Power{8} CPU family's |
| suspended-transaction facility~\cite{Le:2015:TMS:3266491.3266500} makes |
| it an exception to this rule. |
| |
| So how should TM-based updates interact with concurrent RCU readers? |
| Some possibilities are as follows: |
| |
| \begin{enumerate} |
| \item RCU readers abort concurrent conflicting TM updates. |
| This is in fact the approach taken by the TxLinux project. |
| This approach does preserve RCU semantics, and also preserves |
| RCU's read-side performance, scalability, and real-time-response |
| properties, but it does have the unfortunate side-effect of |
| unnecessarily aborting conflicting updates. |
| In the worst case, a long sequence of RCU readers could |
| potentially starve all updaters, which could in theory result |
| in system hangs. |
| In addition, not all TM implementations offer the strong atomicity |
| required to implement this approach, and for good reasons. |
| \item RCU readers that run concurrently with conflicting TM updates |
| get old (pre-transaction) values from any conflicting RCU loads. |
| This preserves RCU semantics and performance, and also prevents |
| RCU-update \IX{starvation}. |
| However, not all TM implementations can provide timely access |
| to old values of variables that have been tentatively updated |
| by an in-flight transaction. |
| In particular, log-based TM implementations that maintain |
| old values in the log (thus providing excellent TM commit |
| performance) are not likely to be happy with this approach. |
| Perhaps the \co{rcu_dereference()} primitive can be leveraged |
| to permit RCU to access the old values within a greater range |
| of TM implementations, though performance might still be an issue. |
| Nevertheless, there are popular TM implementations that have |
| been integrated with RCU in this |
| manner~\cite{DonaldEPorter2007TRANSACT,PhilHoward2011RCUTMRBTree, |
| PhilipWHoward2013RCUrbtree}. |
| \item If an RCU reader executes an access that conflicts with an |
| in-flight transaction, then that RCU access is delayed until |
| the conflicting transaction either commits or aborts. |
| This approach preserves RCU semantics, but not RCU's performance |
| or real-time response, particularly in presence of long-running |
| transactions. |
| In addition, not all TM implementations are capable of delaying |
| conflicting accesses. |
| Nevertheless, this approach seems eminently reasonable for hardware |
| TM implementations that support only small transactions. |
| \item RCU readers are converted to transactions. |
| This approach pretty much guarantees that RCU is compatible with |
| any TM implementation, but it also imposes TM's rollbacks on RCU |
| read-side critical sections, destroying RCU's real-time response |
| guarantees, and also degrading RCU's read-side performance. |
| Furthermore, this approach is infeasible in cases where any of |
| the RCU read-side critical sections contains operations that |
| the TM implementation in question is incapable of handling. |
| This approach is more difficult to apply to hazard pointers and |
| reference counters, which do not have a sharply defined notion |
| of a reader as a section of code. |
| \item Many update-side uses of RCU modify a single pointer to publish |
| a new data structure. |
| In some of these cases, RCU can safely be permitted to see a |
| transactional pointer update that is subsequently rolled back, |
| as long as the transaction respects memory ordering and as long |
| as the roll-back process uses \co{call_rcu()} to free up the |
| corresponding structure. |
| Unfortunately, not all TM implementations respect memory barriers |
| within a transaction. |
| Apparently, the thought is that because transactions are supposed |
| to be atomic, the ordering of the accesses within the transaction |
| is not supposed to matter. |
| \item Prohibit use of TM in RCU updates. |
| This is guaranteed to work, but restricts use of TM. |
| \end{enumerate} |
| |
| It seems likely that additional approaches will be uncovered, especially |
| given the advent of user-level RCU and hazard-pointer |
| implementations.\footnote{ |
| Kudos to the TxLinux group, Maged Michael, and Josh Triplett |
| for coming up with a number of the above alternatives.} |
| It is interesting to note that many of the better performing and |
| scaling STM implementations make use of RCU-like techniques |
| internally~\cite{UCAM-CL-TR-579,KeirFraser2007withoutLocks,Gu:2019:PSE:3358807.3358885,Kim:2019:MSR:3297858.3304040}. |
| |
| \QuickQuiz{ |
| MV-RLU looks pretty good! |
| Doesn't it beat RCU hands down? |
| }\QuickQuizAnswer{ |
| One might get that impression from a quick read of the abstract, |
| but more careful readers will notice the ``for a wide range of |
| workloads'' phrase in the last sentence. |
| It turns out that this phrase is quite important: |
| |
| \begin{enumerate} |
| \item Their RCU evaluation uses synchronous grace periods, which |
| needlessly throttle updates, as noted in their |
| Section~6.2.1. |
| See \cref{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo in the Presence of Updates} |
| \cpageref{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo in the Presence of Updates} |
| of this book to see that the venerable asynchronous |
| \co{call_rcu()} primitive enables RCU to perform and |
| scale quite well with large numbers of updaters. |
| Furthermore, in Section~3.7 of their paper, the authors |
| admit that asynchronous grace periods are important to |
| MV-RLU scalability. |
| A fair comparison would also allow RCU the benefits of |
| asynchrony. |
| \item They use a poorly tuned 1,000-bucket hash table containing |
| 10,000~elements. |
| In addition, their 448~hardware threads need considerably |
| more than 1,000~buckets to avoid the lock contention |
| that they correctly state limits RCU performance in |
| their benchmarks. |
| A useful comparison would feature a properly tuned |
| hash table. |
| \item Their RCU hash table used per-bucket locks, which they |
| call out as a bottleneck, which is not a surprise given |
| the long hash chains and small ratio of buckets to threads. |
| A number of their competing mechanisms instead use |
| lockfree techniques, thus avoiding the per-bucket-lock |
| bottleneck, which cynics might claim sheds some light |
| on the authors' otherwise inexplicable choice of poorly |
| tuned hash tables. |
| The first graph in the middle row of the authors' |
| Figure~4 show what RCU can achieve if not hobbled by |
| artificial bottlenecks, as does the first portion of |
| the second graph in that same row. |
| \item Their linked-list operation permits RLU to do concurrent |
| modifications of different elements in the list, while |
| RCU is forced to serialize updates. |
| Again, RCU has always worked just fine in conjunction |
| with lockless updaters, a fact that has been set forth |
| in academic literature that the authors |
| cited~\cite{MathieuDesnoyers2012URCU}. |
| A fair comparison would use the same style of update |
| for RCU as it does for MV-RLU. |
| \item The authors fail to consider combining RCU and sequence |
| locking, which is used in the Linux kernel to give |
| readers coherent views of multi-pointer updates. |
| \item The authors fail to consider RCU-based solutions to the |
| Issaquah Challenge~\cite{PaulEMcKenney2016IssaquahCPPCON}, |
| which also gives readers a coherent view of multi-pointer |
| updates, albeit with a weaker view of ``coherent''. |
| \end{enumerate} |
| |
| It is surprising that the anonymous reviewers of this paper did |
| not demand an apples-to-apples comparison of MV-RLU and RCU\@. |
| Nevertheless, the authors should be congratulated on producing |
| an academic paper that presents an all-too-rare example of good |
| scalability combined with strong read-side coherence. |
| They are also to be congratulated on overcoming the traditional |
| academic prejudice against asynchronous grace periods, |
| which greatly aided their scalability. |
| |
| Interestingly enough, RLU and RCU take different approaches to avoid |
| the inherent limitations of STM noted by \ppl{Hagit}{Attiya} et |
| al.~\cite{Attiya:2009:STMReadOnlyLimits}. |
| RCU avoids providing strict serializability and RLU avoids providing |
| invisible read-only transactions, both thus avoiding the |
| limitations. |
| }\QuickQuizEnd |
| |
| \subsubsection{Extra-Transactional Accesses} |
| \label{sec:future:Extra-Transactional Accesses} |
| |
| Within a lock-based critical section, it is perfectly legal to manipulate |
| variables that are concurrently accessed or even modified outside that |
| lock's critical section, with one common example being statistical |
| counters. |
| The same thing is possible within RCU read-side critical |
| sections, and is in fact the common case. |
| |
| Given mechanisms such as the so-called ``dirty reads'' that are |
| prevalent in production database systems, it is not surprising |
| that extra-transactional accesses have received serious attention |
| from the proponents of TM, with the concept of weak |
| atomicity~\cite{Blundell2006TMdeadlock} being but one case in point. |
| |
| Here are some extra-transactional options: |
| |
| \begin{enumerate} |
| \item Conflicts due to extra-transactional accesses always abort |
| transactions. |
| This is strong atomicity. |
| \item Conflicts due to extra-transactional accesses are ignored, |
| so only conflicts among transactions can abort transactions. |
| This is weak atomicity. |
| \item Transactions are permitted to carry out non-transactional |
| operations in special cases, such as when allocating memory or |
| interacting with lock-based critical sections. |
| \item Produce hardware extensions that permit some operations |
| (for example, addition) to be carried out concurrently on a |
| single variable by multiple transactions. |
| \item Introduce weak semantics to transactional memory. |
| One approach is the combination with RCU described in |
| \cref{sec:future:Deferred Reclamation}, |
| while Gramoli and Guerraoui |
| survey a number of other weak-transaction |
| approaches~\cite{Gramoli:2014:DTP:2541883.2541900}, for example, |
| restricted partitioning of large |
| ``elastic'' transactions into smaller transactions, thus |
| reducing conflict probabilities (albeit with tepid performance |
| and scalability). |
| Perhaps further experience will show that some uses of |
| extra-transactional accesses can be replaced by weak |
| transactions. |
| \end{enumerate} |
| |
| It appears that transactions were conceived in a vacuum, with no |
| interaction required with any other synchronization mechanism. |
| If so, it is no surprise that much confusion and complexity arises when |
| combining transactions with non-transactional accesses. |
| But unless transactions are to be confined to small updates to isolated |
| data structures, or alternatively to be confined to new programs |
| that do not interact with the huge body of existing parallel code, |
| then transactions absolutely must be so combined if they are to have |
| large-scale practical impact in the near term. |
| |
| \subsection{Other Transactions} |
| \label{sec:future:Other Transactions} |
| % If we ever get epigraphs down at this level, perhaps the old |
| % "Hell is other people" quote by Jean-Paul Sartre. His clarification |
| % also applies, which can be stated as "Heaven is also other people". |
| |
| Because transactions are to appear atomic with respect to each other, |
| something must be done when multiple transactions attempt to access the |
| same variable at the same time. |
| If all the transactions are reading that variable and none of them are |
| updating it, then that something can be ``nothing'', but otherwise |
| the transactions must be at least partially serialized. |
| This serialization is often achieved by aborting and rolling back some |
| subset of those transactions. |
| |
| How to choose which to abort and roll back? |
| |
| There has been much work on this question. |
| The answer is ``use a contention manager'', but that simply shifts this |
| question to ``what should a contention manager do?'' |
| A full exposition on contention managers is beyond the scope of this section. |
| This section will therefore limit itself to a few seminal papers |
| in this area. |
| |
| Herlihy et al.~\cite{HerlihyLMS03} |
| have writing transactions unconditionally abort reading transactions, |
| which has the virtue of simplicity, but which can result in reader |
| starvation. |
| This paper also introduces the concept of early release, which can |
| allow a long traversal to proceed concurrently with small updates. |
| From this viewpoint, one might characterize RCU readers as ``instant |
| release'', but this is a property that causes RCU readers to behave in |
| a decidedly non-STM manner. |
| |
| Hammond et al.~\cite{LanceHammond2004a} |
| advise using small transactions, which does reduce contention-manager |
| complexity, but which also fails to deliver on the full STM promise |
| of automated concurrency. |
| This paper also introduced annotations to identify different types of |
| transactions. |
| |
| Scherer and Scott~\cite{WilliamNSchererIII2005} |
| describe a number of STM contention-management strategies. |
| This paper defined ``visible'' and ``invisible'' readers for |
| extra-transactional accesses, and discuss a number of contention-management |
| strategies, including: |
| |
| \begin{enumerate} |
| \item Exponential backoff, which they conclude required manual |
| tuning. |
| \item Preferentially abort transactions that have done the least work, |
| where the work of an aborted prior attempt to complete a |
| transactions is counted for the next retry of that transaction. |
| This has nice fairness properties, but can cause huge transactions |
| to hog the system for an extended time. |
| \item Preferentially abort transactions that have done the least work, |
| but also credit a given transaction with the work done by any |
| other transactions blocked waiting for that transaction to complete. |
| This adds a form of priority inheritance to the mix. |
| \item Track how many times a given transaction has been aborted by |
| another transaction as another way to promote some notion of |
| fairness. |
| \item Use transaction-start timestamps in order to preferentially abort |
| transactions that started most recently. |
| \item Use transaction-start timestamps, but also add an indication |
| of when a given transaction last ran to prevent preempted |
| transactions from aborting other transactions running in |
| higher-priority processes. |
| \end{enumerate} |
| |
| This 2005 paper thus gave an early indication of the complexity inherent |
| in contention management. |
| |
| Porter and Witchel~\cite{DonaldEPorter2007TRANSACT,Ramadan:2008:DTM:1521747.1521799} |
| considered loosening STM consistency requirements in order to simplify |
| contention management. |
| They noted RCU as a case in which an update will cause concurrent reads |
| to access old versions of the data. |
| |
| Contention management is also an issue in HTM, and will be discussed |
| in \cref{sec:future:Conflict Handling,sec:future:Aborts and Rollbacks}. |
| |
| \QuickQuiz{ |
| Why not get with the times and apply machine learning to |
| contention management? |
| }\QuickQuizAnswer{ |
| Many transactions have sub-microsecond overheads, so it would |
| be all too easy to burn more CPU on machine learning than was |
| saved by more efficiently executing transactions. |
| Machine learning might nevertheless have a role to play in |
| tuning contention managers to specific workloads. |
| It would of course be even better to not need the tuning, but |
| sometimes ``better'' is unattainable. |
| }\QuickQuizEnd |
| |
| \subsection{Case Study: |
| Sequence Locking} |
| \label{sec:future:Case Study: Sequence Locking} |
| |
| Sequence locking, described in |
| \cref{sec:defer:Sequence Locks}, |
| is sometimes thought of as a trivial form of STM\@. |
| The sequence-locking read-side critical section is normally restricted |
| to doing loads, and it is retried whenever there is a conflict with a |
| sequence-locking updater. |
| Crucially, sequence-locking read-side critical section do not conflict |
| with each other. |
| It is therefore instructive to consider how sequence locking handles |
| the STM challenges called out in the preceding sections. |
| |
| Sequence locking handles interactions with the outside world |
| (\cref{sec:future:Outside World}) by simply executing the |
| TM-unfriendly operation on each pass through its read-side |
| critical section. |
| For example, if the critical section doing an I/O write |
| is retried three times, it will execute that I/O write four |
| times, once for the original pass through the critical |
| section and once each for the three retries. |
| |
| Doing I/O operations within a sequence-locking read-side |
| critical section might seem a bit unconventional. |
| It nevertheless makes a good deal of sense to do |
| an I/O write (\cref{sec:future:I/O Operations}) |
| to a log for historical, debugging, or auditing purposes. |
| An RPC operation (\cref{sec:future:RPC Operations}) might gather needed data. |
| A time delay (\cref{sec:future:Time Delays}) might serve valuable |
| debugging purposes such as exercising the retry code path. |
| Even persistent operations (\cref{sec:future:Persistence}) are potentially |
| useful, for example, to carry out cross-application mutual exclusion. |
| Alternatively, just as with STM, users of sequence locking may choose |
| to buffer I/O and execute it just after the end of the read-side critical |
| section. |
| |
| Sequence locking handles process modifications |
| (\cref{sec:future:Process Modification}) |
| in the same way, by simply executing the process-modification operation |
| on each pass through its read-side critical section. |
| |
| Creating processes and threads within a critical section |
| (\cref{sec:future:Multithreaded Transactions}) |
| operates normally, though care should be taken in order to prevent |
| an often-retried critical section from becoming a ``fork bomb''. |
| The \co{exec()} system call (\cref{sec:future:The exec System Call}) |
| has perfectly reasonable (if rather drastic) semantics. |
| Calls to functions within dynamic libraries |
| (\cref{sec:future:Dynamic Linking and Loading}), |
| such as Linux-kernel loadable modules, work just as well as do calls |
| to built-in functions. |
| Of course, if such a call were cause a module to load, this might result |
| in a high retry probability on the first pass through the critical |
| section. |
| Remapping memory (\cref{sec:future:Memory-Mapping Operations}) |
| behaves intuitively, as least assuming that neither the code making |
| up the critical section nor the sequence lock itself are remapped. |
| Note that these same restrictions also apply to normal locks. |
| Debuggers (\cref{sec:future:Debugging}) |
| work normally within sequence-locking read-side critical sections, |
| just as they do within the critical sections for normal locks. |
| |
| Synchronization primitives (\cref{sec:future:Synchronization}) |
| can be used freely within sequence-locking read-side critical sections. |
| Locking (\cref{sec:future:Locking}), |
| reader-writer locking (\cref{sec:future:Reader-Writer Locking}), |
| deferred reclamation (\cref{sec:future:Deferred Reclamation}), |
| and extra-transactional accesses |
| (\cref{sec:future:Extra-Transactional Accesses}) |
| may all be used within sequence-locking read-side critical sections, |
| and all give the expected results. |
| |
| Sequence locking uses a trivial contention manager |
| (\cref{sec:future:Other Transactions}). |
| Sequence-locking readers are always rolled back in case of a concurrent |
| updater (as do Herlihy et al.~\cite{HerlihyLMS03}), sequence-locking |
| updaters typically delegate their contention management to some form of |
| mutual exclusion, and sequence-locking readers never conflict, and thus |
| never need their non-existent contention to be managed. |
| This rough-and-ready approach means that sequence-locking updaters |
| can starve readers, and users of sequence locking must therefore take |
| care to avoid frequent updates. |
| |
| In short, sequence locking handles these STM challenges quite well. |
| This might be one reason that sequence locking is heavily used in |
| production. |
| However, a key enabler of sequence locking is the pushing of a number |
| of these challenges back onto the user. |
| This approach is reasonable because sequence locking has no ambitions |
| to be the one synchronization mechanism to rule them all. |
| So perhaps the greatest impediment to widespread STM adoption is its |
| proponents' overweening desire for full generality. |
| |
| \subsection{Discussion} |
| \label{sec:future:Discussion} |
| |
| The obstacles to universal TM adoption lead to the following |
| conclusions: |
| |
| \begin{enumerate} |
| \item One interesting property of TM is the fact that transactions are |
| subject to rollback and retry. |
| This property underlies TM's difficulties with irreversible |
| operations, including unbuffered I/O, RPCs, memory-mapping |
| operations, time delays, and the \co{exec()} system call. |
| This property also has the unfortunate consequence of introducing |
| all the complexities inherent in the possibility of failure, |
| often in a developer-visible manner. |
| \item Another interesting property of TM, noted by |
| Shpeisman et al.~\cite{TatianaShpeisman2009CppTM}, is that TM |
| intertwines the synchronization with the data it protects. |
| This property underlies TM's issues with I/O, memory-mapping |
| operations, extra-transactional accesses, and debugging |
| breakpoints. |
| In contrast, conventional synchronization primitives, including |
| locking and RCU, maintain a clear separation between the |
| synchronization primitives and the data that they protect. |
| \item One of the stated goals of many workers in the TM area is to |
| ease parallelization of large sequential programs. |
| As such, individual transactions are commonly expected to |
| execute serially, which might do much to explain TM's issues |
| with multithreaded transactions. |
| \end{enumerate} |
| |
| \QuickQuiz{ |
| Given things like \co{spin_trylock()}, how does it make any |
| sense at all to claim that TM introduces the concept of failure??? |
| }\QuickQuizAnswer{ |
| When using locking, \co{spin_trylock()} is a choice, with a |
| corresponding failure-free choice being \co{spin_lock()}, |
| which is used in the common case, as in there are more than |
| 100 times as many calls to \co{spin_lock()} than to |
| \co{spin_trylock()} in the v5.11 Linux kernel. |
| When using TM, the only failure-free choice is the irrevocable |
| transaction, which is not used in the common case. |
| In fact, the irrevocable transaction is not even available |
| in all TM implementations. |
| }\QuickQuizEnd |
| |
| What should TM researchers and developers do about all of this? |
| |
| One approach is to focus on TM in the small, focusing on small |
| transactions where hardware assist potentially provides substantial |
| advantages over other synchronization primitives and on small programs |
| where there is some evidence for increased productivity for a combined |
| TM-locking approach~\cite{VPankratius2011TMvsLockingProductivity}. |
| Sun took the small-transaction approach with its Rock research |
| CPU~\cite{DaveDice2009ASPLOSRockHTM}. |
| Some TM researchers seem to agree with these two small-is-beautiful |
| approaches~\cite{JMStone93}, others have much higher hopes for TM, and yet others |
| hint that high TM aspirations might be TM's worst |
| enemy~\cite[Section 6]{Attiya:2010:ICT:1835698.1835699}. |
| It is nonetheless quite possible that TM will be able to take on larger |
| problems, and this section has listed a few of the issues that must be |
| resolved if TM is to achieve this lofty goal. |
| |
| Of course, everyone involved should treat this as a learning experience. |
| It would seem that TM researchers have great deal to learn from |
| practitioners who have successfully built large software systems using |
| traditional synchronization primitives. |
| |
| And vice versa. |
| |
| \QuickQuiz{ |
| What is to learn? |
| Why not just use TM for memory-based data structures and locking |
| for those rare cases featuring the many silly corner cases listed |
| in this silly section??? |
| }\QuickQuizAnswer{ |
| The year 2005 just called, and it wants its incandescent TM |
| marketing hype back. |
| |
| In the year 2023, TM still has significant proving to do, |
| even with the advent of HTM, which is covered in the |
| upcoming |
| \cref{sec:future:Hardware Transactional Memory}. |
| }\QuickQuizEnd |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{cartoons/TM-the-vision}} |
| \caption{The STM Vision} |
| \ContributedBy{fig:future:The STM Vision}{Melissa Broussard} |
| \end{figure} |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.7in}{!}{\includegraphics{cartoons/TM-the-reality-conflict}} |
| \caption{The STM Reality: |
| Conflicts} |
| \ContributedBy{fig:future:The STM Reality: Conflicts}{Melissa Broussard} |
| \end{figure} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{cartoons/TM-the-reality-nonidempotent}} |
| \caption{The STM Reality: |
| Irrevocable Operations} |
| \ContributedBy{fig:future:The STM Reality: Irrevocable Operations}{Melissa Broussard} |
| \end{figure} |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.7in}{!}{\includegraphics{cartoons/TM-the-reality-realtime}} |
| \caption{The STM Reality: |
| Realtime Response} |
| \ContributedBy{fig:future:The STM Reality: Realtime Response}{Melissa Broussard} |
| \end{figure} |
| |
| But for the moment, the current state of STM |
| can best be summarized with a series of cartoons. |
| First, |
| \cref{fig:future:The STM Vision} |
| shows the STM vision. |
| As always, the reality is a bit more nuanced, as fancifully depicted by |
| \cref{fig:future:The STM Reality: Conflicts,% |
| fig:future:The STM Reality: Irrevocable Operations,% |
| fig:future:The STM Reality: Realtime Response}.\footnote{ |
| Recent academic work-in-progress has investigated lock-based STM |
| systems for real-time use~\cite{JimAnderson2019STMRT,CatherineNemitz2018LockSTMrealtime}, |
| albeit without any performance results, and with some indications |
| that real-time hybrid STM/HTM systems must choose between fast |
| common-case performance and worst-case forward-progress |
| guarantees~\cite{DBLP:journals/corr/AlistarhKKRS14,MartinSchoeberl2010realtimeTM}.} |
| Less fanciful STM retrospectives are also |
| available~\cite{JoeDuffy2010RetroTM,JoeDuffy2010RetroTM2}. |
| |
| Alternatively, one might argue that sequence locking constitutes a |
| restricted form of STM that is heavily used in practice, as discussed in |
| \cref{sec:future:Case Study: Sequence Locking}. |
| |
| Whether or not sequence locking is considered to be the standard bearer |
| for STM in actual practice, some commercially available hardware supports |
| restricted variants of HTM, which are addressed in the following section. |