| % future/htm.tex |
| |
| \section{Hardware Transactional Memory} |
| \label{sec:future:Hardware Transactional Memory} |
| |
| As of early 2012, hardware transactional memory (HTM) is starting to emerge |
| into commercially available commodity computer systems. |
| This section makes a first attempt to find its place in the parallel |
| programmer's toolbox. |
| |
| From a conceptual viewpoint, HTM uses processor caches and speculative |
| execution to make a designated group of statements (a ``transaction'') |
| take effect atomically |
| from the viewpoint of any other transactions running on other processors. |
| This transaction is initiated by a |
| begin-transaction machine instruction and completed by a commit-transaction |
| machine instruction. |
| There is typically also an abort-transaction machine instruction, which |
| squashes the speculation (as if the begin-transaction instruction and |
| all following instructions had not executed) and commences execution |
| at a failure handler. |
| The location of the failure handler is typically specified by the |
| begin-transaction instruction, either as an explicit failure-handler |
| address or via a condition code set by the instruction itself. |
| Each transaction executes atomically with respect to all other transactions. |
| |
| HTM has a number of important benefits, including automatic |
| dynamic partitioning of data structures, reducing synchronization-primitive |
| cache misses, and supporting a fair number of practical applications. |
| |
| However, it always pays to read the fine print, and HTM is no exception. |
| A major point of this section is determining under what conditions HTM's |
| benefits outweigh the complications hidden in its fine print. |
| To this end, Section~\ref{sec:future:HTM Benefits WRT to Locking} |
| describes HTM's benefits and |
| Section~\ref{sec:future:HTM Weaknesses WRT Locking} describes its weaknesses. |
| This is the same approach used in earlier |
| papers~\cite{McKenney2007PLOSTM,PaulEMcKenney2010OSRGrassGreener}, |
| but focused on HTM rather than TM as a whole.\footnote{ |
| And I gratefully acknowledge many stimulating |
| discussions with the other authors, Maged Michael, Josh Triplett, |
| and Jonathan Walpole, as well as with Andi Kleen.} |
| |
| Section~\ref{sec:future:HTM Weaknesses WRT to Locking When Augmented} then describes |
| HTM's weaknesses with respect to the combination of synchronization |
| primitives used in the Linux kernel (and in some user-space applications). |
| Section~\ref{sec:future:Where Does HTM Best Fit In?} looks at where HTM |
| might best fit into the parallel programmer's toolbox, and |
| Section~\ref{sec:future:Potential Game Changers} lists some events that might |
| greatly increase HTM's scope and appeal. |
| Finally, Section~\ref{sec:future:Conclusions} |
| presents concluding remarks. |
| |
| \subsection{HTM Benefits WRT to Locking} |
| \label{sec:future:HTM Benefits WRT to Locking} |
| |
| The primary benefits of HTM are |
| (1)~its avoidance of the cache misses that are often incurred by |
| other synchronization primitives, |
| (2)~its ability to dynamically partition |
| data structures, |
| and (3)~the fact that it has |
| a fair number of practical applications. |
| I break from TM tradition by not listing ease of use separately |
| for two reasons. |
| First, ease of use should stem from HTM's primary benefits, |
| which this section focuses on. |
| Second, there has been considerable controversy surrounding attempts to |
| test for raw programming |
| talent~\cite{RichardBornat2006SheepGoats,SaeedDehnadi2009SheepGoats} |
| and even around the use of small programming exercises in job |
| interviews~\cite{RegBraithwaite2007FizzBuzz}. |
| This indicates that we really do not have a grasp on what makes |
| programming easy or hard. |
| Therefore, this section focuses on the three benefits listed above, |
| each in one of the following sections. |
| |
| \subsubsection{Avoiding Synchronization Cache Misses} |
| \label{sec:future:Avoiding Synchronization Cache Misses} |
| |
| Most synchronization mechanisms are based on data structures that are |
| operated on by atomic instructions. |
| Because these atomic instructions normally operate by first causing |
| the relevant cache line to be owned by the CPU that they are running on, |
| a subsequent execution |
| of the same instance of that synchronization primitive on some other |
| CPU will result in a cache miss. |
| These communications cache misses severely degrade both the performance and |
| scalability of conventional synchronization |
| mechanisms~\cite[Section 4.2.3]{Anderson97}. |
| |
| In contrast, HTM synchronizes by using the CPU's cache, avoiding the need |
| for a synchronization data structure and resultant cache misses. |
| HTM's advantage is greatest in cases where a lock data structure is |
| placed in a separate cache line, in which case, converting a given |
| critical section to an HTM transaction can reduce that critical section's |
| overhead by a full cache miss. |
| These savings can be quite significant for the common case of short |
| critical sections, at least for those situations where the elided lock |
| does not share a cache line with an oft-written variable protected by |
| that lock. |
| |
| \QuickQuiz{} |
| Why would it matter that oft-written variables shared the cache |
| line with the lock variable? |
| \QuickQuizAnswer{ |
| If the lock is in the same cacheline as some of the variables |
| that it is protecting, then writes to those variables by one CPU |
| will invalidate that cache line for all the other CPUs. |
| These invalidations will |
| generate large numbers of conflicts and retries, perhaps even |
| degrading performance and scalability compared to locking. |
| } \QuickQuizEnd |
| |
| \subsubsection{Dynamic Partitioning of Data Structures} |
| \label{sec:future:Dynamic Partitioning of Data Structures} |
| |
| A major obstacle to the use of some conventional synchronization mechanisms |
| is the need to statically partition data structures. |
| There are a number of data structures that are trivially |
| partitionable, with the most prominent example being hash tables, |
| where each hash chain constitutes a partition. |
| Allocating a lock for each hash chain then trivially parallelizes |
| the hash table for operations confined to a given chain.\footnote{ |
| And it is also easy to extend this scheme to operations accessing |
| multiple hash chains by having such operations acquire the |
| locks for all relevant chains in hash order.} |
| Partitioning is similarly trivial for arrays, radix trees, and a few |
| other data structures. |
| |
| However, partitioning for many types of trees and graphs is quite |
| difficult, and the results are often quite complex~\cite{Ellis80}. |
| Although it is possible to use two-phased locking and hashed arrays |
| of locks to partition general data structures, other techniques |
| have proven preferable~\cite{DavidSMiller2006HashedLocking}, |
| as will be discussed in |
| Section~\ref{sec:future:HTM Weaknesses WRT to Locking When Augmented}. |
| Given its avoidance of synchronization cache misses, |
| HTM is therefore a very real possibility for large non-partitionable |
| data structures, at least assuming relatively small updates. |
| |
| \QuickQuiz{} |
| Why are relatively small updates important to HTM performance |
| and scalability? |
| \QuickQuizAnswer{ |
| The larger the updates, the greater the probability of conflict, |
| and thus the greater probability of retries, which degrade |
| performance. |
| } \QuickQuizEnd |
| |
| \subsubsection{Practical Value} |
| \label{sec:future:Practical Value} |
| |
| Some evidence of HTM's practical value has been demonstrated in a number |
| of hardware platforms, including |
| Sun Rock~\cite{DaveDice2009ASPLOSRockHTM} and |
| Azul Vega~\cite{CliffClick2009AzulHTM}. |
| It is reasonable to assume that practical benefits will flow from the |
| more recent IBM Blue Gene/Q, Intel Haswell TSX, and AMD ASF systems. |
| |
| Expected practical benefits include: |
| |
| \begin{enumerate} |
| \item Lock elision for in-memory data access and |
| update~\cite{Martinez01a,Rajwar02a}. |
| \item Concurrent access and small random updates to large non-partitionable |
| data structures. |
| \end{enumerate} |
| |
| However, HTM also has some very real shortcomings, which will be discussed |
| in the next section. |
| |
| \subsection{HTM Weaknesses WRT Locking} |
| \label{sec:future:HTM Weaknesses WRT Locking} |
| |
| The concept of HTM is quite simple: A group of accesses and updates to |
| memory occurs atomically. |
| However, as is the case with many simple ideas, complications arise |
| when you apply it to real systems in the real world. |
| These complications are as follows: |
| |
| \begin{enumerate} |
| \item Transaction-size limitations. |
| \item Conflict handling. |
| \item Aborts and rollbacks. |
| \item Lack of forward-progress guarantees. |
| \item Irrevocable operations. |
| \item Semantic differences. |
| \end{enumerate} |
| |
| Each of these complications is covered in the following sections, |
| followed by a summary. |
| |
| \subsubsection{Transaction-Size Limitations} |
| \label{sec:future:Transaction-Size Limitations} |
| |
| The transaction-size limitations of current HTM implementations |
| stem from the use of the processor caches to hold the data |
| affected by the transaction. |
| Although this allows a given CPU to make the transaction appear atomic to |
| other CPUs by executing the transaction within the confines of its cache, |
| it also means that any transaction that does not fit must be aborted. |
| Furthermore, events that change execution context, such as interrupts, |
| system calls, exceptions, traps, and context switches either must |
| abort any ongoing transaction on the CPU in question or must further |
| restrict transaction size due to the cache footprint of the other |
| execution context. |
| |
| Of course, modern CPUs tend to have large caches, and the data required |
| for many transactions would fit easily in a one-megabyte cache. |
| Unfortunately, with caches, sheer size is not all that matters. |
| The problem is that most caches |
| can be thought of hash tables implemented in hardware. |
| However, hardware caches do not chain their buckets (which are normally |
| called \emph{sets}), but rather |
| provide a fixed number of cachelines per set. |
| The number of elements provided for each set in a given cache |
| is termed that cache's \emph{associativity}. |
| |
| Although cache associativity varies, the eight-way associativity of |
| the level-0 cache on the laptop I am typing this on is not unusual. |
| What this means is that if a given transaction needed to touch |
| nine cache lines, and if all nine cache lines mapped to the same |
| set, then that transaction cannot possibly complete, never mind how |
| many megabytes of additional space might be available in that cache. |
| Yes, given randomly selected data elements in a given data structure, |
| the probability of that transaction being able to commit is quite |
| high, but there can be no guarantee. |
| |
| There has been some research work to alleviate this limitation. |
| Fully associative \emph{victim caches} would alleviate the associativity |
| constraints, but there are currently stringent performance and |
| energy-efficiency constraints on the sizes of victim caches. |
| That said, HTM victim caches for unmodified cache lines can be quite |
| small, as they need to retain only the address: |
| The data itself can be written to memory or shadowed by other caches, |
| while the address itself is sufficient to detect a conflicting |
| write~\cite{RaviRajwar2012TSX}. |
| |
| \emph{Unbounded transactional memory} (UTM) |
| schemes~\cite{CScottAnanian2006,KevinEMoore2006} |
| use DRAM as an extremely large victim cache, but integrating such schemes |
| into a production-quality cache-coherence mechanism is still an unsolved |
| problem. |
| In addition, use of DRAM as a victim cache may have unfortunate |
| performance and energy-efficiency consequences, particularly |
| if the victim cache is to be fully associative. |
| Finally, the ``unbounded'' aspect of UTM assumes that all of DRAM |
| could be used as a victim cache, while in reality |
| the large but still fixed amount of DRAM assigned to a given CPU |
| would limit the size of that CPU's transactions. |
| Other schemes use a combination of hardware and software transactional |
| memory~\cite{SanjeevKumar2006} and one could imagine using STM as a |
| fallback mechanism for HTM. |
| |
| However, to the best of my knowledge, currently available systems do |
| not implement any of these research ideas, and perhaps for good reason. |
| |
| \subsubsection{Conflict Handling} |
| \label{sec:future:Conflict Handling} |
| |
| The first complication is the possibility of \emph{conflicts}. |
| For example, suppose that transactions~A and~B are defined as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \begin{verbatim} |
| Transaction A Transaction B |
| |
| x = 1; y = 2; |
| y = 3; x = 4; |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Suppose that each transaction executes concurrently on its own processor. |
| If transaction~A stores to \co{x} at the same time that transaction~B |
| stores to \co{y}, neither transaction can progress. |
| To see this, suppose that transaction~A executes its store to \co{y}. |
| Then transaction~A will be interleaved within transaction~B, in violation |
| of the requirement that transactions execute atomically with respect to |
| each other. |
| Allowing transaction~B to execute its store to \co{x} similarly violates |
| the atomic-execution requirement. |
| This situation is termed a \emph{conflict}, which happens whenever two |
| concurrent transactions access the same variable where at least one of |
| the accesses is a store. |
| The system is therefore obligated to abort one or both of the transactions |
| in order to allow execution to progress. |
| The choice of exactly which transaction to abort is an interesting topic |
| that will very likely retain the ability to generate Ph.D. dissertations for |
| some time to come, see for |
| example~\cite{EgeAkpinar2011HTM2TLE}.\footnote{ |
| Liu's and Spear's paper entitled ``Toxic |
| Transactions''~\cite{YujieLiu2011ToxicTransactions} is |
| particularly instructive in this regard.} |
| For the purposes of this section, we can assume that the system makes |
| a random choice. |
| |
| Another complication is conflict detection, which is comparatively |
| straightforward, at least in the simplest case. |
| When a processor is executing a transaction, it marks every cache line |
| touched by that transaction. |
| If the processor's cache receives a request involving a cache line that |
| has been marked as touched by the current transaction, a potential |
| conflict has occurred. |
| More sophisticated systems might try to order the current processors' |
| transaction to precede that of the processor sending the request, |
| and optimization of this process will likely also retain the ability |
| to generate Ph.D. dissertations for quite some time. |
| However this section assumes a very simple conflict-detection strategy. |
| |
| However, for HTM to work effectively, the probability of conflict must |
| be suitably low, which in turn requires that the data structures |
| be organized so as to maintain a sufficiently low probability of conflict. |
| For example, a red-black tree with simple insertion, deletion, and search |
| operations fits this description, but a red-black |
| tree that maintains an accurate count of the number of elements in |
| the tree does not.\footnote{ |
| The need to update the count would result in additions to and |
| deletions from the tree conflicting with each other, resulting |
| in strong non-commutativity~\cite{HagitAttiya2011LawsOfOrder,Attiya:2011:LOE:1925844.1926442,PaulEMcKenney2011SNC}.} |
| For another example, a red-black tree that enumerates all elements in |
| the tree in a single transaction will have high conflict probabilities, |
| degrading performance and scalability. |
| As a result, many serial programs will require some restructuring before |
| HTM can work effectively. |
| In some cases, practitioners will prefer to take the extra steps |
| (in the red-black-tree case, perhaps switching to a partitionable |
| data structure such as a radix tree or a hash table), and just |
| use locking, particularly during the time before HTM is readily available |
| on all relevant |
| architectures~\cite{CliffClick2009AzulHTM}. |
| |
| \QuickQuiz{} |
| How could a red-black tree possibly efficiently enumerate all |
| elements of the tree regardless of choice of synchronization |
| mechanism??? |
| \QuickQuizAnswer{ |
| In many cases, the enumeration need not be exact. |
| In these cases, hazard pointers or RCU may be used to protect |
| readers with low probability of conflict with any given insertion |
| or deletion. |
| } \QuickQuizEnd |
| |
| Furthermore, the fact that conflicts can occur brings failure handling |
| into the picture, as discussed in the next section. |
| |
| \subsubsection{Aborts and Rollbacks} |
| \label{sec:future:Aborts and Rollbacks} |
| |
| Because any transaction might be aborted at any time, it is important |
| that transactions contain no statements that cannot be rolled back. |
| This means that transactions cannot do I/O, system calls, or debugging |
| breakpoints (no single stepping in the debugger for HTM transactions!!!). |
| Instead, transactions must confine themselves to accessing normal |
| cached memory. |
| Furthermore, on some systems, interrupts, exceptions, traps, |
| TLB misses, and other events will also abort transactions. |
| Given the number of bugs that have resulted from improper handling |
| of error conditions, it is fair to ask what impact aborts and rollbacks |
| have on ease of use. |
| |
| \QuickQuiz{} |
| But why can't a debugger emulate single stepping by setting |
| breakpoints at successive lines of the transaction, relying |
| on the retry to retrace the steps of the earlier instances |
| of the transaction? |
| \QuickQuizAnswer{ |
| This scheme might work with reasonably high probability, but it |
| can fail in ways that would be quite surprising to most users. |
| To see this, consider the following transaction: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \small |
| \begin{verbatim} |
| 1 begin_trans(); |
| 2 if (a) { |
| 3 do_one_thing(); |
| 4 do_another_thing(); |
| 5 } else { |
| 6 do_a_third_thing(); |
| 7 do_a_fourth_thing(); |
| 8 } |
| 9 end_trans(); |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Suppose that the user sets a breakpoint at line 3, which triggers, |
| aborting the transaction and entering the debugger. |
| Suppose that between the time that the breakpoint triggers |
| and the debugger gets around to stopping all the threads, some |
| other thread sets the value of \co{a} to zero. |
| When the poor user attempts to single-step the program, surprise! |
| The program is now in the else-clause instead of the then-clause. |
| |
| This is \emph{not} what I call an easy-to-use debugger. |
| } \QuickQuizEnd |
| |
| Of course, aborts and rollbacks raise the question of whether HTM can |
| be useful for hard real-time systems. |
| Do the performance benefits of HTM outweigh the costs of the aborts |
| and rollbacks, and if so under what conditions? |
| Can transactions use priority boosting? |
| Or should transactions for high-priority threads instead preferentially |
| abort those of low-priority threads? |
| If so, how is the hardware efficiently informed of priorities? |
| The literature on real-time use of HTM is quite sparse, perhaps because |
| researchers are finding more than enough problems in getting |
| transactions to work well in non-real-time environments. |
| |
| Because current HTM implementations might deterministically abort a |
| given transaction, software must provide fallback code. |
| This fallback code must use some other form of synchronization, for |
| example, locking. |
| If the fallback is used frequently, then all the limitations of locking, |
| including the possibility of deadlock, reappear. |
| One can of course hope that the fallback isn't used often, which might |
| allow simpler and less deadlock-prone locking designs to be used. |
| But this raises the question of how the system transitions from using |
| the lock-based fallbacks back to transactions.\footnote{ |
| The possibility of an application getting stuck in fallback |
| mode has been termed the ``lemming effect'', a term that |
| Dave Dice has been credited with coining.} |
| One approach is to use a test-and-test-and-set discipline~\cite{Martinez02a}, |
| so that everyone holds off until the lock is released, allowing the |
| system to start from a clean slate in transactional mode at that point. |
| However, this could result in quite a bit of spinning, which might not |
| be wise if the lock holder has blocked or been preempted. |
| Another approach is to allow transactions to proceed in parallel with |
| a thread holding a lock~\cite{Martinez02a}, but this raises difficulties |
| in maintaining atomicity, especially if the reason that the thread is |
| holding the lock is because the corresponding transaction would not fit |
| into cache. |
| |
| Finally, dealing with the possibility of aborts and rollbacks seems to |
| put an additional burden on the developer, who must correctly handle |
| all combinations of possible error conditions. |
| |
| It is clear that users of HTM must put considerable validation effort |
| into testing both the fallback code paths and transition from fallback |
| code back to transactional code. |
| |
| \subsubsection{Lack of Forward-Progress Guarantees} |
| \label{sec:future:Lack of Forward-Progress Guarantees} |
| |
| Even though transaction size, conflicts, and aborts/rollbacks can all |
| cause transactions to abort, one might hope that sufficiently small and |
| short-duration transactions could be guaranteed to eventually succeed. |
| This would permit a transaction to be unconditionally retried, in the |
| same way that compare-and-swap (CAS) and load-linked/store-conditional |
| (LL/SC) operations are unconditionally retried in code that uses these |
| instructions to implement atomic operations. |
| |
| Unfortunately, most currently available HTM implementation refuse to |
| make any |
| sort of forward-progress guarantee, which means that HTM cannot be |
| used to avoid deadlock on those systems.\footnote{ |
| HTM might well be used to reduce the probability of deadlock, |
| but as long as there is some possibility of the fallback |
| code being executed, there is some possibility of deadlock.} |
| Hopefully future implementations of HTM will provide some sort of |
| forward-progress guarantees. |
| Until that time, HTM must be used with extreme caution in real-time |
| applications.\footnote{ |
| As of mid-2012, there has been surprisingly little work on |
| transactional memory's real-time characteristics.} |
| |
| The one exception to this gloomy picture as of 2013 is upcoming versions |
| of the IBM mainframe, which provides a separate instruction that may be |
| used to start a special |
| \emph{constrained transaction}~\cite{ChristianJacobi2012MainframeTM}. |
| As you might guess from the name, such transactions must live within |
| the following constraints: |
| |
| \begin{enumerate} |
| \item Each transaction's data footprint must be contained within |
| four 32-byte blocks of memory. |
| \item Each transaction is permitted to execute at most 32 assembler |
| instructions. |
| \item Transactions are not permitted to have backwards branches |
| (e.g., no loops). |
| \item Each transaction's code is limited to 256 bytes of memory. |
| \item If a portion of a given transaction's data footprint resides |
| within a given 4K page, then that 4K page is prohibited from |
| containing any of that transaction's instructions. |
| \end{enumerate} |
| |
| These constraints are severe, but they nevertheless permit a wide variety |
| of data-structure updates to be implemented, including stacks, queues, |
| hash tables, and so on. |
| These operations are guaranteed to eventually complete, and are free of |
| deadlock and livelock conditions. |
| |
| It will be interesting to see how hardware support of forward-progress |
| guarantees evolves over time. |
| |
| \subsubsection{Irrevocable Operations} |
| \label{sec:future:Irrevocable Operations} |
| |
| Another consequence of aborts and rollbacks is that HTM transactions |
| cannot accommodate irrevocable operations. |
| Current HTM implementations typically enforce this limitation by |
| requiring that all of the accesses in the transaction be to cacheable |
| memory (thus prohibiting MMIO accesses) and aborting transactions on |
| interrupts, traps, and exceptions (thus prohibiting system calls). |
| |
| Note that buffered I/O can be accommodated by HTM transactions as |
| long as the buffer fill/flush operations occur extra-transactionally. |
| The reason that this works is that adding data to and removing data |
| from the buffer is revocable: Only the actual buffer fill/flush |
| operations are irrevocable. |
| Of course, this buffered-I/O approach has the effect of including the I/O |
| in the transaction's footprint, increasing the size of the transaction |
| and thus increasing the probability of failure. |
| |
| \subsubsection{Semantic Differences} |
| \label{sec:future:Semantic Differences} |
| |
| Although HTM can in many cases be used as a drop-in replacement for locking |
| (hence the name transactional lock |
| elision~\cite{DaveDice2008TransactLockElision}), |
| there are subtle differences in semantics. |
| A particularly nasty example involving coordinated lock-based critical |
| sections that results in deadlock or livelock when executed transactionally |
| was given by Blundell~\cite{Blundell2006TMdeadlock}, but a much simpler |
| example is the empty critical section. |
| |
| In a lock-based program, an empty critical section will guarantee |
| that all processes that had previously been holding that lock have |
| now released it. |
| This idiom was used by the 2.4 Linux kernel's networking stack to |
| coordinate changes in configuration. |
| But if this empty critical section is translated to a transaction, |
| the result is a no-op. |
| The guarantee that all prior critical sections have terminated is |
| lost. |
| In other words, transactional lock elision preserves the data-protection |
| semantics of locking, but loses locking's time-based messaging semantics. |
| |
| \QuickQuiz{} |
| But why would \emph{anyone} need an empty lock-based critical |
| section??? |
| \QuickQuizAnswer{ |
| See the answer to \QuickQuizARef{\QlockingQemptycriticalsection} in |
| Section~\ref{sec:locking:Exclusive Locks}. |
| |
| However, it is claimed that given a strongly atomic HTM |
| implementation without forward-progress guarantees, any |
| memory-based locking design based on empty critical sections |
| will operate correctly in the presence of transactional |
| lock elision. |
| Although I have not seen a proof of this statement, there |
| is a straightforward rationale for this claim. |
| The main idea is that in a strongly atomic HTM implementation, |
| the results of a given transaction are not visible until |
| after the transaction completes successfully. |
| Therefore, if you can see that a transaction has started, |
| it is guaranteed to have already completed, which means |
| that a subsequent empty lock-based critical section will |
| successfully ``wait'' on it---after all, there is no waiting |
| required. |
| |
| This line of reasoning does not apply to weakly atomic |
| systems (including many STM implementation), and it also |
| does not apply to lock-based programs that use means other |
| than memory to communicate. |
| One such means is the passage of time (for example, in |
| hard real-time systems) or flow of priority (for example, |
| in soft real-time systems). |
| |
| Locking designs that rely on priority boosting are of particular |
| interest. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Can't transactional lock elision trivially handle locking's |
| time-based messaging semantics |
| by simply choosing not to elide empty lock-based critical sections? |
| \QuickQuizAnswer{ |
| It could do so, but this would be both unnecessary and |
| insufficient. |
| |
| It would be unnecessary in cases where the empty critical section |
| was due to conditional compilation. |
| Here, it might well be that the only purpose of the lock was to |
| protect data, so eliding it completely would be the right thing |
| to do. |
| In fact, leaving the empty lock-based critical section would |
| degrade performance and scalability. |
| |
| On the other hand, it is possible for a non-empty lock-based |
| critical section to be relying on both the data-protection |
| and time-based and messaging semantics of locking. |
| Using transactional lock elision in such a case would be |
| incorrect, and would result in bugs. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Given modern hardware~\cite{PeterOkech2009InherentRandomness}, |
| how can anyone possibly expect parallel software relying |
| on timing to work? |
| \QuickQuizAnswer{ |
| The short answer is that on commonplace commodity hardware, |
| synchronization designs based on any sort of fine-grained |
| timing are foolhardy and cannot be expected to operate correctly |
| under all conditions. |
| |
| That said, there are systems designed for hard real-time use |
| that are much more deterministic. |
| In the (very unlikely) event that you are using such a system, |
| here is a toy example showing how time-based synchronization can |
| work. |
| Again, do \emph{not} try this on commodity microprocessors, |
| as they have highly nondeterministic performance characteristics. |
| |
| This example uses multiple worker threads along with a control |
| thread. |
| Each worker thread corresponds to an outbound data feed, and |
| records the current time (for example, from the |
| \co{clock_gettime()} system call) in a per-thread |
| \co{my_timestamp} variable after executing each unit |
| of work. |
| The real-time nature of this example results in the following |
| set of constraints: |
| |
| \begin{enumerate} |
| \item It is a fatal error for a given worker thread to fail |
| to update its timestamp for a time period of more than |
| \co{MAX_LOOP_TIME}. |
| \item Locks are used sparingly to access and update global |
| state. |
| item Locks are granted in strict FIFO order within |
| a given thread priority. |
| \end{enumerate} |
| |
| When worker threads complete their feed, they must disentangle |
| themselves from the rest of the application and place a status |
| value in a per-thread \co{my_status} variable that is initialized |
| to -1. |
| Threads do not exit; they instead are placed on a thread pool |
| to accommodate later processing requirements. |
| The control thread assigns (and re-assigns) worker threads as |
| needed, and also maintains a histogram of thread statuses. |
| The control thread runs at a real-time priority no higher than |
| that of the worker threads. |
| |
| Worker threads' code is as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 int my_status = -1; /* Thread local. */ |
| 2 |
| 3 while (continue_working()) { |
| 4 enqueue_any_new_work(); |
| 5 wp = dequeue_work(); |
| 6 do_work(wp); |
| 7 my_timestamp = clock_gettime(...); |
| 8 } |
| 9 |
| 10 acquire_lock(&departing_thread_lock); |
| 11 |
| 12 /* |
| 13 * Disentangle from application, might |
| 14 * acquire other locks, can take much longer |
| 15 * than MAX_LOOP_TIME, especially if many |
| 16 * threads exit concurrently. |
| 17 */ |
| 18 my_status = get_return_status(); |
| 19 release_lock(&departing_thread_lock); |
| 20 |
| 21 /* thread awaits repurposing. */ |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| The control thread's code is as follows: |
| |
| \vspace{5pt} |
| \begin{minipage}[t]{\columnwidth} |
| \scriptsize |
| \begin{verbatim} |
| 1 for (;;) { |
| 2 for_each_thread(t) { |
| 3 ct = clock_gettime(...); |
| 4 d = ct - per_thread(my_timestamp, t); |
| 5 if (d >= MAX_LOOP_TIME) { |
| 6 /* thread departing. */ |
| 7 acquire_lock(&departing_thread_lock); |
| 8 release_lock(&departing_thread_lock); |
| 9 i = per_thread(my_status, t); |
| 10 status_hist[i]++; /* Bug if TLE! */ |
| 11 } |
| 12 } |
| 13 /* Repurpose threads as needed. */ |
| 14 } |
| \end{verbatim} |
| \end{minipage} |
| \vspace{5pt} |
| |
| Line~5 uses the passage of time to deduce that the thread |
| has exited, executing lines~6-10 if so. |
| The empty lock-based critical section on lines~7 and~8 |
| guarantees that any thread in the process of exiting |
| completes (remember that locks are granted in FIFO order!). |
| |
| Once again, do not try this sort of thing on commodity |
| microprocessors. |
| After all, it is difficult enough to get right on systems |
| specifically designed for hard real-time use! |
| } \QuickQuizEnd |
| |
| One important semantic difference between locking and transactions |
| is the priority boosting that is used to avoid priority inversion |
| in lock-based real-time programs. |
| One way in which priority inversion can occur is when a |
| low-priority thread holding a lock |
| is preempted by a medium-priority CPU-bound thread. |
| If there is at least one such medium-priority thread per CPU, the |
| low-priority thread will never get a chance to run. |
| If a high-priority thread now attempts to acquire the lock, |
| it will block. |
| It cannot acquire the lock until the low-priority thread releases it, |
| the low-priority thread cannot release the lock until it gets a chance |
| to run, and it cannot get a chance to run until one of the medium-priority |
| threads gives up its CPU. |
| Therefore, the medium-priority threads are in effect blocking the |
| high-priority process, which is the rationale for the name ``priority |
| inversion.'' |
| |
| { \scriptsize |
| \begin{verbbox} |
| 1 void boostee(void) |
| 2 { |
| 3 int i = 0; |
| 4 |
| 5 acquire_lock(&boost_lock[i]); |
| 6 for (;;) { |
| 7 acquire_lock(&boost_lock[!i]); |
| 8 release_lock(&boost_lock[i]); |
| 9 i = i ^ 1; |
| 10 do_something(); |
| 11 } |
| 12 } |
| 13 |
| 14 void booster(void) |
| 15 { |
| 16 int i = 0; |
| 17 |
| 18 for (;;) { |
| 19 usleep(1000); /* sleep 1 ms. */ |
| 20 acquire_lock(&boost_lock[i]); |
| 21 release_lock(&boost_lock[i]); |
| 22 i = i ^ 1; |
| 23 } |
| 24 } |
| \end{verbbox} |
| } |
| \begin{figure}[tbp] |
| \centering |
| \theverbbox |
| \caption{Exploiting Priority Boosting} |
| \label{fig:future:Exploiting Priority Boosting} |
| \end{figure} |
| |
| One way to avoid priority inversion is \emph{priority inheritance}, |
| in which a high-priority thread blocked on a lock temporarily donates |
| its priority to the lock's holder, which is also called \emph{priority |
| boosting}. |
| However, priority boosting can be used for things other than avoiding |
| priority inversion, as shown in |
| Figure~\ref{fig:future:Exploiting Priority Boosting}. |
| Lines~1-12 of this figure show a low-priority process that must |
| nevertheless run every millisecond or so, while lines~14-24 of |
| this same figure show a high-priority process that uses priority |
| boosting to ensure that \co{boostee()} runs periodically as needed. |
| |
| The \co{boostee()} function arranges this by always holding one of |
| the two \co{boost_lock[]} locks, so that lines~20-21 of |
| \co{booster()} can boost priority as needed. |
| |
| \QuickQuiz{} |
| But the \co{boostee()} function in |
| Figure~\ref{fig:future:Exploiting Priority Boosting} |
| alternatively acquires its locks in reverse order! |
| Won't this result in deadlock? |
| \QuickQuizAnswer{ |
| No deadlock will result. |
| To arrive at deadlock, two different threads must each |
| acquire the two locks in oppposite orders, which does not |
| happen in this example. |
| However, deadlock detectors such as |
| lockdep~\cite{JonathanCorbet2006lockdep} |
| will flag this as a false positive. |
| } \QuickQuizEnd |
| |
| This arrangement requires that \co{boostee()} acquire its first |
| lock on line~5 before the system becomes busy, but this is easily |
| arranged, even on modern hardware. |
| |
| Unfortunately, this arrangement can break down in presence of transactional |
| lock elision. |
| The \co{boostee()} function's overlapping critical sections become |
| one infinite transaction, which will sooner or later abort, |
| for example, on the first time that the thread running |
| the \co{boostee()} function is preempted. |
| At this point, \co{boostee()} will fall back to locking, but given |
| its low priority and that the quiet initialization period is now |
| complete (which after all is why \co{boostee()} was preempted), |
| this thread might never again get a chance to run. |
| |
| And if the \co{boostee()} thread is not holding the lock, then |
| the \co{booster()} thread's empty critical section on lines~20 and~21 of |
| Figure~\ref{fig:future:Exploiting Priority Boosting} |
| will become an empty transaction that has no effect, so that |
| \co{boostee()} never runs. |
| This example illustrates some of the subtle consequences of |
| transactional memory's rollback-and-retry semantics. |
| |
| Given that experience will likely uncover additional subtle semantic |
| differences, application of HTM-based lock elision to large programs |
| should be undertaken with caution. |
| That said, where it does apply, HTM-based lock elision can eliminate |
| the cache misses associated with the lock variable, which has resulted |
| in tens of percent performance increases in large real-world software |
| systems as of early 2015. |
| We can therefore expect to see substantial use of this technique on |
| hardware supporting it. |
| |
| \QuickQuiz{} |
| So a bunch of people set out to supplant locking, and they |
| mostly end up just optimizing locking??? |
| \QuickQuizAnswer{ |
| At least they accomplished something useful! |
| And perhaps there will be additional HTM progress over time. |
| } \QuickQuizEnd |
| |
| \subsubsection{Summary} |
| \label{sec:future:HTM Weaknesses WRT Locking: Summary} |
| |
| \input{future/HTMtable} |
| |
| Although it seems likely that HTM will have compelling use cases, |
| current implementations have serious transaction-size limitations, |
| conflict-handling complications, abort-and-rollback issues, and |
| semantic differences that will require careful handling. |
| HTM's current situation relative to locking is summarized in |
| Table~\ref{tab:future:Comparison of Locking and HTM}. |
| As can be seen, although the current state of HTM alleviates some |
| serious shortcomings of locking,\footnote{ |
| In fairness, it is important to emphasize that locking's shortcomings |
| do have well-known and heavily used engineering solutions, including |
| deadlock detectors~\cite{JonathanCorbet2006lockdep}, a wealth |
| of data structures that have been adapted to locking, and |
| a long history of augmentation, as discussed in |
| Section~\ref{sec:future:HTM Weaknesses WRT to Locking When Augmented}. |
| In addition, if locking really were as horrible as a quick skim |
| of many academic papers might reasonably lead one to believe, |
| where did all the large lock-based parallel programs (both |
| FOSS and proprietary) come from, anyway?} |
| it does so by introducing a significant |
| number of shortcomings of its own. |
| These shortcomings are acknowledged by leaders in the TM |
| community~\cite{AlexanderMatveev2012PessimisticTM}.\footnote{ |
| In addition, in early 2011, I was invited to deliver a critique of |
| some of the assumptions underlying transactional |
| memory~\cite{PaulEMcKenney2011Verico}. |
| The audience was surprisingly non-hostile, though perhaps they |
| were taking it easy on me due to the fact that I was heavily |
| jet-lagged while giving the presentation.} |
| |
| In addition, this is not the whole story. |
| Locking is not normally used by itself, but is instead typically |
| augmented by other synchronization mechanisms, |
| including reference counting, atomic operations, non-blocking data structures, |
| hazard pointers~\cite{MagedMichael04a,HerlihyLM02}, |
| and read-copy update (RCU)~\cite{McKenney98,McKenney01a,ThomasEHart2007a,PaulEMcKenney2012ELCbattery}. |
| The next section looks at how such augmentation changes the equation. |
| |
| \subsection{HTM Weaknesses WRT to Locking When Augmented} |
| \label{sec:future:HTM Weaknesses WRT to Locking When Augmented} |
| |
| \input{future/HTMtableRCU} |
| |
| Practitioners have long used reference counting, atomic operations, |
| non-blocking data structures, hazard pointers, and RCU to avoid some |
| of the shortcomings of locking. |
| For example, deadlock can be avoided in many cases by using reference |
| counts, hazard pointers, or RCU to protect data structures, |
| particularly for read-only critical |
| sections~\cite{MagedMichael04a,HerlihyLM02,MathieuDesnoyers2012URCU,DinakarGuniguntala2008IBMSysJ,ThomasEHart2007a}. |
| These approaches also reduce the need to partition data |
| structures, as was see in Chapter~\ref{chp:Data Structures}. |
| RCU further provides contention-free wait-free read-side |
| primitives~\cite{MathieuDesnoyers2012URCU}. |
| Adding these considerations to |
| Table~\ref{tab:future:Comparison of Locking and HTM} |
| results in the updated comparison between augmented locking and HTM |
| shown in |
| Table~\ref{tab:future:Comparison of Locking (Augmented by RCU or Hazard Pointers) and HTM}. |
| A summary of the differences between the two tables is as follows: |
| |
| \begin{enumerate} |
| \item Use of non-blocking read-side mechanisms alleviates deadlock issues. |
| \item Read-side mechanisms such as hazard pointers and RCU can operate |
| efficiently on non-partitionable data. |
| \item Hazard pointers and RCU do not contend with each other or with |
| updaters, allowing excellent performance and scalability for |
| read-mostly workloads. |
| \item Hazard pointers and RCU provide forward-progress guarantees |
| (lock freedom and wait-freedom, respectively). |
| \item Privatization operations for hazard pointers and RCU are |
| straightforward. |
| \end{enumerate} |
| |
| Of course, it is also possible to augment HTM, |
| as discussed in the next section. |
| |
| \subsection{Where Does HTM Best Fit In?} |
| \label{sec:future:Where Does HTM Best Fit In?} |
| |
| Although it will likely be some time before HTM's area of applicability |
| can be as crisply delineated as that shown for RCU in |
| Figure~\ref{fig:defer:RCU Areas of Applicability} on |
| page~\pageref{fig:defer:RCU Areas of Applicability}, that is no reason not to |
| start moving in that direction. |
| |
| HTM seems best suited to update-heavy workloads involving relatively |
| small changes to disparate portions of relatively large in-memory |
| data structures running on large multiprocessors, |
| as this meets the size restrictions of current HTM implementations while |
| minimizing the probability of conflicts and attendant aborts and |
| rollbacks. |
| This scenario is also one that is relatively difficult to handle given |
| current synchronization primitives. |
| |
| Use of locking in conjunction with HTM seems likely to overcome HTM's |
| difficulties with irrevocable operations, while use of RCU or |
| hazard pointers might alleviate HTM's transaction-size limitations |
| for read-only operations that traverse large fractions of the data |
| structure. |
| Current HTM implementations unconditionally abort an update transaction |
| that conflicts with an RCU or hazard-pointer reader, but perhaps future |
| HTM implementations will interoperate more smoothly with these |
| synchronization mechanisms. |
| In the meantime, the probability of an update conflicting with a |
| large RCU or hazard-pointer read-side critical section should be |
| much smaller than the probability of conflicting with the equivalent |
| read-only transaction.\footnote{ |
| It is quite ironic that strictly transactional mechanisms are |
| appearing in shared-memory systems at just about the time |
| that NoSQL databases are relaxing the traditional |
| database-application reliance on strict transactions.} |
| Nevertheless, it is quite possible that a steady stream of RCU or |
| hazard-pointer readers might starve updaters due to a corresponding |
| steady stream of conflicts. |
| This vulnerability could be eliminated (perhaps at significant |
| hardware cost and complexity) by giving extra-transactional |
| reads the pre-transaction copy of the memory location being loaded. |
| |
| The fact that HTM transactions must have fallbacks might in some cases |
| force static partitionability of data structures back onto HTM. |
| This limitation might be alleviated if future HTM implementations |
| provide forward-progress guarantees, which might eliminate the need |
| for fallback code in some cases, which in turn might allow HTM to |
| be used efficiently in situations with higher conflict probabilities. |
| |
| In short, although HTM is likely to have important uses and applications, |
| it is another tool in the parallel programmer's toolbox, not a replacement |
| for the toolbox in its entirety. |
| |
| \subsection{Potential Game Changers} |
| \label{sec:future:Potential Game Changers} |
| |
| Game changers that could greatly increase the need for HTM include |
| the following: |
| |
| \begin{enumerate} |
| \item Forward-progress guarantees. |
| \item Transaction-size increases. |
| \item Improved debugging support. |
| \item Weak atomicity. |
| \end{enumerate} |
| |
| These are expanded upon in the following sections. |
| |
| \subsubsection{Forward-Progress Guarantees} |
| \label{sec:future:Forward-Progress Guarantees} |
| |
| As was discussed in |
| Section~\ref{sec:future:Lack of Forward-Progress Guarantees}, |
| current HTM implementations lack forward-progress guarantees, which requires |
| that fallback software be available to handle HTM failures. |
| Of course, it is easy to demand guarantees, but not always easy |
| to provide them. |
| In the case of HTM, obstacles to guarantees can include cache size and |
| associativity, TLB size and associativity, transaction duration and |
| interrupt frequency, and scheduler implementation. |
| |
| Cache size and associativity was discussed in |
| Section~\ref{sec:future:Transaction-Size Limitations}, |
| along with some research intended to work around current limitations. |
| However, HTM forward-progress guarantees would |
| come with size limits, large though these limits might one day be. |
| So why don't current HTM implementations provide forward-progress |
| guarantees for small transactions, for example, limited to the |
| associativity of the cache? |
| One potential reason might be the need to deal with hardware failure. |
| For example, a failing cache SRAM cell might be handled by deactivating |
| the failing cell, thus reducing the associativity of the cache and |
| therefore also the maximum size of transactions that can be guaranteed |
| forward progress. |
| Given that this would simply decrease the guaranteed transaction size, |
| it seems likely that other reasons are at work. |
| Perhaps providing forward progress guarantees on production-quality |
| hardware is more difficult than one might think, an entirely plausible |
| explanation given the difficulty of making forward-progress guarantees |
| in software. |
| Moving a problem from software to hardware does not necessarily make |
| it easier to solve. |
| |
| Given a physically tagged and indexed cache, it is not enough for the |
| transaction to fit in the cache. |
| Its address translations must also fit in the TLB. |
| Any forward-progress guarantees must therefore also take TLB size |
| and associativity into account. |
| |
| Given that interrupts, traps, and exceptions abort transactions in current |
| HTM implementations, it is necessary that the execution duration of |
| a given transaction be shorter than the expected interval between |
| interrupts. |
| No matter how little data a given transaction touches, if it runs too |
| long, it will be aborted. |
| Therefore, any forward-progress guarantees must be conditioned not only |
| on transaction size, but also on transaction duration. |
| |
| Forward-progress guarantees depend critically on the ability to determine |
| which of several conflicting transactions should be aborted. |
| It is all too easy to imagine an endless series of transactions, each |
| aborting an earlier transaction only to itself be aborted by a later |
| transactions, so that none of the transactions actually commit. |
| The complexity of conflict handling is |
| evidenced by the large number of HTM conflict-resolution policies |
| that have been proposed~\cite{EgeAkpinar2011HTM2TLE,YujieLiu2011ToxicTransactions}. |
| Additional complications are introduced by extra-transactional accesses, |
| as noted by Blundell~\cite{Blundell2006TMdeadlock}. |
| It is easy to blame the extra-transactional accesses for all of these |
| problems, but the folly of this line of thinking is easily demonstrated |
| by placing each of the extra-transactional accesses into its own |
| single-access transaction. |
| It is the pattern of accesses that is the issue, not whether or not they |
| happen to be enclosed in a transaction. |
| |
| Finally, any forward-progress guarantees for transactions also |
| depend on the scheduler, which must let the thread executing the |
| transaction run long enough to successfully commit. |
| |
| So there are significant obstacles to HTM vendors offering forward-progress |
| guarantees. |
| However, the impact of any of them doing so would be enormous. |
| It would mean that HTM transactions would no longer need software |
| fallbacks, which would mean that HTM could finally deliver on the |
| TM promise of deadlock elimination. |
| |
| And as of late 2012, the IBM Mainframe announced an HTM |
| implementation that includes \emph{constrained transactions} |
| in addition to the usual best-effort HTM |
| implementation~\cite{ChristianJacobi2012MainframeTM}. |
| A constrained transaction starts with the \co{tbeginc} instruction |
| instead of the \co{tbegin} instruction that is used for best-effort |
| transactions. |
| Constrained transactions are guaranteed to always complete (eventually), |
| so if a transaction aborts, rather than branching to a fallback path |
| (as is done for best-effort transactions), the hardware instead restarts |
| the transaction at the \co{tbeginc} instruction. |
| |
| The Mainframe architects needed to take extreme measures to deliver on |
| this forward-progress guarantee. |
| If a given constrained transaction repeatedly fails, the CPU |
| might disable branch prediction, force in-order execution, and even |
| disable pipelining. |
| If the repeated failures are due to high contention, the CPU might |
| disable speculative fetches, introduce random delays, and even |
| serialize execution of the conflicting CPUs. |
| ``Interesting'' forward-progress scenarios involve as few as two CPUs |
| or as many as one hundred CPUs. |
| Perhaps these extreme measures provide some insight as to why other CPUs |
| have thus far refrained from offering constrained transactions. |
| |
| As the name implies, constrained transactions are in fact severely constrained: |
| |
| \begin{enumerate} |
| \item The maximum data footprint is four blocks of memory, |
| where each block can be no larger than 32 bytes. |
| \item The maximum code footprint is 256 bytes. |
| \item If a given 4K page contains a constrained transaction's code, |
| then that page may not contain that transaction's data. |
| \item The maximum number of assembly instructions that may be executed |
| is 32. |
| \item Backwards branches are forbidden. |
| \end{enumerate} |
| |
| Nevertheless, these constraints support a number of important data structures, |
| including linked lists, stacks, queues, and arrays. |
| Constrained HTM therefore seems likely to become an important tool in |
| the parallel programmer's toolbox. |
| |
| \subsubsection{Transaction-Size Increases} |
| \label{sec:future:Transaction-Size Increases} |
| |
| Forward-progress guarantees are important, but as we saw, they will |
| be conditional guarantees based on transaction size and duration. |
| It is important to note that even small-sized guarantees will be |
| quite useful. |
| For example, |
| a guarantee of two cache lines is sufficient for a stack, queue, or dequeue. |
| However, larger data structures require larger guarantees, for example, |
| traversing a tree in order requires a guarantee equal to the number |
| of nodes in the tree. |
| |
| Therefore, increasing the size of the guarantee also increases the |
| usefulness of HTM, thereby increasing the need for CPUs to either |
| provide it or provide good-and-sufficient workarounds. |
| |
| \subsubsection{Improved Debugging Support} |
| \label{sec:future:Improved Debugging Support} |
| |
| Another inhibitor to transaction size is the need to debug the transactions. |
| The problem with current mechanisms is that a single-step exception |
| aborts the enclosing transaction. |
| There are a number of workarounds for this issue, including emulating |
| the processor (slow!), substituting STM for HTM (slow and slightly |
| different semantics!), |
| playback techniques using repeated retries to emulate forward |
| progress (strange failure modes!), and |
| full support of debugging HTM transactions (complex!). |
| |
| Should one of the HTM vendors produce an HTM system that allows |
| straightforward use of classical debugging techniques within |
| transactions, including breakpoints, single stepping, and |
| print statements, this will make HTM much more compelling. |
| Some transactional-memory researchers are starting to recognize this |
| problem as of 2013, with at least one proposal involving hardware-assisted |
| debugging facilities~\cite{JustinGottschlich2013TMdebug}. |
| Of course, this proposal depends on readily available hardware gaining such |
| facilities. |
| |
| \subsubsection{Weak Atomicity} |
| \label{sec:future:Weak Atomicity} |
| |
| Given that HTM is likely to face some sort of size limitations for the |
| foreseeable future, it will be necessary for HTM to interoperate |
| smoothly with other mechanisms. |
| HTM's interoperability with read-mostly mechanisms such as hazard pointers |
| and RCU would be improved if extra-transactional reads did not |
| unconditionally abort transactions with conflicting writes---instead, |
| the read could simply be provided with the pre-transaction value. |
| In this way, hazard pointers and RCU could be used to allow HTM to handle |
| larger data structures and to reduce conflict probabilities. |
| |
| This is not necessarily simple, however. |
| The most straightforward way of implementing this requires an additional |
| state in each cache line and on the bus, which is a non-trivial added |
| expense. |
| The benefit that goes along with this expense is permitting |
| large-footprint readers without the risk of starving updaters due |
| to continual conflicts. |
| |
| \subsection{Conclusions} |
| \label{sec:future:Conclusions} |
| |
| Although current HTM implementations appear to be poised to deliver real |
| benefits, they also have significant shortcomings. |
| The most significant shortcomings appear to be |
| limited transaction sizes, |
| the need for conflict handling, the need for aborts and rollbacks, |
| the lack of forward-progress guarantees, |
| the inability to handle irrevocable operations, |
| and subtle semantic differences |
| from locking. |
| |
| Some of these shortcomings might be alleviated in future implementations, |
| but it appears that there will continue to be a strong need to make |
| HTM work well with the many other types of synchronization mechanisms, |
| as noted earlier~\cite{McKenney2007PLOSTM,PaulEMcKenney2010OSRGrassGreener}. |
| |
| In short, current HTM implementations appear to be welcome and useful |
| additions to the parallel programmer's toolbox, and much interesting |
| and challenging work is required to make use of them. |
| However, they cannot be |
| considered to be a magic wand with which to wave away all parallel-programming |
| problems. |