| % datastruct/datastruct.tex |
| % mainfile: ../perfbook.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \QuickQuizChapter{chp:Data Structures}{Data Structures}{qqzdatastruct} |
| % |
| \Epigraph{Bad programmers worry about the code. |
| Good programmers worry about data structures and their |
| relationships.} |
| {Linus Torvalds} |
| |
| Serious discussions of algorithms include time complexity of their |
| data structures~\cite{ThomasHCorman2001Algorithms}. |
| However, for parallel programs, the time complexity includes concurrency |
| effects because these effects can be overwhelmingly large, as shown in |
| \cref{chp:Hardware and its Habits}. |
| In other words, a good programmer's data-structure relationships |
| include those aspects related to concurrency. |
| |
| This chapter will expose a number of complications: |
| |
| \begin{enumerate} |
| \item Data structures designed in full accordance with the |
| good advice given in |
| \cref{chp:Partitioning and Synchronization Design} |
| can nonetheless abjectly fail to scale on some types of systems. |
| \label{sec:datastruct:NUMA-induced failure to scale} |
| \item Data structures designed in full accordance with the |
| good advice given in both |
| \cref{chp:Partitioning and Synchronization Design} |
| and |
| \cref{chp:Deferred Processing} |
| can \emph{still} abjectly fail to scale on some types of systems. |
| \label{sec:datastruct:Cache-size-induced failure to scale} |
| \item Even \emph{read-only synchronization-free} data-structure traversal |
| can fail to scale on some types of systems. |
| \label{sec:datastruct:Cache-size-induced failure to scale redux} |
| \item Data-structure traversals avoiding the aforementioned complications |
| can still be impeded by concurrent updates. |
| \label{sec:datastruct:Update-activity failure to scale} |
| \end{enumerate} |
| |
| This chapter will investigate these complications and demostrate some |
| ways of unraveling them. |
| |
| \Cref{sec:datastruct:Motivating Application} |
| presents the motivating application for this chapter's data structures. |
| \Cref{chp:Partitioning and Synchronization Design} showed how |
| partitioning improves scalability, so |
| \cref{sec:datastruct:Partitionable Data Structures} |
| discusses partitionable data structures. |
| \Cref{chp:Deferred Processing} described how deferring some |
| actions can greatly improve both performance and scalability, |
| a topic taken up by |
| \cref{sec:datastruct:Read-Mostly Data Structures}. |
| \Cref{sec:datastruct:Non-Partitionable Data Structures} |
| looks at a non-partitionable data structure, splitting |
| it into read-mostly and partitionable portions, |
| which improves both performance and scalability. |
| Because this chapter cannot delve into the details of every concurrent |
| data structure, |
| \cref{sec:datastruct:Other Data Structures} |
| surveys a few of the important ones. |
| Finally, \cref{sec:datastruct:Summary} |
| presents a summary of this chapter. |
| |
| \section{Motivating Application} |
| \label{sec:datastruct:Motivating Application} |
| % |
| \epigraph{The art of doing mathematics consists in finding that special |
| case which contains all the germs of generality.} |
| {David Hilbert} |
| |
| We will use the Schr\"odinger's Zoo application to evaluate |
| performance~\cite{McKenney:2013:SDS:2483852.2483867}. |
| Schr\"odinger has a zoo containing a large number of animals, and |
| he would like to track them using an in-memory database with |
| each animal in the zoo represented by a data item in this database. |
| Each animal has a unique name that is used as a key, with a variety |
| of data tracked for each animal. |
| |
| Births, captures, and purchases result in insertions, while deaths, |
| releases, and sales result in deletions. |
| Because Schr\"odinger's zoo contains a large quantity of short-lived |
| animals, including mice and insects, the database must handle |
| high update rates. |
| Those interested in Schr\"odinger's animals can query them, and |
| Schr\"odinger has noted suspiciously query rates for his cat, so much |
| so that he suspects that his mice might be checking up on their nemesis. |
| Whatever their source, Schr\"odinger's application must handle high |
| query rates to a single data element. |
| |
| As we will see, this simple application can pose a challenge to |
| traditional concurrent data structures. |
| |
| \section{Partitionable Data Structures} |
| \label{sec:datastruct:Partitionable Data Structures} |
| % |
| \epigraph{Finding a way to live the simple life today is the most |
| complicated task.} |
| {Henry A. Courtney, updated} |
| |
| There are a huge number of data structures in use today, so much so |
| that there are multiple textbooks covering them. |
| This section focuses on a single data structure, namely the hash table. |
| This focused approach allows a much deeper investigation of how concurrency |
| interacts with data structures, and also focuses on a data structure |
| that is heavily used in practice. |
| \Cref{sec:datastruct:Hash-Table Design} |
| overviews the design, and |
| \cref{sec:datastruct:Hash-Table Implementation} |
| presents the implementation. |
| Finally, |
| \cref{sec:datastruct:Hash-Table Performance} |
| discusses the resulting performance and scalability. |
| |
| \subsection{Hash-Table Design} |
| \label{sec:datastruct:Hash-Table Design} |
| |
| \Cref{chp:Partitioning and Synchronization Design} |
| emphasized the need to apply partitioning in order to attain |
| respectable performance and scalability, so partitionability |
| must be a first-class criterion when selecting data structures. |
| This criterion is well satisfied by that workhorse of parallelism, |
| the hash table. |
| Hash tables are conceptually simple, consisting of an array of |
| \emph{hash buckets}. |
| A \emph{hash function} maps from a given element's \emph{key} |
| to the hash bucket that this element will be stored in. |
| Each hash bucket therefore heads up a linked list of elements, |
| called a \emph{hash chain}. |
| When properly configured, these hash chains will be quite short, |
| permitting a hash table to access its elements extremely efficiently. |
| |
| \QuickQuiz{ |
| But chained hash tables are but one type of many. |
| Why the focus on chained hash tables? |
| }\QuickQuizAnswer{ |
| Chained hash tables are completely partitionable, and thus |
| well-suited to concurrent use. |
| There are other completely-partitionable hash tables, for |
| example, split-ordered list~\cite{OriShalev2006SplitOrderListHash}, |
| but they are considerably more complex. |
| We therefore focus on chained hash tables. |
| }\QuickQuizEnd |
| |
| In addition, each bucket has its own lock, so that elements in different |
| buckets of the hash table may be added, deleted, and looked up completely |
| independently. |
| A large hash table with a large number of buckets (and thus locks), with |
| each bucket containing a small number of elements should therefore provide |
| excellent scalability. |
| |
| \subsection{Hash-Table Implementation} |
| \label{sec:datastruct:Hash-Table Implementation} |
| |
| \begin{fcvref}[ln:datastruct:hash_bkt:struct] |
| \Cref{lst:datastruct:Hash-Table Data Structures} |
| (\path{hash_bkt.c}) |
| shows a set of data structures used in a simple fixed-sized hash |
| table using chaining and per-hash-bucket locking, and |
| \cref{fig:datastruct:Hash-Table Data-Structure Diagram} |
| diagrams how they fit together. |
| Note that the \co{cds_} functions and data structures may be found |
| in the userspace RCU |
| library~\cite{MathieuDesnoyers2009URCU,PaulMcKenney2013LWNURCUqueuestack,PaulMcKenney2013LWNURCUqueuestackAPI,PaulMcKenney2013LWNURCUlist}. |
| The \co{hashtab} structure (\clnrefrange{tab:b}{tab:e} in |
| \cref{lst:datastruct:Hash-Table Data Structures}) |
| contains four \co{ht_bucket} structures |
| (\clnrefrange{bucket:b}{bucket:e} in |
| \cref{lst:datastruct:Hash-Table Data Structures}), |
| with the \co{->ht_nbuckets} field controlling the number of buckets |
| and the \co{->ht_cmp} field holding the pointer to key-comparison |
| function. |
| Each such bucket contains a list header \co{->htb_head} and |
| a lock \co{->htb_lock}. |
| The list headers chain \co{ht_elem} structures |
| (\clnrefrange{elem:b}{elem:e} in |
| \cref{lst:datastruct:Hash-Table Data Structures}) |
| through their |
| \co{->hte_next} fields, and each \co{ht_elem} structure also caches |
| the corresponding element's hash value in the \co{->hte_hash} field. |
| The \co{ht_elem} structure is included in a larger structure |
| which might contain a complex key. |
| \end{fcvref} |
| \Cref{fig:datastruct:Hash-Table Data-Structure Diagram} |
| shows bucket~0 containing two elements and bucket~2 containing one. |
| |
| \begin{listing} |
| \input{CodeSamples/datastruct/hash/hash_bkt@struct.fcv} |
| \caption{Hash-Table Data Structures} |
| \label{lst:datastruct:Hash-Table Data Structures} |
| \end{listing} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{datastruct/hashdiagram}} |
| \caption{Hash-Table Data-Structure Diagram} |
| \label{fig:datastruct:Hash-Table Data-Structure Diagram} |
| \end{figure} |
| |
| \begin{fcvref}[ln:datastruct:hash_bkt:map_lock:map] |
| \Cref{lst:datastruct:Hash-Table Mapping and Locking} |
| shows mapping and locking functions. |
| \Clnref{b,e} |
| show the macro \co{HASH2BKT()}, which maps from a hash value |
| to the corresponding \co{ht_bucket} structure. |
| This macro uses a simple modulus: |
| If more aggressive hashing is required, |
| the caller needs to implement it when mapping from key to hash value. |
| The remaining two functions acquire and release the \co{->htb_lock} |
| corresponding to the specified hash value. |
| \end{fcvref} |
| |
| \begin{listing} |
| \input{CodeSamples/datastruct/hash/hash_bkt@map_lock.fcv} |
| \caption{Hash-Table Mapping and Locking} |
| \label{lst:datastruct:Hash-Table Mapping and Locking} |
| \end{listing} |
| |
| \begin{fcvref}[ln:datastruct:hash_bkt:lookup] |
| \Cref{lst:datastruct:Hash-Table Lookup} |
| shows \co{hashtab_lookup()}, |
| which returns a pointer to the element with the specified hash and key, |
| or \co{NULL} if that element does not exist. |
| This function takes both a hash value and a pointer to the key because |
| this allows users of this function to use arbitrary keys and |
| arbitrary hash functions. |
| \Clnref{map} maps from the hash value to a pointer to the corresponding |
| hash bucket. |
| Each pass through the loop spanning |
| \clnrefrange{loop:b}{loop:e} examines one element |
| of the bucket's hash chain. |
| \Clnref{hashmatch} checks to see if the hash values match, and if not, |
| \clnref{next} |
| proceeds to the next element. |
| \Clnref{keymatch} checks to see if the actual key matches, and if so, |
| \clnref{return} returns a pointer to the matching element. |
| If no element matches, \clnref{ret_NULL} returns \co{NULL}. |
| \end{fcvref} |
| |
| \begin{listing} |
| \input{CodeSamples/datastruct/hash/hash_bkt@lookup.fcv} |
| \caption{Hash-Table Lookup} |
| \label{lst:datastruct:Hash-Table Lookup} |
| \end{listing} |
| |
| \QuickQuiz{ |
| \begin{fcvref}[ln:datastruct:hash_bkt:lookup] |
| But isn't the double comparison on |
| \clnrefrange{hashmatch}{return} in |
| \cref{lst:datastruct:Hash-Table Lookup} inefficient |
| in the case where the key fits into an unsigned long? |
| \end{fcvref} |
| }\QuickQuizAnswer{ |
| Indeed it is! |
| However, hash tables quite frequently store information with |
| keys such as character strings that do not necessarily fit |
| into an unsigned long. |
| Simplifying the hash-table implementation for the case where |
| keys always fit into unsigned longs is left as an exercise |
| for the reader. |
| }\QuickQuizEnd |
| |
| \begin{listing} |
| \input{CodeSamples/datastruct/hash/hash_bkt@add_del.fcv} |
| \caption{Hash-Table Modification} |
| \label{lst:datastruct:Hash-Table Modification} |
| \end{listing} |
| |
| \Cref{lst:datastruct:Hash-Table Modification} |
| shows the \co{hashtab_add()} and \co{hashtab_del()} functions |
| that add and delete elements from the hash table, respectively. |
| |
| \begin{fcvref}[ln:datastruct:hash_bkt:add_del:add] |
| The \co{hashtab_add()} function simply sets the element's hash |
| value on \clnref{set}, then adds it to the corresponding bucket on |
| \clnref{add:b,add:e}. |
| \end{fcvref} |
| The \co{hashtab_del()} function simply removes the specified element |
| from whatever hash chain it is on, courtesy of the doubly linked |
| nature of the hash-chain lists. |
| Before calling either of these two functions, the caller is required to |
| ensure that no other thread is accessing |
| or modifying this same bucket, for example, by invoking |
| \co{hashtab_lock()} beforehand. |
| |
| \begin{listing} |
| \input{CodeSamples/datastruct/hash/hash_bkt@alloc_free.fcv} |
| \caption{Hash-Table Allocation and Free} |
| \label{lst:datastruct:Hash-Table Allocation and Free} |
| \end{listing} |
| |
| \Cref{lst:datastruct:Hash-Table Allocation and Free} |
| shows \co{hashtab_alloc()} and \co{hashtab_free()}, |
| which do hash-table allocation and freeing, respectively. |
| \begin{fcvref}[ln:datastruct:hash_bkt:alloc_free:alloc] |
| Allocation begins on |
| \clnrefrange{alloc:b}{alloc:e} with allocation of the underlying memory. |
| If \clnref{chk_NULL} detects that memory has been exhausted, |
| \clnref{ret_NULL} returns |
| \co{NULL} to the caller. |
| Otherwise, \clnref{set_nbck,set_cmp} initialize |
| the number of buckets and the pointer to key-comparison function, |
| and the loop |
| spanning \clnrefrange{loop:b}{loop:e} initializes the buckets themselves, |
| including the chain list header on |
| \clnref{init_head} and the lock on \clnref{init_lock}. |
| Finally, \clnref{return} returns a pointer to the newly allocated hash table. |
| \end{fcvref} |
| \begin{fcvref}[ln:datastruct:hash_bkt:alloc_free:free] |
| The \co{hashtab_free()} function on |
| \clnrefrange{b}{e} is straightforward. |
| \end{fcvref} |
| |
| \subsection{Hash-Table Performance} |
| \label{sec:datastruct:Hash-Table Performance} |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zoocpubktlin8}} |
| \caption{Read-Only Hash-Table Performance For Schr\"odinger's Zoo} |
| \label{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo} |
| \end{figure} |
| |
| The performance results for a single 28-core socket of a 2.1\,GHz |
| Intel Xeon system using a bucket-locked hash table |
| with 262,144 buckets are shown in |
| \cref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo}. |
| The performance does scale nearly linearly, but falls far short ideal, |
| even at only 28~CPUs. |
| Part of this shortfall is due to the fact that lock acquisitions and |
| releases incur communications cache misses only on two or more CPUs. |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zoocpubktlin}} |
| \caption{Read-Only Hash-Table Performance For Schr\"odinger's Zoo, 448 CPUs} |
| \label{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo; 448 CPUs} |
| \end{figure} |
| |
| And things only get worse with more CPUs, as can be seen in |
| \cref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo; 448 CPUs}. |
| We do not need to show ideal performance: |
| The performance for 29~CPUs and beyond is abysmal. |
| This clearly underscores the dangers of extrapolating performance from a |
| modest number of CPUs. |
| |
| Of course, one possible reason for the collapse in performance might be |
| that more hash buckets are needed. |
| We can test this by increasing the number of hash buckets. |
| |
| \QuickQuiz{ |
| Instead of simply increasing the number of hash buckets, |
| wouldn't it be better to cache-align the existing hash buckets? |
| }\QuickQuizAnswer{ |
| The answer depends on a great many things. |
| If the hash table has a large number of elements per bucket, it |
| would clearly be better to increase the number of hash buckets. |
| On the other hand, if the hash table is lightly loaded, |
| the answer depends on the hardware, the effectiveness of the |
| hash function, and the workload. |
| Interested readers are encouraged to experiment. |
| }\QuickQuizEnd |
| |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zoocpubktsizelin}} |
| \caption{Read-Only Hash-Table Performance For Schr\"odinger's Zoo, Varying Buckets} |
| \label{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo; Varying Buckets} |
| \end{figure} |
| |
| However, as can be seen in |
| \cref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo; Varying Buckets}, |
| changing the number of buckets has almost no effect: |
| Scalability is still abysmal. |
| In particular, we still see a sharp dropoff at 29~CPUs and beyond, |
| clearly demonstrating the complication put forward by |
| \cref{sec:datastruct:NUMA-induced failure to scale} |
| of the list of complications on |
| \cpageref{sec:datastruct:NUMA-induced failure to scale}. |
| Clearly, something else is going on. |
| |
| The problem is that this is a multi-socket system, with CPUs~0--27 |
| and~224--251 mapped to the first socket as shown in |
| \cref{fig:datastruct:NUMA Topology of System Under Test}. |
| Test runs confined to the first 28~CPUs therefore perform quite |
| well, but tests that involve socket~0's CPUs~0--27 as well as |
| socket~1's CPU~28 incur the overhead of passing data across |
| socket boundaries. |
| This can severely degrade performance, as was discussed in |
| \cref{sec:cpu:Hardware System Architecture}. |
| In short, large multi-socket systems require good locality of reference |
| in addition to full partitioning. |
| The remainder of this chapter will discuss ways of providing good |
| locality of reference within the hash table itself, but in the |
| meantime please note that one other way to provide good locality |
| of reference would be to place large data elements in the hash |
| table. |
| For example, Schr\"odinger might attain excellent cache locality by |
| placing photographs or even videos of his animals in each element of |
| the hash table. |
| But for those needing hash tables containing small data elements, |
| please read on! |
| |
| \QuickQuiz{ |
| Given the negative scalability of the Schr\"odinger's |
| Zoo application across sockets, why not just run multiple |
| copies of the application, with each copy having a subset |
| of the animals and confined to run on a single socket? |
| }\QuickQuizAnswer{ |
| You can do just that! |
| In fact, you can extend this idea to large clustered systems, |
| running one copy of the application on each node of the cluster. |
| This practice is called ``sharding'', and is heavily used in |
| practice by large web-based |
| retailers~\cite{DeCandia:2007:DAH:1323293.1294281}. |
| |
| However, if you are going to shard on a per-socket basis within |
| a multisocket system, why not buy separate smaller and cheaper |
| single-socket systems, and then run one shard of the database |
| on each of those systems? |
| }\QuickQuizEnd |
| |
| One key property of these Schr\"odinger's-zoo experiments is they are |
| all read-only. |
| This makes the performance degradation due to lock-acquisition-induced |
| cache misses all the more painful. |
| Even though we are not updating the underlying hash table itself, we are |
| still paying the price for writing to memory. |
| Of course, if the hash table was never going to be updated, we could dispense |
| entirely with mutual exclusion. |
| This approach is quite straightforward and is left as an exercise for the |
| reader. |
| But even with the occasional update, avoiding writes avoids cache |
| misses, and allows the read-mostly data to be replicated across all |
| the caches, which in turn promotes locality of reference. |
| |
| The next section therefore examines optimizations that can be carried out in |
| read-mostly cases where updates are rare, but could happen at any time. |
| |
| \setlength\dashlinedash{1pt} |
| \setlength\dashlinegap{2pt} |
| |
| \begin{table} |
| \renewcommand*{\arraystretch}{1.2} |
| \centering |
| \begin{tabular}{r||c:c} |
| & \multicolumn{2}{c}{Hyperthread} \\ |
| \cline{2-3} |
| Socket & 0 & 1 \\ |
| \hline |
| \hline |
| 0 & 0--27 & 224--251 \\ |
| \cdashline{2-3} |
| 1 & 28--55 & 252--279 \\ |
| \cdashline{2-3} |
| 2 & 56--83 & 280--307 \\ |
| \cdashline{2-3} |
| 3 & 84--111 & 308--335 \\ |
| \cdashline{2-3} |
| 4 & 112--139 & 336--363 \\ |
| \cdashline{2-3} |
| 5 & 140--167 & 364--391 \\ |
| \cdashline{2-3} |
| 6 & 168--195 & 392--419 \\ |
| \cdashline{2-3} |
| 7 & 196--223 & 420--447 \\ |
| \end{tabular} |
| \caption{NUMA Topology of System Under Test} |
| \label{fig:datastruct:NUMA Topology of System Under Test} |
| \end{table} |
| |
| \section{Read-Mostly Data Structures} |
| \label{sec:datastruct:Read-Mostly Data Structures} |
| % |
| \epigraph{Adapt the remedy to the disease.}{Chinese proverb} |
| |
| Although partitioned data structures can offer excellent scalability, |
| \IXacr{numa} effects can result in severe degradations of both performance and |
| scalability. |
| In addition, |
| the need for read-side synchronization can degrade performance in |
| read-mostly situations. |
| However, we can achieve both performance and scalability by using |
| RCU, which was introduced in |
| \cref{sec:defer:Read-Copy Update (RCU)}. |
| Similar results can be achieved using \IXpl{hazard pointer} |
| (\path{hazptr.c})~\cite{MagedMichael04a}, which will be included in |
| the performance results shown in this |
| section~\cite{McKenney:2013:SDS:2483852.2483867}. |
| |
| \subsection{RCU-Protected Hash Table Implementation} |
| \label{sec:datastruct:RCU-Protected Hash Table Implementation} |
| |
| For an RCU-protected hash table with per-bucket locking, |
| updaters use locking as shown in |
| \cref{sec:datastruct:Partitionable Data Structures}, |
| but readers use RCU\@. |
| The data structures remain as shown in |
| \cref{lst:datastruct:Hash-Table Data Structures}, |
| and the \co{HASH2BKT()}, \co{hashtab_lock()}, and \co{hashtab_unlock()} |
| functions remain as shown in |
| \cref{lst:datastruct:Hash-Table Mapping and Locking}. |
| However, readers use the lighter-weight concurrency-control embodied |
| by \co{hashtab_lock_lookup()} and \co{hashtab_unlock_lookup()} |
| shown in |
| \cref{lst:datastruct:RCU-Protected Hash-Table Read-Side Concurrency Control}. |
| |
| \begin{listing} |
| \input{CodeSamples/datastruct/hash/hash_bkt_rcu@lock_unlock.fcv} |
| \caption{RCU-Protected Hash-Table Read-Side Concurrency Control} |
| \label{lst:datastruct:RCU-Protected Hash-Table Read-Side Concurrency Control} |
| \end{listing} |
| |
| \Cref{lst:datastruct:RCU-Protected Hash-Table Lookup} |
| shows \co{hashtab_lookup()} for the RCU-protected per-bucket-locked |
| hash table. |
| This is identical to that in |
| \cref{lst:datastruct:Hash-Table Lookup} |
| except that \apiur{cds_list_for_each_entry()} is replaced |
| by \apiur{cds_list_for_each_entry_rcu()}. |
| Both of these primitives traverse the hash chain referenced |
| by \co{htb->htb_head} but \co{cds_list_for_each_entry_rcu()} also |
| correctly enforces memory ordering in case of concurrent insertion. |
| This is an important difference between these two hash-table implementations: |
| Unlike the pure per-bucket-locked implementation, the RCU protected |
| implementation allows lookups to run concurrently with insertions |
| and deletions, and RCU-aware primitives like |
| \co{cds_list_for_each_entry_rcu()} are required to correctly handle |
| this added concurrency. |
| Note also that \co{hashtab_lookup()}'s caller must be within an |
| RCU read-side critical section, for example, the caller must invoke |
| \co{hashtab_lock_lookup()} before invoking \co{hashtab_lookup()} |
| (and of course invoke \co{hashtab_unlock_lookup()} some time afterwards). |
| |
| \begin{listing} |
| \input{CodeSamples/datastruct/hash/hash_bkt_rcu@lookup.fcv} |
| \caption{RCU-Protected Hash-Table Lookup} |
| \label{lst:datastruct:RCU-Protected Hash-Table Lookup} |
| \end{listing} |
| |
| \QuickQuiz{ |
| But if elements in a hash table can be removed concurrently |
| with lookups, doesn't that mean that a lookup could return |
| a reference to a data element that was removed immediately |
| after it was looked up? |
| }\QuickQuizAnswer{ |
| Yes it can! |
| This is why \co{hashtab_lookup()} must be invoked within an |
| RCU read-side critical section, and it is why |
| \co{hashtab_add()} and \co{hashtab_del()} must also use |
| RCU-aware list-manipulation primitives. |
| Finally, this is why the caller of \co{hashtab_del()} must |
| wait for a grace period (e.g., by calling \co{synchronize_rcu()}) |
| before freeing the removed element. |
| This will ensure that all RCU readers that might reference |
| the newly removed element have completed before that element |
| is freed. |
| }\QuickQuizEnd |
| |
| \begin{listing} |
| \input{CodeSamples/datastruct/hash/hash_bkt_rcu@add_del.fcv} |
| \caption{RCU-Protected Hash-Table Modification} |
| \label{lst:datastruct:RCU-Protected Hash-Table Modification} |
| \end{listing} |
| |
| \Cref{lst:datastruct:RCU-Protected Hash-Table Modification} |
| shows \co{hashtab_add()} and \co{hashtab_del()}, both of which |
| are quite similar to their counterparts in the non-RCU hash table |
| shown in |
| \cref{lst:datastruct:Hash-Table Modification}. |
| The \co{hashtab_add()} function uses \apiur{cds_list_add_rcu()} instead |
| of \apiur{cds_list_add()} in order to ensure proper ordering when |
| an element is added to the hash table at the same time that it is |
| being looked up. |
| The \co{hashtab_del()} function uses \apiur{cds_list_del_rcu()} instead |
| of \apiur{cds_list_del_init()} to allow for the case where an element is |
| looked up just before it is deleted. |
| Unlike \co{cds_list_del_init()}, \co{cds_list_del_rcu()} leaves the |
| forward pointer intact, so that \co{hashtab_lookup()} can traverse |
| to the newly deleted element's successor. |
| |
| Of course, after invoking \co{hashtab_del()}, the caller must wait for |
| an RCU \IX{grace period} (e.g., by invoking \co{synchronize_rcu()}) before |
| freeing or otherwise reusing the memory for the newly deleted element. |
| |
| \subsection{RCU-Protected Hash Table Validation} |
| \label{sec:datastruct:RCU-Protected Hash Table Validation} |
| |
| Although the topic of validation is covered in detail in |
| \cref{chp:Validation}, the fact is that a hash table with lockless |
| RCU-protected lookups needs special attention to validation sooner rather |
| than later. |
| |
| The test suite (``\path{hashtorture.h}'') contains a \co{smoketest()} |
| function that verifies that a specific series of single-threaded |
| additions, deletions, and lookups give the expected results. |
| |
| Concurrent test runs put each updater thread in control of its portion of |
| the elements, which allows assertions checking for the following issues: |
| |
| \begin{enumerate} |
| \item A just-now-to-be-added element already being in the table |
| according to \co{hastab_lookup()}. |
| \item A just-now-to-be-added element being marked as being in the |
| table by its \co{->in_table} flag. |
| \item A just-now-to-be-deleted element not being in the table according |
| to \co{hastab_lookup()}. |
| \item A just-now-to-be-deleted element being marked as not being in |
| the table by its \co{->in_table} flag. |
| \end{enumerate} |
| |
| In addition, concurrent test runs run lookups concurrently with updates |
| in order to catch all manner of data-structure corruption problems. |
| Some runs also continually resize the hash table concurrently with both |
| lookups and updates to verify correct behavior, and also to verify that |
| resizes do not unduly delay either readers or updaters. |
| |
| Finally, the concurrent tests output statistics that can be used to |
| track down performance and scalabilty issues, which provides the raw |
| data used by |
| \cref{sec:datastruct:RCU-Protected Hash Table Performance}. |
| |
| \QuickQuiz{ |
| The \path{hashtorture.h} file contains more than 1,000 lines! |
| Is that a comprehensive test or what??? |
| }\QuickQuizAnswer{ |
| What. |
| |
| The \path{hashtorture.h} tests are a good start and suffice |
| for a textbook algorithm. |
| If this code was to be used in production, much more testing |
| would be required: |
| |
| \begin{enumerate} |
| \item Have some subset of elements that always reside in the |
| table, and verify that lookups always find these elements |
| regardless of the number and type of concurrent updates |
| in flight. |
| \item Pair an updater with one or more readers, verifying that |
| after an element is added, once a reader successfully |
| looks up that element, all later lookups succeed. |
| The definition of ``later'' will depend on the table's |
| consistency requirements. |
| \item Pair an updater with one or more readers, verifying that |
| after an element is deleted, once a reader's lookup |
| of that element fails, all later lookups also fail. |
| \end{enumerate} |
| |
| There are many more tests where those came from, the exact |
| nature of which depend on the details of the requirements |
| on your particular hash table. |
| }\QuickQuizEnd |
| |
| All code requires significant validation effort, and high-performance |
| concurrent code requires more validation than most. |
| |
| \subsection{RCU-Protected Hash Table Performance} |
| \label{sec:datastruct:RCU-Protected Hash Table Performance} |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zoocpu}} |
| \caption{Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo} |
| \label{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo} |
| \end{figure} |
| |
| \Cref{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo} |
| shows the read-only performance of RCU-protected and hazard-pointer-protected |
| hash tables against the previous section's per-bucket-locked implementation. |
| As you can see, both RCU and hazard pointers perform and scale |
| much better than per-bucket locking because read-only |
| replication avoids NUMA effects. |
| The difference increases with larger numbers of threads. |
| Results from a globally locked implementation are also shown, and as expected |
| the results are even worse than those of the per-bucket-locked implementation. |
| RCU does slightly better than hazard pointers. |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zoocpulin}} |
| \caption{Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo, Linear Scale} |
| \label{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo; Linear Scale} |
| \end{figure} |
| |
| \Cref{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo; Linear Scale} |
| shows the same data on a linear scale. |
| This drops the global-locking trace into the x-axis, but allows the |
| non-ideal performance of RCU and hazard pointers to be more readily |
| discerned. |
| Both show a change in slope at 224~CPUs, and this is due to hardware |
| multithreading. |
| At 224 and fewer CPUs, each thread has a core to itself. |
| In this regime, RCU does better than does hazard pointers because the |
| latter's read-side \IXpl{memory barrier} result in dead time within the core. |
| In short, RCU is better able to utilize a core from a single hardware |
| thread than is hazard pointers. |
| |
| This situation changes above 224~CPUs. |
| Because RCU is using more than half of each core's resources from a |
| single hardware thread, RCU gains relatively little benefit from the |
| second hardware thread in each core. |
| The slope of the hazard-pointers trace also decreases at 224~CPUs, but |
| less dramatically, |
| because the second hardware thread is able to fill in the time |
| that the first hardware thread is stalled due to \IXh{memory-barrier}{latency}. |
| As we will see in later sections, this second-hardware-thread |
| advantage depends on the workload. |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zoocpulinqsbr}} |
| \caption{Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo including QSBR, Linear Scale} |
| \label{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo including QSBR; Linear Scale} |
| \end{figure} |
| |
| But why is RCU's performance a factor of five less than ideal? |
| One possibility is that the per-thread counters manipulated by |
| \co{rcu_read_lock()} and \co{rcu_read_unlock()} are slowing things down. |
| \Cref{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo including QSBR; Linear Scale} |
| therefore adds the results for the \IXacr{qsbr} variant of RCU, whose read-side |
| primitives do nothing. |
| And although QSBR does perform slightly better than does RCU, it is still |
| about a factor of five short of ideal. |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zoocpulinqsbrunsync}} |
| \caption{Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo including QSBR and Unsynchronized, Linear Scale} |
| \label{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo including QSBR and Unsynchronized; Linear Scale} |
| \end{figure} |
| |
| \Cref{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo including QSBR and Unsynchronized; Linear Scale} |
| adds completely unsynchronized results, which works because this |
| is a read-only benchmark with nothing to synchronize. |
| Even with no synchronization whatsoever, performance still falls far |
| short of ideal, thus demonstrating |
| \cref{sec:datastruct:Cache-size-induced failure to scale,sec:datastruct:Cache-size-induced failure to scale redux} |
| of the list of complications on |
| \cpageref{sec:datastruct:Cache-size-induced failure to scale redux}. |
| |
| The problem is that this system has sockets with 28 cores, which have |
| the modest cache sizes shown in |
| \cref{tab:cpu:Cache Geometry for 8-Socket System With Intel Xeon Platinum 8176 CPUs @ 2.10GHz} |
| on \cpageref{tab:cpu:Cache Geometry for 8-Socket System With Intel Xeon Platinum 8176 CPUs @ 2.10GHz}. |
| Each hash bucket (\co{struct ht_bucket}) occupies 56~bytes and each |
| element (\co{struct zoo_he}) occupies 72~bytes for the RCU and QSBR runs. |
| The benchmark generating |
| \cref{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo including QSBR and Unsynchronized; Linear Scale} |
| used 262,144 buckets and up to 262,144 elements, for a total of |
| 33,554,448~bytes, which not only overflows the 1,048,576-byte L2 caches |
| by more than a factor of thirty, but is also uncomfortably close to the |
| L3 cache size of 40,370,176~bytes, especially given that this cache has |
| only 11~ways. |
| This means that L2 cache collisions will be the rule and also that L3 |
| cache collisions will not be uncommon, so that the resulting cache misses |
| will degrade performance. |
| In this case, the bottleneck is not in the CPU, but rather in the |
| memory system. |
| % This data was generated wtih 262,144 buckets and elements. |
| % RCU hash ht_elem structure is 40 bytes plus 32 more from zoo_he for 72. |
| % struct ht_bucket is 56 bytes. |
| % For "# B" RCU runs, hash buckets weigh in at 14,680,080 bytes and the hash |
| % elements up to 18,874,368 bytes. The total is 33,554,448 bytes, which |
| % in theory fits into the 40,370,176-byte L3 cache, but definitely not |
| % the 1,048,576-byte L2 cache, much less the 32,768-byte L0 and L1 caches. |
| % The L3 is 11-way set associative. |
| % So we have 8192 4K pages of data (buckets and elements) that must |
| % fit into 896 11-way set-associated cache sets. |
| |
| Additional evidence for this memory-system bottleneck may be found by |
| examining the unsynchronized code. |
| This code does not need locks, so each hash bucket occupies only 16~bytes |
| compared to the 56~bytes for RCU and QSBR\@. |
| Similarly, each hash-table element occupies only 56~bytes compared to the |
| 72~bytes for RCU and QSBR\@. |
| So it is unsurprising that the single-CPU unsynchronized run performs up |
| to about half again faster than that of either QSBR or RCU\@. |
| |
| \QuickQuiz{ |
| How can we be so sure that the hash-table size is at fault here, |
| especially given that |
| \cref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo; Varying Buckets} |
| on \cpageref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo; Varying Buckets} |
| shows that varying hash-table size has almost |
| no effect? |
| Might the problem instead be something like \IX{false sharing}? |
| }\QuickQuizAnswer{ |
| Excellent question! |
| |
| False sharing requires writes, which are not featured in the |
| unsynchronized and RCU runs of this lookup-only |
| benchmark. |
| The problem is therefore not false sharing. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf-hashsize.2020.12.29a/zoohashsize}} |
| \caption{Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo at 448 CPUs, Varying Table Size} |
| \label{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo at 448 CPUs; Varying Table Size} |
| \end{figure} |
| |
| Still unconvinced? |
| Then look at the log-log plot in |
| \cref{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo at 448 CPUs; Varying Table Size}, |
| which shows performance for 448~CPUs as a function of the |
| hash-table size, that is, number of buckets and maximum number |
| of elements. |
| A hash-table of size 1,024 has 1,024~buckets and contains |
| at most 1,024~elements, with the average occupancy being |
| 512~elements. |
| Because this is a read-only benchmark, the actual occupancy is |
| always equal to the average occupancy. |
| |
| This figure shows near-ideal performance below about 8,000~elements, |
| that is, when the hash table comprises less than 1\,MB of data. |
| This near-ideal performance is consistent with that for the |
| pre-BSD routing table shown in |
| \cref{fig:defer:Pre-BSD Routing Table Protected by RCU} |
| on \cpageref{fig:defer:Pre-BSD Routing Table Protected by RCU}, |
| even at 448~CPUs. |
| However, the performance drops significantly (this is a log-log |
| plot) at about 8,000~elements, which is where the 1,048,576-byte |
| L2 cache overflows. |
| Performance falls off a cliff (even on this log-log plot) at about |
| 300,000~elements, where the 40,370,176-byte L3 cache overflows. |
| This demonstrates that the memory-system bottleneck is profound, |
| degrading performance by well in excess of an order of magnitude |
| for the large hash tables. |
| This should not be a surprise, as the size-8,388,608 hash table |
| occupies about 1\,GB of memory, overflowing the L3 caches by |
| a factor of 25. |
| |
| The reason that |
| \cref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo; Varying Buckets} |
| on \cpageref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo; Varying Buckets} |
| shows little effect is that its data was gathered from |
| bucket-locked hash tables, where locking overhead and contention |
| drowned out cache-capacity effects. |
| In contrast, both RCU and hazard-pointers readers avoid stores |
| to shared data, which means that the cache-capacity effects come |
| to the fore. |
| |
| Still not satisfied? |
| Find a multi-socket system and run this code, making use of |
| whatever performance-counter hardware is available. |
| This hardware should allow you to track down the precise cause |
| of any slowdowns exhibited on your particular system. |
| The experience gained by doing this exercise will be extremely |
| valuable, giving you a significant advantage over those whose |
| understanding of this issue is strictly theoretical.\footnote{ |
| Of course, a theoretical understanding beats no |
| understanding. |
| Most of the time, anyway.} |
| }\QuickQuizEnd |
| |
| What if the memory footprint is reduced still further? |
| \Cref{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR With Non-Initial rcu-head} |
| on \cpageref{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR With Non-Initial rcu-head} |
| shows that RCU attains very nearly ideal performance on the much smaller |
| data structure represented by the pre-BSD routing table. |
| |
| \QuickQuiz{ |
| The memory system is a serious bottleneck on this big system. |
| Why bother putting 448~CPUs on a system without giving them |
| enough memory bandwidth to do something useful??? |
| }\QuickQuizAnswer{ |
| It would indeed be a bad idea to use this large and expensive |
| system for a workload consisting solely of simple hash-table |
| lookups of small data elements. |
| However, this system is extremely useful for a great many |
| workloads that feature more processing and less memory accessing. |
| For example, some in-memory databases run extremely well on |
| this class of system, albeit when running much more complex |
| sets of queries than performed by the benchmarks in this chapter. |
| For example, such systems might be processing images or video |
| streams stored in each element, providing further performance |
| benefits due to the fact that the resulting sequential memory |
| accesses will make better use of the available memory bandwidth |
| than will a pure pointer-following workload. |
| |
| But let this be a lesson to you. |
| Modern computer systems come in a great many shapes and sizes, |
| and great care is frequently required to select one that suits |
| your application. |
| And perhaps even more frequently, significant care and work is |
| required to adjust your application to the specific computer |
| systems at hand. |
| }\QuickQuizEnd |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zoocatonly}} |
| \caption{Read-Side Cat-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo at 64 CPUs} |
| \label{fig:datastruct:Read-Side Cat-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo at 64 CPUs} |
| \end{figure} |
| |
| As noted earlier, Schr\"odinger is surprised by the popularity of his |
| cat~\cite{ErwinSchroedinger1935Cat}, but recognizes the need to reflect |
| this popularity in his design. |
| \Cref{fig:datastruct:Read-Side Cat-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo at 64 CPUs} |
| shows the results of 64-CPU runs, varying the number of CPUs that are |
| doing nothing but looking up the cat. |
| Both RCU and hazard pointers respond well to this challenge, but bucket |
| locking scales negatively, eventually performing as badly as global |
| locking. |
| This should not be a surprise because if all CPUs are doing nothing |
| but looking up the cat, the lock corresponding to the cat's bucket |
| is for all intents and purposes a global lock. |
| |
| This cat-only benchmark illustrates one potential problem with |
| fully partitioned sharding approaches. |
| Only the CPUs associated with the cat's |
| partition is able to access the cat, limiting the cat-only |
| throughput. |
| Of course, a great many applications have good load-spreading |
| properties, and for these applications sharding works |
| quite well. |
| However, sharding does not handle ``\IXpl{hot spot}'' very well, with |
| the hot spot exemplified by Schr\"odinger's cat being but one case |
| in point. |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zooupdatelu}} |
| \caption{Read-Side RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo in the Presence of Updates} |
| \label{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo in the Presence of Updates} |
| \end{figure} |
| |
| If we were only ever going to read the data, we would not need any |
| concurrency control to begin with. |
| \Cref{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo in the Presence of Updates} |
| therefore shows the effect of updates on readers. |
| At the extreme left-hand side of this graph, all but one of the CPUs |
| are doing lookups, while to the right all 448~CPUs are doing updates. |
| For all four implementations, the number of lookups per millisecond |
| decreases as the number of updating CPUs increases, of course reaching |
| zero lookups per millisecond when all 448~CPUs are updating. |
| Both hazard pointers and RCU do well compared to per-bucket locking |
| because their readers do not increase update-side lock contention. |
| RCU does well relative to hazard pointers as the number of updaters |
| increases due to the latter's read-side memory barriers, which incur |
| greater overhead, especially in the presence of updates, and particularly |
| when execution involves more than one socket. |
| It therefore seems likely that modern hardware heavily optimizes memory-barrier |
| execution, greatly reducing memory-barrier overhead in the read-only case. |
| % @@@ When there is a section covering trees, reference sharding with |
| % @@@ CPUs being permitted to directly look up popular entries. |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zooupdate}} |
| \caption{Update-Side RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo} |
| \label{fig:datastruct:Update-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo} |
| \end{figure} |
| |
| Where |
| \cref{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo in the Presence of Updates} |
| showed the effect of increasing update rates on lookups, |
| \cref{fig:datastruct:Update-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo} |
| shows the effect of increasing update rates on the updates themselves. |
| Again, at the left-hand side of the figure all but one of the CPUs are |
| doing lookups and at the right-hand side of the figure all 448~CPUs are |
| doing updates. |
| Hazard pointers and RCU start off with a significant advantage because, |
| unlike bucket locking, readers do not exclude updaters. |
| However, as the number of updating CPUs increases, the update-side |
| deferred-execution overhead starts to make its presence known, first |
| for RCU and then for hazard pointers. |
| Of course, all three of these implementations beat global locking, and |
| by more than an order of magnitude. |
| |
| It is quite possible that the differences in lookup performance observed in |
| \cref{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo in the Presence of Updates} |
| are affected by the differences in update rates. |
| One way to check this is to artificially throttle the update rates of |
| per-bucket locking and hazard pointers to match that of RCU\@. |
| Doing so does not significantly improve the lookup performance of |
| per-bucket locking, nor does it close the gap between hazard pointers |
| and RCU\@. |
| However, removing the read-side memory barriers from hazard pointers |
| (thus resulting in an unsafe implementation) does nearly close the gap |
| between hazard pointers and RCU\@. |
| Although this unsafe hazard-pointer implementation will |
| usually be reliable enough for benchmarking purposes, it is absolutely |
| not recommended for production use. |
| |
| \QuickQuiz{ |
| The dangers of extrapolating from 28~CPUs to 448~CPUs was |
| made quite clear in |
| \cref{sec:datastruct:Hash-Table Performance}. |
| Would extrapolating up from 448~CPUs be any safer? |
| }\QuickQuizAnswer{ |
| In theory, no, it isn't any safer, and a useful exercise would be |
| to run these programs on larger systems. |
| In practice, there are only a very few systems with more than |
| 448~CPUs, in contrast to the huge number having more than 28~CPUs. |
| This means that although it is dangerous to extrapolate beyond |
| 448~CPUs, there is very little need to do so. |
| |
| In addition, other testing has shown that RCU read-side primitives |
| offer consistent performance and scalability up to at least 1024~CPUs. |
| However, it is useful to review |
| \cref{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo at 448 CPUs; Varying Table Size} |
| and its associated commentary. |
| You see, unlike the 448-CPU system that provided this data, |
| the system enjoying linear scalability up to 1024~CPUs boasted |
| excellent memory bandwidth. |
| }\QuickQuizEnd |
| |
| And this situation demonstrates |
| \cref{sec:datastruct:Update-activity failure to scale} |
| in the list complications on |
| \cpageref{sec:datastruct:Update-activity failure to scale}. |
| |
| % @@@ Testing strategy. Summarize hashtorture, add QQ for additional |
| % @@@ things that could be done. Note difference between textbook |
| % @@@ and production algorithms, also the need for clear understanding |
| % @@@ of what level of consistency is required. |
| |
| \subsection{RCU-Protected Hash Table Discussion} |
| \label{sec:datastruct:RCU-Protected Hash Table Discussion} |
| |
| One consequence of the RCU and hazard-pointer implementations is |
| that a pair of concurrent readers might disagree on the state of |
| the cat. |
| For example, one of the readers might have fetched the pointer to |
| the cat's data structure just before it was removed, while another |
| reader might have fetched this same pointer just afterwards. |
| The first reader would then believe that the cat was alive, while |
| the second reader would believe that the cat was dead. |
| |
| This situation is completely fitting for Schr\"odinger's |
| cat, but it turns out that it is quite reasonable for normal |
| non-quantum cats as well. |
| After all, it is impossible to determine exactly when an animal is born |
| or dies. |
| |
| To see this, let's suppose that we detect a cat's death by heartbeat. |
| This raise the question of exactly how long we should wait after the |
| last heartbeat before declaring death. |
| It is clearly ridiculous to wait only one millisecond, because then |
| a healthy living cat would have to be declared dead---and then |
| resurrected---more than once per second. |
| It is equally ridiculous to wait a full month, because by that time the |
| poor cat's death would be unmistakeably apparent to olfactory sensors. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{cartoons/2013-08-is-it-dead}} |
| \caption{Even Veterinarians Disagree!} |
| \ContributedBy{fig:datastruct:Even Veterinarians Disagree}{Melissa Broussard} |
| \end{figure} |
| |
| Because an animal's heart can stop for some seconds and then start up |
| again, there is a tradeoff between timely recognition of death and |
| probability of false alarms. |
| It is quite possible that a pair of veterinarians might disagree on |
| the time to wait between the last heartbeat and the declaration of |
| death. |
| For example, one veterinarian might declare death thirty seconds after |
| the last heartbeat, while another might insist on waiting a full |
| minute. |
| In this case, the two veterinarians would disagree on the state of the |
| cat for the second period of thirty seconds following the last heartbeat, |
| as fancifully depicted in |
| \cref{fig:datastruct:Even Veterinarians Disagree}. |
| |
| \pplsur{Weiner}{Heisenberg} taught us to live with this sort of |
| uncertainty~\cite{WeinerHeisenberg1927Uncertain}, which is a good |
| thing because computing hardware and software acts similarly. |
| For example, how do you know that a piece of computing hardware |
| has failed? |
| Often because it does not respond in a timely fashion. |
| Just like the cat's heartbeat, this results in a window of |
| uncertainty as to whether or not the hardware has really failed, |
| as opposed to just being slow. |
| |
| Furthermore, most computing systems are intended to interact with |
| the outside world. |
| Consistency with the outside world is therefore of paramount importance. |
| However, as we saw in |
| \cref{fig:defer:Response Time of RCU vs. Reader-Writer Locking} |
| on |
| \cpageref{fig:defer:Response Time of RCU vs. Reader-Writer Locking}, |
| increased internal consistency can come at the expense of degraded |
| external consistency. |
| Techniques such as RCU and hazard pointers give up some degree of |
| internal consistency to attain improved external consistency. |
| |
| In short, internal consistency is not necessarily a natural part of all |
| problem domains, and often incurs great expense in terms of performance, |
| scalability, consistency with the outside |
| world~\cite{AndreasHaas2012FIFOisnt,AndreasHaas2013CFRelaxedQueues,10.5555/3241639.3241645}, |
| or all of the above. |
| |
| \section{Non-Partitionable Data Structures} |
| \label{sec:datastruct:Non-Partitionable Data Structures} |
| % |
| \epigraph{Don't be afraid to take a big step if one is indicated. |
| You can't cross a chasm in two small steps.} |
| {David Lloyd George} |
| |
| Fixed-size hash tables are perfectly partitionable, but resizable hash |
| tables pose partitioning challenges when growing or shrinking, as |
| fancifully depicted in |
| \cref{fig:datastruct:Partitioning Problems}. |
| However, it turns out that it is possible to construct high-performance |
| scalable RCU-protected hash tables, as described in the following sections. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{cartoons/2014_Hash-table-hydra}} |
| \caption{Partitioning Problems} |
| \ContributedBy{fig:datastruct:Partitioning Problems}{Melissa Broussard} |
| \end{figure} |
| |
| \subsection{Resizable Hash Table Design} |
| \label{sec:datastruct:Resizable Hash Table Design} |
| |
| In happy contrast to the situation in the early 2000s, there are now |
| several different types of scalable RCU-protected hash tables. |
| The first (and simplest) was developed for the Linux kernel by |
| \ppl{Herbert}{Xu}~\cite{HerbertXu2010RCUResizeHash}, and is described in the |
| following sections. |
| Two others are covered briefly in |
| \cref{sec:datastruct:Other Resizable Hash Tables}. |
| |
| The key insight behind the first hash-table implementation is that |
| each data element can have two sets of list pointers, with one set |
| currently being used by RCU readers (as well as by non-RCU updaters) |
| and the other being used to construct a new resized hash table. |
| This approach allows lookups, insertions, and deletions to all run |
| concurrently with a resize operation (as well as with each other). |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{datastruct/hashxu-a}} |
| \caption{Growing a Two-List Hash Table, State~(a)} |
| \label{fig:datastruct:Growing a Two-List Hash Table; State (a)} |
| \end{figure} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{datastruct/hashxu-b}} |
| \caption{Growing a Two-List Hash Table, State~(b)} |
| \label{fig:datastruct:Growing a Two-List Hash Table; State (b)} |
| \end{figure} |
| |
| The resize operation proceeds as shown in |
| \crefrange{fig:datastruct:Growing a Two-List Hash Table; State (a)} |
| {fig:datastruct:Growing a Two-List Hash Table; State (d)}, |
| with the initial two-bucket state shown in |
| \cref{fig:datastruct:Growing a Two-List Hash Table; State (a)} |
| and with time advancing from figure to figure. |
| The initial state uses the zero-index links to chain the elements into |
| hash buckets. |
| A four-bucket array is allocated, and the one-index links are used to |
| chain the elements into these four new hash buckets. |
| This results in state~(b) shown in |
| \cref{fig:datastruct:Growing a Two-List Hash Table; State (b)}, |
| with readers still using the original two-bucket array. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{datastruct/hashxu-c}} |
| \caption{Growing a Two-List Hash Table, State~(c)} |
| \label{fig:datastruct:Growing a Two-List Hash Table; State (c)} |
| \end{figure} |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{datastruct/hashxu-d}} |
| \caption{Growing a Two-List Hash Table, State~(d)} |
| \label{fig:datastruct:Growing a Two-List Hash Table; State (d)} |
| \end{figure} |
| |
| The new four-bucket array is exposed to readers and then a grace-period |
| operation waits for all readers, resulting in state~(c), shown in |
| \cref{fig:datastruct:Growing a Two-List Hash Table; State (c)}. |
| In this state, all readers are using the new four-bucket array, |
| which means that the old two-bucket array may now be freed, resulting |
| in state~(d), shown in |
| \cref{fig:datastruct:Growing a Two-List Hash Table; State (d)}. |
| |
| This design leads to a relatively straightforward implementation, |
| which is the subject of the next section. |
| |
| \subsection{Resizable Hash Table Implementation} |
| \label{sec:datastruct:Resizable Hash Table Implementation} |
| |
| \begin{fcvref}[ln:datastruct:hash_resize:data] |
| Resizing is accomplished by the classic approach of inserting a level |
| of indirection, in this case, the \co{ht} structure shown on |
| \clnrefrange{ht:b}{ht:e} of |
| \cref{lst:datastruct:Resizable Hash-Table Data Structures} |
| (\path{hash_resize.c}). |
| The \co{hashtab} structure shown on |
| \clnrefrange{hashtab:b}{hashtab:e} contains only a |
| pointer to the current \co{ht} structure along with a spinlock that |
| is used to serialize concurrent attempts to resize the hash table. |
| If we were to use a traditional lock- or atomic-operation-based |
| implementation, this \co{hashtab} structure could become a severe bottleneck |
| from both performance and scalability viewpoints. |
| However, because resize operations should be relatively infrequent, |
| we should be able to make good use of RCU\@. |
| |
| \begin{listing} |
| \input{CodeSamples/datastruct/hash/hash_resize@data.fcv} |
| \caption{Resizable Hash-Table Data Structures} |
| \label{lst:datastruct:Resizable Hash-Table Data Structures} |
| \end{listing} |
| |
| The \co{ht} structure represents a specific size of the hash table, |
| as specified by the \co{->ht_nbuckets} field on \clnref{ht:nbuckets}. |
| The size is stored in the same structure containing the array of |
| buckets (\co{->ht_bkt[]} on |
| \clnref{ht:bkt}) in order to avoid mismatches between |
| the size and the array. |
| The \co{->ht_resize_cur} field on |
| \clnref{ht:resize_cur} is equal to $-1$ unless a resize |
| operation |
| is in progress, in which case it indicates the index of the bucket whose |
| elements are being inserted into the new hash table, which is referenced |
| by the \co{->ht_new} field on \clnref{ht:new}. |
| If there is no resize operation in progress, \co{->ht_new} is \co{NULL}. |
| Thus, a resize operation proceeds by allocating a new \co{ht} structure |
| and referencing it via the \co{->ht_new} pointer, then advancing |
| \co{->ht_resize_cur} through the old table's buckets. |
| When all the elements have been added to the new table, the new |
| table is linked into the \co{hashtab} structure's \co{->ht_cur} field. |
| Once all old readers have completed, the old hash table's \co{ht} structure |
| may be freed. |
| |
| The \co{->ht_idx} field on |
| \clnref{ht:idx} indicates which of the two sets of |
| list pointers are being used by this instantiation of the hash table, |
| and is used to index the \co{->hte_next[]} array in the \co{ht_elem} |
| structure on \clnref{ht_elem:next}. |
| |
| The \co{->ht_cmp()}, \co{->ht_gethash()}, and \co{->ht_getkey()} fields on |
| \clnrefrange{ht:cmp}{ht:getkey} |
| collectively define the per-element key and the hash function. |
| The \co{->ht_cmp()} function compares a specified key with that of |
| the specified element, |
| the \co{->ht_gethash()} calculates the specified key's hash, |
| and \co{->ht_getkey()} extracts the key from the enclosing data |
| element. |
| |
| The \co{ht_lock_state} shown on \clnrefrange{hls:b}{hls:e} |
| is used to communicate lock state from a new \co{hashtab_lock_mod()} |
| to \co{hashtab_add()}, \co{hashtab_del()}, and \co{hashtab_unlock_mod()}. |
| This state prevents the algorithm from being redirected to the wrong |
| bucket during concurrent resize operations. |
| |
| The \co{ht_bucket} structure is the same as before, and the |
| \co{ht_elem} structure differs from that of previous implementations |
| only in providing a two-element array of list pointer sets in place of |
| the prior single set of list pointers. |
| |
| In a fixed-sized hash table, bucket selection is quite straightforward: |
| Simply transform the hash value to the corresponding bucket index. |
| In contrast, when resizing, updaters must also determine which of the |
| old and new sets of buckets to select from. |
| If the bucket that would be selected from the old table has already |
| been distributed into the new table, then the bucket should be selected |
| from the new table as well as from the old table. |
| Conversely, if the bucket that would be selected from the old table |
| has not yet been distributed, then the bucket should be selected from |
| the old table. |
| \end{fcvref} |
| |
| \begin{listing} |
| \input{CodeSamples/datastruct/hash/hash_resize@get_bucket.fcv} |
| \caption{Resizable Hash-Table Bucket Selection} |
| \label{lst:datastruct:Resizable Hash-Table Bucket Selection} |
| \end{listing} |
| |
| \begin{fcvref}[ln:datastruct:hash_resize:get_bucket] |
| Bucket selection is shown in |
| \cref{lst:datastruct:Resizable Hash-Table Bucket Selection}, |
| which shows \co{ht_get_bucket()} on |
| \clnrefrange{single:b}{single:e} and \co{ht_search_bucket()} on |
| \clnrefrange{hsb:b}{hsb:e}. |
| The \co{ht_get_bucket()} function returns a reference to the bucket |
| corresponding to the specified key in the specified hash table, without |
| making any allowances for resizing. |
| It also stores the bucket index corresponding to the key into the location |
| referenced by parameter~\co{b} on |
| \clnref{single:gethash}, and the corresponding |
| hash value corresponding to the key into the location |
| referenced by parameter~\co{h} (if non-\co{NULL}) on \clnref{single:h}. |
| \Clnref{single:return} then returns a reference to the corresponding bucket. |
| |
| The \co{ht_search_bucket()} function searches for the specified key |
| within the specified hash-table version. |
| \Clnref{hsb:get_curbkt} obtains a reference to the bucket corresponding |
| to the specified key. |
| The loop spanning \clnrefrange{hsb:loop:b}{hsb:loop:e} searches |
| that bucket, so that if \clnref{hsb:match} detects a match, |
| \clnref{hsb:ret_match} returns a pointer to the enclosing data element. |
| Otherwise, if there is no match, |
| \clnref{hsb:ret_NULL} returns \co{NULL} to indicate |
| failure. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| How does the code in |
| \cref{lst:datastruct:Resizable Hash-Table Bucket Selection} |
| protect against the resizing process progressing past the |
| selected bucket? |
| }\QuickQuizAnswer{ |
| It does not provide any such protection. |
| That is instead the job of the update-side concurrency-control |
| functions described next. |
| }\QuickQuizEnd |
| |
| This implementation of \co{ht_get_bucket()} and \co{ht_search_bucket()} |
| permits lookups and modifications to run concurrently with a resize |
| operation. |
| |
| \begin{listing} |
| \input{CodeSamples/datastruct/hash/hash_resize@lock_unlock_mod.fcv} |
| \caption{Resizable Hash-Table Update-Side Concurrency Control} |
| \label{lst:datastruct:Resizable Hash-Table Update-Side Concurrency Control} |
| \end{listing} |
| |
| Read-side concurrency control is provided by RCU as was shown in |
| \cref{lst:datastruct:RCU-Protected Hash-Table Read-Side Concurrency Control}, |
| but the update-side concurrency\-/control functions |
| \co{hashtab_lock_mod()} and \co{hashtab_unlock_mod()} |
| must now deal with the possibility of a |
| concurrent resize operation as shown in |
| \cref{lst:datastruct:Resizable Hash-Table Update-Side Concurrency Control}. |
| |
| \begin{fcvref}[ln:datastruct:hash_resize:lock_unlock_mod:l] |
| The \co{hashtab_lock_mod()} spans |
| \clnrefrange{b}{e} in the listing. |
| \Clnref{rcu_lock} enters an RCU read-side critical section to prevent |
| the data structures from being freed during the traversal, |
| \clnref{refhashtbl} acquires a reference to the current hash table, and then |
| \clnref{refbucket} obtains a reference to the bucket in this hash table |
| corresponding to the key. |
| \Clnref{acq_bucket} acquires that bucket's lock, which will prevent any concurrent |
| resizing operation from distributing that bucket, though of course it |
| will have no effect if that bucket has already been distributed. |
| \Clnrefrange{lsp0b}{lsp0e} store the bucket pointer and |
| pointer-set index into their respective fields in the |
| \co{ht_lock_state} structure, which communicates the information to |
| \co{hashtab_add()}, \co{hashtab_del()}, and \co{hashtab_unlock_mod()}. |
| \Clnref{ifresized} then checks to see if a concurrent resize |
| operation has already distributed this bucket across the new hash table, |
| and if not, \clnref{lsp1_1} indicates that there is no |
| already-resized hash bucket and |
| \clnref{fastret1} returns with the selected hash bucket's |
| lock held (thus preventing a concurrent resize operation from distributing |
| this bucket) and also within an RCU read-side critical section. |
| \IX{Deadlock} is avoided because the old table's locks are always acquired |
| before those of the new table, and because the use of RCU prevents more |
| than two versions from existing at a given time, thus preventing a |
| deadlock cycle. |
| |
| Otherwise, a concurrent resize operation has already distributed this |
| bucket, so \clnref{new_hashtbl} proceeds to the new hash table, |
| \clnref{get_newbkt} selects the bucket corresponding to the key, |
| and \clnref{acq_newbkt} acquires the bucket's lock. |
| \Clnrefrange{lsp1b}{lsp1e} store the new-table bucket pointer and |
| pointer-set index into their respective fields in the |
| \co{ht_lock_state} structure, which again communicates this information to |
| \co{hashtab_add()}, \co{hashtab_del()}, and \co{hashtab_unlock_mod()}. |
| Because this bucket has already been resized and because |
| \co{hashtab_add()} and \co{hashtab_del()} affect both the old and the |
| new \co{ht_bucket} structures, two locks are held, one on each of the |
| two buckets. |
| Additionally, both elements of each array in \co{ht_lock_state} structure |
| are used, with the \co{[0]} element pertaining to the old \co{ht_bucket} |
| structure and the \co{[1]} element pertaining to the new structure. |
| Once again, \co{hashtab_lock_mod()} exits within an RCU read-side critical |
| section. |
| \end{fcvref} |
| |
| \begin{fcvref}[ln:datastruct:hash_resize:lock_unlock_mod:ul] |
| The \co{hashtab_unlock_mod()} function releases the lock(s) acquired by |
| \co{hashtab_lock_mod()}. |
| \Clnref{relbkt0} releases the lock on the old \co{ht_bucket} structure. |
| In the unlikely event that \clnref{ifbkt1} determines that a resize |
| operation is in progress, \clnref{relbkt1} releases the lock on the |
| new \co{ht_bucket} structure. |
| Either way, \clnref{rcu_unlock} exits the RCU read-side critical |
| section. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| Suppose that one thread is inserting an element into the |
| hash table during a resize operation. |
| What prevents this insertion from being lost due to a subsequent |
| resize operation completing before the insertion does? |
| }\QuickQuizAnswer{ |
| The second resize operation will not be able to move beyond |
| the bucket into which the insertion is taking place due to |
| the insertion holding the lock(s) on one or both of the hash |
| buckets in the hash tables. |
| Furthermore, the insertion operation takes place within an |
| RCU read-side critical section. |
| As we will see when we examine the \co{hashtab_resize()} |
| function, this allows each resize operation to use |
| \co{synchronize_rcu()} invocations to wait for the insertion's |
| read-side critical section to complete. |
| These \co{synchronize_rcu()} invocations prevent pre-existing |
| insertions operation from outliving the resize operation. |
| }\QuickQuizEnd |
| |
| \begin{listing} |
| \input{CodeSamples/datastruct/hash/hash_resize@access.fcv} |
| \caption{Resizable Hash-Table Access Functions} |
| \label{lst:datastruct:Resizable Hash-Table Access Functions} |
| \end{listing} |
| |
| Now that we have bucket selection and concurrency control in place, |
| we are ready to search and update our resizable hash table. |
| The \co{hashtab_lookup()}, \co{hashtab_add()}, and \co{hashtab_del()} |
| functions are shown in |
| \cref{lst:datastruct:Resizable Hash-Table Access Functions}. |
| |
| \begin{fcvref}[ln:datastruct:hash_resize:access:lkp] |
| The \co{hashtab_lookup()} function on |
| \clnrefrange{b}{e} of the listing does |
| hash lookups. |
| \Clnref{get_curtbl} fetches the current hash table and |
| \clnref{get_curbkt} searches the bucket corresponding to the |
| specified key. |
| \Clnref{ret} returns a pointer to the searched-for element |
| or \co{NULL} when the search fails. |
| The caller must be within an RCU read-side critical section. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| The \co{hashtab_lookup()} function in |
| \cref{lst:datastruct:Resizable Hash-Table Access Functions} |
| ignores concurrent resize operations. |
| Doesn't this mean that readers might miss an element that was |
| previously added during a resize operation? |
| }\QuickQuizAnswer{ |
| No. |
| As we will see soon, |
| the \co{hashtab_add()} and \co{hashtab_del()} functions |
| keep the old hash table up-to-date while a resize operation |
| is in progress. |
| }\QuickQuizEnd |
| |
| \begin{fcvref}[ln:datastruct:hash_resize:access:add] |
| The \co{hashtab_add()} function on \clnrefrange{b}{e} of the listing adds |
| new data elements to the hash table. |
| \Clnref{htbp} picks up the current \co{ht_bucket} structure into which the |
| new element is to be added, and \clnref{i} picks up the index of |
| the pointer pair. |
| \Clnref{add} adds the new element to the current hash bucket. |
| If \clnref{ifnew} determines that this bucket has been distributed |
| to a new version of the hash table, then \clnref{addnew} also adds the |
| new element to the corresponding new bucket. |
| The caller is required to handle concurrency, for example, by invoking |
| \co{hashtab_lock_mod()} before the call to \co{hashtab_add()} and invoking |
| \co{hashtab_unlock_mod()} afterwards. |
| \end{fcvref} |
| |
| \begin{fcvref}[ln:datastruct:hash_resize:access:del] |
| The \co{hashtab_del()} function on |
| \clnrefrange{b}{e} of the listing removes |
| an existing element from the hash table. |
| \Clnref{i} picks up the index of the pointer pair |
| and \clnref{del} removes the specified element from the current table. |
| If \clnref{ifnew} determines that this bucket has been distributed |
| to a new version of the hash table, then \clnref{delnew} also removes |
| the specified element from the corresponding new bucket. |
| As with \co{hashtab_add()}, the caller is responsible for concurrency |
| control and this concurrency control suffices for synchronizing with |
| a concurrent resize operation. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| The \co{hashtab_add()} and \co{hashtab_del()} functions in |
| \cref{lst:datastruct:Resizable Hash-Table Access Functions} |
| can update two hash buckets while a resize operation is progressing. |
| This might cause poor performance if the frequency of resize operation |
| is not negligible. |
| Isn't it possible to reduce the cost of updates in such cases? |
| }\QuickQuizAnswer{ |
| Yes, at least assuming that a slight increase in the cost of |
| \co{hashtab_lookup()} is acceptable. |
| One approach is shown in |
| \cref{lst:datastruct:Resizable Hash-Table Access Functions (Fewer Updates),% |
| lst:datastruct:Resizable Hash-Table Update-Side Locking Function (Fewer Updates)} |
| (\path{hash_resize_s.c}). |
| |
| \begin{listing} |
| \input{CodeSamples/datastruct/hash/hash_resize_s@access.fcv} |
| \caption{Resizable Hash-Table Access Functions (Fewer Updates)} |
| \label{lst:datastruct:Resizable Hash-Table Access Functions (Fewer Updates)} |
| \end{listing} |
| |
| \begin{listing} |
| \input{CodeSamples/datastruct/hash/hash_resize_s@lock_mod.fcv} |
| \caption{Resizable Hash-Table Update-Side Locking Function (Fewer Updates)} |
| \label{lst:datastruct:Resizable Hash-Table Update-Side Locking Function (Fewer Updates)} |
| \end{listing} |
| |
| This version of \co{hashtab_add()} adds an element to |
| either the old bucket if it is not resized yet, or to the new |
| bucket if it has been resized, and \co{hashtab_del()} removes |
| the specified element from any buckets into which it has been inserted. |
| The \co{hashtab_lookup()} function searches the new bucket |
| if the search of the old bucket fails, which has the disadvantage |
| of adding overhead to the lookup fastpath. |
| The alternative \co{hashtab_lock_mod()} returns the locking |
| state of the new bucket in \co{->hbp[0]} and \co{->hls_idx[0]} |
| if resize operation is in progress, instead of the perhaps |
| more natural choice of \co{->hbp[1]} and \co{->hls_idx[1]}. |
| However, this less-natural choice has the advantage of simplifying |
| \co{hashtab_add()}. |
| |
| Further analysis of the code is an exercise for the reader. |
| }\QuickQuizEnd |
| |
| \begin{listing*} |
| \input{CodeSamples/datastruct/hash/hash_resize@resize.fcv} |
| \caption{Resizable Hash-Table Resizing} |
| \label{lst:datastruct:Resizable Hash-Table Resizing} |
| \end{listing*} |
| |
| \begin{fcvref}[ln:datastruct:hash_resize:resize] |
| The actual resizing itself is carried out by \co{hashtab_resize()}, shown in |
| \cref{lst:datastruct:Resizable Hash-Table Resizing} on |
| \cpageref{lst:datastruct:Resizable Hash-Table Resizing}. |
| \Clnref{trylock} conditionally acquires the top-level \co{->ht_lock}, and if |
| this acquisition fails, \clnref{ret_busy} returns \co{-EBUSY} to indicate that |
| a resize is already in progress. |
| Otherwise, \clnref{get_curtbl} picks up a reference to the current hash table, |
| and \clnrefrange{alloc:b}{alloc:e} allocate a new hash table of the desired size. |
| If a new set of hash/key functions have been specified, these are |
| used for the new table, otherwise those of the old table are preserved. |
| If \clnref{chk_nomem} detects memory-allocation failure, |
| \clnref{rel_nomem} releases \co{->ht_lock} |
| and \clnref{ret_nomem} returns a failure indication. |
| |
| \Clnref{get_curidx} picks up the current table's index and |
| \clnref{put_curidx} stores its inverse to |
| the new hash table, thus ensuring that the two hash tables avoid overwriting |
| each other's linked lists. |
| \Clnref{set_newtbl} then starts the bucket-distribution process by |
| installing a reference to the new table into the \co{->ht_new} field of |
| the old table. |
| \Clnref{sync_rcu} ensures that all readers who are not aware of the |
| new table complete before the resize operation continues. |
| |
| Each pass through the loop spanning \clnrefrange{loop:b}{loop:e} distributes the contents |
| of one of the old hash table's buckets into the new hash table. |
| \Clnref{get_oldcur} picks up a reference to the old table's current bucket |
| and \clnref{acq_oldcur} acquires that bucket's spinlock. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| \begin{fcvref}[ln:datastruct:hash_resize:resize] |
| In the \co{hashtab_resize()} function in |
| \cref{lst:datastruct:Resizable Hash-Table Resizing}, |
| what guarantees that the update to \co{->ht_new} on \clnref{set_newtbl} |
| will be seen as happening before the update to \co{->ht_resize_cur} |
| on \clnref{update_resize} from the perspective of |
| \co{hashtab_add()} and \co{hashtab_del()}? |
| In other words, what prevents \co{hashtab_add()} |
| and \co{hashtab_del()} from dereferencing |
| a \co{NULL} pointer loaded from \co{->ht_new}? |
| \end{fcvref} |
| }\QuickQuizAnswer{ |
| \begin{fcvref}[ln:datastruct:hash_resize:resize] |
| The \co{synchronize_rcu()} on \clnref{sync_rcu} of |
| \cref{lst:datastruct:Resizable Hash-Table Resizing} |
| ensures that all pre-existing RCU readers have completed between |
| the time that we install the new hash-table reference on |
| \clnref{set_newtbl} and the time that we update \co{->ht_resize_cur} on |
| \clnref{update_resize}. |
| This means that any reader that sees a non-negative value |
| of \co{->ht_resize_cur} cannot have started before the |
| assignment to \co{->ht_new}, and thus must be able to see |
| the reference to the new hash table. |
| |
| And this is why the update-side \co{hashtab_add()} and |
| \co{hashtab_del()} functions must be enclosed |
| in RCU read-side critical sections, courtesy of |
| \co{hashtab_lock_mod()} and \co{hashtab_unlock_mod()} in |
| \cref{lst:datastruct:Resizable Hash-Table Update-Side Concurrency Control}. |
| \end{fcvref} |
| }\QuickQuizEnd |
| |
| \begin{fcvref}[ln:datastruct:hash_resize:resize] |
| Each pass through the loop spanning |
| \clnrefrange{loop_list:b}{loop_list:e} adds one data element |
| from the current old-table bucket to the corresponding new-table bucket, |
| holding the new-table bucket's lock during the add operation. |
| \Clnref{update_resize} updates |
| \co{->ht_resize_cur} to indicate that this bucket has been distributed. |
| Finally, \clnref{rel_oldcur} releases the old-table bucket lock. |
| |
| Execution reaches \clnref{rcu_assign} once all old-table buckets have been distributed |
| across the new table. |
| \Clnref{rcu_assign} installs the newly created table as the current one, and |
| \clnref{sync_rcu_2} waits for all old readers (who might still be referencing |
| the old table) to complete. |
| Then \clnref{rel_master} releases the resize-serialization lock, |
| \clnref{free} frees |
| the old hash table, and finally \clnref{ret_success} returns success. |
| \end{fcvref} |
| |
| \QuickQuiz{ |
| \begin{fcvref}[ln:datastruct:hash_resize:resize] |
| Why is there a \co{WRITE_ONCE()} on \clnref{update_resize} |
| in \cref{lst:datastruct:Resizable Hash-Table Resizing}? |
| \end{fcvref} |
| }\QuickQuizAnswer{ |
| \begin{fcvref}[ln:datastruct:hash_resize:lock_unlock_mod] |
| Together with the \co{READ_ONCE()} |
| on \clnref{l:ifresized} in \co{hashtab_lock_mod()} |
| of \cref{lst:datastruct:Resizable Hash-Table Update-Side Concurrency Control}, |
| it tells the compiler that the non-initialization accesses |
| to \co{->ht_resize_cur} must remain because reads |
| from \co{->ht_resize_cur} really can race with writes, |
| just not in a way to change the ``if'' conditions. |
| \end{fcvref} |
| }\QuickQuizEnd |
| |
| \subsection{Resizable Hash Table Discussion} |
| \label{sec:datastruct:Resizable Hash Table Discussion} |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.7in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.resize.2020.09.05a/perftestresize}} |
| \caption{Overhead of Resizing Hash Tables Between 262,144 and 524,288 Buckets vs.\@ Total Number of Elements} |
| \label{fig:datastruct:Overhead of Resizing Hash Tables Between 262;144 and 524;288 Buckets vs. Total Number of Elements} |
| \end{figure} |
| |
| \Cref{fig:datastruct:Overhead of Resizing Hash Tables Between 262;144 and 524;288 Buckets vs. Total Number of Elements} |
| compares resizing hash tables to their fixed-sized counterparts |
| for 262,144 and 2,097,152 elements in the hash table. |
| The figure shows three traces for each element count, one |
| for a fixed-size 262,144-bucket hash table, another for a |
| fixed-size 524,288-bucket hash table, and a third for a resizable |
| hash table that shifts back and forth between 262,144 and 524,288 |
| buckets, with a one-millisecond pause between each resize operation. |
| |
| The uppermost three traces are for the 262,144-element hash table.\footnote{ |
| You see only two traces? |
| The dashed one is composed of two traces that differ |
| only slightly, hence the irregular-looking dash pattern.} |
| The dashed trace corresponds to the two fixed-size hash tables, |
| and the solid trace to the resizable hash table. |
| In this case, the short hash chains cause normal lookup overhead |
| to be so low that the overhead of resizing dominates over most |
| of the range. |
| In particular, the entire hash table fits into L3 cache. |
| % Nevertheless, the larger fixed-size hash table has a significant |
| % performance advantage, so that resizing can be quite beneficial, |
| % at least given sufficient time between resizing operations: One |
| % millisecond is clearly too short a time. |
| |
| The lower three traces are for the 2,097,152-element hash table. |
| The upper dashed trace corresponds to the 262,144-bucket fixed-size |
| hash table, the solid trace in the middle for low CPU counts and at |
| the bottom for high CPU counts to the resizable hash table, |
| and the other trace to the 524,288-bucket fixed-size hash table. |
| The fact that there are now an average of eight elements per bucket |
| can only be expected to produce a sharp decrease in performance, |
| as in fact is shown in the graph. |
| But worse yet, the hash-table elements occupy 128\,MB, which overflows |
| each socket's 39\,MB L3 cache, with performance consequences analogous |
| to those described in \cref{sec:cpu:Costs of Operations}. |
| The resulting cache overflow means that the memory system is involved |
| even for a read-only benchmark, and as you can see from the sublinear |
| portions of the lower three traces, the memory system can be a serious |
| bottleneck. |
| |
| \QuickQuiz{ |
| How much of the difference in performance between the large and |
| small hash tables shown in |
| \cref{fig:datastruct:Overhead of Resizing Hash Tables Between 262;144 and 524;288 Buckets vs. Total Number of Elements} |
| was due to long hash chains and how much was due to |
| memory-system bottlenecks? |
| }\QuickQuizAnswer{ |
| The easy way to answer this question is to do another run with |
| 2,097,152 elements, but this time also with 2,097,152 buckets, |
| thus bringing the average number of elements per bucket back down |
| to unity. |
| |
| \begin{figure} |
| \centering |
| \resizebox{2.7in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.resize.2020.09.27a/perftestresizebig}} |
| \caption{Effect of Memory-System Bottlenecks on Hash Tables} |
| \label{fig:datastruct:Effect of Memory-System Bottlenecks on Hash Tables} |
| \end{figure} |
| |
| The results are shown by the triple-dashed new trace in |
| the middle of |
| \cref{fig:datastruct:Effect of Memory-System Bottlenecks on Hash Tables}. |
| The other six traces are identical to their counterparts in |
| \cref{fig:datastruct:Overhead of Resizing Hash Tables Between 262;144 and 524;288 Buckets vs. Total Number of Elements} |
| on \cpageref{fig:datastruct:Overhead of Resizing Hash Tables Between 262;144 and 524;288 Buckets vs. Total Number of Elements}. |
| The gap between this new trace and the lower set of three |
| traces is a rough measure of how much of the difference in |
| performance was due to hash-chain length, and the gap between |
| the new trace and the upper set of three traces is a rough measure |
| of how much of that difference was due to memory-system bottlenecks. |
| The new trace starts out slightly below its 262,144-element |
| counterpart at a single CPU, showing that cache capacity is |
| degrading performance slightly even on that single CPU\@.\footnote{ |
| Yes, as far as hardware architects are concerned, |
| caches are part of the memory system.} |
| This is to be expected, given that unlike its smaller counterpart, |
| the 2,097,152-bucket hash table does not fit into the L3 cache. |
| This new trace rises just past 28~CPUs, which is also to be |
| expected. |
| This rise is due to the fact that the 29\textsuperscript{th} |
| CPU is on another socket, which brings with it an additional |
| 39\,MB of cache as well as additional memory bandwidth. |
| |
| But the large hash table's advantage over that of the hash table |
| with 524,288~buckets (but still 2,097,152 elements) decreases |
| with additional CPUs, which is consistent with the bottleneck |
| residing in the memory system. |
| Above about 400~CPUs, the 2,097,152-bucket hash table is |
| actually outperformed slightly by the 524,288-bucket hash |
| table. |
| This should not be a surprise because the memory system is |
| the bottleneck and the larger number of buckets increases this |
| workload's memory footprint. |
| |
| The alert reader will have noted the word ``rough'' above |
| and might be interested in a more detailed analysis. |
| Such readers are invited to run similar benchmarks, using |
| whatever performance counters or hardware-analysis tools |
| they might have available. |
| This can be a long and complex journey, but those brave enough |
| to embark on it will be rewarded with detailed knowledge of |
| hardware performance and its effect on software. |
| }\QuickQuizEnd |
| |
| Referring to the last column of |
| \cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz}, |
| we recall that the first 28~CPUs are in the first socket, on a |
| one-CPU-per-core basis, which explains the sharp decrease in performance |
| of the resizable hash table beyond 28~CPUs. |
| Sharp though this decrease is, please recall that it is due to constant |
| resizing back and forth. |
| It would clearly be better to resize once to 524,288 buckets, |
| or, even better, do a single eight-fold resize to 2,097,152 elements, |
| thus dropping the average number of elements per bucket down to the |
| level enjoyed by the runs producing the upper three traces. |
| |
| The key point from this data is that the RCU-protected resizable hash |
| table performs and scales almost as well as does its fixed-size counterpart. |
| The performance during an actual resize operation of course suffers |
| somewhat due to the cache misses causes by the updates to each element's |
| pointers, and this effect is most pronounced when the memory system |
| becomes a bottleneck. |
| This indicates that hash tables should be resized by substantial amounts, |
| and that hysteresis should be applied to prevent performance degradation |
| due to too-frequent resize operations. |
| In memory-rich environments, hash-table sizes should furthermore |
| be increased much more aggressively than they are decreased. |
| |
| Another key point is that although the \co{hashtab} structure is |
| non-partitionable, it is also read-mostly, which suggests the use |
| of RCU\@. |
| Given that the performance and scalability of this resizable hash table is |
| very nearly that of RCU-protected fixed-sized hash tables, we must |
| conclude that this approach was quite successful. |
| |
| Finally, it is important to note that insertions, deletions, and |
| lookups can proceed concurrently with a resize operation. |
| This concurrency is |
| critically important when resizing large hash tables, especially |
| for applications that must meet severe response-time constraints. |
| |
| Of course, the \co{ht_elem} structure's |
| pair of pointer sets does impose some memory overhead, |
| which is taken up in the next section. |
| |
| \subsection{Other Resizable Hash Tables} |
| \label{sec:datastruct:Other Resizable Hash Tables} |
| |
| One shortcoming of the resizable hash table described earlier in this |
| section is memory consumption. |
| Each data element has two pairs of linked-list pointers rather than just |
| one. |
| Is it possible to create an RCU-protected resizable hash table that |
| makes do with just one pair? |
| |
| It turns out that the answer is ``yes''. |
| Josh Triplett et al.~\cite{Triplett:2011:RPHash} |
| produced a \emph{relativistic hash table} that incrementally |
| splits and combines corresponding hash chains so that readers always |
| see valid hash chains at all points during the resizing operation. |
| This incremental splitting and combining relies on the fact that it |
| is harmless for a reader to see a data element that should be in some |
| other hash chain: |
| When this happens, the reader will simply ignore the extraneous data |
| element due to key mismatches. |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{datastruct/zipperhashshrink}} |
| \caption{Shrinking a Relativistic Hash Table} |
| \label{fig:datastruct:Shrinking a Relativistic Hash Table} |
| \end{figure} |
| |
| The process of shrinking a relativistic hash table by a factor of two |
| is shown in |
| \cref{fig:datastruct:Shrinking a Relativistic Hash Table}, |
| in this case shrinking a two-bucket hash table into a one-bucket |
| hash table, otherwise known as a linear list. |
| This process works by coalescing pairs of buckets in the old larger hash |
| table into single buckets in the new smaller hash table. |
| For this process to work correctly, we clearly need to constrain the hash |
| functions for the two tables. |
| One such constraint is to use the same underlying hash function for |
| both tables, but to throw out the low-order bit when shrinking from |
| large to small. |
| For example, the old two-bucket hash table would |
| use the two top bits of the value, while the new one-bucket hash table |
| could use the top bit of the value. |
| In this way, a given pair of adjacent even and odd buckets in the old |
| large hash table can be coalesced into a single bucket in the new small |
| hash table, while still having a single hash value cover all of the |
| elements in that single bucket. |
| |
| The initial state is shown at the top of the figure, with time advancing |
| from top to bottom, starting with initial state~(a). |
| The shrinking process begins by allocating the new smaller array of |
| buckets, and having each bucket of this new smaller array reference |
| the first element of one of the buckets of the corresponding pair in |
| the old large hash table, resulting in state~(b). |
| |
| Then the two hash chains are linked together, resulting in state~(c). |
| In this state, readers looking up an even-numbered element see no change, |
| and readers looking up elements~1 and~3 likewise see no change. |
| However, readers looking up some other odd number will also traverse |
| elements~0 and~2. |
| This is harmless because any odd number will compare not-equal to these |
| two elements. |
| There is some performance loss, but on the other hand, this is exactly |
| the same performance loss that will be experienced once the new small |
| hash table is fully in place. |
| |
| Next, the new small hash table is made accessible to readers, resulting |
| in state~(d). |
| Note that older readers might still be traversing the old large hash |
| table, so in this state both hash tables are in use. |
| |
| The next step is to wait for all pre-existing readers to complete, |
| resulting in state~(e). |
| In this state, all readers are using the new small hash table, so that |
| the old large hash table's buckets may be freed, resulting in the final |
| state~(f). |
| |
| \begin{figure} |
| \centering |
| \resizebox{3in}{!}{\includegraphics{datastruct/zipperhashgrow}} |
| \caption{Growing a Relativistic Hash Table} |
| \label{fig:datastruct:Growing a Relativistic Hash Table} |
| \end{figure} |
| |
| Growing a relativistic hash table reverses the shrinking process, |
| but requires more grace-period steps, as shown in |
| \cref{fig:datastruct:Growing a Relativistic Hash Table}. |
| The initial state~(a) is at the top of this figure, with time advancing |
| from top to bottom. |
| |
| We start by allocating the new large two-bucket hash table, resulting |
| in state~(b). |
| Note that each of these new buckets references the first element destined |
| for that bucket. |
| These new buckets are published to readers, resulting in state~(c). |
| After a grace-period operation, all readers are using the new large |
| hash table, resulting in state~(d). |
| In this state, only those readers traversing the even-values hash bucket |
| traverse element~0, which is therefore now colored white. |
| |
| At this point, the old small hash buckets may be freed, although many |
| implementations use these old buckets to track progress ``unzipping'' |
| the list of items into their respective new buckets. |
| The last even-numbered element in the first consecutive run of such |
| elements now has its pointer-to-next updated to reference the following |
| even-numbered element. |
| After a subsequent grace-period operation, the result is state~(e). |
| The vertical arrow indicates the next element to be unzipped, and |
| element~1 is now colored black to indicate that only those readers traversing |
| the odd-values hash bucket may reach it. |
| |
| Next, the last odd-numbered element in the first consecutive run of such |
| elements now has its pointer-to-next updated to reference the following |
| odd-numbered element. |
| After a subsequent grace-period operation, the result is state~(f). |
| A final unzipping operation (including a grace-period operation) |
| results in the final state~(g). |
| |
| In short, the relativistic hash table reduces the number of per-element |
| list pointers at the expense of additional grace periods incurred during |
| resizing. |
| These additional grace periods are usually not a problem because |
| insertions, deletions, and lookups may proceed concurrently with |
| a resize operation. |
| |
| However, when a hash table is growing quickly, the large numbers of |
| grace periods might delay the beginning of the next resize operation. |
| One way to reduce the number of grace periods is to unzip all the |
| buckets concurrently, so that the number of grace periods will be |
| limited by the largest number of elements in a single bucket instead |
| of by the total number of elements in the hash table. |
| Another approach, also by Josh Triplett, is to maintain a separate |
| pointer to the element currently being moved from the old bucket to |
| the new bucket. |
| Readers then check the old bucket, the separate pointer, and the new |
| bucket in order. |
| This requires read-side memory ordering, which slows readers, but |
| it greatly reduces the required number of resize-time grace periods. |
| This approach is used within the Linux kernel. |
| |
| It turns out that it is possible to reduce the per-element memory overhead |
| from a pair of pointers to a single pointer, while still retaining |
| $\O{1}$ deletions. |
| This is accomplished by augmenting split-order |
| list~\cite{OriShalev2006SplitOrderListHash} |
| with RCU |
| protection~\cite{MathieuDesnoyers2009URCU,PaulMcKenney2013LWNURCUhash}. |
| The data elements in the hash table are arranged into a single |
| sorted linked list, with each hash bucket referencing the first element |
| in that bucket. |
| Elements are deleted by setting low-order bits in their pointer-to-next |
| fields, and these elements are removed from the list by later traversals |
| that encounter them. |
| |
| This RCU-protected split-order list is complex, but offers lock-free |
| progress guarantees for all insertion, deletion, and lookup operations. |
| Such guarantees can be important in real-time applications. |
| However, one downside is that the keys must be bit-reversed during |
| lookup, which slows down readers. |
| But for those for whom this slowdown is not a problem, an |
| implementation is available from recent versions of the userspace RCU |
| library~\cite{MathieuDesnoyers2009URCU}. |
| |
| \section{Other Data Structures} |
| \label{sec:datastruct:Other Data Structures} |
| % |
| \epigraph{All life is an experiment. |
| The more experiments you make the better.} |
| {Ralph Waldo Emerson} |
| |
| The preceding sections have focused on data structures that enhance |
| concurrency due to partitionability |
| (\cref{sec:datastruct:Partitionable Data Structures}), |
| efficient handling of read-mostly access patterns |
| (\cref{sec:datastruct:Read-Mostly Data Structures}), |
| or application of read-mostly techniques to avoid |
| non-partitionability |
| (\cref{sec:datastruct:Non-Partitionable Data Structures}). |
| This section gives a brief review of other data structures. |
| |
| One of the hash table's greatest advantages for parallel use is that it |
| is fully partitionable, at least while not being resized. |
| One way of preserving the partitionability and the size independence is |
| to use a radix tree, which is also called a trie. |
| Tries partition the search key, using each successive key partition |
| to traverse the next level of the trie. |
| As such, a trie can be thought of as a set of nested hash tables, |
| thus providing the required partitionability. |
| One disadvantage of tries is that a sparse key space can result in |
| inefficient use of memory. |
| There are a number of compression techniques that may be used to |
| work around this disadvantage, including hashing the key value to |
| a smaller keyspace before the |
| traversal~\cite{RobertOlsson2007Trash}. |
| Radix trees are heavily used in practice, including in the Linux |
| kernel~\cite{NickPiggin2006radixtree}. |
| |
| One important special case of both a hash table and a trie is what is |
| perhaps the oldest of data structures, the array and its multi-dimensional |
| counterpart, the matrix. |
| The fully partitionable nature of matrices is exploited heavily in |
| concurrent numerical algorithms. |
| |
| Self-balancing trees are heavily used in sequential code, with |
| AVL trees and red-black trees being perhaps the most well-known |
| examples~\cite{ThomasHCorman2001Algorithms}. |
| Early attempts to parallelize AVL trees were complex and not necessarily |
| all that efficient~\cite{Ellis80}, |
| however, more recent work on red-black trees provides better |
| performance and scalability by using RCU for readers and hashed arrays |
| of locks\footnote{ |
| In the guise of swissTM~\cite{AleksandarDragovejic2011STMnotToy}, |
| which is a variant of software transactional memory in which |
| the developer flags non-shared accesses.} |
| to protect reads and updates, |
| respectively~\cite{PhilHoward2011RCUTMRBTree,PhilipWHoward2013RCUrbtree}. |
| It turns out that red-black trees rebalance aggressively, which works |
| well for sequential programs, but not necessarily so well for parallel |
| use. |
| Recent work has therefore made use of RCU-protected ``bonsai trees'' |
| that rebalance less aggressively~\cite{AustinClements2012RCULinux:mmapsem}, |
| trading off optimal tree depth to gain more efficient concurrent updates. |
| |
| Concurrent skip lists lend themselves well to RCU readers, and in fact |
| represents an early academic use of a technique resembling |
| RCU~\cite{Pugh90}. |
| % @@@ Reference skiplist chapter once added. |
| |
| Concurrent double-ended queues were discussed in |
| \cref{sec:SMPdesign:Double-Ended Queue}, |
| and concurrent stacks and queues have a long history~\cite{Treiber86}, |
| though not normally the most impressive performance or scalability. |
| They are nevertheless a common feature of concurrent |
| libraries~\cite{PaulMcKenney2013LWNURCUqueuestack}. |
| Researchers have recently proposed relaxing the ordering constraints |
| of stacks and queues~\cite{Shavit:2011:DSM:1897852.1897873}, |
| with some work indicating that relaxed-ordered queues actually have |
| better ordering properties than do strict FIFO |
| queues~\cite{AndreasHaas2012FIFOisnt,ChristophMKirsch2012FIFOisntTR,AndreasHaas2013CFRelaxedQueues}. |
| |
| It seems likely that continued work with concurrent data structures will |
| produce novel algorithms with surprising properties. |
| |
| \section{Summary} |
| \label{sec:datastruct:Summary} |
| % |
| \epigraph{There's only one thing more painful than learning from |
| experience, and that is not learning from experience.} |
| {Archibald MacLeish} |
| |
| This chapter has focused primarily on hash tables, including resizable |
| hash tables, which are not fully partitionable. |
| \Cref{sec:datastruct:Other Data Structures} gave a quick |
| overview of a few non-hash-table data structures. |
| Nevertheless, this exposition of hash tables is an excellent introduction |
| to the many issues surrounding high-performance scalable data access, |
| including: |
| |
| \begin{enumerate} |
| \item Fully partitioned data structures work well on small systems, |
| for example, single-socket systems. |
| \item Larger systems require locality of reference as well as |
| full partitioning. |
| \item Read-mostly techniques, such as hazard pointers and RCU, |
| provide good locality of reference for read-mostly workloads, |
| and thus provide excellent performance and scalability even |
| on larger systems. |
| \item Read-mostly techniques also work well on some types of |
| non-partitionable data structures, such as resizable hash tables. |
| \item Large data structures can overflow CPU caches, reducing performance |
| and scalability. |
| \item Additional performance and scalability can be obtained by |
| specializing the data structure to a specific workload, |
| for example, by replacing a general key with a 32-bit integer. |
| \item Although requirements for portability and for extreme performance |
| often conflict, there are some data-structure-layout techniques |
| that can strike a good balance between these two sets of |
| requirements. |
| \end{enumerate} |
| |
| That said, performance and scalability are of little use without reliability, |
| so the next chapter covers validation. |
| |
| \QuickQuizAnswersChp{qqzdatastruct} |