| % datastruct/datastruct.tex |
| |
| \QuickQuizChapter{chp:Data Structures}{Data Structures} |
| |
| Efficient access to data is critically important, so that |
| discussions of algorithms include time complexity of the related |
| data structures~\cite{ThomasHCorman2001Algorithms}. |
| However, for parallel programs, measures of time complexity must also |
| include concurrency effects. |
| These effects can be overwhelmingly large, as shown in |
| Chapter~\ref{chp:Hardware and its Habits}, which means that |
| concurrent data structure designs must focus as much on |
| concurrency as they do on sequential time complexity. |
| |
| Section~\ref{sec:datastruct:Motivating Application} |
| presents a motivating application that will be used to evaluate |
| the data structures presented in this chapter. |
| |
| As discussed in Chapter~\ref{cha:Partitioning and Synchronization Design}, |
| an excellent way to achieve high scalability is partitioning. |
| This points the way to partitionable data structures, a topic taken up by |
| Section~\ref{sec:datastruct:Partitionable Data Structures}. |
| Chapter~\ref{chp:Deferred Processing} described how deferring some |
| actions can greatly improve both performance and scalability. |
| Section~\ref{sec:defer:Read-Copy Update (RCU)} in particular showed |
| how to tap the awesome power of procrastination in pursuit of |
| performance and scalability, a topic taken up by |
| Section~\ref{sec:datastruct:Read-Mostly Data Structures}. |
| |
| Not all data structures are partitionable. |
| Section~\ref{sec:datastruct:Non-Partitionable Data Structures} |
| looks at a mildly non-partitionable example data structure. |
| This section shows |
| how to split it into read-mostly and partitionable portions, |
| enabling a fast and scalable implementation. |
| |
| Because this chapter cannot delve into the details of every concurrent |
| data structure that has ever been used |
| Section~\ref{sec:datastruct:Other Data Structures} |
| provides a brief survey of the most common and important ones. |
| Although the best performance and scalability results design rather |
| than after-the-fact micro-optimization, it is nevertheless the case |
| that micro-optimization has an important place in achieving the |
| absolute best possible performance and scalability. |
| This topic is therefore taken up in |
| Section~\ref{sec:datastruct:Micro-Optimization}. |
| |
| Finally, Section~\ref{sec:datastruct:Summary} |
| presents a summary of this chapter. |
| |
| \section{Motivating Application} |
| \label{sec:datastruct:Motivating Application} |
| |
| We will use the Schr\"odinger's Zoo application to evaluate |
| performance~\cite{McKenney:2013:SDS:2483852.2483867}. |
| Schr\"odinger has a zoo containing a large number of animals, and |
| he would like to track them using an in-memory database with |
| each animal in the zoo represented by a data item in this database. |
| Each animal has a unique name that is used as a key, with a variety |
| of data tracked for each animal. |
| |
| Births, captures, and purchases result in insertions, while deaths, |
| releases, and sales result in deletions. |
| Because Schr\"odinger's zoo contains a large quantity of short-lived |
| animals, including mice and insects, the database must be able to |
| support a high update rate. |
| |
| Those interested in Schr\"odinger's animals can query them, however, |
| Schr\"odinger has noted extremely high rates of queries for his cat, |
| so much so that he suspects that his mice might be using the database |
| to check up on their nemesis. |
| This means that Sch\"odinger's application must be able to support a |
| high rate of queries to a single data element. |
| |
| Please keep this application in mind as various data structures are presented. |
| |
| \section{Partitionable Data Structures} |
| \label{sec:datastruct:Partitionable Data Structures} |
| |
| There are a huge number of data structures in use today, so much so |
| that there are multiple textbooks covering them. |
| This small section focuses |
| on a single data structure, namely the hash table. |
| This focused approach allows a much deeper investigation of how concurrency |
| interacts with data structures, and also focuses on a data structure |
| that is heavily used in practice. |
| Section~\ref{sec:datastruct:Hash-Table Design} |
| overviews of the design, and |
| Section~\ref{sec:datastruct:Hash-Table Implementation} |
| presents the implementation. |
| Finally, |
| Section~\ref{sec:datastruct:Hash-Table Performance} |
| discusses the resulting performance and scalability. |
| |
| \subsection{Hash-Table Design} |
| \label{sec:datastruct:Hash-Table Design} |
| |
| Chapter~\ref{cha:Partitioning and Synchronization Design} |
| emphasized the need to apply partitioning in order to attain |
| respectable performance and scalability, so partitionability |
| must be a first-class criterion when selecting data structures. |
| This criterion is well satisfied by that workhorse of parallelism, |
| the hash table. |
| Hash tables are conceptually simple, consisting of an array of |
| \emph{hash buckets}. |
| A \emph{hash function} maps from a given element's \emph{key} |
| to the hash bucket that this element will be stored in. |
| Each hash bucket therefore heads up a linked list of elements, |
| called a \emph{hash chain}. |
| When properly configured, these hash chains will be quite short, |
| permitting a hash table to access the element with a given key |
| extremely efficiently. |
| |
| \QuickQuiz{} |
| But there are many types of hash tables, of which the chained |
| hash tables described here are but one type. |
| Why the focus on chained hash tables? |
| \QuickQuizAnswer{ |
| Chained hash tables are completely partitionable, and thus |
| well-suited to concurrent use. |
| There are other completely-partitionable hash tables, for |
| example, split-ordered list~\cite{OriShalev2006SplitOrderListHash}, |
| but they are considerably more complex. |
| We therefore start with chained hash tables. |
| } \QuickQuizEnd |
| |
| In addition, each bucket can be given its own lock, so that |
| elements in different buckets of the hash table may be added, |
| deleted, and looked up completely independently. |
| A large hash table containing a large number of elements therefore |
| offers excellent scalability. |
| |
| \subsection{Hash-Table Implementation} |
| \label{sec:datastruct:Hash-Table Implementation} |
| |
| \begin{figure}[tb] |
| { \scriptsize |
| \begin{verbatim} |
| 1 struct ht_elem { |
| 2 struct cds_list_head hte_next; |
| 3 unsigned long hte_hash; |
| 4 }; |
| 5 |
| 6 struct ht_bucket { |
| 7 struct cds_list_head htb_head; |
| 8 spinlock_t htb_lock; |
| 9 }; |
| 10 |
| 11 struct hashtab { |
| 12 unsigned long ht_nbuckets; |
| 13 struct ht_bucket ht_bkt[0]; |
| 14 }; |
| \end{verbatim} |
| } |
| \caption{Hash-Table Data Structures} |
| \label{fig:datastruct:Hash-Table Data Structures} |
| \end{figure} |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{datastruct/hashdiagram}} |
| \end{center} |
| \caption{Hash-Table Data-Structure Diagram} |
| \label{fig:datastruct:Hash-Table Data-Structure Diagram} |
| \end{figure} |
| |
| Figure~\ref{fig:datastruct:Hash-Table Data Structures} |
| (\co{hash_bkt.c}) |
| shows a set of data structures used in a simple fixed-sized hash |
| table using chaining and per-hash-bucket locking, and |
| Figure~\ref{fig:datastruct:Hash-Table Data-Structure Diagram} |
| diagrams how they fit together. |
| The \co{hashtab} structure (lines~11-14 in |
| Figure~\ref{fig:datastruct:Hash-Table Data Structures}) |
| contains four \co{ht_bucket} structures (lines~6-9 in |
| Figure~\ref{fig:datastruct:Hash-Table Data Structures}), |
| with the \co{->bt_nbuckets} field controlling the number of buckets. |
| Each such bucket contains a list header \co{->htb_head} and |
| a lock \co{->htb_lock}. |
| The list headers chain \co{ht_elem} structures |
| (lines~1-4 in |
| Figure~\ref{fig:datastruct:Hash-Table Data Structures}) |
| through their |
| \co{->hte_next} fields, and each \co{ht_elem} structure also caches |
| the corresponding element's hash value in the \co{->hte_hash} field. |
| The \co{ht_elem} structure would be included in the larger structure |
| being placed in the hash table, and this larger structure might contain |
| a complex key. |
| |
| The diagram shown in |
| Figure~\ref{fig:datastruct:Hash-Table Data-Structure Diagram} |
| has bucket~0 with two elements and bucket~2 with one. |
| |
| \begin{figure}[tb] |
| { \scriptsize |
| \begin{verbatim} |
| 1 #define HASH2BKT(htp, h) \ |
| 2 (&(htp)->ht_bkt[h % (htp)->ht_nbuckets]) |
| 3 |
| 4 static void hashtab_lock(struct hashtab *htp, |
| 5 unsigned long hash) |
| 6 { |
| 7 spin_lock(&HASH2BKT(htp, hash)->htb_lock); |
| 8 } |
| 9 |
| 10 static void hashtab_unlock(struct hashtab *htp, |
| 11 unsigned long hash) |
| 12 { |
| 13 spin_unlock(&HASH2BKT(htp, hash)->htb_lock); |
| 14 } |
| \end{verbatim} |
| } |
| \caption{Hash-Table Mapping and Locking} |
| \label{fig:datastruct:Hash-Table Mapping and Locking} |
| \end{figure} |
| |
| Figure~\ref{fig:datastruct:Hash-Table Mapping and Locking} |
| shows mapping and locking functions. |
| Lines~1 and~2 show the macro \co{HASH2BKT()}, which maps from a hash value |
| to the corresponding \co{ht_bucket} structure. |
| This macro uses a simple modulus: if more aggressive hashing is required, |
| the caller needs to implement it when mapping from key to hash value. |
| The remaining two functions acquire and release the \co{->htb_lock} |
| corresponding to the specified hash value. |
| |
| \begin{figure}[tb] |
| { \scriptsize |
| \begin{verbatim} |
| 1 struct ht_elem * |
| 2 hashtab_lookup(struct hashtab *htp, |
| 3 unsigned long hash, |
| 4 void *key, |
| 5 int (*cmp)(struct ht_elem *htep, |
| 6 void *key)) |
| 7 { |
| 8 struct ht_bucket *htb; |
| 9 struct ht_elem *htep; |
| 10 |
| 11 htb = HASH2BKT(htp, hash); |
| 12 cds_list_for_each_entry(htep, |
| 13 &htb->htb_head, |
| 14 hte_next) { |
| 15 if (htep->hte_hash != hash) |
| 16 continue; |
| 17 if (cmp(htep, key)) |
| 18 return htep; |
| 19 } |
| 20 return NULL; |
| 21 } |
| \end{verbatim} |
| } |
| \caption{Hash-Table Lookup} |
| \label{fig:datastruct:Hash-Table Lookup} |
| \end{figure} |
| |
| Figure~\ref{fig:datastruct:Hash-Table Lookup} |
| shows \co{hashtab_lookup()}, |
| which returns a pointer to the element with the specified hash and key if it |
| exists, or \co{NULL} otherwise. |
| This function takes both a hash value and a pointer to the key because |
| this allows users of this function to use arbitrary keys and |
| arbitrary hash functions, with the key-comparison function passed in via |
| \co{cmp()}, in a manner similar to \co{qsort()}. |
| Line~11 maps from the hash value to a pointer to the corresponding |
| hash bucket. |
| Each pass through the loop spanning line~12-19 examines one element |
| of the bucket's hash chain. |
| Line~15 checks to see if the hash values match, and if not, line~16 |
| proceeds to the next element. |
| Line~17 checks to see if the actual key matches, and if so, |
| line~18 returns a pointer to the matching element. |
| If no element matches, line~20 returns \co{NULL}. |
| |
| \QuickQuiz{} |
| But isn't the double comparison on lines~15-18 in |
| Figure~\ref{fig:datastruct:Hash-Table Lookup} inefficient |
| in the case where the key fits into an unsigned long? |
| \QuickQuizAnswer{ |
| Indeed it is! |
| However, hash tables quite frequently store information with |
| keys such as character strings that do not necessarily fit |
| into an unsigned long. |
| Simplifying the hash-table implementation for the case where |
| keys always fit into unsigned longs is left as an exercise |
| for the reader. |
| } \QuickQuizEnd |
| |
| \begin{figure}[tb] |
| { \scriptsize |
| \begin{verbatim} |
| 1 void |
| 2 hashtab_add(struct hashtab *htp, |
| 3 unsigned long hash, |
| 4 struct ht_elem *htep) |
| 5 { |
| 6 htep->hte_hash = hash; |
| 7 cds_list_add(&htep->hte_next, |
| 8 &HASH2BKT(htp, hash)->htb_head); |
| 9 } |
| 10 |
| 11 void hashtab_del(struct ht_elem *htep) |
| 12 { |
| 13 cds_list_del_init(&htep->hte_next); |
| 14 } |
| \end{verbatim} |
| } |
| \caption{Hash-Table Modification} |
| \label{fig:datastruct:Hash-Table Modification} |
| \end{figure} |
| |
| Figure~\ref{fig:datastruct:Hash-Table Modification} |
| shows the \co{hashtab_add()} and \co{hashtab_del()} functions |
| that add and delete elements from the hash table, respectively. |
| |
| The \co{hashtab_add()} function simply sets the element's hash |
| value on line~6, then adds it to the corresponding bucket on |
| lines~7 and~8. |
| The \co{hashtab_del()} function simply removes the specified element |
| from whatever hash chain it is on, courtesy of the doubly linked |
| nature of the hash-chain lists. |
| Before calling either of these two functions, the caller is required to |
| ensure that no other thread is accessing |
| or modifying this same bucket, for example, by invoking |
| \co{hashtab_lock()} beforehand. |
| |
| \begin{figure}[tb] |
| { \scriptsize |
| \begin{verbatim} |
| 1 struct hashtab * |
| 2 hashtab_alloc(unsigned long nbuckets) |
| 3 { |
| 4 struct hashtab *htp; |
| 5 int i; |
| 6 |
| 7 htp = malloc(sizeof(*htp) + |
| 8 nbuckets * |
| 9 sizeof(struct ht_bucket)); |
| 10 if (htp == NULL) |
| 11 return NULL; |
| 12 htp->ht_nbuckets = nbuckets; |
| 13 for (i = 0; i < nbuckets; i++) { |
| 14 CDS_INIT_LIST_HEAD(&htp->ht_bkt[i].htb_head); |
| 15 spin_lock_init(&htp->ht_bkt[i].htb_lock); |
| 16 } |
| 17 return htp; |
| 18 } |
| 19 |
| 20 void hashtab_free(struct hashtab *htp) |
| 21 { |
| 22 free(htp); |
| 23 } |
| \end{verbatim} |
| } |
| \caption{Hash-Table Allocation and Free} |
| \label{fig:datastruct:Hash-Table Allocation and Free} |
| \end{figure} |
| |
| Figure~\ref{fig:datastruct:Hash-Table Allocation and Free} |
| shows \co{hashtab_alloc()} and \co{hashtab_free()}, |
| which do hash-table allocation and freeing, respectively. |
| Allocation begins on lines~7-9 with allocation of the underlying memory. |
| If line~10 detects that memory has been exhausted, line~11 returns |
| \co{NULL} to the caller. |
| Otherwise, line~12 initializes the number of buckets, and the loop |
| spanning lines~13-16 initializes the buckets themselves, |
| including the chain list header on line~14 and the lock on line~15. |
| Finally, line~17 returns a pointer to the newly allocated hash table. |
| The \co{hashtab_free()} function on lines~20-23 is straightforward. |
| |
| \subsection{Hash-Table Performance} |
| \label{sec:datastruct:Hash-Table Performance} |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{datastruct/zoocpubktlin8}} |
| \end{center} |
| \caption{Read-Only Hash-Table Performance For Schr\"odinger's Zoo} |
| \label{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo} |
| \end{figure} |
| |
| The performance results for an eight-CPU 2GHz |
| Intel\textsuperscript\textregistered |
| Xeon\textsuperscript\textregistered |
| system using a bucket-locked hash table with 1024 buckets are shown in |
| Figure~\ref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo}. |
| The performance does scale nearly linearly, but is not much more than half |
| of the ideal performance level, even at only eight CPUs. |
| Part of this shortfall is due to the fact that the lock acquisitions and |
| releases incur no cache misses on a single CPU, but do incur misses |
| on two or more CPUs. |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{datastruct/zoocpubktlin}} |
| \end{center} |
| \caption{Read-Only Hash-Table Performance For Schr\"odinger's Zoo, 60 CPUs} |
| \label{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo, 60 CPUs} |
| \end{figure} |
| |
| And things only get worse with larger number of CPUs, as can be seen in |
| Figure~\ref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo, 60 CPUs}. |
| We do not need an additional line to show ideal performance: The performance |
| for nine CPUs and beyond is worse than abysmal. |
| This clearly underscores the dangers of extrapolating performance from a |
| modest number of CPUs. |
| |
| Of course, one possible reason for the collapse in performance might be |
| that more hash buckets are needed. |
| After all, we did not pad each hash bucket to a full cache line, so |
| there are a number of hash buckets per cache line. |
| It is possible that the resulting cache-thrashing comes into play at |
| nine CPUs. |
| This is of course easy to test by increasing the number of hash buckets. |
| |
| \QuickQuiz{} |
| Instead of simply increasing the number of hash buckets, |
| wouldn't it be better to cache-align the existing hash buckets? |
| \QuickQuizAnswer{ |
| The answer depends on a great many things. |
| If the hash table has a large number of elements per bucket, it |
| would clearly be better to increase the number of hash buckets. |
| On the other hand, if the hash table is lightly loaded, |
| the answer depends on the hardware, the effectiveness of the |
| hash function, and the workload. |
| Interested readers are encouraged to experiment. |
| } \QuickQuizEnd |
| |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{datastruct/zoocpubktsizelin}} |
| \end{center} |
| \caption{Read-Only Hash-Table Performance For Schr\"odinger's Zoo, Varying Buckets} |
| \label{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo, Varying Buckets} |
| \end{figure} |
| |
| However, as can be seen in |
| Figure~\ref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo, Varying Buckets}, |
| although increasing the number of buckets does increase performance somewhat, |
| scalability is still abysmal. |
| In particular, we still see a sharp dropoff at nine CPUs and beyond. |
| Furthermore, going from 8192 buckets to 16,384 buckets produced almost |
| no increase in performance. |
| Clearly something else is going on. |
| |
| The problem is that this is a multi-socket system, with CPUs~0-7 |
| and~32-39 mapped to the first socket as shown in |
| Table~\ref{tab:datastruct:NUMA Topology of System Under Test}. |
| Test runs confined to the first eight CPUs therefore perform quite |
| well, but tests that involve socket~0's CPUs~0-7 as well as |
| socket~1's CPU~8 incur the overhead of passing data across |
| socket boundaries. |
| This can severely degrade performance, as was discussed in |
| Section~\ref{sec:cpu:Hardware System Architecture}. |
| In short, large multi-socket systems require good locality of reference |
| in addition to full partitioning. |
| |
| \QuickQuiz{} |
| Given the negative scalability of the Schr\"odinger's |
| Zoo application across sockets, why not just run multiple |
| copies of the application, with each copy having a subset |
| of the animals and confined to run on a single socket? |
| \QuickQuizAnswer{ |
| You can do just that! |
| In fact, you can extend this idea to large clustered systems, |
| running one copy of the application on each node of the cluster. |
| This practice is called ``sharding'', and is heavily used in |
| practice by large web-based |
| retailers~\cite{DeCandia:2007:DAH:1323293.1294281}. |
| |
| However, if you are going to shard on a per-socket basis within |
| a multisocket system, why not buy separate smaller and cheaper |
| single-socket systems, and then run one shard of the database |
| on each of those systems? |
| } \QuickQuizEnd |
| |
| One key property of the Schr\"odinger's-zoo runs discussed thus far is that |
| they are all read-only. |
| This makes the performance degradation due to lock-acquisition-induced |
| cache misses all the more painful. |
| Even though we are not updating the underlying hash table itself, we are |
| still paying the price for writing to memory. |
| Of course, if the hash table was never going to be updated, we could dispense |
| entirely with mutual exclusion. |
| This approach is quite straightforward and is left as an exercise for the |
| reader. |
| But even with the occasional update, avoiding writes avoids cache |
| misses, and allows the read-mostly data to be replicated across all |
| the caches, which in turn promotes locality of reference. |
| |
| The next section therefore examines optimizations that can be carried out in |
| read-mostly cases where updates are rare, but could happen at any time. |
| |
| \begin{table} |
| \begin{center} |
| \begin{tabular}{r|r|r|r|r|r|r|r|r} |
| Socket & \multicolumn{8}{|c}{Core} \\ |
| \hline |
| 0 & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\ |
| & 32 & 33 & 34 & 35 & 36 & 37 & 38 & 39 \\ |
| \hline |
| 1 & 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15 \\ |
| & 40 & 41 & 42 & 43 & 44 & 45 & 46 & 47 \\ |
| \hline |
| 2 & 16 & 17 & 18 & 19 & 20 & 21 & 22 & 23 \\ |
| & 48 & 49 & 50 & 51 & 52 & 53 & 54 & 55 \\ |
| \hline |
| 3 & 24 & 25 & 26 & 27 & 28 & 29 & 30 & 31 \\ |
| & 56 & 47 & 58 & 59 & 60 & 61 & 62 & 63 \\ |
| \end{tabular} |
| \end{center} |
| \caption{NUMA Topology of System Under Test} |
| \label{tab:datastruct:NUMA Topology of System Under Test} |
| \end{table} |
| |
| \section{Read-Mostly Data Structures} |
| \label{sec:datastruct:Read-Mostly Data Structures} |
| |
| Although partitioned data structures can offer excellent scalability, |
| NUMA effects can result in severe degradations of both performance and |
| scalability. |
| In addition, |
| the need for readers to exclude writers can degrade performance in |
| read-mostly situations. |
| However, we can achieve both performance and scalability by using |
| RCU, which was introduced in |
| Section~\ref{sec:defer:Read-Copy Update (RCU)}. |
| Similar results can be achieved using hazard pointers |
| (\co{hazptr.c})~\cite{MagedMichael04a}, which will be included in |
| the performance results shown in this |
| section~\cite{McKenney:2013:SDS:2483852.2483867}. |
| |
| \subsection{RCU-Protected Hash Table Implementation} |
| \label{sec:datastruct:RCU-Protected Hash Table Implementation} |
| |
| \begin{figure}[tb] |
| { \scriptsize |
| \begin{verbatim} |
| 1 static void hashtab_lock_lookup(struct hashtab *htp, |
| 2 unsigned long hash) |
| 3 { |
| 4 rcu_read_lock(); |
| 5 } |
| 6 |
| 7 static void hashtab_unlock_lookup(struct hashtab *htp, |
| 8 unsigned long hash) |
| 9 { |
| 10 rcu_read_unlock(); |
| 11 } |
| \end{verbatim} |
| } |
| \caption{RCU-Protected Hash-Table Read-Side Concurrency Control} |
| \label{fig:datastruct:RCU-Protected Hash-Table Read-Side Concurrency Control} |
| \end{figure} |
| |
| For an RCU-protected hash table with per-bucket locking, |
| updaters use locking exactly as described in |
| Section~\ref{sec:datastruct:Partitionable Data Structures}, |
| but readers use RCU. |
| The data structures remain as shown in |
| Figure~\ref{fig:datastruct:Hash-Table Data Structures}, |
| and the \co{HASH2BKT()}, \co{hashtab_lock()}, and \co{hashtab_unlock()} |
| functions remain as shown in |
| Figure~\ref{fig:datastruct:Hash-Table Mapping and Locking}. |
| However, readers use the lighter-weight concurrency-control embodied |
| by \co{hashtab_lock_lookup()} and \co{hashtab_unlock_lookup()} |
| shown in |
| Figure~\ref{fig:datastruct:RCU-Protected Hash-Table Read-Side Concurrency Control}. |
| |
| \begin{figure}[tb] |
| { \scriptsize |
| \begin{verbatim} |
| 1 struct ht_elem |
| 2 *hashtab_lookup(struct hashtab *htp, |
| 3 unsigned long hash, |
| 4 void *key, |
| 5 int (*cmp)(struct ht_elem *htep, |
| 6 void *key)) |
| 7 { |
| 8 struct ht_bucket *htb; |
| 9 struct ht_elem *htep; |
| 10 |
| 11 htb = HASH2BKT(htp, hash); |
| 12 cds_list_for_each_entry_rcu(htep, |
| 13 &htb->htb_head, |
| 14 hte_next) { |
| 15 if (htep->hte_hash != hash) |
| 16 continue; |
| 17 if (cmp(htep, key)) |
| 18 return htep; |
| 19 } |
| 20 return NULL; |
| 21 } |
| \end{verbatim} |
| } |
| \caption{RCU-Protected Hash-Table Lookup} |
| \label{fig:datastruct:RCU-Protected Hash-Table Lookup} |
| \end{figure} |
| |
| Figure~\ref{fig:datastruct:RCU-Protected Hash-Table Lookup} |
| shows \co{hashtab_lookup()} for the RCU-protected per-bucket-locked |
| hash table. |
| This is identical to that in |
| Figure~\ref{fig:datastruct:Hash-Table Lookup} |
| except that \co{cds_list_for_each_entry()} is replaced |
| by \co{cds_list_for_each_entry_rcu()}. |
| Both of these primitives sequence down the hash chain referenced |
| by \co{htb->htb_head} but \co{cds_list_for_each_entry_rcu()} also |
| correctly enforces memory ordering in case of concurrent insertion. |
| This is an important difference between these two hash-table implementations: |
| Unlike the pure per-bucket-locked implementation, the RCU protected |
| implementation allows lookups to run concurrently with insertions |
| and deletions, and RCU-aware primitives like |
| \co{cds_list_for_each_entry_rcu()} are required to correctly handle |
| this added concurrency. |
| Note also that \co{hashtab_lookup()}'s caller must be within an |
| RCU read-side critical section, for example, the caller must invoke |
| \co{hashtab_lock_lookup()} before invoking \co{hashtab_lookup()} |
| (and of course invoke \co{hashtab_unlock_lookup()} some time afterwards). |
| |
| \QuickQuiz{} |
| But if elements in a hash table can be deleted concurrently |
| with lookups, doesn't that mean that a lookup could return |
| a reference to a data element that was deleted immediately |
| after it was looked up? |
| \QuickQuizAnswer{ |
| Yes it can! |
| This is why \co{hashtab_lookup()} must be invoked within an |
| RCU read-side critical section, and it is why |
| \co{hashtab_add()} and \co{hashtab_del()} must also use |
| RCU-aware list-manipulation primitives. |
| Finally, this is why the caller of \co{hashtab_del()} must |
| wait for a grace period (e.g., by calling \co{synchronize_rcu()}) |
| before freeing the deleted element. |
| } \QuickQuizEnd |
| |
| \begin{figure}[tb] |
| { \scriptsize |
| \begin{verbatim} |
| 1 void |
| 2 hashtab_add(struct hashtab *htp, |
| 3 unsigned long hash, |
| 4 struct ht_elem *htep) |
| 5 { |
| 6 htep->hte_hash = hash; |
| 7 cds_list_add_rcu(&htep->hte_next, |
| 8 &HASH2BKT(htp, hash)->htb_head); |
| 9 } |
| 10 |
| 11 void hashtab_del(struct ht_elem *htep) |
| 12 { |
| 13 cds_list_del_rcu(&htep->hte_next); |
| 14 } |
| \end{verbatim} |
| } |
| \caption{RCU-Protected Hash-Table Modification} |
| \label{fig:datastruct:RCU-Protected Hash-Table Modification} |
| \end{figure} |
| |
| Figure~\ref{fig:datastruct:RCU-Protected Hash-Table Modification} |
| shows \co{hashtab_add()} and \co{hashtab_del()}, both of which |
| are quite similar to their counterparts in the non-RCU hash table |
| shown in |
| Figure~\ref{fig:datastruct:Hash-Table Modification}. |
| The \co{hashtab_add()} function uses \co{cds_list_add_rcu()} instead |
| of \co{cds_list_add()} in order to ensure proper ordering when |
| an element is added to the hash table at the same time that it is |
| being looked up. |
| The \co{hashtab_del()} function uses \co{cds_list_del_rcu()} instead |
| of \co{cds_list_del_init()} to allow for the case where an element is |
| looked up just before it is deleted. |
| Unlike \co{cds_list_del_init()}, \co{cds_list_del_rcu()} leaves the |
| forward pointer intact, so that \co{hashtab_lookup()} can traverse |
| to the newly deleted element's successor. |
| |
| Of course, after invoking \co{hashtab_del()}, the caller must wait for |
| an RCU grace period (e.g., by invoking \co{synchronize_rcu()}) before |
| freeing or otherwise reusing the memory for the newly deleted element. |
| |
| \subsection{RCU-Protected Hash Table Performance} |
| \label{sec:datastruct:RCU-Protected Hash Table Performance} |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{datastruct/zoocpu}} |
| \end{center} |
| \caption{Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo} |
| \label{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo} |
| \end{figure} |
| |
| Figure~\ref{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo} |
| shows the read-only performance of RCU-protected and hazard-pointer-protected |
| hash tables against the previous section's per-bucket-locked implementation. |
| As you can see, both RCU and hazard pointers achieve near-ideal performance |
| and scalability despite the larger numbers of threads and the NUMA effects. |
| Results from a globally locked implementation are also shown, and as expected |
| the results are even worse than those of the per-bucket-locked implementation. |
| RCU does slightly better than hazard pointers, but the difference is not |
| readily visible in this log-scale plot. |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{datastruct/zoocpulin}} |
| \end{center} |
| \caption{Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo, Linear Scale} |
| \label{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo, Linear Scale} |
| \end{figure} |
| |
| Figure~\ref{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo, Linear Scale} |
| shows the same data on a linear scale. |
| This drops the global-locking trace into the x-axis, but allows the |
| relative performance of RCU and hazard pointers to be more readily |
| discerned. |
| Both show a change in slope at 32 CPUs, and this is due to hardware |
| multithreading. |
| At 32 and fewer CPUs, each thread has a core to itself. |
| In this regime, RCU does better than does hazard pointers because |
| hazard pointers's read-side memory barriers result in dead time within |
| the core. |
| In short, RCU is better able to utilize a core from a single hardware |
| thread than is hazard pointers. |
| |
| This situation changes above 32 CPUs. |
| Because RCU is using more than half of each core's resources from a |
| single hardware thread, RCU gains relatively litte benefit from the |
| second hardware thread in each core. |
| The slope of hazard pointers's trace also decreases at 32 CPUs, but |
| less dramatically, |
| because the second hardware thread is able to fill in the time |
| that the first hardware thread is stalled due to memory-barrier latency. |
| As we will see in later sections, hazard pointers's second-hardware-thread |
| advantage depends on the workload. |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{datastruct/zoocatonly}} |
| \end{center} |
| \caption{Read-Side Cat-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo at 60 CPUs} |
| \label{fig:datastruct:Read-Side Cat-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo at 60 CPUs} |
| \end{figure} |
| |
| As noted earlier, Schr\"odinger is surprised by the popularity of his |
| cat~\cite{ErwinSchroedinger1935Cat}, but recognizes the need to reflect |
| this popularity in his design. |
| Figure~\ref{fig:datastruct:Read-Side Cat-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo at 60 CPUs} |
| shows the results of 60-CPU runs, varying the number of CPUs that are |
| doing nothing but looking up the cat. |
| Both RCU and hazard pointers respond well to this challenge, but |
| bucket locking scales negatively, eventually performing even worse |
| than global locking. |
| This should not be a surprise because if all CPUs are doing nothing |
| but looking up the cat, the lock corresponding to the cat's bucket |
| is for all intents and purposes a global lock. |
| |
| This cat-only benchmark illustrates one potential problem with |
| fully partitioned sharding approaches. |
| Only the CPUs associated with the cat's |
| partition is able to access the cat, limiting the cat-only |
| throughput. |
| Of course, a great many applications have good load-spreading |
| properties, and for these applications sharding works |
| quite well. |
| However, sharding does not handle ``hot spots'' very well, with |
| the hot spot exemplified by Schr\"odinger's cat being but one case |
| in point. |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{datastruct/zooupdatelu}} |
| \end{center} |
| \caption{Read-Side RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo at 60 CPUs} |
| \label{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo at 60 CPUs} |
| \end{figure} |
| |
| Of course, if we were only ever going to read the data, we would not need |
| any concurrency control to begin with. |
| Figure~\ref{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo at 60 CPUs} |
| therefore shows the effect of updates. |
| At the extreme left-hand side of this graph, all 60 CPUs are doing lookups, |
| while to the right all 60 CPUs are doing updates. |
| For all four implementations, the number of lookups per millisecond |
| decreases as the number of updating CPUs increases, of course reaching |
| zero lookups per millisecond when all 60 CPUs are updating. |
| RCU does well relative to hazard pointers due to the fact that hazard |
| pointers's read-side memory barriers incur greater overhead in the |
| presence of updates. |
| It therefore seems likely that modern hardware heavily optimizes memory-barrier |
| execution, greatly reducing memory-barrier overhead in the read-only case. |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{datastruct/zooupdate}} |
| \end{center} |
| \caption{Update-Side RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo at 60 CPUs} |
| \label{fig:datastruct:Update-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo at 60 CPUs} |
| \end{figure} |
| |
| Where |
| Figure~\ref{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo at 60 CPUs} |
| showed the effect of increasing update rates on lookups, |
| Figure~\ref{fig:datastruct:Update-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo at 60 CPUs} |
| shows the effect of increasing update rates on the updates themselves. |
| Hazard pointers and RCU start off with a significant advantage because, |
| unlike bucket locking, readers do not exclude updaters. |
| However, as the number of updating CPUs increases, update-side overhead |
| starts to make its presence known, first for RCU and then for hazard |
| pointers. |
| Of course, all three of these implementations fare much better than does |
| global locking. |
| |
| Of course, it is quite possible that the differences in lookup performance |
| is affected by the differences in update rates. |
| One way to check this is to artificially throttle the update rates of |
| per-bucket locking and hazard pointers to match that of RCU. |
| Doing so does not significantly improve the lookup performace of |
| per-bucket locking, nor does it close the gap between hazard pointers |
| and RCU. |
| However, removing hazard pointers's read-side memory barriers |
| (thus resulting in an unsafe implementation of hazard pointers) |
| does nearly close the gap between hazard pointers and RCU. |
| Although this unsafe hazard-pointer implementation will |
| usually be reliable enough for benchmarking purposes, it is absolutely |
| not recommended for production use. |
| |
| \QuickQuiz{} |
| The dangers of extrapolating from eight CPUs to 60 CPUs was |
| made quite clear in |
| Section~\ref{sec:datastruct:Hash-Table Performance}. |
| But why should extrapolating up from 60 CPUs be any safer? |
| \QuickQuizAnswer{ |
| It isn't any safer, and a useful exercise would be to run these |
| programs on larger systems. |
| That said, other testing has shown that RCU read-side primitives |
| offer consistent performance and scalability up to at least 1024 CPUs. |
| } \QuickQuizEnd |
| |
| \subsection{RCU-Protected Hash Table Discussion} |
| \label{sec:datastruct:RCU-Protected Hash Table Discussion} |
| |
| One consequence of the RCU and hazard-pointer implementations is |
| that a pair of concurrent readers might disagree on the state of |
| the cat. |
| For example, one of the readers might have fetched the pointer to |
| the cat's data structure just before it was removed, while another |
| reader might have fetched this same pointer just afterwards. |
| The first reader would then believe that the cat was alive, while |
| the second reader would believe that the cat was dead. |
| |
| Of course, this situation is completely fitting for Schr\"odinger's |
| cat, but it turns out that it is quite reasonable for normal |
| non-quantum cats as well. |
| |
| The reason for this is that it is impossible to determine exactly |
| when an animal is born or dies. |
| |
| To see this, let's suppose that we detect a cat's death by heartbeat. |
| This raise the question of exactly how long we should wait after the |
| last heartbeat before declaring death. |
| It is clearly ridiculous to wait only one millisecond, because then |
| a healthy living cat would have to be declared dead---and then |
| resurrected---more than once every second. |
| It is equally ridiculous to wait a full month, because by that time |
| the poor cat's death would have made itself very clearly known |
| via olfactory means. |
| |
| \begin{figure}[htb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{cartoons/2013-08-is-it-dead}} |
| \end{center} |
| \caption{Even Veterinarians Disagree!} |
| \ContributedBy{Figure}{fig:datastruct:Even Veterinarians Disagree}{Melissa Broussard} |
| \end{figure} |
| |
| Because an animal's heart can stop for some seconds and then start up |
| again, there is a tradeoff between timely recognition of death and |
| probability of false alarms. |
| It is quite possible that a pair of veterinarians might disagree on |
| the time to wait between the last heartbeat and the declaration of |
| death. |
| For example, one veterinarian might declare death thirty seconds after |
| the last heartbeat, while another might insist on waiting a full |
| minute. |
| In this case, the two veterinarians would disagree on the state of the |
| cat for the second period of thirty seconds following the last heartbeat, |
| as fancifully depicted in |
| Figure~\ref{fig:datastruct:Even Veterinarians Disagree}. |
| |
| Of course, Heisenberg taught us to live with this sort of |
| uncertainty~\cite{WeinerHeisenberg1927Uncertain}, which is a good |
| thing because computing hardware and software acts similarly. |
| For example, how do you know that a piece of computing hardware |
| has failed? |
| Often because it does not respond in a timely fashion. |
| Just like the cat's heartbeat, this results in a window of |
| uncertainty as to whether or not the hardware has failed. |
| |
| Furthermore, most computing systems are intended to interact with |
| the outside world. |
| Consistency with the outside world is therefore of paramount importance. |
| However, as we saw in |
| Figure~\ref{fig:defer:Response Time of RCU vs. Reader-Writer Locking} |
| on |
| page~\pageref{fig:defer:Response Time of RCU vs. Reader-Writer Locking}, |
| increased internal consistency can come at the expense of external |
| consistency. |
| Techniques such as RCU and hazard pointers give up some degree of |
| internal consistency to attain improved external consistency. |
| |
| In short, internal consistency is not a natural part of all problem domains, |
| and often incurs great expense in terms of performance, |
| scalability, external consistency, or all of the above. |
| |
| \section{Non-Partitionable Data Structures} |
| \label{sec:datastruct:Non-Partitionable Data Structures} |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{cartoons/2014_Hash-table-hydra}} |
| \end{center} |
| \caption{Partitioning Problems} |
| \ContributedBy{Figure}{fig:datastruct:Partitioning Problems}{Melissa Broussard} |
| \end{figure} |
| |
| Fixed-size hash tables are perfectly partitionable, but resizable hash |
| tables pose partitioning challenges when growing or shrinking, as |
| fancifully depicted in |
| Figure~\ref{fig:datastruct:Partitioning Problems}. |
| However, it turns out that it is possible to construct high-performance |
| scalable RCU-protected hash tables, as described in the following sections. |
| |
| \subsection{Resizable Hash Table Design} |
| \label{sec:datastruct:Resizable Hash Table Design} |
| |
| In happy contrast to the situation in the early 2000s, there are now |
| no fewer than three different types of scalable RCU-protected hash |
| tables. |
| The first (and simplest) was developed for the Linux kernel by |
| Herbert Xu~\cite{HerbertXu2010RCUResizeHash}, and is described in the |
| following sections. |
| The other two are covered briefly in |
| Section~\ref{sec:datastruct:Other Resizable Hash Tables}. |
| |
| The key insight behind the first hash-table implementation is that |
| each data element can have two sets of list pointers, with one set |
| currently being used by RCU readers (as well as by non-RCU updaters) |
| and the other being used to construct a new resized hash table. |
| This approach allows lookups, insertions, and deletions to all run |
| concurrently with a resize operation (as well as with each other). |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{datastruct/hashxu-a}} |
| \end{center} |
| \caption{Growing a Double-List Hash Table, State (a)} |
| \label{fig:datastruct:Growing a Double-List Hash Table, State (a)} |
| \end{figure} |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{datastruct/hashxu-b}} |
| \end{center} |
| \caption{Growing a Double-List Hash Table, State (b)} |
| \label{fig:datastruct:Growing a Double-List Hash Table, State (b)} |
| \end{figure} |
| |
| The resize operation proceeds as shown in |
| Figures~\ref{fig:datastruct:Growing a Double-List Hash Table, State (a)}-\ref{fig:datastruct:Growing a Double-List Hash Table, State (d)}, |
| with the initial two-bucket state shown in |
| Figure~\ref{fig:datastruct:Growing a Double-List Hash Table, State (a)} |
| and with time advancing from figure to figure. |
| The initial state uses the zero-index links to chain the elements into |
| hash buckets. |
| A four-bucket array is allocated, and the one-index links are used to |
| chain the elements into these four new hash buckets. |
| This results in state~(b) shown in |
| Figure~\ref{fig:datastruct:Growing a Double-List Hash Table, State (b)}, |
| with readers still using the original two-bucket array. |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{datastruct/hashxu-c}} |
| \end{center} |
| \caption{Growing a Double-List Hash Table, State (c)} |
| \label{fig:datastruct:Growing a Double-List Hash Table, State (c)} |
| \end{figure} |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{datastruct/hashxu-d}} |
| \end{center} |
| \caption{Growing a Double-List Hash Table, State (d)} |
| \label{fig:datastruct:Growing a Double-List Hash Table, State (d)} |
| \end{figure} |
| |
| The new four-bucket array is exposed to readers and then a grace-period |
| operation waits for all readers, resulting in state~(c), shown in |
| Figure~\ref{fig:datastruct:Growing a Double-List Hash Table, State (c)}. |
| In this state, all readers are using the new four-bucket array, |
| which means that the old two-bucket array may now be freed, resulting |
| in state~(d), shown in |
| Figure~\ref{fig:datastruct:Growing a Double-List Hash Table, State (d)}. |
| |
| This design leads to a relatively straightforward implementation, |
| which is the subject of the next section. |
| |
| \subsection{Resizable Hash Table Implementation} |
| \label{sec:datastruct:Resizable Hash Table Implementation} |
| |
| \begin{figure}[tb] |
| { \scriptsize |
| \begin{verbatim} |
| 1 struct ht_elem { |
| 2 struct rcu_head rh; |
| 3 struct cds_list_head hte_next[2]; |
| 4 unsigned long hte_hash; |
| 5 }; |
| 6 |
| 7 struct ht_bucket { |
| 8 struct cds_list_head htb_head; |
| 9 spinlock_t htb_lock; |
| 10 }; |
| 11 |
| 12 struct ht { |
| 13 long ht_nbuckets; |
| 14 long ht_resize_cur; |
| 15 struct ht *ht_new; |
| 16 int ht_idx; |
| 17 void *ht_hash_private; |
| 18 int (*ht_cmp)(void *hash_private, |
| 19 struct ht_elem *htep, |
| 20 void *key); |
| 21 long (*ht_gethash)(void *hash_private, |
| 22 void *key); |
| 23 void *(*ht_getkey)(struct ht_elem *htep); |
| 24 struct ht_bucket ht_bkt[0]; |
| 25 }; |
| 26 |
| 27 struct hashtab { |
| 28 struct ht *ht_cur; |
| 29 spinlock_t ht_lock; |
| 30 }; |
| \end{verbatim} |
| } |
| \caption{Resizable Hash-Table Data Structures} |
| \label{fig:datastruct:Resizable Hash-Table Data Structures} |
| \end{figure} |
| |
| Resizing is accomplished by the classic approach of inserting a level |
| of indirection, in this case, the \co{ht} structure shown on |
| lines~12-25 of |
| Figure~\ref{fig:datastruct:Resizable Hash-Table Data Structures}. |
| The \co{hashtab} structure shown on lines~27-30 contains only a |
| pointer to the current \co{ht} structure along with a spinlock that |
| is used to serialize concurrent attempts to resize the hash table. |
| If we were to use a traditional lock- or atomic-operation-based |
| implementation, this \co{hashtab} structure could become a severe bottleneck |
| from both performance and scalability viewpoints. |
| However, because resize operations should be relatively infrequent, |
| we should be able to make good use of RCU. |
| |
| The \co{ht} structure represents a specific size of the hash table, |
| as specified by the \co{->ht_nbuckets} field on line~13. |
| The size is stored in the same structure containing the array of |
| buckets (\co{->ht_bkt[]} on line~24) in order to avoid mismatches between |
| the size and the array. |
| The \co{->ht_resize_cur} field on line~14 is equal to -1 unless a resize |
| operation |
| is in progress, in which case it indicates the index of the bucket whose |
| elements are being inserted into the new hash table, which is referenced |
| by the \co{->ht_new} field on line~15. |
| If there is no resize operation in progress, \co{->ht_new} is \co{NULL}. |
| Thus, a resize operation proceeds by allocating a new \co{ht} structure |
| and referencing it via the \co{->ht_new} pointer, then advancing |
| \co{->ht_resize_cur} through the old table's buckets. |
| When all the elements have been added to the new table, the new |
| table is linked into the \co{hashtab} structure's \co{->ht_cur} field. |
| Once all old readers have completed, the old hash table's \co{ht} structure |
| may be freed. |
| |
| The \co{->ht_idx} field on line~16 indicates which of the two sets of |
| list pointers are being used by this instantiation of the hash table, |
| and is used to index the \co{->hte_next[]} array in the \co{ht_bucket} |
| structure on line~3. |
| |
| The \co{->ht_hash_private}, \co{->ht_cmp()}, \co{->ht_gethash()}, |
| and \co{->ht_getkey()} fields on lines~17-23 |
| collectively define the per-element key and the hash function. |
| The \co{->ht_hash_private} allows the hash function to be |
| perturbed~\cite{McKenney89c,McKenney90,McKenney91}, which can be |
| used to avoid denial-of-service attacks based on statistical |
| estimation of the parameters used in the hash function. |
| The \co{->ht_cmp()} function compares a specified key with that of |
| the specified element, |
| the \co{->ht_gethash()} calculates the specified key's hash, |
| and \co{->ht_getkey()} extracts the key from the enclosing data |
| element. |
| |
| The \co{ht_bucket} structure is the same as before, and the |
| \co{ht_elem} structure differs from that of previous implementations |
| only in providing a two-element array of list pointer sets in place of |
| the prior single set of list pointers. |
| |
| In a fixed-sized hash table, bucket selection is quite straightforward: |
| Simply transform the hash value to the corresponding bucket index. |
| In contrast, when resizing, it is also necessary to determine which |
| of the old and new sets of buckets to select from. |
| If the bucket that would be selected from the old table has already |
| been distributed into the new table, then the bucket should be selected |
| from the new table. |
| Conversely, if the bucket that would be selected from the old table |
| has not yet been distributed, then the bucket should be selected from |
| the old table. |
| |
| \begin{figure}[tb] |
| { \scriptsize |
| \begin{verbatim} |
| 1 static struct ht_bucket * |
| 2 ht_get_bucket_single(struct ht *htp, |
| 3 void *key, long *b) |
| 4 { |
| 5 *b = htp->ht_gethash(htp->ht_hash_private, |
| 6 key) % htp->ht_nbuckets; |
| 7 return &htp->ht_bkt[*b]; |
| 8 } |
| 9 |
| 10 static struct ht_bucket * |
| 11 ht_get_bucket(struct ht **htp, void *key, |
| 12 long *b, int *i) |
| 13 { |
| 14 struct ht_bucket *htbp; |
| 15 |
| 16 htbp = ht_get_bucket_single(*htp, key, b); |
| 17 if (*b <= (*htp)->ht_resize_cur) { |
| 18 *htp = (*htp)->ht_new; |
| 19 htbp = ht_get_bucket_single(*htp, key, b); |
| 20 } |
| 21 if (i) |
| 22 *i = (*htp)->ht_idx; |
| 23 return htbp; |
| 24 } |
| \end{verbatim} |
| } |
| \caption{Resizable Hash-Table Bucket Selection} |
| \label{fig:datastruct:Resizable Hash-Table Bucket Selection} |
| \end{figure} |
| |
| Bucket selection is shown in |
| Figure~\ref{fig:datastruct:Resizable Hash-Table Bucket Selection}, |
| which shows \co{ht_get_bucket_single()} on lines~1-8 and |
| \co{ht_get_bucket()} on lines~10-24. |
| The \co{ht_get_bucket_single()} function returns a reference to the bucket |
| corresponding to the specified key in the specified hash table, without |
| making any allowances for resizing. |
| It also stores the hash value corresponding to the key into the location |
| referenced by parameter \co{b} on lines~5 and ~6. |
| Line~7 then returns a reference to the corresponding bucket. |
| |
| The \co{ht_get_bucket()} function handles hash-table selection, invoking |
| \co{ht_get_bucket_single()} on line~16 to select the bucket |
| corresponding to the hash in the current |
| hash table, storing the hash value through parameter \co{b}. |
| If line~17 determines that the table is being resized and that |
| line~16's bucket has already been distributed across the new hash |
| table, then line~18 selects the new hash table and line~19 |
| selects the bucket corresponding to the hash in the new hash table, |
| again storing the hash value through parameter \co{b}. |
| |
| \QuickQuiz{} |
| The code in |
| Figure~\ref{fig:datastruct:Resizable Hash-Table Bucket Selection} |
| computes the hash twice! |
| Why this blatant inefficiency? |
| \QuickQuizAnswer{ |
| The reason is that the old and new hash tables might have |
| completely different hash functions, so that a hash computed |
| for the old table might be completely irrelevant to the |
| new table. |
| } \QuickQuizEnd |
| |
| If line~21 finds that parameter \co{i} is non-\co{NULL}, then |
| line~22 stores the pointer-set index for the selected hash table. |
| Finally, line~23 returns a reference to the selected hash bucket. |
| |
| \QuickQuiz{} |
| How does the code in |
| Figure~\ref{fig:datastruct:Resizable Hash-Table Bucket Selection} |
| protect against the resizing process progressing past the |
| selected bucket? |
| \QuickQuizAnswer{ |
| It does not provide any such protection. |
| That is instead the job of the update-side concurrency-control |
| functions described next. |
| } \QuickQuizEnd |
| |
| This implementation of |
| \co{ht_get_bucket_single()} and \co{ht_get_bucket()} |
| will permit lookups and modifications to run concurrently |
| with a resize operation. |
| |
| \begin{figure}[tb] |
| { \scriptsize |
| \begin{verbatim} |
| 1 void hashtab_lock_mod(struct hashtab *htp_master, |
| 2 void *key) |
| 3 { |
| 4 long b; |
| 5 struct ht *htp; |
| 6 struct ht_bucket *htbp; |
| 7 struct ht_bucket *htbp_new; |
| 8 |
| 9 rcu_read_lock(); |
| 10 htp = rcu_dereference(htp_master->ht_cur); |
| 11 htbp = ht_get_bucket_single(htp, key, &b); |
| 12 spin_lock(&htbp->htb_lock); |
| 13 if (b > htp->ht_resize_cur) |
| 14 return; |
| 15 htp = htp->ht_new; |
| 16 htbp_new = ht_get_bucket_single(htp, key, &b); |
| 17 spin_lock(&htbp_new->htb_lock); |
| 18 spin_unlock(&htbp->htb_lock); |
| 19 } |
| 20 |
| 21 void hashtab_unlock_mod(struct hashtab *htp_master, |
| 22 void *key) |
| 23 { |
| 24 long b; |
| 25 struct ht *htp; |
| 26 struct ht_bucket *htbp; |
| 27 |
| 28 htp = rcu_dereference(htp_master->ht_cur); |
| 29 htbp = ht_get_bucket(&htp, key, &b, NULL); |
| 30 spin_unlock(&htbp->htb_lock); |
| 31 rcu_read_unlock(); |
| 32 } |
| \end{verbatim} |
| } |
| \caption{Resizable Hash-Table Update-Side Concurrency Control} |
| \label{fig:datastruct:Resizable Hash-Table Update-Side Concurrency Control} |
| \end{figure} |
| |
| Read-side concurrency control is provided by RCU as was shown in |
| Figure~\ref{fig:datastruct:RCU-Protected Hash-Table Read-Side Concurrency Control}, |
| but the update-side concurrency-control functions |
| \co{hashtab_lock_mod()} and \co{hashtab_unlock_mod()} |
| must now deal with the possibility of a |
| concurrent resize operation as shown in |
| Figure~\ref{fig:datastruct:Resizable Hash-Table Update-Side Concurrency Control}. |
| |
| The \co{hashtab_lock_mod()} spans lines~1-19 in the figure. |
| Line~9 enters an RCU read-side critical section to prevent |
| the data structures from being freed during the traversal, |
| line~10 acquires a reference to the current hash table, and then |
| line~11 obtains a reference to the bucket in this hash table |
| corresponding to the key. |
| Line~12 acquires that bucket's lock, which will prevent any concurrent |
| resizing operation from distributing that bucket, though of course it |
| will have no effect if the resizing operation has already distributed |
| this bucket. |
| Line~13 then checks to see if a concurrent resize operation has |
| already distributed this bucket across the new hash table, and if not, |
| line~14 returns with the selected hash bucket's lock held (and also |
| within an RCU read-side critical section). |
| |
| Otherwise, a concurrent resize operation has already distributed this |
| bucket, so line~15 proceeds to the new hash table and line~16 |
| selects the bucket corresponding to the key. |
| Finally, line~17 acquires the bucket's lock and line~18 releases the |
| lock for the old hash table's bucket. |
| Once again, \co{hashtab_lock_mod()} exits within an RCU read-side critical |
| section. |
| |
| \QuickQuiz{} |
| The code in |
| Figures~\ref{fig:datastruct:Resizable Hash-Table Bucket Selection} |
| and~\ref{fig:datastruct:Resizable Hash-Table Update-Side Concurrency Control} |
| compute the hash and execute the bucket-selection logic twice for |
| updates! |
| Why this blatant inefficiency? |
| \QuickQuizAnswer{ |
| This approach allows the \co{hashtorture.h} testing infrastructure |
| to be reused. |
| That said, a production-quality resizable hash table would likely |
| be optimized to avoid this double computation. |
| Carrying out this optimization is left as an exercise for the reader. |
| } \QuickQuizEnd |
| |
| The \co{hashtab_unlock_mod()} function releases the lock acquired by |
| \co{hashtab_lock_mod()}. |
| Line~28 picks up the current hash table, and then line~29 invokes |
| \co{ht_get_bucket()} in order to gain a reference to the bucket that |
| corresponds to the key---and of course this bucket might well in a |
| new hash table. |
| Line~30 releases the bucket's lock and finally line~31 exits the |
| RCU read-side critical section. |
| |
| \QuickQuiz{} |
| Suppose that one thread is inserting an element into the |
| new hash table during a resize operation. |
| What prevents this insertion to be lost due to a subsequent |
| resize operation completing before the insertion does? |
| \QuickQuizAnswer{ |
| The second resize operation will not be able to move beyond |
| the bucket into which the insertion is taking place due to |
| the insertion holding the lock on one of the hash buckets in |
| the new hash table (the second hash table of three in this |
| example). |
| Furthermore, the insertion operation takes place within an |
| RCU read-side critical section. |
| As we will see when we examine the \co{hashtab_resize()} |
| function, this means that the first resize operation will |
| use |
| \co{synchronize_rcu()} to wait for the insertion's read-side |
| critical section to complete. |
| } \QuickQuizEnd |
| |
| \begin{figure}[tb] |
| { \scriptsize |
| \begin{verbatim} |
| 1 struct ht_elem * |
| 2 hashtab_lookup(struct hashtab *htp_master, |
| 3 void *key) |
| 4 { |
| 5 long b; |
| 6 int i; |
| 7 struct ht *htp; |
| 8 struct ht_elem *htep; |
| 9 struct ht_bucket *htbp; |
| 10 |
| 11 htp = rcu_dereference(htp_master->ht_cur); |
| 12 htbp = ht_get_bucket(&htp, key, &b, &i); |
| 13 cds_list_for_each_entry_rcu(htep, |
| 14 &htbp->htb_head, |
| 15 hte_next[i]) { |
| 16 if (htp->ht_cmp(htp->ht_hash_private, |
| 17 htep, key)) |
| 18 return htep; |
| 19 } |
| 20 return NULL; |
| 21 } |
| 22 |
| 23 void |
| 24 hashtab_add(struct hashtab *htp_master, |
| 25 struct ht_elem *htep) |
| 26 { |
| 27 long b; |
| 28 int i; |
| 29 struct ht *htp; |
| 30 struct ht_bucket *htbp; |
| 31 |
| 32 htp = rcu_dereference(htp_master->ht_cur); |
| 33 htbp = ht_get_bucket(&htp, htp->ht_getkey(htep), |
| 34 &b, &i); |
| 35 cds_list_add_rcu(&htep->hte_next[i], |
| 36 &htbp->htb_head); |
| 37 } |
| 38 |
| 39 void |
| 40 hashtab_del(struct hashtab *htp_master, |
| 41 struct ht_elem *htep) |
| 42 { |
| 43 long b; |
| 44 int i; |
| 45 struct ht *htp; |
| 46 struct ht_bucket *htbp; |
| 47 |
| 48 htp = rcu_dereference(htp_master->ht_cur); |
| 49 htbp = ht_get_bucket(&htp, htp->ht_getkey(htep), |
| 50 &b, &i); |
| 51 cds_list_del_rcu(&htep->hte_next[i]); |
| 52 } |
| \end{verbatim} |
| } |
| \caption{Resizable Hash-Table Access Functions} |
| \label{fig:datastruct:Resizable Hash-Table Access Functions} |
| \end{figure} |
| |
| Now that we have bucket selection and concurrency control in place, |
| we are ready to search and update our resizable hash table. |
| The \co{hashtab_lookup()}, \co{hashtab_add()}, and \co{hashtab_del()} |
| functions shown in |
| Figure~\ref{fig:datastruct:Resizable Hash-Table Access Functions}. |
| |
| The \co{hashtab_lookup()} function on lines~1-21 of the figure does |
| hash lookups. |
| Line~11 fetches the current hash table and line~12 obtains a reference |
| to the bucket corresponding to the specified key. |
| This bucket will be located in a new resized hash table when a |
| resize operation has progressed past the bucket in the old hash |
| table that contained the desired data element. |
| Note that line~12 also passes back the index that will be used to |
| select the correct set of pointers from the pair in each element. |
| The loop spanning lines~13-19 searches the bucket, so that if line~16 |
| detects a match, line~18 returns a pointer to the enclosing data element. |
| Otherwise, if there is no match, line~20 returns \co{NULL} to indicate |
| failure. |
| |
| \QuickQuiz{} |
| In the \co{hashtab_lookup()} function in |
| Figure~\ref{fig:datastruct:Resizable Hash-Table Access Functions}, |
| the code carefully finds the right bucket in the new hash table |
| if the element to be looked up has already been distributed |
| by a concurrent resize operation. |
| This seems wasteful for RCU-protected lookups. |
| Why not just stick with the old hash table in this case? |
| \QuickQuizAnswer{ |
| Suppose that a resize operation begins and distributes half of |
| the old table's buckets to the new table. |
| Suppose further that a thread adds a new element that goes into |
| one of the already-distributed buckets, and that this same thread |
| now looks up this newly added element. |
| If lookups unconditionally traversed only the old hash table, |
| this thread would get a lookup failure for the element that it |
| just added, which certainly sounds like a bug to me! |
| } \QuickQuizEnd |
| |
| The \co{hashtab_add()} function on lines~23-37 of the figure adds |
| new data elements to the hash table. |
| Lines~32-34 obtain a pointer to the hash bucket corresponding to |
| the key (and provide the index), as before, and line~35 adds the new |
| element to the table. |
| The caller is required to handle concurrency, for example, by invoking |
| \co{hashtab_lock_mod()} before the call to \co{hashtab_add()} and invoking |
| \co{hashtab_unlock_mod()} afterwards. |
| These two concurrency-control functions will correctly synchronize with |
| a concurrent resize operation: If the resize operation has already |
| progressed beyond the bucket that this data element would have been added to, |
| then the element is added to the new table. |
| |
| The \co{hashtab_del()} function on lines~39-52 of the figure removes |
| an existing element from the hash table. |
| Lines~48-50 provide the bucket and index as before, and line~51 removes |
| the specified element. |
| As with \co{hashtab_add()}, the caller is responsible for concurrency |
| control and this concurrency control suffices for synchronizing with |
| a concurrent resize operation. |
| |
| \QuickQuiz{} |
| The \co{hashtab_del()} function in |
| Figure~\ref{fig:datastruct:Resizable Hash-Table Access Functions} |
| does not always remove the element from the old hash table. |
| Doesn't this mean that readers might access this newly removed |
| element after it has been freed? |
| \QuickQuizAnswer{ |
| No. |
| The \co{hashtab_del()} function omits removing the element |
| from the old hash table only if the resize operation has |
| already progressed beyond the bucket containing the just-deleted |
| element. |
| But this means that new \co{hashtab_lookup()} operations will |
| use the new hash table when looking up that element. |
| Therefore, only old \co{hashtab_lookup()} operations that started |
| before the \co{hashtab_del()} might encounter the newly |
| removed element. |
| This means that \co{hashtab_del()} need only wait for an |
| RCU grace period to avoid inconveniencing |
| \co{hashtab_lookup()} operations. |
| } \QuickQuizEnd |
| |
| \begin{figure*}[tb] |
| { \scriptsize |
| \begin{verbatim} |
| 1 int hashtab_resize(struct hashtab *htp_master, |
| 2 unsigned long nbuckets, void *hash_private, |
| 3 int (*cmp)(void *hash_private, struct ht_elem *htep, void *key), |
| 4 long (*gethash)(void *hash_private, void *key), |
| 5 void *(*getkey)(struct ht_elem *htep)) |
| 6 { |
| 7 struct ht *htp; |
| 8 struct ht *htp_new; |
| 9 int i; |
| 10 int idx; |
| 11 struct ht_elem *htep; |
| 12 struct ht_bucket *htbp; |
| 13 struct ht_bucket *htbp_new; |
| 14 unsigned long hash; |
| 15 long b; |
| 16 |
| 17 if (!spin_trylock(&htp_master->ht_lock)) |
| 18 return -EBUSY; |
| 19 htp = htp_master->ht_cur; |
| 20 htp_new = ht_alloc(nbuckets, |
| 21 hash_private ? hash_private : htp->ht_hash_private, |
| 22 cmp ? cmp : htp->ht_cmp, |
| 23 gethash ? gethash : htp->ht_gethash, |
| 24 getkey ? getkey : htp->ht_getkey); |
| 25 if (htp_new == NULL) { |
| 26 spin_unlock(&htp_master->ht_lock); |
| 27 return -ENOMEM; |
| 28 } |
| 29 htp->ht_new = htp_new; |
| 30 synchronize_rcu(); |
| 31 idx = htp->ht_idx; |
| 32 htp_new->ht_idx = !idx; |
| 33 for (i = 0; i < htp->ht_nbuckets; i++) { |
| 34 htbp = &htp->ht_bkt[i]; |
| 35 spin_lock(&htbp->htb_lock); |
| 36 htp->ht_resize_cur = i; |
| 37 cds_list_for_each_entry(htep, &htbp->htb_head, hte_next[idx]) { |
| 38 htbp_new = ht_get_bucket_single(htp_new, htp_new->ht_getkey(htep), &b); |
| 39 spin_lock(&htbp_new->htb_lock); |
| 40 cds_list_add_rcu(&htep->hte_next[!idx], &htbp_new->htb_head); |
| 41 spin_unlock(&htbp_new->htb_lock); |
| 42 } |
| 43 spin_unlock(&htbp->htb_lock); |
| 44 } |
| 45 rcu_assign_pointer(htp_master->ht_cur, htp_new); |
| 46 synchronize_rcu(); |
| 47 spin_unlock(&htp_master->ht_lock); |
| 48 free(htp); |
| 49 return 0; |
| 50 } |
| \end{verbatim} |
| } |
| \caption{Resizable Hash-Table Resizing} |
| \label{fig:datastruct:Resizable Hash-Table Resizing} |
| \end{figure*} |
| |
| The actual resizing itself is carried out by \co{hashtab_resize}, shown in |
| Figure~\ref{fig:datastruct:Resizable Hash-Table Resizing} on |
| page~\pageref{fig:datastruct:Resizable Hash-Table Resizing}. |
| Line~17 conditionally acquires the top-level \co{->ht_lock}, and if |
| this acquisition fails, line~18 returns \co{-EBUSY} to indicate that |
| a resize is already in progress. |
| Otherwise, line~19 picks up a reference to the current hash table, |
| and lines~21-24 allocate a new hash table of the desired size. |
| If a new set of hash/key functions have been specified, these are |
| used for the new table, otherwise those of the old table are preserved. |
| If line~25 detects memory-allocation failure, line~26 releases \co{->htlock} |
| and line~27 returns a failure indication. |
| |
| Line~29 starts the bucket-distribution process by installing a reference |
| to the new table into the \co{->ht_new} field of the old table. |
| Line~30 ensures that all readers who are not aware of the new table |
| complete before the resize operation continues. |
| Line~31 picks up the current table's index and stores its inverse to |
| the new hash table, thus ensuring that the two hash tables avoid overwriting |
| each other's linked lists. |
| |
| Each pass through the loop spanning lines~33-44 distributes the contents |
| of one of the old hash table's buckets into the new hash table. |
| Line~34 picks up a reference to the old table's current bucket, |
| line~35 acquires that bucket's spinlock, and line~36 updates |
| \co{->ht_resize_cur} to indicate that this bucket is being distributed. |
| |
| \QuickQuiz{} |
| In the \co{hashtab_resize()} function in |
| Figure~\ref{fig:datastruct:Resizable Hash-Table Access Functions}, |
| what guarantees that the update to \co{->ht_new} on line~29 |
| will be seen as happening before the update to \co{->ht_resize_cur} |
| on line~36 from the perspective of \co{hashtab_lookup()}, |
| \co{hashtab_add()}, and \co{hashtab_del()}? |
| \QuickQuizAnswer{ |
| The \co{synchronize_rcu()} on line~30 of |
| Figure~\ref{fig:datastruct:Resizable Hash-Table Access Functions} |
| ensures that all pre-existing RCU readers have completed between |
| the time that we install the new hash-table reference on |
| line~29 and the time that we update \co{->ht_resize_cur} on |
| line~36. |
| This means that any reader that sees a non-negative value |
| of \co{->ht_resize_cur} cannot have started before the |
| assignment to \co{->ht_new}, and thus must be able to see |
| the reference to the new hash table. |
| } \QuickQuizEnd |
| |
| Each pass through the loop spanning lines~37-42 adds one data element |
| from the current old-table bucket to the corresponding new-table bucket, |
| holding the new-table bucket's lock during the add operation. |
| Finally, line~43 releases the old-table bucket lock. |
| |
| Execution reaches line~45 once all old-table buckets have been distributed |
| across the new table. |
| Line~45 installs the newly created table as the current one, and |
| line~46 waits for all old readers (who might still be referencing |
| the old table) to complete. |
| Then line~47 releases the resize-serialization lock, line~48 frees |
| the old hash table, and finally line~48 returns success. |
| |
| \subsection{Resizable Hash Table Discussion} |
| \label{sec:datastruct:Resizable Hash Table Discussion} |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{datastruct/perftestresize}} |
| \end{center} |
| \caption{Overhead of Resizing Hash Tables} |
| \label{fig:datastruct:Overhead of Resizing Hash Tables} |
| \end{figure} |
| |
| Figure~\ref{fig:datastruct:Overhead of Resizing Hash Tables} |
| compares resizing hash tables to their fixed-sized counterparts |
| for 2048, 16,384, and 131,072 elements in the hash table. |
| The figure shows three traces for each element count, one |
| for a fixed-size 1024-bucket hash table, another for a |
| fixed-size 2048-bucket hash table, and a third for a resizable |
| hash table that shifts back and forth between 1024 and 2048 |
| buckets, with a one-millisecond pause between each resize operation. |
| |
| The uppermost three traces are for the 2048-element hash table. |
| The upper trace corresponds to the 2048-bucket fixed-size hash |
| table, the middle trace to the 1024-bucket fixed-size hash table, |
| and the lower trace to the resizable hash table. |
| In this case, the short hash chains cause normal lookup overhead |
| to be so low that the overhead of resizing dominates. |
| Nevertheless, the larger fixed-size hash table has a significant |
| performance advantage, so that resizing can be quite beneficial, |
| at least given sufficient time between resizing operations: One |
| millisecond is clearly too short a time. |
| |
| The middle three traces are for the 16,384-element hash table. |
| Again, the upper trace corresponds to the 2048-bucket fixed-size hash |
| table, but the middle trace now corresponds to the resizable hash |
| table and the lower trace to the 1024-bucket fixed-size hash table. |
| However, the performance difference between the resizable and the |
| 1024-bucket hash table is quite small. |
| One consequence of the eight-fold increase in number of elements |
| (and thus also in hash-chain length) is that incessant resizing |
| is now no worse than maintaining a too-small hash table. |
| |
| The lower three traces are for the 131,072-element hash table. |
| The upper trace corresponds to the 2048-bucket fixed-size hash table, |
| the middle trace to the resizable hash table, and the lower trace |
| to the 1024-bucket fixed-size hash table. |
| In this case, longer hash chains result in higher lookup overhead, |
| so that this lookup overhead dominates that of resizing the hash |
| table. |
| However, the performance of all three approaches at the 131,072-element |
| level is more than an order of magnitude worse than that at the |
| 2048-element level, suggesting that the best strategy would be |
| a single 64-fold increase in hash-table size. |
| |
| The key point from this data is that the RCU-protected resizable hash |
| table performs and scales almost as well as does its fixed-size counterpart. |
| The performance during an actual resize operation of course suffers |
| somewhat due to the cache misses causes by the updates to each element's |
| pointers, and this effect is most pronounced when the hash-tables |
| bucket lists are short. |
| This indicates that hash tables should be resized by substantial amounts, |
| and that hysteresis should be be applied to prevent performance degradation |
| due to too-frequent resize operations. |
| In memory-rich environments, hash-table sizes should furthermore |
| be increased much more aggressively than they are decreased. |
| |
| Another key point is that although the \co{hashtab} structure is |
| non-partitionable, it is also read-mostly, which suggests the use |
| of RCU. |
| Given that the performance and scalability of this resizable hash table is |
| very nearly that of RCU-protected fixed-sized hash tables, we must |
| conclude that this approach was quite successful. |
| |
| Finally, it is important to note that insertions, deletions, and |
| lookups can proceed concurrently with a resize operation. |
| This concurrency is |
| critically important when resizing large hash tables, especially |
| for applications that must meet severe response-time constraints. |
| |
| Of course, the \co{ht_elem} structure's |
| pair of pointer sets does impose some memory overhead, |
| which is taken up in the next section. |
| |
| \subsection{Other Resizable Hash Tables} |
| \label{sec:datastruct:Other Resizable Hash Tables} |
| |
| One shortcoming of the resizable hash table described earlier in this |
| section is memory consumption. |
| Each data element has two pairs of linked-list pointers rather than just |
| one. |
| Is it possible to create an RCU-protected resizable hash table that |
| makes do with just one pair? |
| |
| It turns out that the answer is ``yes.'' |
| Josh Triplett et al.~\cite{Triplett:2011:RPHash} |
| produced a \emph{relativistic hash table} that incrementally |
| splits and combines corresponding hash chains so that readers always |
| see valid hash chains at all points during the resizing operation. |
| This incremental splitting and combining relies on the fact that it |
| is harmless for a reader to see a data element that should be in some |
| other hash chain: When this happens, the reader will simply ignore the |
| extraneous data element due to key mismatches. |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{datastruct/zipperhashshrink}} |
| \end{center} |
| \caption{Shrinking a Relativistic Hash Table} |
| \label{fig:datastruct:Shrinking a Relativistic Hash Table} |
| \end{figure} |
| |
| The process of shrinking a relativistic hash table by a factor of two |
| is shown in |
| Figure~\ref{fig:datastruct:Shrinking a Relativistic Hash Table}, |
| in this case shrinking a two-bucket hash table into a one-bucket |
| hash table, otherwise known as a linear list. |
| This process works by coalescing pairs of buckets in the old larger hash |
| table into single buckets in the new smaller hash table. |
| For this process to work correctly, we clearly need to constrain the hash |
| functions for the two tables. |
| One such constraint is to use the same underlying hash function for |
| both tables, but to throw out the low-order bit when shrinking from |
| large to small. |
| For example, the old two-bucket hash table would |
| use the two top bits of the value, while the new one-bucket hash table |
| could use the top bit of the value. |
| In this way, a given pair of adjacent even and odd buckets in the old |
| large hash table can be coalesced into a single bucket in the new small |
| hash table, while still having a single hash value cover all of the |
| elements in that single bucket. |
| |
| The initial state is shown at the top of the figure, with time advancing |
| from top to bottom, starting with initial state~(a). |
| The shrinking process begins by allocating the new smaller array of |
| buckets, and having each bucket of this new smaller array reference |
| the first element of one of the buckets of the corresponding pair in |
| the old large hash table, resulting in state~(b). |
| |
| Then the two hash chains are linked together, resulting in state~(c). |
| In this state, readers looking up an even-numbered element see no change, |
| and readers looking up elements~1 and~3 likewise see no change. |
| However, readers looking up some other odd number will also traverse |
| elements~0 and~2. |
| This is harmless because any odd number will compare not-equal to these |
| two elements. |
| There is some performance loss, but on the other hand, this is exactly |
| the same performance loss that will be experienced once the new small |
| hash table is fully in place. |
| |
| Next, the new small hash table is made accessible to readers, resulting |
| in state~(d). |
| Note that older readers might still be traversing the old large hash |
| table, so in this state both hash tables are in use. |
| |
| The next step is to wait for all pre-existing readers to complete, |
| resulting in state~(e). |
| In this state, all readers are using the new small hash table, so that |
| the old large hash table's buckets may be freed, resulting in the final |
| state~(f). |
| |
| \begin{figure}[tb] |
| \begin{center} |
| \resizebox{3in}{!}{\includegraphics{datastruct/zipperhashgrow}} |
| \end{center} |
| \caption{Growing a Relativistic Hash Table} |
| \label{fig:datastruct:Growing a Relativistic Hash Table} |
| \end{figure} |
| |
| Growing a relativistic hash table reverses the shrinking process, |
| but requires more grace-period steps, as shown in |
| Figure~\ref{fig:datastruct:Growing a Relativistic Hash Table}. |
| The initial state~(a) is at the top of this figure, with time advancing |
| from top to bottom. |
| |
| We start by allocating the new large two-bucket hash table, resulting |
| in state~(b). |
| Note that each of these new buckets references the first element destined |
| for that bucket. |
| These new buckets are published to readers, resulting in state~(c). |
| After a grace-period operation, all readers are using the new large |
| hash table, resulting in state~(d). |
| In this state, only those readers traversing the even-values hash bucket |
| traverse element~0, which is therefore now colored white. |
| |
| At this point, the old small hash buckets may be freed, although many |
| implementations use these old buckets to track progress ``unzipping'' |
| the list of items into their respective new buckets. |
| The last even-numbered element in the first consecutive run of such |
| elements now has its pointer-to-next updated to reference the following |
| even-numbered element. |
| After a subsequent grace-period operation, the result is state~(e). |
| The vertical arrow indicates the next element to be unzipped, and |
| element~1 is now colored black to indicate that only those readers traversing |
| the odd-values hash bucket may reach it. |
| |
| Next, the last odd-numbered element in the first consecutive run of such |
| elements now has its pointer-to-next updated to reference the following |
| odd-numbered element. |
| After a subsequent grace-period operation, the result is state~(f). |
| A final unzipping operation (including a grace-period operation) |
| results in the final state~(g). |
| |
| In short, the relativistic hash table reduces the number of per-element |
| list pointers at the expense of additional grace periods incurred during |
| resizing. |
| These additional grace periods are usually not a problem because |
| insertions, deletions, and lookups may proceed concurrently with |
| a resize operation. |
| |
| It turns out that it is possible to reduce the per-element memory overhead |
| from a pair of pointers to a single pointer, while still retaining |
| $O(1)$ deletions. |
| This is accomplished by augmenting split-order |
| list~\cite{OriShalev2006SplitOrderListHash} |
| with RCU |
| protection~\cite{MathieuDesnoyers2009URCU,PaulMcKenney2013LWNURCUhash}. |
| The data elements in the hash table are arranged into a single |
| sorted linked list, with each hash bucket referencing the first element |
| in that bucket. |
| Elements are deleted by setting low-order bits in their pointer-to-next |
| fields, and these elements are removed from the list by later traversals |
| that encounter them. |
| |
| This RCU-protected split-order list is complex, but offers lock-free |
| progress guarantees for all insertion, deletion, and lookup operations. |
| Such guarantees can be important in real-time applications. |
| An implementation is available from recent versions of the userspace RCU |
| library~\cite{MathieuDesnoyers2009URCU}. |
| |
| \section{Other Data Structures} |
| \label{sec:datastruct:Other Data Structures} |
| |
| The preceding sections have focused on data structures that enhance |
| concurrency due to partitionability |
| (Section~\ref{sec:datastruct:Partitionable Data Structures}), |
| efficient handling of read-mostly access patterns |
| (Section~\ref{sec:datastruct:Read-Mostly Data Structures}), |
| or application of read-mostly techniques to avoid |
| non-partitionability |
| (Section~\ref{sec:datastruct:Non-Partitionable Data Structures}). |
| This section gives a brief review of other data structures. |
| |
| One of the hash table's greatest advantages for parallel use is that it |
| is fully partitionable, at least while not being resized. |
| One way of preserving the partitionability and the size independence is |
| to use a radix tree, which is also called a trie. |
| Tries partition the search key, using each successive key partition |
| to traverse the next level of the trie. |
| As such, a trie can be thought of as a set of nested hash tables, |
| thus providing the required partitionability. |
| One disadvantage of tries is that a sparse key space can result in |
| inefficient use of memory. |
| There are a number of compression techniques that may be used to |
| work around this disadvantage, including hashing the key value to |
| a smaller keyspace before the |
| traversal~\cite{RobertOlsson2006a}. |
| Radix trees are heavily used in practice, including in the Linux |
| kernel~\cite{NickPiggin2006radixtree}. |
| |
| One important special case of both a hash table and a trie is what is |
| perhaps the oldest of data structures, the array and its multi-dimensional |
| counterpart, the matrix. |
| The fully partitionable nature of matrices is exploited heavily in |
| concurrent numerical algorithms. |
| |
| Self-balancing trees are heavily used in sequential code, with |
| AVL trees and red-black trees being perhaps the most well-known |
| examples~\cite{ThomasHCorman2001Algorithms}. |
| Early attempts to parallelize AVL trees were complex and not necessarily |
| all that efficient~\cite{Ellis80}, |
| however, more recent work on red-black trees provides better |
| performance and scalability by using RCU for readers and hashed arrays |
| of locks\footnote{ |
| In the guise of swissTM~\cite{AleksandarDragovejic2011STMnotToy}, |
| which is a variant of software transactional memory in which |
| the developer flags non-shared accesses.} |
| to protect reads and updates, |
| respectively~\cite{PhilHoward2011RCUTMRBTree,PhilipWHoward2013RCUrbtree}. |
| It turns out that red-black trees rebalance aggressively, which works |
| well for sequential programs, but not necessarily so well for parallel |
| use. |
| Recent work has therefore made use of RCU-protected ``bonsai trees'' |
| that rebalance less aggressively~\cite{AustinClements2012RCULinux:mmapsem}, |
| trading off optimal tree depth to gain more efficient concurrent updates. |
| |
| Concurrent skip lists lend themselves well to RCU readers, and in fact |
| represents an early academic use of a technique resembling |
| RCU~\cite{Pugh90}. |
| |
| Concurrent double-ended queues were discussed in |
| Section~\ref{sec:SMPdesign:Double-Ended Queue}, |
| and concurrent stacks and queues have a long history~\cite{Treiber86}, |
| though not normally the most impressive performance or scalability. |
| They are nevertheless a common feature of concurrent |
| libraries~\cite{PaulMcKenney2013LWNURCUqueuestack}. |
| Researchers have recently proposed relaxing the ordering constraints |
| of stacks and queues~\cite{Shavit:2011:DSM:1897852.1897873}, |
| with some work indicating that relaxed-ordered queues actually have |
| better ordering properties than do strict FIFO |
| queues~\cite{AndreasHaas2012FIFOisnt,ChristophMKirsch2012FIFOisntTR,AndreasHaas2013CFRelaxedQueues}. |
| |
| It seems likely that continued work with concurrent data structures will |
| produce novel algorithms with surprising properties. |
| |
| \section{Micro-Optimization} |
| \label{sec:datastruct:Micro-Optimization} |
| |
| The data structures shown in this section were coded straightforwardly, |
| with no adaptation to the underlying system's cache hierarchy. |
| In addition, many of the implementations used pointers to functions |
| for key-to-hash conversions and other frequent operations. |
| Although this approach provides simplicity and portability, in many |
| cases it does give up some performance. |
| |
| The following sections touch on specialization, memory conservation, |
| and hardware considerations. |
| Please do not mistakes these short sections for a definitive treatise |
| on this subject. |
| Whole books have been written on optimizing to a specific CPU, let |
| alone to the set of CPU families in common use today. |
| |
| \subsection{Specialization} |
| \label{sec:datastruct:Specialization} |
| |
| The resizable hash table presented in |
| Section~\ref{sec:datastruct:Non-Partitionable Data Structures} |
| used an opaque type for the key. |
| This allows great flexibility, permitting any sort of key to be |
| used, but it also incurs significant overhead due to the calls via |
| of pointers to functions. |
| Now, modern hardware uses sophisticated branch-prediction techniques |
| to minimize this overhead, but on the other hand, real-world software |
| is often larger than can be accommodated even by today's large |
| hardware branch-prediction tables. |
| This is especially the case for calls via pointers, in which case |
| the branch prediction hardware must record a pointer in addition |
| to branch-taken/branch-not-taken information. |
| |
| This overhead can be eliminated by specializing a hash-table implementation |
| to a given key type and hash function. |
| Doing so eliminates the \co{->ht_cmp()}, \co{->ht_gethash()}, and |
| \co{->ht_getkey()} function pointers in the \co{ht} structure shown in |
| Figure~\ref{fig:datastruct:Resizable Hash-Table Data Structures} on |
| page~\pageref{fig:datastruct:Resizable Hash-Table Data Structures}. |
| It also eliminates the corresponding calls through these pointers, |
| which could allow the compiler to inline the resulting fixed functions, |
| eliminating not only the overhead of the call instruction, but the |
| argument marshalling as well. |
| |
| In addition, the resizable hash table is designed to fit an API |
| that segregates bucket selection from concurrency control. |
| Although this allows a single torture test to exercise all the hash-table |
| implementations in this chapter, it also means that many operations |
| must compute the hash and interact with possible resize operations twice |
| rather than just once. |
| In a performance-conscious environment, the \co{hashtab_lock_mod()} |
| function would also return a reference to the bucket selected, eliminating |
| the subsequent call to \co{ht_get_bucket()}. |
| |
| \QuickQuiz{} |
| Couldn't the \co{hashtorture.h} code be modified to accommodate |
| a version of \co{hashtab_lock_mod()} that subsumes the |
| \co{ht_get_bucket()} functionality? |
| \QuickQuizAnswer{ |
| It probably could, and doing so would benefit all of the |
| per-bucket-locked hash tables presented in this chapter. |
| Making this modification is left as an exercise for the |
| reader. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| How much do these specializations really save? |
| Are they really worth it? |
| \QuickQuizAnswer{ |
| The answer to the first question is left as an exercise to |
| the reader. |
| Try specializing the resizable hash table and see how much |
| performance improvement results. |
| The second question cannot be answered in general, but must |
| instead be answered with respect to a specific use case. |
| Some use cases are extremely sensitive to performance and |
| scalability, while others are less so. |
| } \QuickQuizEnd |
| |
| All that aside, one of the great benefits of modern hardware compared |
| to that available when I first started learning to program back in |
| the early 1970s is that much less specialization is required. |
| This allows much greater productivity than was possible back in the |
| days of four-kilobyte address spaces. |
| |
| \subsection{Bits and Bytes} |
| \label{sec:datastruct:Bits and Bytes} |
| |
| The hash tables discussed in this chapter made almost no attempt to conserve |
| memory. |
| For example, the \co{->ht_idx} field in the \co{ht} structure in |
| Figure~\ref{fig:datastruct:Resizable Hash-Table Data Structures} on |
| page~\pageref{fig:datastruct:Resizable Hash-Table Data Structures} |
| always has a value of either zero or one, yet takes up a full 32 bits |
| of memory. |
| It could be eliminated, for example, by stealing a bit from the |
| \co{->ht_resize_key} field. |
| This works because the \co{->ht_resize_key} field is large enough to |
| address every byte of memory and the \co{ht_bucket} structure |
| is more than one byte long, so that |
| the \co{->ht_resize_key} field must have several bits to spare. |
| |
| This sort of bit-packing trick is frequently used in data structures |
| that are highly replicated, as is the \co{page} structure in the Linux |
| kernel. |
| However, the resizable hash table's \co{ht} structure is not all that |
| highly replicated. |
| It is instead the \co{ht_bucket} structures we should focus on. |
| There are two major opportunities for shrinking the \co{ht_bucket} structure: |
| (1)~Placing the \co{->htb_lock} field in a low-order bit of one of the |
| \co{->htb_head} pointers and (2)~Reducing the number of pointers required. |
| |
| The first opportunity might make use of bit-spinlocks in the Linux |
| kernel, which are provided by the \co{include/linux/bit_spinlock.h} |
| header file. |
| These are used in space-critical data structures in the Linux kernel, |
| but are not without their disadvantages: |
| |
| \begin{enumerate} |
| \item They are significantly slower than the traditional spinlock |
| primitives. |
| \item They cannot participate in the lockdep deadlock detection |
| tooling in the Linux kernel~\cite{JonathanCorbet2006lockdep}. |
| \item They do not record lock ownership, further complicating |
| debugging. |
| \item They do not participate in priority boosting in -rt kernels, |
| which means that preemption must be disabled when holding |
| bit spinlocks, which can degrade real-time latency. |
| \end{enumerate} |
| |
| Despite these disadvantages, bit-spinlocks are extremely useful when |
| memory is at a premium. |
| |
| One aspect of the second opportunity was covered in |
| Section~\ref{sec:datastruct:Other Resizable Hash Tables}, |
| which presented resizable hash tables that require only one |
| set of bucket-list pointers in place of the pair of sets required |
| by the resizable hash table presented in |
| Section~\ref{sec:datastruct:Non-Partitionable Data Structures}. |
| Another approach would be to use singly linked bucket lists in |
| place of the doubly linked lists used in this chapter. |
| One downside of this approach is that deletion would then require |
| additional overhead, either by marking the outgoing pointer |
| for later removal |
| or by searching the bucket list for the element being deleted. |
| |
| In short, there is a tradeoff between minimal memory overhead on |
| the one hand, and performance and simplicity on the other. |
| Fortunately, the relatively large memories available on modern |
| systems have allowed us to prioritize performance and simplicity |
| over memory overhead. |
| However, even with today's large-memory systems\footnote{ |
| Smartphones with gigabytes of memory, anyone?} |
| it is sometime necessary to take extreme measures to reduce |
| memory overhead. |
| |
| \subsection{Hardware Considerations} |
| \label{sec:datastruct:Hardware Considerations} |
| |
| Modern computers typically move data between CPUs and main memory in |
| fixed-sized blocks that range in size from 32 bytes to 256 bytes. |
| These blocks are called \emph{cache lines}, and are extremely important |
| to high performance and scalability, as was discussed in |
| Section~\ref{sec:cpu:Overheads}. |
| One timeworn way to kill both performance and scalability is to |
| place incompatible variables into the same cacheline. |
| For example, suppose that a resizable hash table data element had |
| the \co{ht_elem} structure in the same cacheline as a counter that |
| was incremented quite frequently. |
| The frequent incrementing would cause the cacheline to be present at |
| the CPU doing the incrementing, but nowhere else. |
| If other CPUs attempted to traverse the hash bucket list containing |
| that element, they would incur expensive cache misses, degrading both |
| performance and scalability. |
| |
| \begin{figure}[tb] |
| { \scriptsize |
| \begin{verbatim} |
| struct hash_elem { |
| struct ht_elem e; |
| long __attribute__ ((aligned(64))) counter; |
| }; |
| \end{verbatim} |
| } |
| \caption{Alignment for 64-Byte Cache Lines} |
| \label{fig:datastruct:Alignment for 64-Byte Cache Lines} |
| \end{figure} |
| |
| One way to solve this problem on systems with 64-byte cache line is shown in |
| Figure~\ref{fig:datastruct:Alignment for 64-Byte Cache Lines}. |
| Here a gcc \co{aligned} attribute is used to force the \co{->counter} |
| and the \co{ht_elem} structure into separate cache lines. |
| This would allow CPUs to traverse the hash bucket list at full speed |
| despite the frequent incrementing. |
| |
| Of course, this raises the question ``How did we know that cache lines |
| are 64 bytes in size?'' |
| On a Linux system, this information may be obtained from the |
| \co{/sys/devices/system/cpu/cpu*/cache/} directories, and it is even |
| possible to make the installation process rebuild the application to |
| accommodate the system's hardware structure. |
| However, this would be more difficult if you wanted your application to |
| also run on non-Linux systems. |
| Furthermore, even if you were content to run only on Linux, such a |
| self-modifying installation poses validation challenges. |
| |
| Fortunately, there are some rules of thumb that work reasonably well in |
| practice, which were gathered into a 1995 |
| paper~\cite{BenjaminGamsa95a}.\footnote{ |
| A number of these rules are paraphrased and expanded on here |
| with permission from Orran Krieger.} |
| The first group of rules involve rearranging structures to accommodate |
| cache geometry: |
| |
| \begin{enumerate} |
| \item Separate read-mostly data from data that is frequently updated. |
| For example, place read-mostly data at the beginning of the |
| structure and frequently updated data at the end. |
| Where possible, place data that is rarely accessed in between. |
| \item If the structure has groups of fields such that each group is |
| updated by an independent code path, separate these groups |
| from each other. |
| Again, it can make sense to place data that is rarely accessed |
| between the groups. |
| In some cases, it might also make sense to place each such group |
| into a separate structure referenced by the original structure. |
| \item Where possible, associate update-mostly data with a CPU, thread, |
| or task. |
| We saw several very effective examples of this rule of thumb |
| in the counter implementations in |
| Chapter~\ref{chp:Counting}. |
| \item In fact, where possible, you should partition your data on |
| a per-CPU, per-thread, or per-task basis, as was discussed |
| in Chapter~\ref{chp:Data Ownership}. |
| \end{enumerate} |
| |
| There has recently been some work towards automated trace-based |
| rearrangement of structure |
| fields~\cite{Golovanevsky:2010:TDL:2174824.2174835}. |
| This work might well ease one of the more painstaking tasks |
| required to get excellent performance and scalability from |
| multithreaded software. |
| |
| An additional set of rules of thumb deal with locks: |
| |
| \begin{enumerate} |
| \item Given a heavily contended lock protecting data that is |
| frequently modified, take one of the following approaches: |
| \begin{enumerate} |
| \item Place the lock in a different cacheline than the data |
| that it protects. |
| \item Use a lock that is adapted for high contention, such |
| as a queued lock. |
| \item Redesign to reduce lock contention. |
| (This approach is best, but can require quite a bit |
| of work.) |
| \end{enumerate} |
| \item Place uncontended locks into the same cache line as the data |
| that they protect. |
| This approach means that the cache miss that brought the |
| lock to the current CPU also brought its data. |
| \item Protect read-mostly data with RCU, or, if RCU cannot be used and |
| the critical sections are of very long duration, reader-writer locks. |
| \end{enumerate} |
| |
| Of course, these are rules of thumb rather than absolute rules. |
| Some experimentation is required to work out which are most applicable |
| to your particular situation. |
| |
| \section{Summary} |
| \label{sec:datastruct:Summary} |
| |
| This chapter has focused primarily on hash tables, including resizable |
| hash tables, which are not fully partitionable. |
| Section~\ref{sec:datastruct:Other Data Structures} gave a quick |
| overview of a few non-hash-table data structures. |
| Nevertheless, this exposition of hash tables is an excellent introduction |
| to the many issues surrounding high-performance scalable data access, |
| including: |
| |
| \begin{enumerate} |
| \item Fully partitioned data structures work well on small systems, |
| for example, single-socket systems. |
| \item Larger systems require locality of reference as well as |
| full partitioning. |
| \item Read-mostly techniques, such as hazard pointers and RCU, |
| provide good locality of reference for read-mostly workloads, |
| and thus provide excellent performance and scalability even |
| on larger systems. |
| \item Read-mostly techniques also work well on some types of |
| non-partitionable data structures, such as resizable hash tables. |
| \item Additional performance and scalability can be obtained by |
| specializing the data structure to a specific workload, |
| for example, by replacing a general key with a 32-bit integer. |
| \item Although requirements for portability and for extreme performance |
| often conflict, there are some data-structure-layout techniques |
| that can strike a good balance between these two sets of |
| requirements. |
| \end{enumerate} |
| |
| That said, performance and scalability is of little use without reliability, |
| so the next chapter covers validation. |