blob: c20c2420bf4729f4aa2765c41ecba3d0174d68b6 [file] [log] [blame]
% datastruct/datastruct.tex
\QuickQuizChapter{chp:Data Structures}{Data Structures}
Efficient access to data is critically important, so that
discussions of algorithms include time complexity of the related
data structures~\cite{ThomasHCorman2001Algorithms}.
However, for parallel programs, measures of time complexity must also
include concurrency effects.
These effects can be overwhelmingly large, as shown in
Chapter~\ref{chp:Hardware and its Habits}, which means that
concurrent data structure designs must focus as much on
concurrency as they do on sequential time complexity.
Section~\ref{sec:datastruct:Motivating Application}
presents a motivating application that will be used to evaluate
the data structures presented in this chapter.
As discussed in Chapter~\ref{cha:Partitioning and Synchronization Design},
an excellent way to achieve high scalability is partitioning.
This points the way to partitionable data structures, a topic taken up by
Section~\ref{sec:datastruct:Partitionable Data Structures}.
Chapter~\ref{chp:Deferred Processing} described how deferring some
actions can greatly improve both performance and scalability.
Section~\ref{sec:defer:Read-Copy Update (RCU)} in particular showed
how to tap the awesome power of procrastination in pursuit of
performance and scalability, a topic taken up by
Section~\ref{sec:datastruct:Read-Mostly Data Structures}.
Not all data structures are partitionable.
Section~\ref{sec:datastruct:Non-Partitionable Data Structures}
looks at a mildly non-partitionable example data structure.
This section shows
how to split it into read-mostly and partitionable portions,
enabling a fast and scalable implementation.
Because this chapter cannot delve into the details of every concurrent
data structure that has ever been used
Section~\ref{sec:datastruct:Other Data Structures}
provides a brief survey of the most common and important ones.
Although the best performance and scalability results design rather
than after-the-fact micro-optimization, it is nevertheless the case
that micro-optimization has an important place in achieving the
absolute best possible performance and scalability.
This topic is therefore taken up in
Section~\ref{sec:datastruct:Micro-Optimization}.
Finally, Section~\ref{sec:datastruct:Summary}
presents a summary of this chapter.
\section{Motivating Application}
\label{sec:datastruct:Motivating Application}
We will use the Schr\"odinger's Zoo application to evaluate
performance~\cite{McKenney:2013:SDS:2483852.2483867}.
Schr\"odinger has a zoo containing a large number of animals, and
he would like to track them using an in-memory database with
each animal in the zoo represented by a data item in this database.
Each animal has a unique name that is used as a key, with a variety
of data tracked for each animal.
Births, captures, and purchases result in insertions, while deaths,
releases, and sales result in deletions.
Because Schr\"odinger's zoo contains a large quantity of short-lived
animals, including mice and insects, the database must be able to
support a high update rate.
Those interested in Schr\"odinger's animals can query them, however,
Schr\"odinger has noted extremely high rates of queries for his cat,
so much so that he suspects that his mice might be using the database
to check up on their nemesis.
This means that Sch\"odinger's application must be able to support a
high rate of queries to a single data element.
Please keep this application in mind as various data structures are presented.
\section{Partitionable Data Structures}
\label{sec:datastruct:Partitionable Data Structures}
There are a huge number of data structures in use today, so much so
that there are multiple textbooks covering them.
This small section focuses
on a single data structure, namely the hash table.
This focused approach allows a much deeper investigation of how concurrency
interacts with data structures, and also focuses on a data structure
that is heavily used in practice.
Section~\ref{sec:datastruct:Hash-Table Design}
overviews of the design, and
Section~\ref{sec:datastruct:Hash-Table Implementation}
presents the implementation.
Finally,
Section~\ref{sec:datastruct:Hash-Table Performance}
discusses the resulting performance and scalability.
\subsection{Hash-Table Design}
\label{sec:datastruct:Hash-Table Design}
Chapter~\ref{cha:Partitioning and Synchronization Design}
emphasized the need to apply partitioning in order to attain
respectable performance and scalability, so partitionability
must be a first-class criterion when selecting data structures.
This criterion is well satisfied by that workhorse of parallelism,
the hash table.
Hash tables are conceptually simple, consisting of an array of
\emph{hash buckets}.
A \emph{hash function} maps from a given element's \emph{key}
to the hash bucket that this element will be stored in.
Each hash bucket therefore heads up a linked list of elements,
called a \emph{hash chain}.
When properly configured, these hash chains will be quite short,
permitting a hash table to access the element with a given key
extremely efficiently.
\QuickQuiz{}
But there are many types of hash tables, of which the chained
hash tables described here are but one type.
Why the focus on chained hash tables?
\QuickQuizAnswer{
Chained hash tables are completely partitionable, and thus
well-suited to concurrent use.
There are other completely-partitionable hash tables, for
example, split-ordered list~\cite{OriShalev2006SplitOrderListHash},
but they are considerably more complex.
We therefore start with chained hash tables.
} \QuickQuizEnd
In addition, each bucket can be given its own lock, so that
elements in different buckets of the hash table may be added,
deleted, and looked up completely independently.
A large hash table containing a large number of elements therefore
offers excellent scalability.
\subsection{Hash-Table Implementation}
\label{sec:datastruct:Hash-Table Implementation}
\begin{figure}[tb]
{ \scriptsize
\begin{verbatim}
1 struct ht_elem {
2 struct cds_list_head hte_next;
3 unsigned long hte_hash;
4 };
5
6 struct ht_bucket {
7 struct cds_list_head htb_head;
8 spinlock_t htb_lock;
9 };
10
11 struct hashtab {
12 unsigned long ht_nbuckets;
13 struct ht_bucket ht_bkt[0];
14 };
\end{verbatim}
}
\caption{Hash-Table Data Structures}
\label{fig:datastruct:Hash-Table Data Structures}
\end{figure}
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{datastruct/hashdiagram}}
\end{center}
\caption{Hash-Table Data-Structure Diagram}
\label{fig:datastruct:Hash-Table Data-Structure Diagram}
\end{figure}
Figure~\ref{fig:datastruct:Hash-Table Data Structures}
(\co{hash_bkt.c})
shows a set of data structures used in a simple fixed-sized hash
table using chaining and per-hash-bucket locking, and
Figure~\ref{fig:datastruct:Hash-Table Data-Structure Diagram}
diagrams how they fit together.
The \co{hashtab} structure (lines~11-14 in
Figure~\ref{fig:datastruct:Hash-Table Data Structures})
contains four \co{ht_bucket} structures (lines~6-9 in
Figure~\ref{fig:datastruct:Hash-Table Data Structures}),
with the \co{->bt_nbuckets} field controlling the number of buckets.
Each such bucket contains a list header \co{->htb_head} and
a lock \co{->htb_lock}.
The list headers chain \co{ht_elem} structures
(lines~1-4 in
Figure~\ref{fig:datastruct:Hash-Table Data Structures})
through their
\co{->hte_next} fields, and each \co{ht_elem} structure also caches
the corresponding element's hash value in the \co{->hte_hash} field.
The \co{ht_elem} structure would be included in the larger structure
being placed in the hash table, and this larger structure might contain
a complex key.
The diagram shown in
Figure~\ref{fig:datastruct:Hash-Table Data-Structure Diagram}
has bucket~0 with two elements and bucket~2 with one.
\begin{figure}[tb]
{ \scriptsize
\begin{verbatim}
1 #define HASH2BKT(htp, h) \
2 (&(htp)->ht_bkt[h % (htp)->ht_nbuckets])
3
4 static void hashtab_lock(struct hashtab *htp,
5 unsigned long hash)
6 {
7 spin_lock(&HASH2BKT(htp, hash)->htb_lock);
8 }
9
10 static void hashtab_unlock(struct hashtab *htp,
11 unsigned long hash)
12 {
13 spin_unlock(&HASH2BKT(htp, hash)->htb_lock);
14 }
\end{verbatim}
}
\caption{Hash-Table Mapping and Locking}
\label{fig:datastruct:Hash-Table Mapping and Locking}
\end{figure}
Figure~\ref{fig:datastruct:Hash-Table Mapping and Locking}
shows mapping and locking functions.
Lines~1 and~2 show the macro \co{HASH2BKT()}, which maps from a hash value
to the corresponding \co{ht_bucket} structure.
This macro uses a simple modulus: if more aggressive hashing is required,
the caller needs to implement it when mapping from key to hash value.
The remaining two functions acquire and release the \co{->htb_lock}
corresponding to the specified hash value.
\begin{figure}[tb]
{ \scriptsize
\begin{verbatim}
1 struct ht_elem *
2 hashtab_lookup(struct hashtab *htp,
3 unsigned long hash,
4 void *key,
5 int (*cmp)(struct ht_elem *htep,
6 void *key))
7 {
8 struct ht_bucket *htb;
9 struct ht_elem *htep;
10
11 htb = HASH2BKT(htp, hash);
12 cds_list_for_each_entry(htep,
13 &htb->htb_head,
14 hte_next) {
15 if (htep->hte_hash != hash)
16 continue;
17 if (cmp(htep, key))
18 return htep;
19 }
20 return NULL;
21 }
\end{verbatim}
}
\caption{Hash-Table Lookup}
\label{fig:datastruct:Hash-Table Lookup}
\end{figure}
Figure~\ref{fig:datastruct:Hash-Table Lookup}
shows \co{hashtab_lookup()},
which returns a pointer to the element with the specified hash and key if it
exists, or \co{NULL} otherwise.
This function takes both a hash value and a pointer to the key because
this allows users of this function to use arbitrary keys and
arbitrary hash functions, with the key-comparison function passed in via
\co{cmp()}, in a manner similar to \co{qsort()}.
Line~11 maps from the hash value to a pointer to the corresponding
hash bucket.
Each pass through the loop spanning line~12-19 examines one element
of the bucket's hash chain.
Line~15 checks to see if the hash values match, and if not, line~16
proceeds to the next element.
Line~17 checks to see if the actual key matches, and if so,
line~18 returns a pointer to the matching element.
If no element matches, line~20 returns \co{NULL}.
\QuickQuiz{}
But isn't the double comparison on lines~15-18 in
Figure~\ref{fig:datastruct:Hash-Table Lookup} inefficient
in the case where the key fits into an unsigned long?
\QuickQuizAnswer{
Indeed it is!
However, hash tables quite frequently store information with
keys such as character strings that do not necessarily fit
into an unsigned long.
Simplifying the hash-table implementation for the case where
keys always fit into unsigned longs is left as an exercise
for the reader.
} \QuickQuizEnd
\begin{figure}[tb]
{ \scriptsize
\begin{verbatim}
1 void
2 hashtab_add(struct hashtab *htp,
3 unsigned long hash,
4 struct ht_elem *htep)
5 {
6 htep->hte_hash = hash;
7 cds_list_add(&htep->hte_next,
8 &HASH2BKT(htp, hash)->htb_head);
9 }
10
11 void hashtab_del(struct ht_elem *htep)
12 {
13 cds_list_del_init(&htep->hte_next);
14 }
\end{verbatim}
}
\caption{Hash-Table Modification}
\label{fig:datastruct:Hash-Table Modification}
\end{figure}
Figure~\ref{fig:datastruct:Hash-Table Modification}
shows the \co{hashtab_add()} and \co{hashtab_del()} functions
that add and delete elements from the hash table, respectively.
The \co{hashtab_add()} function simply sets the element's hash
value on line~6, then adds it to the corresponding bucket on
lines~7 and~8.
The \co{hashtab_del()} function simply removes the specified element
from whatever hash chain it is on, courtesy of the doubly linked
nature of the hash-chain lists.
Before calling either of these two functions, the caller is required to
ensure that no other thread is accessing
or modifying this same bucket, for example, by invoking
\co{hashtab_lock()} beforehand.
\begin{figure}[tb]
{ \scriptsize
\begin{verbatim}
1 struct hashtab *
2 hashtab_alloc(unsigned long nbuckets)
3 {
4 struct hashtab *htp;
5 int i;
6
7 htp = malloc(sizeof(*htp) +
8 nbuckets *
9 sizeof(struct ht_bucket));
10 if (htp == NULL)
11 return NULL;
12 htp->ht_nbuckets = nbuckets;
13 for (i = 0; i < nbuckets; i++) {
14 CDS_INIT_LIST_HEAD(&htp->ht_bkt[i].htb_head);
15 spin_lock_init(&htp->ht_bkt[i].htb_lock);
16 }
17 return htp;
18 }
19
20 void hashtab_free(struct hashtab *htp)
21 {
22 free(htp);
23 }
\end{verbatim}
}
\caption{Hash-Table Allocation and Free}
\label{fig:datastruct:Hash-Table Allocation and Free}
\end{figure}
Figure~\ref{fig:datastruct:Hash-Table Allocation and Free}
shows \co{hashtab_alloc()} and \co{hashtab_free()},
which do hash-table allocation and freeing, respectively.
Allocation begins on lines~7-9 with allocation of the underlying memory.
If line~10 detects that memory has been exhausted, line~11 returns
\co{NULL} to the caller.
Otherwise, line~12 initializes the number of buckets, and the loop
spanning lines~13-16 initializes the buckets themselves,
including the chain list header on line~14 and the lock on line~15.
Finally, line~17 returns a pointer to the newly allocated hash table.
The \co{hashtab_free()} function on lines~20-23 is straightforward.
\subsection{Hash-Table Performance}
\label{sec:datastruct:Hash-Table Performance}
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{datastruct/zoocpubktlin8}}
\end{center}
\caption{Read-Only Hash-Table Performance For Schr\"odinger's Zoo}
\label{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo}
\end{figure}
The performance results for an eight-CPU 2GHz
Intel\textsuperscript\textregistered
Xeon\textsuperscript\textregistered
system using a bucket-locked hash table with 1024 buckets are shown in
Figure~\ref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo}.
The performance does scale nearly linearly, but is not much more than half
of the ideal performance level, even at only eight CPUs.
Part of this shortfall is due to the fact that the lock acquisitions and
releases incur no cache misses on a single CPU, but do incur misses
on two or more CPUs.
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{datastruct/zoocpubktlin}}
\end{center}
\caption{Read-Only Hash-Table Performance For Schr\"odinger's Zoo, 60 CPUs}
\label{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo, 60 CPUs}
\end{figure}
And things only get worse with larger number of CPUs, as can be seen in
Figure~\ref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo, 60 CPUs}.
We do not need an additional line to show ideal performance: The performance
for nine CPUs and beyond is worse than abysmal.
This clearly underscores the dangers of extrapolating performance from a
modest number of CPUs.
Of course, one possible reason for the collapse in performance might be
that more hash buckets are needed.
After all, we did not pad each hash bucket to a full cache line, so
there are a number of hash buckets per cache line.
It is possible that the resulting cache-thrashing comes into play at
nine CPUs.
This is of course easy to test by increasing the number of hash buckets.
\QuickQuiz{}
Instead of simply increasing the number of hash buckets,
wouldn't it be better to cache-align the existing hash buckets?
\QuickQuizAnswer{
The answer depends on a great many things.
If the hash table has a large number of elements per bucket, it
would clearly be better to increase the number of hash buckets.
On the other hand, if the hash table is lightly loaded,
the answer depends on the hardware, the effectiveness of the
hash function, and the workload.
Interested readers are encouraged to experiment.
} \QuickQuizEnd
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{datastruct/zoocpubktsizelin}}
\end{center}
\caption{Read-Only Hash-Table Performance For Schr\"odinger's Zoo, Varying Buckets}
\label{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo, Varying Buckets}
\end{figure}
However, as can be seen in
Figure~\ref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo, Varying Buckets},
although increasing the number of buckets does increase performance somewhat,
scalability is still abysmal.
In particular, we still see a sharp dropoff at nine CPUs and beyond.
Furthermore, going from 8192 buckets to 16,384 buckets produced almost
no increase in performance.
Clearly something else is going on.
The problem is that this is a multi-socket system, with CPUs~0-7
and~32-39 mapped to the first socket as shown in
Table~\ref{tab:datastruct:NUMA Topology of System Under Test}.
Test runs confined to the first eight CPUs therefore perform quite
well, but tests that involve socket~0's CPUs~0-7 as well as
socket~1's CPU~8 incur the overhead of passing data across
socket boundaries.
This can severely degrade performance, as was discussed in
Section~\ref{sec:cpu:Hardware System Architecture}.
In short, large multi-socket systems require good locality of reference
in addition to full partitioning.
\QuickQuiz{}
Given the negative scalability of the Schr\"odinger's
Zoo application across sockets, why not just run multiple
copies of the application, with each copy having a subset
of the animals and confined to run on a single socket?
\QuickQuizAnswer{
You can do just that!
In fact, you can extend this idea to large clustered systems,
running one copy of the application on each node of the cluster.
This practice is called ``sharding'', and is heavily used in
practice by large web-based
retailers~\cite{DeCandia:2007:DAH:1323293.1294281}.
However, if you are going to shard on a per-socket basis within
a multisocket system, why not buy separate smaller and cheaper
single-socket systems, and then run one shard of the database
on each of those systems?
} \QuickQuizEnd
One key property of the Schr\"odinger's-zoo runs discussed thus far is that
they are all read-only.
This makes the performance degradation due to lock-acquisition-induced
cache misses all the more painful.
Even though we are not updating the underlying hash table itself, we are
still paying the price for writing to memory.
Of course, if the hash table was never going to be updated, we could dispense
entirely with mutual exclusion.
This approach is quite straightforward and is left as an exercise for the
reader.
But even with the occasional update, avoiding writes avoids cache
misses, and allows the read-mostly data to be replicated across all
the caches, which in turn promotes locality of reference.
The next section therefore examines optimizations that can be carried out in
read-mostly cases where updates are rare, but could happen at any time.
\begin{table}
\begin{center}
\begin{tabular}{r|r|r|r|r|r|r|r|r}
Socket & \multicolumn{8}{|c}{Core} \\
\hline
0 & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\
& 32 & 33 & 34 & 35 & 36 & 37 & 38 & 39 \\
\hline
1 & 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15 \\
& 40 & 41 & 42 & 43 & 44 & 45 & 46 & 47 \\
\hline
2 & 16 & 17 & 18 & 19 & 20 & 21 & 22 & 23 \\
& 48 & 49 & 50 & 51 & 52 & 53 & 54 & 55 \\
\hline
3 & 24 & 25 & 26 & 27 & 28 & 29 & 30 & 31 \\
& 56 & 47 & 58 & 59 & 60 & 61 & 62 & 63 \\
\end{tabular}
\end{center}
\caption{NUMA Topology of System Under Test}
\label{tab:datastruct:NUMA Topology of System Under Test}
\end{table}
\section{Read-Mostly Data Structures}
\label{sec:datastruct:Read-Mostly Data Structures}
Although partitioned data structures can offer excellent scalability,
NUMA effects can result in severe degradations of both performance and
scalability.
In addition,
the need for readers to exclude writers can degrade performance in
read-mostly situations.
However, we can achieve both performance and scalability by using
RCU, which was introduced in
Section~\ref{sec:defer:Read-Copy Update (RCU)}.
Similar results can be achieved using hazard pointers
(\co{hazptr.c})~\cite{MagedMichael04a}, which will be included in
the performance results shown in this
section~\cite{McKenney:2013:SDS:2483852.2483867}.
\subsection{RCU-Protected Hash Table Implementation}
\label{sec:datastruct:RCU-Protected Hash Table Implementation}
\begin{figure}[tb]
{ \scriptsize
\begin{verbatim}
1 static void hashtab_lock_lookup(struct hashtab *htp,
2 unsigned long hash)
3 {
4 rcu_read_lock();
5 }
6
7 static void hashtab_unlock_lookup(struct hashtab *htp,
8 unsigned long hash)
9 {
10 rcu_read_unlock();
11 }
\end{verbatim}
}
\caption{RCU-Protected Hash-Table Read-Side Concurrency Control}
\label{fig:datastruct:RCU-Protected Hash-Table Read-Side Concurrency Control}
\end{figure}
For an RCU-protected hash table with per-bucket locking,
updaters use locking exactly as described in
Section~\ref{sec:datastruct:Partitionable Data Structures},
but readers use RCU.
The data structures remain as shown in
Figure~\ref{fig:datastruct:Hash-Table Data Structures},
and the \co{HASH2BKT()}, \co{hashtab_lock()}, and \co{hashtab_unlock()}
functions remain as shown in
Figure~\ref{fig:datastruct:Hash-Table Mapping and Locking}.
However, readers use the lighter-weight concurrency-control embodied
by \co{hashtab_lock_lookup()} and \co{hashtab_unlock_lookup()}
shown in
Figure~\ref{fig:datastruct:RCU-Protected Hash-Table Read-Side Concurrency Control}.
\begin{figure}[tb]
{ \scriptsize
\begin{verbatim}
1 struct ht_elem
2 *hashtab_lookup(struct hashtab *htp,
3 unsigned long hash,
4 void *key,
5 int (*cmp)(struct ht_elem *htep,
6 void *key))
7 {
8 struct ht_bucket *htb;
9 struct ht_elem *htep;
10
11 htb = HASH2BKT(htp, hash);
12 cds_list_for_each_entry_rcu(htep,
13 &htb->htb_head,
14 hte_next) {
15 if (htep->hte_hash != hash)
16 continue;
17 if (cmp(htep, key))
18 return htep;
19 }
20 return NULL;
21 }
\end{verbatim}
}
\caption{RCU-Protected Hash-Table Lookup}
\label{fig:datastruct:RCU-Protected Hash-Table Lookup}
\end{figure}
Figure~\ref{fig:datastruct:RCU-Protected Hash-Table Lookup}
shows \co{hashtab_lookup()} for the RCU-protected per-bucket-locked
hash table.
This is identical to that in
Figure~\ref{fig:datastruct:Hash-Table Lookup}
except that \co{cds_list_for_each_entry()} is replaced
by \co{cds_list_for_each_entry_rcu()}.
Both of these primitives sequence down the hash chain referenced
by \co{htb->htb_head} but \co{cds_list_for_each_entry_rcu()} also
correctly enforces memory ordering in case of concurrent insertion.
This is an important difference between these two hash-table implementations:
Unlike the pure per-bucket-locked implementation, the RCU protected
implementation allows lookups to run concurrently with insertions
and deletions, and RCU-aware primitives like
\co{cds_list_for_each_entry_rcu()} are required to correctly handle
this added concurrency.
Note also that \co{hashtab_lookup()}'s caller must be within an
RCU read-side critical section, for example, the caller must invoke
\co{hashtab_lock_lookup()} before invoking \co{hashtab_lookup()}
(and of course invoke \co{hashtab_unlock_lookup()} some time afterwards).
\QuickQuiz{}
But if elements in a hash table can be deleted concurrently
with lookups, doesn't that mean that a lookup could return
a reference to a data element that was deleted immediately
after it was looked up?
\QuickQuizAnswer{
Yes it can!
This is why \co{hashtab_lookup()} must be invoked within an
RCU read-side critical section, and it is why
\co{hashtab_add()} and \co{hashtab_del()} must also use
RCU-aware list-manipulation primitives.
Finally, this is why the caller of \co{hashtab_del()} must
wait for a grace period (e.g., by calling \co{synchronize_rcu()})
before freeing the deleted element.
} \QuickQuizEnd
\begin{figure}[tb]
{ \scriptsize
\begin{verbatim}
1 void
2 hashtab_add(struct hashtab *htp,
3 unsigned long hash,
4 struct ht_elem *htep)
5 {
6 htep->hte_hash = hash;
7 cds_list_add_rcu(&htep->hte_next,
8 &HASH2BKT(htp, hash)->htb_head);
9 }
10
11 void hashtab_del(struct ht_elem *htep)
12 {
13 cds_list_del_rcu(&htep->hte_next);
14 }
\end{verbatim}
}
\caption{RCU-Protected Hash-Table Modification}
\label{fig:datastruct:RCU-Protected Hash-Table Modification}
\end{figure}
Figure~\ref{fig:datastruct:RCU-Protected Hash-Table Modification}
shows \co{hashtab_add()} and \co{hashtab_del()}, both of which
are quite similar to their counterparts in the non-RCU hash table
shown in
Figure~\ref{fig:datastruct:Hash-Table Modification}.
The \co{hashtab_add()} function uses \co{cds_list_add_rcu()} instead
of \co{cds_list_add()} in order to ensure proper ordering when
an element is added to the hash table at the same time that it is
being looked up.
The \co{hashtab_del()} function uses \co{cds_list_del_rcu()} instead
of \co{cds_list_del_init()} to allow for the case where an element is
looked up just before it is deleted.
Unlike \co{cds_list_del_init()}, \co{cds_list_del_rcu()} leaves the
forward pointer intact, so that \co{hashtab_lookup()} can traverse
to the newly deleted element's successor.
Of course, after invoking \co{hashtab_del()}, the caller must wait for
an RCU grace period (e.g., by invoking \co{synchronize_rcu()}) before
freeing or otherwise reusing the memory for the newly deleted element.
\subsection{RCU-Protected Hash Table Performance}
\label{sec:datastruct:RCU-Protected Hash Table Performance}
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{datastruct/zoocpu}}
\end{center}
\caption{Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo}
\label{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo}
\end{figure}
Figure~\ref{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo}
shows the read-only performance of RCU-protected and hazard-pointer-protected
hash tables against the previous section's per-bucket-locked implementation.
As you can see, both RCU and hazard pointers achieve near-ideal performance
and scalability despite the larger numbers of threads and the NUMA effects.
Results from a globally locked implementation are also shown, and as expected
the results are even worse than those of the per-bucket-locked implementation.
RCU does slightly better than hazard pointers, but the difference is not
readily visible in this log-scale plot.
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{datastruct/zoocpulin}}
\end{center}
\caption{Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo, Linear Scale}
\label{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo, Linear Scale}
\end{figure}
Figure~\ref{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo, Linear Scale}
shows the same data on a linear scale.
This drops the global-locking trace into the x-axis, but allows the
relative performance of RCU and hazard pointers to be more readily
discerned.
Both show a change in slope at 32 CPUs, and this is due to hardware
multithreading.
At 32 and fewer CPUs, each thread has a core to itself.
In this regime, RCU does better than does hazard pointers because
hazard pointers's read-side memory barriers result in dead time within
the core.
In short, RCU is better able to utilize a core from a single hardware
thread than is hazard pointers.
This situation changes above 32 CPUs.
Because RCU is using more than half of each core's resources from a
single hardware thread, RCU gains relatively litte benefit from the
second hardware thread in each core.
The slope of hazard pointers's trace also decreases at 32 CPUs, but
less dramatically,
because the second hardware thread is able to fill in the time
that the first hardware thread is stalled due to memory-barrier latency.
As we will see in later sections, hazard pointers's second-hardware-thread
advantage depends on the workload.
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{datastruct/zoocatonly}}
\end{center}
\caption{Read-Side Cat-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo at 60 CPUs}
\label{fig:datastruct:Read-Side Cat-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo at 60 CPUs}
\end{figure}
As noted earlier, Schr\"odinger is surprised by the popularity of his
cat~\cite{ErwinSchroedinger1935Cat}, but recognizes the need to reflect
this popularity in his design.
Figure~\ref{fig:datastruct:Read-Side Cat-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo at 60 CPUs}
shows the results of 60-CPU runs, varying the number of CPUs that are
doing nothing but looking up the cat.
Both RCU and hazard pointers respond well to this challenge, but
bucket locking scales negatively, eventually performing even worse
than global locking.
This should not be a surprise because if all CPUs are doing nothing
but looking up the cat, the lock corresponding to the cat's bucket
is for all intents and purposes a global lock.
This cat-only benchmark illustrates one potential problem with
fully partitioned sharding approaches.
Only the CPUs associated with the cat's
partition is able to access the cat, limiting the cat-only
throughput.
Of course, a great many applications have good load-spreading
properties, and for these applications sharding works
quite well.
However, sharding does not handle ``hot spots'' very well, with
the hot spot exemplified by Schr\"odinger's cat being but one case
in point.
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{datastruct/zooupdatelu}}
\end{center}
\caption{Read-Side RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo at 60 CPUs}
\label{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo at 60 CPUs}
\end{figure}
Of course, if we were only ever going to read the data, we would not need
any concurrency control to begin with.
Figure~\ref{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo at 60 CPUs}
therefore shows the effect of updates.
At the extreme left-hand side of this graph, all 60 CPUs are doing lookups,
while to the right all 60 CPUs are doing updates.
For all four implementations, the number of lookups per millisecond
decreases as the number of updating CPUs increases, of course reaching
zero lookups per millisecond when all 60 CPUs are updating.
RCU does well relative to hazard pointers due to the fact that hazard
pointers's read-side memory barriers incur greater overhead in the
presence of updates.
It therefore seems likely that modern hardware heavily optimizes memory-barrier
execution, greatly reducing memory-barrier overhead in the read-only case.
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{datastruct/zooupdate}}
\end{center}
\caption{Update-Side RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo at 60 CPUs}
\label{fig:datastruct:Update-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo at 60 CPUs}
\end{figure}
Where
Figure~\ref{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo at 60 CPUs}
showed the effect of increasing update rates on lookups,
Figure~\ref{fig:datastruct:Update-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo at 60 CPUs}
shows the effect of increasing update rates on the updates themselves.
Hazard pointers and RCU start off with a significant advantage because,
unlike bucket locking, readers do not exclude updaters.
However, as the number of updating CPUs increases, update-side overhead
starts to make its presence known, first for RCU and then for hazard
pointers.
Of course, all three of these implementations fare much better than does
global locking.
Of course, it is quite possible that the differences in lookup performance
is affected by the differences in update rates.
One way to check this is to artificially throttle the update rates of
per-bucket locking and hazard pointers to match that of RCU.
Doing so does not significantly improve the lookup performace of
per-bucket locking, nor does it close the gap between hazard pointers
and RCU.
However, removing hazard pointers's read-side memory barriers
(thus resulting in an unsafe implementation of hazard pointers)
does nearly close the gap between hazard pointers and RCU.
Although this unsafe hazard-pointer implementation will
usually be reliable enough for benchmarking purposes, it is absolutely
not recommended for production use.
\QuickQuiz{}
The dangers of extrapolating from eight CPUs to 60 CPUs was
made quite clear in
Section~\ref{sec:datastruct:Hash-Table Performance}.
But why should extrapolating up from 60 CPUs be any safer?
\QuickQuizAnswer{
It isn't any safer, and a useful exercise would be to run these
programs on larger systems.
That said, other testing has shown that RCU read-side primitives
offer consistent performance and scalability up to at least 1024 CPUs.
} \QuickQuizEnd
\subsection{RCU-Protected Hash Table Discussion}
\label{sec:datastruct:RCU-Protected Hash Table Discussion}
One consequence of the RCU and hazard-pointer implementations is
that a pair of concurrent readers might disagree on the state of
the cat.
For example, one of the readers might have fetched the pointer to
the cat's data structure just before it was removed, while another
reader might have fetched this same pointer just afterwards.
The first reader would then believe that the cat was alive, while
the second reader would believe that the cat was dead.
Of course, this situation is completely fitting for Schr\"odinger's
cat, but it turns out that it is quite reasonable for normal
non-quantum cats as well.
The reason for this is that it is impossible to determine exactly
when an animal is born or dies.
To see this, let's suppose that we detect a cat's death by heartbeat.
This raise the question of exactly how long we should wait after the
last heartbeat before declaring death.
It is clearly ridiculous to wait only one millisecond, because then
a healthy living cat would have to be declared dead---and then
resurrected---more than once every second.
It is equally ridiculous to wait a full month, because by that time
the poor cat's death would have made itself very clearly known
via olfactory means.
\begin{figure}[htb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{cartoons/2013-08-is-it-dead}}
\end{center}
\caption{Even Veterinarians Disagree!}
\ContributedBy{Figure}{fig:datastruct:Even Veterinarians Disagree}{Melissa Broussard}
\end{figure}
Because an animal's heart can stop for some seconds and then start up
again, there is a tradeoff between timely recognition of death and
probability of false alarms.
It is quite possible that a pair of veterinarians might disagree on
the time to wait between the last heartbeat and the declaration of
death.
For example, one veterinarian might declare death thirty seconds after
the last heartbeat, while another might insist on waiting a full
minute.
In this case, the two veterinarians would disagree on the state of the
cat for the second period of thirty seconds following the last heartbeat,
as fancifully depicted in
Figure~\ref{fig:datastruct:Even Veterinarians Disagree}.
Of course, Heisenberg taught us to live with this sort of
uncertainty~\cite{WeinerHeisenberg1927Uncertain}, which is a good
thing because computing hardware and software acts similarly.
For example, how do you know that a piece of computing hardware
has failed?
Often because it does not respond in a timely fashion.
Just like the cat's heartbeat, this results in a window of
uncertainty as to whether or not the hardware has failed.
Furthermore, most computing systems are intended to interact with
the outside world.
Consistency with the outside world is therefore of paramount importance.
However, as we saw in
Figure~\ref{fig:defer:Response Time of RCU vs. Reader-Writer Locking}
on
page~\pageref{fig:defer:Response Time of RCU vs. Reader-Writer Locking},
increased internal consistency can come at the expense of external
consistency.
Techniques such as RCU and hazard pointers give up some degree of
internal consistency to attain improved external consistency.
In short, internal consistency is not a natural part of all problem domains,
and often incurs great expense in terms of performance,
scalability, external consistency, or all of the above.
\section{Non-Partitionable Data Structures}
\label{sec:datastruct:Non-Partitionable Data Structures}
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{cartoons/2014_Hash-table-hydra}}
\end{center}
\caption{Partitioning Problems}
\ContributedBy{Figure}{fig:datastruct:Partitioning Problems}{Melissa Broussard}
\end{figure}
Fixed-size hash tables are perfectly partitionable, but resizable hash
tables pose partitioning challenges when growing or shrinking, as
fancifully depicted in
Figure~\ref{fig:datastruct:Partitioning Problems}.
However, it turns out that it is possible to construct high-performance
scalable RCU-protected hash tables, as described in the following sections.
\subsection{Resizable Hash Table Design}
\label{sec:datastruct:Resizable Hash Table Design}
In happy contrast to the situation in the early 2000s, there are now
no fewer than three different types of scalable RCU-protected hash
tables.
The first (and simplest) was developed for the Linux kernel by
Herbert Xu~\cite{HerbertXu2010RCUResizeHash}, and is described in the
following sections.
The other two are covered briefly in
Section~\ref{sec:datastruct:Other Resizable Hash Tables}.
The key insight behind the first hash-table implementation is that
each data element can have two sets of list pointers, with one set
currently being used by RCU readers (as well as by non-RCU updaters)
and the other being used to construct a new resized hash table.
This approach allows lookups, insertions, and deletions to all run
concurrently with a resize operation (as well as with each other).
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{datastruct/hashxu-a}}
\end{center}
\caption{Growing a Double-List Hash Table, State (a)}
\label{fig:datastruct:Growing a Double-List Hash Table, State (a)}
\end{figure}
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{datastruct/hashxu-b}}
\end{center}
\caption{Growing a Double-List Hash Table, State (b)}
\label{fig:datastruct:Growing a Double-List Hash Table, State (b)}
\end{figure}
The resize operation proceeds as shown in
Figures~\ref{fig:datastruct:Growing a Double-List Hash Table, State (a)}-\ref{fig:datastruct:Growing a Double-List Hash Table, State (d)},
with the initial two-bucket state shown in
Figure~\ref{fig:datastruct:Growing a Double-List Hash Table, State (a)}
and with time advancing from figure to figure.
The initial state uses the zero-index links to chain the elements into
hash buckets.
A four-bucket array is allocated, and the one-index links are used to
chain the elements into these four new hash buckets.
This results in state~(b) shown in
Figure~\ref{fig:datastruct:Growing a Double-List Hash Table, State (b)},
with readers still using the original two-bucket array.
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{datastruct/hashxu-c}}
\end{center}
\caption{Growing a Double-List Hash Table, State (c)}
\label{fig:datastruct:Growing a Double-List Hash Table, State (c)}
\end{figure}
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{datastruct/hashxu-d}}
\end{center}
\caption{Growing a Double-List Hash Table, State (d)}
\label{fig:datastruct:Growing a Double-List Hash Table, State (d)}
\end{figure}
The new four-bucket array is exposed to readers and then a grace-period
operation waits for all readers, resulting in state~(c), shown in
Figure~\ref{fig:datastruct:Growing a Double-List Hash Table, State (c)}.
In this state, all readers are using the new four-bucket array,
which means that the old two-bucket array may now be freed, resulting
in state~(d), shown in
Figure~\ref{fig:datastruct:Growing a Double-List Hash Table, State (d)}.
This design leads to a relatively straightforward implementation,
which is the subject of the next section.
\subsection{Resizable Hash Table Implementation}
\label{sec:datastruct:Resizable Hash Table Implementation}
\begin{figure}[tb]
{ \scriptsize
\begin{verbatim}
1 struct ht_elem {
2 struct rcu_head rh;
3 struct cds_list_head hte_next[2];
4 unsigned long hte_hash;
5 };
6
7 struct ht_bucket {
8 struct cds_list_head htb_head;
9 spinlock_t htb_lock;
10 };
11
12 struct ht {
13 long ht_nbuckets;
14 long ht_resize_cur;
15 struct ht *ht_new;
16 int ht_idx;
17 void *ht_hash_private;
18 int (*ht_cmp)(void *hash_private,
19 struct ht_elem *htep,
20 void *key);
21 long (*ht_gethash)(void *hash_private,
22 void *key);
23 void *(*ht_getkey)(struct ht_elem *htep);
24 struct ht_bucket ht_bkt[0];
25 };
26
27 struct hashtab {
28 struct ht *ht_cur;
29 spinlock_t ht_lock;
30 };
\end{verbatim}
}
\caption{Resizable Hash-Table Data Structures}
\label{fig:datastruct:Resizable Hash-Table Data Structures}
\end{figure}
Resizing is accomplished by the classic approach of inserting a level
of indirection, in this case, the \co{ht} structure shown on
lines~12-25 of
Figure~\ref{fig:datastruct:Resizable Hash-Table Data Structures}.
The \co{hashtab} structure shown on lines~27-30 contains only a
pointer to the current \co{ht} structure along with a spinlock that
is used to serialize concurrent attempts to resize the hash table.
If we were to use a traditional lock- or atomic-operation-based
implementation, this \co{hashtab} structure could become a severe bottleneck
from both performance and scalability viewpoints.
However, because resize operations should be relatively infrequent,
we should be able to make good use of RCU.
The \co{ht} structure represents a specific size of the hash table,
as specified by the \co{->ht_nbuckets} field on line~13.
The size is stored in the same structure containing the array of
buckets (\co{->ht_bkt[]} on line~24) in order to avoid mismatches between
the size and the array.
The \co{->ht_resize_cur} field on line~14 is equal to -1 unless a resize
operation
is in progress, in which case it indicates the index of the bucket whose
elements are being inserted into the new hash table, which is referenced
by the \co{->ht_new} field on line~15.
If there is no resize operation in progress, \co{->ht_new} is \co{NULL}.
Thus, a resize operation proceeds by allocating a new \co{ht} structure
and referencing it via the \co{->ht_new} pointer, then advancing
\co{->ht_resize_cur} through the old table's buckets.
When all the elements have been added to the new table, the new
table is linked into the \co{hashtab} structure's \co{->ht_cur} field.
Once all old readers have completed, the old hash table's \co{ht} structure
may be freed.
The \co{->ht_idx} field on line~16 indicates which of the two sets of
list pointers are being used by this instantiation of the hash table,
and is used to index the \co{->hte_next[]} array in the \co{ht_bucket}
structure on line~3.
The \co{->ht_hash_private}, \co{->ht_cmp()}, \co{->ht_gethash()},
and \co{->ht_getkey()} fields on lines~17-23
collectively define the per-element key and the hash function.
The \co{->ht_hash_private} allows the hash function to be
perturbed~\cite{McKenney89c,McKenney90,McKenney91}, which can be
used to avoid denial-of-service attacks based on statistical
estimation of the parameters used in the hash function.
The \co{->ht_cmp()} function compares a specified key with that of
the specified element,
the \co{->ht_gethash()} calculates the specified key's hash,
and \co{->ht_getkey()} extracts the key from the enclosing data
element.
The \co{ht_bucket} structure is the same as before, and the
\co{ht_elem} structure differs from that of previous implementations
only in providing a two-element array of list pointer sets in place of
the prior single set of list pointers.
In a fixed-sized hash table, bucket selection is quite straightforward:
Simply transform the hash value to the corresponding bucket index.
In contrast, when resizing, it is also necessary to determine which
of the old and new sets of buckets to select from.
If the bucket that would be selected from the old table has already
been distributed into the new table, then the bucket should be selected
from the new table.
Conversely, if the bucket that would be selected from the old table
has not yet been distributed, then the bucket should be selected from
the old table.
\begin{figure}[tb]
{ \scriptsize
\begin{verbatim}
1 static struct ht_bucket *
2 ht_get_bucket_single(struct ht *htp,
3 void *key, long *b)
4 {
5 *b = htp->ht_gethash(htp->ht_hash_private,
6 key) % htp->ht_nbuckets;
7 return &htp->ht_bkt[*b];
8 }
9
10 static struct ht_bucket *
11 ht_get_bucket(struct ht **htp, void *key,
12 long *b, int *i)
13 {
14 struct ht_bucket *htbp;
15
16 htbp = ht_get_bucket_single(*htp, key, b);
17 if (*b <= (*htp)->ht_resize_cur) {
18 *htp = (*htp)->ht_new;
19 htbp = ht_get_bucket_single(*htp, key, b);
20 }
21 if (i)
22 *i = (*htp)->ht_idx;
23 return htbp;
24 }
\end{verbatim}
}
\caption{Resizable Hash-Table Bucket Selection}
\label{fig:datastruct:Resizable Hash-Table Bucket Selection}
\end{figure}
Bucket selection is shown in
Figure~\ref{fig:datastruct:Resizable Hash-Table Bucket Selection},
which shows \co{ht_get_bucket_single()} on lines~1-8 and
\co{ht_get_bucket()} on lines~10-24.
The \co{ht_get_bucket_single()} function returns a reference to the bucket
corresponding to the specified key in the specified hash table, without
making any allowances for resizing.
It also stores the hash value corresponding to the key into the location
referenced by parameter \co{b} on lines~5 and ~6.
Line~7 then returns a reference to the corresponding bucket.
The \co{ht_get_bucket()} function handles hash-table selection, invoking
\co{ht_get_bucket_single()} on line~16 to select the bucket
corresponding to the hash in the current
hash table, storing the hash value through parameter \co{b}.
If line~17 determines that the table is being resized and that
line~16's bucket has already been distributed across the new hash
table, then line~18 selects the new hash table and line~19
selects the bucket corresponding to the hash in the new hash table,
again storing the hash value through parameter \co{b}.
\QuickQuiz{}
The code in
Figure~\ref{fig:datastruct:Resizable Hash-Table Bucket Selection}
computes the hash twice!
Why this blatant inefficiency?
\QuickQuizAnswer{
The reason is that the old and new hash tables might have
completely different hash functions, so that a hash computed
for the old table might be completely irrelevant to the
new table.
} \QuickQuizEnd
If line~21 finds that parameter \co{i} is non-\co{NULL}, then
line~22 stores the pointer-set index for the selected hash table.
Finally, line~23 returns a reference to the selected hash bucket.
\QuickQuiz{}
How does the code in
Figure~\ref{fig:datastruct:Resizable Hash-Table Bucket Selection}
protect against the resizing process progressing past the
selected bucket?
\QuickQuizAnswer{
It does not provide any such protection.
That is instead the job of the update-side concurrency-control
functions described next.
} \QuickQuizEnd
This implementation of
\co{ht_get_bucket_single()} and \co{ht_get_bucket()}
will permit lookups and modifications to run concurrently
with a resize operation.
\begin{figure}[tb]
{ \scriptsize
\begin{verbatim}
1 void hashtab_lock_mod(struct hashtab *htp_master,
2 void *key)
3 {
4 long b;
5 struct ht *htp;
6 struct ht_bucket *htbp;
7 struct ht_bucket *htbp_new;
8
9 rcu_read_lock();
10 htp = rcu_dereference(htp_master->ht_cur);
11 htbp = ht_get_bucket_single(htp, key, &b);
12 spin_lock(&htbp->htb_lock);
13 if (b > htp->ht_resize_cur)
14 return;
15 htp = htp->ht_new;
16 htbp_new = ht_get_bucket_single(htp, key, &b);
17 spin_lock(&htbp_new->htb_lock);
18 spin_unlock(&htbp->htb_lock);
19 }
20
21 void hashtab_unlock_mod(struct hashtab *htp_master,
22 void *key)
23 {
24 long b;
25 struct ht *htp;
26 struct ht_bucket *htbp;
27
28 htp = rcu_dereference(htp_master->ht_cur);
29 htbp = ht_get_bucket(&htp, key, &b, NULL);
30 spin_unlock(&htbp->htb_lock);
31 rcu_read_unlock();
32 }
\end{verbatim}
}
\caption{Resizable Hash-Table Update-Side Concurrency Control}
\label{fig:datastruct:Resizable Hash-Table Update-Side Concurrency Control}
\end{figure}
Read-side concurrency control is provided by RCU as was shown in
Figure~\ref{fig:datastruct:RCU-Protected Hash-Table Read-Side Concurrency Control},
but the update-side concurrency-control functions
\co{hashtab_lock_mod()} and \co{hashtab_unlock_mod()}
must now deal with the possibility of a
concurrent resize operation as shown in
Figure~\ref{fig:datastruct:Resizable Hash-Table Update-Side Concurrency Control}.
The \co{hashtab_lock_mod()} spans lines~1-19 in the figure.
Line~9 enters an RCU read-side critical section to prevent
the data structures from being freed during the traversal,
line~10 acquires a reference to the current hash table, and then
line~11 obtains a reference to the bucket in this hash table
corresponding to the key.
Line~12 acquires that bucket's lock, which will prevent any concurrent
resizing operation from distributing that bucket, though of course it
will have no effect if the resizing operation has already distributed
this bucket.
Line~13 then checks to see if a concurrent resize operation has
already distributed this bucket across the new hash table, and if not,
line~14 returns with the selected hash bucket's lock held (and also
within an RCU read-side critical section).
Otherwise, a concurrent resize operation has already distributed this
bucket, so line~15 proceeds to the new hash table and line~16
selects the bucket corresponding to the key.
Finally, line~17 acquires the bucket's lock and line~18 releases the
lock for the old hash table's bucket.
Once again, \co{hashtab_lock_mod()} exits within an RCU read-side critical
section.
\QuickQuiz{}
The code in
Figures~\ref{fig:datastruct:Resizable Hash-Table Bucket Selection}
and~\ref{fig:datastruct:Resizable Hash-Table Update-Side Concurrency Control}
compute the hash and execute the bucket-selection logic twice for
updates!
Why this blatant inefficiency?
\QuickQuizAnswer{
This approach allows the \co{hashtorture.h} testing infrastructure
to be reused.
That said, a production-quality resizable hash table would likely
be optimized to avoid this double computation.
Carrying out this optimization is left as an exercise for the reader.
} \QuickQuizEnd
The \co{hashtab_unlock_mod()} function releases the lock acquired by
\co{hashtab_lock_mod()}.
Line~28 picks up the current hash table, and then line~29 invokes
\co{ht_get_bucket()} in order to gain a reference to the bucket that
corresponds to the key---and of course this bucket might well in a
new hash table.
Line~30 releases the bucket's lock and finally line~31 exits the
RCU read-side critical section.
\QuickQuiz{}
Suppose that one thread is inserting an element into the
new hash table during a resize operation.
What prevents this insertion to be lost due to a subsequent
resize operation completing before the insertion does?
\QuickQuizAnswer{
The second resize operation will not be able to move beyond
the bucket into which the insertion is taking place due to
the insertion holding the lock on one of the hash buckets in
the new hash table (the second hash table of three in this
example).
Furthermore, the insertion operation takes place within an
RCU read-side critical section.
As we will see when we examine the \co{hashtab_resize()}
function, this means that the first resize operation will
use
\co{synchronize_rcu()} to wait for the insertion's read-side
critical section to complete.
} \QuickQuizEnd
\begin{figure}[tb]
{ \scriptsize
\begin{verbatim}
1 struct ht_elem *
2 hashtab_lookup(struct hashtab *htp_master,
3 void *key)
4 {
5 long b;
6 int i;
7 struct ht *htp;
8 struct ht_elem *htep;
9 struct ht_bucket *htbp;
10
11 htp = rcu_dereference(htp_master->ht_cur);
12 htbp = ht_get_bucket(&htp, key, &b, &i);
13 cds_list_for_each_entry_rcu(htep,
14 &htbp->htb_head,
15 hte_next[i]) {
16 if (htp->ht_cmp(htp->ht_hash_private,
17 htep, key))
18 return htep;
19 }
20 return NULL;
21 }
22
23 void
24 hashtab_add(struct hashtab *htp_master,
25 struct ht_elem *htep)
26 {
27 long b;
28 int i;
29 struct ht *htp;
30 struct ht_bucket *htbp;
31
32 htp = rcu_dereference(htp_master->ht_cur);
33 htbp = ht_get_bucket(&htp, htp->ht_getkey(htep),
34 &b, &i);
35 cds_list_add_rcu(&htep->hte_next[i],
36 &htbp->htb_head);
37 }
38
39 void
40 hashtab_del(struct hashtab *htp_master,
41 struct ht_elem *htep)
42 {
43 long b;
44 int i;
45 struct ht *htp;
46 struct ht_bucket *htbp;
47
48 htp = rcu_dereference(htp_master->ht_cur);
49 htbp = ht_get_bucket(&htp, htp->ht_getkey(htep),
50 &b, &i);
51 cds_list_del_rcu(&htep->hte_next[i]);
52 }
\end{verbatim}
}
\caption{Resizable Hash-Table Access Functions}
\label{fig:datastruct:Resizable Hash-Table Access Functions}
\end{figure}
Now that we have bucket selection and concurrency control in place,
we are ready to search and update our resizable hash table.
The \co{hashtab_lookup()}, \co{hashtab_add()}, and \co{hashtab_del()}
functions shown in
Figure~\ref{fig:datastruct:Resizable Hash-Table Access Functions}.
The \co{hashtab_lookup()} function on lines~1-21 of the figure does
hash lookups.
Line~11 fetches the current hash table and line~12 obtains a reference
to the bucket corresponding to the specified key.
This bucket will be located in a new resized hash table when a
resize operation has progressed past the bucket in the old hash
table that contained the desired data element.
Note that line~12 also passes back the index that will be used to
select the correct set of pointers from the pair in each element.
The loop spanning lines~13-19 searches the bucket, so that if line~16
detects a match, line~18 returns a pointer to the enclosing data element.
Otherwise, if there is no match, line~20 returns \co{NULL} to indicate
failure.
\QuickQuiz{}
In the \co{hashtab_lookup()} function in
Figure~\ref{fig:datastruct:Resizable Hash-Table Access Functions},
the code carefully finds the right bucket in the new hash table
if the element to be looked up has already been distributed
by a concurrent resize operation.
This seems wasteful for RCU-protected lookups.
Why not just stick with the old hash table in this case?
\QuickQuizAnswer{
Suppose that a resize operation begins and distributes half of
the old table's buckets to the new table.
Suppose further that a thread adds a new element that goes into
one of the already-distributed buckets, and that this same thread
now looks up this newly added element.
If lookups unconditionally traversed only the old hash table,
this thread would get a lookup failure for the element that it
just added, which certainly sounds like a bug to me!
} \QuickQuizEnd
The \co{hashtab_add()} function on lines~23-37 of the figure adds
new data elements to the hash table.
Lines~32-34 obtain a pointer to the hash bucket corresponding to
the key (and provide the index), as before, and line~35 adds the new
element to the table.
The caller is required to handle concurrency, for example, by invoking
\co{hashtab_lock_mod()} before the call to \co{hashtab_add()} and invoking
\co{hashtab_unlock_mod()} afterwards.
These two concurrency-control functions will correctly synchronize with
a concurrent resize operation: If the resize operation has already
progressed beyond the bucket that this data element would have been added to,
then the element is added to the new table.
The \co{hashtab_del()} function on lines~39-52 of the figure removes
an existing element from the hash table.
Lines~48-50 provide the bucket and index as before, and line~51 removes
the specified element.
As with \co{hashtab_add()}, the caller is responsible for concurrency
control and this concurrency control suffices for synchronizing with
a concurrent resize operation.
\QuickQuiz{}
The \co{hashtab_del()} function in
Figure~\ref{fig:datastruct:Resizable Hash-Table Access Functions}
does not always remove the element from the old hash table.
Doesn't this mean that readers might access this newly removed
element after it has been freed?
\QuickQuizAnswer{
No.
The \co{hashtab_del()} function omits removing the element
from the old hash table only if the resize operation has
already progressed beyond the bucket containing the just-deleted
element.
But this means that new \co{hashtab_lookup()} operations will
use the new hash table when looking up that element.
Therefore, only old \co{hashtab_lookup()} operations that started
before the \co{hashtab_del()} might encounter the newly
removed element.
This means that \co{hashtab_del()} need only wait for an
RCU grace period to avoid inconveniencing
\co{hashtab_lookup()} operations.
} \QuickQuizEnd
\begin{figure*}[tb]
{ \scriptsize
\begin{verbatim}
1 int hashtab_resize(struct hashtab *htp_master,
2 unsigned long nbuckets, void *hash_private,
3 int (*cmp)(void *hash_private, struct ht_elem *htep, void *key),
4 long (*gethash)(void *hash_private, void *key),
5 void *(*getkey)(struct ht_elem *htep))
6 {
7 struct ht *htp;
8 struct ht *htp_new;
9 int i;
10 int idx;
11 struct ht_elem *htep;
12 struct ht_bucket *htbp;
13 struct ht_bucket *htbp_new;
14 unsigned long hash;
15 long b;
16
17 if (!spin_trylock(&htp_master->ht_lock))
18 return -EBUSY;
19 htp = htp_master->ht_cur;
20 htp_new = ht_alloc(nbuckets,
21 hash_private ? hash_private : htp->ht_hash_private,
22 cmp ? cmp : htp->ht_cmp,
23 gethash ? gethash : htp->ht_gethash,
24 getkey ? getkey : htp->ht_getkey);
25 if (htp_new == NULL) {
26 spin_unlock(&htp_master->ht_lock);
27 return -ENOMEM;
28 }
29 htp->ht_new = htp_new;
30 synchronize_rcu();
31 idx = htp->ht_idx;
32 htp_new->ht_idx = !idx;
33 for (i = 0; i < htp->ht_nbuckets; i++) {
34 htbp = &htp->ht_bkt[i];
35 spin_lock(&htbp->htb_lock);
36 htp->ht_resize_cur = i;
37 cds_list_for_each_entry(htep, &htbp->htb_head, hte_next[idx]) {
38 htbp_new = ht_get_bucket_single(htp_new, htp_new->ht_getkey(htep), &b);
39 spin_lock(&htbp_new->htb_lock);
40 cds_list_add_rcu(&htep->hte_next[!idx], &htbp_new->htb_head);
41 spin_unlock(&htbp_new->htb_lock);
42 }
43 spin_unlock(&htbp->htb_lock);
44 }
45 rcu_assign_pointer(htp_master->ht_cur, htp_new);
46 synchronize_rcu();
47 spin_unlock(&htp_master->ht_lock);
48 free(htp);
49 return 0;
50 }
\end{verbatim}
}
\caption{Resizable Hash-Table Resizing}
\label{fig:datastruct:Resizable Hash-Table Resizing}
\end{figure*}
The actual resizing itself is carried out by \co{hashtab_resize}, shown in
Figure~\ref{fig:datastruct:Resizable Hash-Table Resizing} on
page~\pageref{fig:datastruct:Resizable Hash-Table Resizing}.
Line~17 conditionally acquires the top-level \co{->ht_lock}, and if
this acquisition fails, line~18 returns \co{-EBUSY} to indicate that
a resize is already in progress.
Otherwise, line~19 picks up a reference to the current hash table,
and lines~21-24 allocate a new hash table of the desired size.
If a new set of hash/key functions have been specified, these are
used for the new table, otherwise those of the old table are preserved.
If line~25 detects memory-allocation failure, line~26 releases \co{->htlock}
and line~27 returns a failure indication.
Line~29 starts the bucket-distribution process by installing a reference
to the new table into the \co{->ht_new} field of the old table.
Line~30 ensures that all readers who are not aware of the new table
complete before the resize operation continues.
Line~31 picks up the current table's index and stores its inverse to
the new hash table, thus ensuring that the two hash tables avoid overwriting
each other's linked lists.
Each pass through the loop spanning lines~33-44 distributes the contents
of one of the old hash table's buckets into the new hash table.
Line~34 picks up a reference to the old table's current bucket,
line~35 acquires that bucket's spinlock, and line~36 updates
\co{->ht_resize_cur} to indicate that this bucket is being distributed.
\QuickQuiz{}
In the \co{hashtab_resize()} function in
Figure~\ref{fig:datastruct:Resizable Hash-Table Access Functions},
what guarantees that the update to \co{->ht_new} on line~29
will be seen as happening before the update to \co{->ht_resize_cur}
on line~36 from the perspective of \co{hashtab_lookup()},
\co{hashtab_add()}, and \co{hashtab_del()}?
\QuickQuizAnswer{
The \co{synchronize_rcu()} on line~30 of
Figure~\ref{fig:datastruct:Resizable Hash-Table Access Functions}
ensures that all pre-existing RCU readers have completed between
the time that we install the new hash-table reference on
line~29 and the time that we update \co{->ht_resize_cur} on
line~36.
This means that any reader that sees a non-negative value
of \co{->ht_resize_cur} cannot have started before the
assignment to \co{->ht_new}, and thus must be able to see
the reference to the new hash table.
} \QuickQuizEnd
Each pass through the loop spanning lines~37-42 adds one data element
from the current old-table bucket to the corresponding new-table bucket,
holding the new-table bucket's lock during the add operation.
Finally, line~43 releases the old-table bucket lock.
Execution reaches line~45 once all old-table buckets have been distributed
across the new table.
Line~45 installs the newly created table as the current one, and
line~46 waits for all old readers (who might still be referencing
the old table) to complete.
Then line~47 releases the resize-serialization lock, line~48 frees
the old hash table, and finally line~48 returns success.
\subsection{Resizable Hash Table Discussion}
\label{sec:datastruct:Resizable Hash Table Discussion}
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{datastruct/perftestresize}}
\end{center}
\caption{Overhead of Resizing Hash Tables}
\label{fig:datastruct:Overhead of Resizing Hash Tables}
\end{figure}
Figure~\ref{fig:datastruct:Overhead of Resizing Hash Tables}
compares resizing hash tables to their fixed-sized counterparts
for 2048, 16,384, and 131,072 elements in the hash table.
The figure shows three traces for each element count, one
for a fixed-size 1024-bucket hash table, another for a
fixed-size 2048-bucket hash table, and a third for a resizable
hash table that shifts back and forth between 1024 and 2048
buckets, with a one-millisecond pause between each resize operation.
The uppermost three traces are for the 2048-element hash table.
The upper trace corresponds to the 2048-bucket fixed-size hash
table, the middle trace to the 1024-bucket fixed-size hash table,
and the lower trace to the resizable hash table.
In this case, the short hash chains cause normal lookup overhead
to be so low that the overhead of resizing dominates.
Nevertheless, the larger fixed-size hash table has a significant
performance advantage, so that resizing can be quite beneficial,
at least given sufficient time between resizing operations: One
millisecond is clearly too short a time.
The middle three traces are for the 16,384-element hash table.
Again, the upper trace corresponds to the 2048-bucket fixed-size hash
table, but the middle trace now corresponds to the resizable hash
table and the lower trace to the 1024-bucket fixed-size hash table.
However, the performance difference between the resizable and the
1024-bucket hash table is quite small.
One consequence of the eight-fold increase in number of elements
(and thus also in hash-chain length) is that incessant resizing
is now no worse than maintaining a too-small hash table.
The lower three traces are for the 131,072-element hash table.
The upper trace corresponds to the 2048-bucket fixed-size hash table,
the middle trace to the resizable hash table, and the lower trace
to the 1024-bucket fixed-size hash table.
In this case, longer hash chains result in higher lookup overhead,
so that this lookup overhead dominates that of resizing the hash
table.
However, the performance of all three approaches at the 131,072-element
level is more than an order of magnitude worse than that at the
2048-element level, suggesting that the best strategy would be
a single 64-fold increase in hash-table size.
The key point from this data is that the RCU-protected resizable hash
table performs and scales almost as well as does its fixed-size counterpart.
The performance during an actual resize operation of course suffers
somewhat due to the cache misses causes by the updates to each element's
pointers, and this effect is most pronounced when the hash-tables
bucket lists are short.
This indicates that hash tables should be resized by substantial amounts,
and that hysteresis should be be applied to prevent performance degradation
due to too-frequent resize operations.
In memory-rich environments, hash-table sizes should furthermore
be increased much more aggressively than they are decreased.
Another key point is that although the \co{hashtab} structure is
non-partitionable, it is also read-mostly, which suggests the use
of RCU.
Given that the performance and scalability of this resizable hash table is
very nearly that of RCU-protected fixed-sized hash tables, we must
conclude that this approach was quite successful.
Finally, it is important to note that insertions, deletions, and
lookups can proceed concurrently with a resize operation.
This concurrency is
critically important when resizing large hash tables, especially
for applications that must meet severe response-time constraints.
Of course, the \co{ht_elem} structure's
pair of pointer sets does impose some memory overhead,
which is taken up in the next section.
\subsection{Other Resizable Hash Tables}
\label{sec:datastruct:Other Resizable Hash Tables}
One shortcoming of the resizable hash table described earlier in this
section is memory consumption.
Each data element has two pairs of linked-list pointers rather than just
one.
Is it possible to create an RCU-protected resizable hash table that
makes do with just one pair?
It turns out that the answer is ``yes.''
Josh Triplett et al.~\cite{Triplett:2011:RPHash}
produced a \emph{relativistic hash table} that incrementally
splits and combines corresponding hash chains so that readers always
see valid hash chains at all points during the resizing operation.
This incremental splitting and combining relies on the fact that it
is harmless for a reader to see a data element that should be in some
other hash chain: When this happens, the reader will simply ignore the
extraneous data element due to key mismatches.
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{datastruct/zipperhashshrink}}
\end{center}
\caption{Shrinking a Relativistic Hash Table}
\label{fig:datastruct:Shrinking a Relativistic Hash Table}
\end{figure}
The process of shrinking a relativistic hash table by a factor of two
is shown in
Figure~\ref{fig:datastruct:Shrinking a Relativistic Hash Table},
in this case shrinking a two-bucket hash table into a one-bucket
hash table, otherwise known as a linear list.
This process works by coalescing pairs of buckets in the old larger hash
table into single buckets in the new smaller hash table.
For this process to work correctly, we clearly need to constrain the hash
functions for the two tables.
One such constraint is to use the same underlying hash function for
both tables, but to throw out the low-order bit when shrinking from
large to small.
For example, the old two-bucket hash table would
use the two top bits of the value, while the new one-bucket hash table
could use the top bit of the value.
In this way, a given pair of adjacent even and odd buckets in the old
large hash table can be coalesced into a single bucket in the new small
hash table, while still having a single hash value cover all of the
elements in that single bucket.
The initial state is shown at the top of the figure, with time advancing
from top to bottom, starting with initial state~(a).
The shrinking process begins by allocating the new smaller array of
buckets, and having each bucket of this new smaller array reference
the first element of one of the buckets of the corresponding pair in
the old large hash table, resulting in state~(b).
Then the two hash chains are linked together, resulting in state~(c).
In this state, readers looking up an even-numbered element see no change,
and readers looking up elements~1 and~3 likewise see no change.
However, readers looking up some other odd number will also traverse
elements~0 and~2.
This is harmless because any odd number will compare not-equal to these
two elements.
There is some performance loss, but on the other hand, this is exactly
the same performance loss that will be experienced once the new small
hash table is fully in place.
Next, the new small hash table is made accessible to readers, resulting
in state~(d).
Note that older readers might still be traversing the old large hash
table, so in this state both hash tables are in use.
The next step is to wait for all pre-existing readers to complete,
resulting in state~(e).
In this state, all readers are using the new small hash table, so that
the old large hash table's buckets may be freed, resulting in the final
state~(f).
\begin{figure}[tb]
\begin{center}
\resizebox{3in}{!}{\includegraphics{datastruct/zipperhashgrow}}
\end{center}
\caption{Growing a Relativistic Hash Table}
\label{fig:datastruct:Growing a Relativistic Hash Table}
\end{figure}
Growing a relativistic hash table reverses the shrinking process,
but requires more grace-period steps, as shown in
Figure~\ref{fig:datastruct:Growing a Relativistic Hash Table}.
The initial state~(a) is at the top of this figure, with time advancing
from top to bottom.
We start by allocating the new large two-bucket hash table, resulting
in state~(b).
Note that each of these new buckets references the first element destined
for that bucket.
These new buckets are published to readers, resulting in state~(c).
After a grace-period operation, all readers are using the new large
hash table, resulting in state~(d).
In this state, only those readers traversing the even-values hash bucket
traverse element~0, which is therefore now colored white.
At this point, the old small hash buckets may be freed, although many
implementations use these old buckets to track progress ``unzipping''
the list of items into their respective new buckets.
The last even-numbered element in the first consecutive run of such
elements now has its pointer-to-next updated to reference the following
even-numbered element.
After a subsequent grace-period operation, the result is state~(e).
The vertical arrow indicates the next element to be unzipped, and
element~1 is now colored black to indicate that only those readers traversing
the odd-values hash bucket may reach it.
Next, the last odd-numbered element in the first consecutive run of such
elements now has its pointer-to-next updated to reference the following
odd-numbered element.
After a subsequent grace-period operation, the result is state~(f).
A final unzipping operation (including a grace-period operation)
results in the final state~(g).
In short, the relativistic hash table reduces the number of per-element
list pointers at the expense of additional grace periods incurred during
resizing.
These additional grace periods are usually not a problem because
insertions, deletions, and lookups may proceed concurrently with
a resize operation.
It turns out that it is possible to reduce the per-element memory overhead
from a pair of pointers to a single pointer, while still retaining
$O(1)$ deletions.
This is accomplished by augmenting split-order
list~\cite{OriShalev2006SplitOrderListHash}
with RCU
protection~\cite{MathieuDesnoyers2009URCU,PaulMcKenney2013LWNURCUhash}.
The data elements in the hash table are arranged into a single
sorted linked list, with each hash bucket referencing the first element
in that bucket.
Elements are deleted by setting low-order bits in their pointer-to-next
fields, and these elements are removed from the list by later traversals
that encounter them.
This RCU-protected split-order list is complex, but offers lock-free
progress guarantees for all insertion, deletion, and lookup operations.
Such guarantees can be important in real-time applications.
An implementation is available from recent versions of the userspace RCU
library~\cite{MathieuDesnoyers2009URCU}.
\section{Other Data Structures}
\label{sec:datastruct:Other Data Structures}
The preceding sections have focused on data structures that enhance
concurrency due to partitionability
(Section~\ref{sec:datastruct:Partitionable Data Structures}),
efficient handling of read-mostly access patterns
(Section~\ref{sec:datastruct:Read-Mostly Data Structures}),
or application of read-mostly techniques to avoid
non-partitionability
(Section~\ref{sec:datastruct:Non-Partitionable Data Structures}).
This section gives a brief review of other data structures.
One of the hash table's greatest advantages for parallel use is that it
is fully partitionable, at least while not being resized.
One way of preserving the partitionability and the size independence is
to use a radix tree, which is also called a trie.
Tries partition the search key, using each successive key partition
to traverse the next level of the trie.
As such, a trie can be thought of as a set of nested hash tables,
thus providing the required partitionability.
One disadvantage of tries is that a sparse key space can result in
inefficient use of memory.
There are a number of compression techniques that may be used to
work around this disadvantage, including hashing the key value to
a smaller keyspace before the
traversal~\cite{RobertOlsson2006a}.
Radix trees are heavily used in practice, including in the Linux
kernel~\cite{NickPiggin2006radixtree}.
One important special case of both a hash table and a trie is what is
perhaps the oldest of data structures, the array and its multi-dimensional
counterpart, the matrix.
The fully partitionable nature of matrices is exploited heavily in
concurrent numerical algorithms.
Self-balancing trees are heavily used in sequential code, with
AVL trees and red-black trees being perhaps the most well-known
examples~\cite{ThomasHCorman2001Algorithms}.
Early attempts to parallelize AVL trees were complex and not necessarily
all that efficient~\cite{Ellis80},
however, more recent work on red-black trees provides better
performance and scalability by using RCU for readers and hashed arrays
of locks\footnote{
In the guise of swissTM~\cite{AleksandarDragovejic2011STMnotToy},
which is a variant of software transactional memory in which
the developer flags non-shared accesses.}
to protect reads and updates,
respectively~\cite{PhilHoward2011RCUTMRBTree,PhilipWHoward2013RCUrbtree}.
It turns out that red-black trees rebalance aggressively, which works
well for sequential programs, but not necessarily so well for parallel
use.
Recent work has therefore made use of RCU-protected ``bonsai trees''
that rebalance less aggressively~\cite{AustinClements2012RCULinux:mmapsem},
trading off optimal tree depth to gain more efficient concurrent updates.
Concurrent skip lists lend themselves well to RCU readers, and in fact
represents an early academic use of a technique resembling
RCU~\cite{Pugh90}.
Concurrent double-ended queues were discussed in
Section~\ref{sec:SMPdesign:Double-Ended Queue},
and concurrent stacks and queues have a long history~\cite{Treiber86},
though not normally the most impressive performance or scalability.
They are nevertheless a common feature of concurrent
libraries~\cite{PaulMcKenney2013LWNURCUqueuestack}.
Researchers have recently proposed relaxing the ordering constraints
of stacks and queues~\cite{Shavit:2011:DSM:1897852.1897873},
with some work indicating that relaxed-ordered queues actually have
better ordering properties than do strict FIFO
queues~\cite{AndreasHaas2012FIFOisnt,ChristophMKirsch2012FIFOisntTR,AndreasHaas2013CFRelaxedQueues}.
It seems likely that continued work with concurrent data structures will
produce novel algorithms with surprising properties.
\section{Micro-Optimization}
\label{sec:datastruct:Micro-Optimization}
The data structures shown in this section were coded straightforwardly,
with no adaptation to the underlying system's cache hierarchy.
In addition, many of the implementations used pointers to functions
for key-to-hash conversions and other frequent operations.
Although this approach provides simplicity and portability, in many
cases it does give up some performance.
The following sections touch on specialization, memory conservation,
and hardware considerations.
Please do not mistakes these short sections for a definitive treatise
on this subject.
Whole books have been written on optimizing to a specific CPU, let
alone to the set of CPU families in common use today.
\subsection{Specialization}
\label{sec:datastruct:Specialization}
The resizable hash table presented in
Section~\ref{sec:datastruct:Non-Partitionable Data Structures}
used an opaque type for the key.
This allows great flexibility, permitting any sort of key to be
used, but it also incurs significant overhead due to the calls via
of pointers to functions.
Now, modern hardware uses sophisticated branch-prediction techniques
to minimize this overhead, but on the other hand, real-world software
is often larger than can be accommodated even by today's large
hardware branch-prediction tables.
This is especially the case for calls via pointers, in which case
the branch prediction hardware must record a pointer in addition
to branch-taken/branch-not-taken information.
This overhead can be eliminated by specializing a hash-table implementation
to a given key type and hash function.
Doing so eliminates the \co{->ht_cmp()}, \co{->ht_gethash()}, and
\co{->ht_getkey()} function pointers in the \co{ht} structure shown in
Figure~\ref{fig:datastruct:Resizable Hash-Table Data Structures} on
page~\pageref{fig:datastruct:Resizable Hash-Table Data Structures}.
It also eliminates the corresponding calls through these pointers,
which could allow the compiler to inline the resulting fixed functions,
eliminating not only the overhead of the call instruction, but the
argument marshalling as well.
In addition, the resizable hash table is designed to fit an API
that segregates bucket selection from concurrency control.
Although this allows a single torture test to exercise all the hash-table
implementations in this chapter, it also means that many operations
must compute the hash and interact with possible resize operations twice
rather than just once.
In a performance-conscious environment, the \co{hashtab_lock_mod()}
function would also return a reference to the bucket selected, eliminating
the subsequent call to \co{ht_get_bucket()}.
\QuickQuiz{}
Couldn't the \co{hashtorture.h} code be modified to accommodate
a version of \co{hashtab_lock_mod()} that subsumes the
\co{ht_get_bucket()} functionality?
\QuickQuizAnswer{
It probably could, and doing so would benefit all of the
per-bucket-locked hash tables presented in this chapter.
Making this modification is left as an exercise for the
reader.
} \QuickQuizEnd
\QuickQuiz{}
How much do these specializations really save?
Are they really worth it?
\QuickQuizAnswer{
The answer to the first question is left as an exercise to
the reader.
Try specializing the resizable hash table and see how much
performance improvement results.
The second question cannot be answered in general, but must
instead be answered with respect to a specific use case.
Some use cases are extremely sensitive to performance and
scalability, while others are less so.
} \QuickQuizEnd
All that aside, one of the great benefits of modern hardware compared
to that available when I first started learning to program back in
the early 1970s is that much less specialization is required.
This allows much greater productivity than was possible back in the
days of four-kilobyte address spaces.
\subsection{Bits and Bytes}
\label{sec:datastruct:Bits and Bytes}
The hash tables discussed in this chapter made almost no attempt to conserve
memory.
For example, the \co{->ht_idx} field in the \co{ht} structure in
Figure~\ref{fig:datastruct:Resizable Hash-Table Data Structures} on
page~\pageref{fig:datastruct:Resizable Hash-Table Data Structures}
always has a value of either zero or one, yet takes up a full 32 bits
of memory.
It could be eliminated, for example, by stealing a bit from the
\co{->ht_resize_key} field.
This works because the \co{->ht_resize_key} field is large enough to
address every byte of memory and the \co{ht_bucket} structure
is more than one byte long, so that
the \co{->ht_resize_key} field must have several bits to spare.
This sort of bit-packing trick is frequently used in data structures
that are highly replicated, as is the \co{page} structure in the Linux
kernel.
However, the resizable hash table's \co{ht} structure is not all that
highly replicated.
It is instead the \co{ht_bucket} structures we should focus on.
There are two major opportunities for shrinking the \co{ht_bucket} structure:
(1)~Placing the \co{->htb_lock} field in a low-order bit of one of the
\co{->htb_head} pointers and (2)~Reducing the number of pointers required.
The first opportunity might make use of bit-spinlocks in the Linux
kernel, which are provided by the \co{include/linux/bit_spinlock.h}
header file.
These are used in space-critical data structures in the Linux kernel,
but are not without their disadvantages:
\begin{enumerate}
\item They are significantly slower than the traditional spinlock
primitives.
\item They cannot participate in the lockdep deadlock detection
tooling in the Linux kernel~\cite{JonathanCorbet2006lockdep}.
\item They do not record lock ownership, further complicating
debugging.
\item They do not participate in priority boosting in -rt kernels,
which means that preemption must be disabled when holding
bit spinlocks, which can degrade real-time latency.
\end{enumerate}
Despite these disadvantages, bit-spinlocks are extremely useful when
memory is at a premium.
One aspect of the second opportunity was covered in
Section~\ref{sec:datastruct:Other Resizable Hash Tables},
which presented resizable hash tables that require only one
set of bucket-list pointers in place of the pair of sets required
by the resizable hash table presented in
Section~\ref{sec:datastruct:Non-Partitionable Data Structures}.
Another approach would be to use singly linked bucket lists in
place of the doubly linked lists used in this chapter.
One downside of this approach is that deletion would then require
additional overhead, either by marking the outgoing pointer
for later removal
or by searching the bucket list for the element being deleted.
In short, there is a tradeoff between minimal memory overhead on
the one hand, and performance and simplicity on the other.
Fortunately, the relatively large memories available on modern
systems have allowed us to prioritize performance and simplicity
over memory overhead.
However, even with today's large-memory systems\footnote{
Smartphones with gigabytes of memory, anyone?}
it is sometime necessary to take extreme measures to reduce
memory overhead.
\subsection{Hardware Considerations}
\label{sec:datastruct:Hardware Considerations}
Modern computers typically move data between CPUs and main memory in
fixed-sized blocks that range in size from 32 bytes to 256 bytes.
These blocks are called \emph{cache lines}, and are extremely important
to high performance and scalability, as was discussed in
Section~\ref{sec:cpu:Overheads}.
One timeworn way to kill both performance and scalability is to
place incompatible variables into the same cacheline.
For example, suppose that a resizable hash table data element had
the \co{ht_elem} structure in the same cacheline as a counter that
was incremented quite frequently.
The frequent incrementing would cause the cacheline to be present at
the CPU doing the incrementing, but nowhere else.
If other CPUs attempted to traverse the hash bucket list containing
that element, they would incur expensive cache misses, degrading both
performance and scalability.
\begin{figure}[tb]
{ \scriptsize
\begin{verbatim}
struct hash_elem {
struct ht_elem e;
long __attribute__ ((aligned(64))) counter;
};
\end{verbatim}
}
\caption{Alignment for 64-Byte Cache Lines}
\label{fig:datastruct:Alignment for 64-Byte Cache Lines}
\end{figure}
One way to solve this problem on systems with 64-byte cache line is shown in
Figure~\ref{fig:datastruct:Alignment for 64-Byte Cache Lines}.
Here a gcc \co{aligned} attribute is used to force the \co{->counter}
and the \co{ht_elem} structure into separate cache lines.
This would allow CPUs to traverse the hash bucket list at full speed
despite the frequent incrementing.
Of course, this raises the question ``How did we know that cache lines
are 64 bytes in size?''
On a Linux system, this information may be obtained from the
\co{/sys/devices/system/cpu/cpu*/cache/} directories, and it is even
possible to make the installation process rebuild the application to
accommodate the system's hardware structure.
However, this would be more difficult if you wanted your application to
also run on non-Linux systems.
Furthermore, even if you were content to run only on Linux, such a
self-modifying installation poses validation challenges.
Fortunately, there are some rules of thumb that work reasonably well in
practice, which were gathered into a 1995
paper~\cite{BenjaminGamsa95a}.\footnote{
A number of these rules are paraphrased and expanded on here
with permission from Orran Krieger.}
The first group of rules involve rearranging structures to accommodate
cache geometry:
\begin{enumerate}
\item Separate read-mostly data from data that is frequently updated.
For example, place read-mostly data at the beginning of the
structure and frequently updated data at the end.
Where possible, place data that is rarely accessed in between.
\item If the structure has groups of fields such that each group is
updated by an independent code path, separate these groups
from each other.
Again, it can make sense to place data that is rarely accessed
between the groups.
In some cases, it might also make sense to place each such group
into a separate structure referenced by the original structure.
\item Where possible, associate update-mostly data with a CPU, thread,
or task.
We saw several very effective examples of this rule of thumb
in the counter implementations in
Chapter~\ref{chp:Counting}.
\item In fact, where possible, you should partition your data on
a per-CPU, per-thread, or per-task basis, as was discussed
in Chapter~\ref{chp:Data Ownership}.
\end{enumerate}
There has recently been some work towards automated trace-based
rearrangement of structure
fields~\cite{Golovanevsky:2010:TDL:2174824.2174835}.
This work might well ease one of the more painstaking tasks
required to get excellent performance and scalability from
multithreaded software.
An additional set of rules of thumb deal with locks:
\begin{enumerate}
\item Given a heavily contended lock protecting data that is
frequently modified, take one of the following approaches:
\begin{enumerate}
\item Place the lock in a different cacheline than the data
that it protects.
\item Use a lock that is adapted for high contention, such
as a queued lock.
\item Redesign to reduce lock contention.
(This approach is best, but can require quite a bit
of work.)
\end{enumerate}
\item Place uncontended locks into the same cache line as the data
that they protect.
This approach means that the cache miss that brought the
lock to the current CPU also brought its data.
\item Protect read-mostly data with RCU, or, if RCU cannot be used and
the critical sections are of very long duration, reader-writer locks.
\end{enumerate}
Of course, these are rules of thumb rather than absolute rules.
Some experimentation is required to work out which are most applicable
to your particular situation.
\section{Summary}
\label{sec:datastruct:Summary}
This chapter has focused primarily on hash tables, including resizable
hash tables, which are not fully partitionable.
Section~\ref{sec:datastruct:Other Data Structures} gave a quick
overview of a few non-hash-table data structures.
Nevertheless, this exposition of hash tables is an excellent introduction
to the many issues surrounding high-performance scalable data access,
including:
\begin{enumerate}
\item Fully partitioned data structures work well on small systems,
for example, single-socket systems.
\item Larger systems require locality of reference as well as
full partitioning.
\item Read-mostly techniques, such as hazard pointers and RCU,
provide good locality of reference for read-mostly workloads,
and thus provide excellent performance and scalability even
on larger systems.
\item Read-mostly techniques also work well on some types of
non-partitionable data structures, such as resizable hash tables.
\item Additional performance and scalability can be obtained by
specializing the data structure to a specific workload,
for example, by replacing a general key with a 32-bit integer.
\item Although requirements for portability and for extreme performance
often conflict, there are some data-structure-layout techniques
that can strike a good balance between these two sets of
requirements.
\end{enumerate}
That said, performance and scalability is of little use without reliability,
so the next chapter covers validation.