blob: 352b61fa001e541c94b744d5257d33ba50d2f2e3 [file] [log] [blame]
% datastruct/datastruct.tex
% mainfile: ../perfbook.tex
% SPDX-License-Identifier: CC-BY-SA-3.0
\QuickQuizChapter{chp:Data Structures}{Data Structures}{qqzdatastruct}
%
\Epigraph{Bad programmers worry about the code.
Good programmers worry about data structures and their
relationships.}
{Linus Torvalds}
Serious discussions of algorithms include time complexity of their
data structures~\cite{ThomasHCorman2001Algorithms}.
However, for parallel programs, the time complexity includes concurrency
effects because these effects can be overwhelmingly large, as shown in
\cref{chp:Hardware and its Habits}.
In other words, a good programmer's data-structure relationships
include those aspects related to concurrency.
This chapter will expose a number of complications:
\begin{enumerate}
\item Data structures designed in full accordance with the
good advice given in
\cref{chp:Partitioning and Synchronization Design}
can nonetheless abjectly fail to scale on some types of systems.
\label{sec:datastruct:NUMA-induced failure to scale}
\item Data structures designed in full accordance with the
good advice given in both
\cref{chp:Partitioning and Synchronization Design}
and
\cref{chp:Deferred Processing}
can \emph{still} abjectly fail to scale on some types of systems.
\label{sec:datastruct:Cache-size-induced failure to scale}
\item Even \emph{read-only synchronization-free} data-structure traversal
can fail to scale on some types of systems.
\label{sec:datastruct:Cache-size-induced failure to scale redux}
\item Data-structure traversals avoiding the aforementioned complications
can still be impeded by concurrent updates.
\label{sec:datastruct:Update-activity failure to scale}
\end{enumerate}
This chapter will investigate these complications and demostrate some
ways of unraveling them.
\Cref{sec:datastruct:Motivating Application}
presents the motivating application for this chapter's data structures.
\Cref{chp:Partitioning and Synchronization Design} showed how
partitioning improves scalability, so
\cref{sec:datastruct:Partitionable Data Structures}
discusses partitionable data structures.
\Cref{chp:Deferred Processing} described how deferring some
actions can greatly improve both performance and scalability,
a topic taken up by
\cref{sec:datastruct:Read-Mostly Data Structures}.
\Cref{sec:datastruct:Non-Partitionable Data Structures}
looks at a non-partitionable data structure, splitting
it into read-mostly and partitionable portions,
which improves both performance and scalability.
Because this chapter cannot delve into the details of every concurrent
data structure,
\cref{sec:datastruct:Other Data Structures}
surveys a few of the important ones.
Finally, \cref{sec:datastruct:Summary}
presents a summary of this chapter.
\section{Motivating Application}
\label{sec:datastruct:Motivating Application}
%
\epigraph{The art of doing mathematics consists in finding that special
case which contains all the germs of generality.}
{David Hilbert}
We will use the Schr\"odinger's Zoo application to evaluate
performance~\cite{McKenney:2013:SDS:2483852.2483867}.
Schr\"odinger has a zoo containing a large number of animals, and
he would like to track them using an in-memory database with
each animal in the zoo represented by a data item in this database.
Each animal has a unique name that is used as a key, with a variety
of data tracked for each animal.
Births, captures, and purchases result in insertions, while deaths,
releases, and sales result in deletions.
Because Schr\"odinger's zoo contains a large quantity of short-lived
animals, including mice and insects, the database must handle
high update rates.
Those interested in Schr\"odinger's animals can query them, and
Schr\"odinger has noted suspiciously query rates for his cat, so much
so that he suspects that his mice might be checking up on their nemesis.
Whatever their source, Schr\"odinger's application must handle high
query rates to a single data element.
As we will see, this simple application can pose a challenge to
traditional concurrent data structures.
\section{Partitionable Data Structures}
\label{sec:datastruct:Partitionable Data Structures}
%
\epigraph{Finding a way to live the simple life today is the most
complicated task.}
{Henry A. Courtney, updated}
There are a huge number of data structures in use today, so much so
that there are multiple textbooks covering them.
This section focuses on a single data structure, namely the hash table.
This focused approach allows a much deeper investigation of how concurrency
interacts with data structures, and also focuses on a data structure
that is heavily used in practice.
\Cref{sec:datastruct:Hash-Table Design}
overviews the design, and
\cref{sec:datastruct:Hash-Table Implementation}
presents the implementation.
Finally,
\cref{sec:datastruct:Hash-Table Performance}
discusses the resulting performance and scalability.
\subsection{Hash-Table Design}
\label{sec:datastruct:Hash-Table Design}
\Cref{chp:Partitioning and Synchronization Design}
emphasized the need to apply partitioning in order to attain
respectable performance and scalability, so partitionability
must be a first-class criterion when selecting data structures.
This criterion is well satisfied by that workhorse of parallelism,
the hash table.
Hash tables are conceptually simple, consisting of an array of
\emph{hash buckets}.
A \emph{hash function} maps from a given element's \emph{key}
to the hash bucket that this element will be stored in.
Each hash bucket therefore heads up a linked list of elements,
called a \emph{hash chain}.
When properly configured, these hash chains will be quite short,
permitting a hash table to access its elements extremely efficiently.
\QuickQuiz{
But chained hash tables are but one type of many.
Why the focus on chained hash tables?
}\QuickQuizAnswer{
Chained hash tables are completely partitionable, and thus
well-suited to concurrent use.
There are other completely-partitionable hash tables, for
example, split-ordered list~\cite{OriShalev2006SplitOrderListHash},
but they are considerably more complex.
We therefore focus on chained hash tables.
}\QuickQuizEnd
In addition, each bucket has its own lock, so that elements in different
buckets of the hash table may be added, deleted, and looked up completely
independently.
A large hash table with a large number of buckets (and thus locks), with
each bucket containing a small number of elements should therefore provide
excellent scalability.
\subsection{Hash-Table Implementation}
\label{sec:datastruct:Hash-Table Implementation}
\begin{fcvref}[ln:datastruct:hash_bkt:struct]
\Cref{lst:datastruct:Hash-Table Data Structures}
(\path{hash_bkt.c})
shows a set of data structures used in a simple fixed-sized hash
table using chaining and per-hash-bucket locking, and
\cref{fig:datastruct:Hash-Table Data-Structure Diagram}
diagrams how they fit together.
Note that the \co{cds_} functions and data structures may be found
in the userspace RCU
library~\cite{MathieuDesnoyers2009URCU,PaulMcKenney2013LWNURCUqueuestack,PaulMcKenney2013LWNURCUqueuestackAPI,PaulMcKenney2013LWNURCUlist}.
The \co{hashtab} structure (\clnrefrange{tab:b}{tab:e} in
\cref{lst:datastruct:Hash-Table Data Structures})
contains four \co{ht_bucket} structures
(\clnrefrange{bucket:b}{bucket:e} in
\cref{lst:datastruct:Hash-Table Data Structures}),
with the \co{->ht_nbuckets} field controlling the number of buckets
and the \co{->ht_cmp} field holding the pointer to key-comparison
function.
Each such bucket contains a list header \co{->htb_head} and
a lock \co{->htb_lock}.
The list headers chain \co{ht_elem} structures
(\clnrefrange{elem:b}{elem:e} in
\cref{lst:datastruct:Hash-Table Data Structures})
through their
\co{->hte_next} fields, and each \co{ht_elem} structure also caches
the corresponding element's hash value in the \co{->hte_hash} field.
The \co{ht_elem} structure is included in a larger structure
which might contain a complex key.
\end{fcvref}
\Cref{fig:datastruct:Hash-Table Data-Structure Diagram}
shows bucket~0 containing two elements and bucket~2 containing one.
\begin{listing}
\input{CodeSamples/datastruct/hash/hash_bkt@struct.fcv}
\caption{Hash-Table Data Structures}
\label{lst:datastruct:Hash-Table Data Structures}
\end{listing}
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{datastruct/hashdiagram}}
\caption{Hash-Table Data-Structure Diagram}
\label{fig:datastruct:Hash-Table Data-Structure Diagram}
\end{figure}
\begin{fcvref}[ln:datastruct:hash_bkt:map_lock:map]
\Cref{lst:datastruct:Hash-Table Mapping and Locking}
shows mapping and locking functions.
\Clnref{b,e}
show the macro \co{HASH2BKT()}, which maps from a hash value
to the corresponding \co{ht_bucket} structure.
This macro uses a simple modulus:
If more aggressive hashing is required,
the caller needs to implement it when mapping from key to hash value.
The remaining two functions acquire and release the \co{->htb_lock}
corresponding to the specified hash value.
\end{fcvref}
\begin{listing}
\input{CodeSamples/datastruct/hash/hash_bkt@map_lock.fcv}
\caption{Hash-Table Mapping and Locking}
\label{lst:datastruct:Hash-Table Mapping and Locking}
\end{listing}
\begin{fcvref}[ln:datastruct:hash_bkt:lookup]
\Cref{lst:datastruct:Hash-Table Lookup}
shows \co{hashtab_lookup()},
which returns a pointer to the element with the specified hash and key,
or \co{NULL} if that element does not exist.
This function takes both a hash value and a pointer to the key because
this allows users of this function to use arbitrary keys and
arbitrary hash functions.
\Clnref{map} maps from the hash value to a pointer to the corresponding
hash bucket.
Each pass through the loop spanning
\clnrefrange{loop:b}{loop:e} examines one element
of the bucket's hash chain.
\Clnref{hashmatch} checks to see if the hash values match, and if not,
\clnref{next}
proceeds to the next element.
\Clnref{keymatch} checks to see if the actual key matches, and if so,
\clnref{return} returns a pointer to the matching element.
If no element matches, \clnref{ret_NULL} returns \co{NULL}.
\end{fcvref}
\begin{listing}
\input{CodeSamples/datastruct/hash/hash_bkt@lookup.fcv}
\caption{Hash-Table Lookup}
\label{lst:datastruct:Hash-Table Lookup}
\end{listing}
\QuickQuiz{
\begin{fcvref}[ln:datastruct:hash_bkt:lookup]
But isn't the double comparison on
\clnrefrange{hashmatch}{return} in
\cref{lst:datastruct:Hash-Table Lookup} inefficient
in the case where the key fits into an unsigned long?
\end{fcvref}
}\QuickQuizAnswer{
Indeed it is!
However, hash tables quite frequently store information with
keys such as character strings that do not necessarily fit
into an unsigned long.
Simplifying the hash-table implementation for the case where
keys always fit into unsigned longs is left as an exercise
for the reader.
}\QuickQuizEnd
\begin{listing}
\input{CodeSamples/datastruct/hash/hash_bkt@add_del.fcv}
\caption{Hash-Table Modification}
\label{lst:datastruct:Hash-Table Modification}
\end{listing}
\Cref{lst:datastruct:Hash-Table Modification}
shows the \co{hashtab_add()} and \co{hashtab_del()} functions
that add and delete elements from the hash table, respectively.
\begin{fcvref}[ln:datastruct:hash_bkt:add_del:add]
The \co{hashtab_add()} function simply sets the element's hash
value on \clnref{set}, then adds it to the corresponding bucket on
\clnref{add:b,add:e}.
\end{fcvref}
The \co{hashtab_del()} function simply removes the specified element
from whatever hash chain it is on, courtesy of the doubly linked
nature of the hash-chain lists.
Before calling either of these two functions, the caller is required to
ensure that no other thread is accessing
or modifying this same bucket, for example, by invoking
\co{hashtab_lock()} beforehand.
\begin{listing}
\input{CodeSamples/datastruct/hash/hash_bkt@alloc_free.fcv}
\caption{Hash-Table Allocation and Free}
\label{lst:datastruct:Hash-Table Allocation and Free}
\end{listing}
\Cref{lst:datastruct:Hash-Table Allocation and Free}
shows \co{hashtab_alloc()} and \co{hashtab_free()},
which do hash-table allocation and freeing, respectively.
\begin{fcvref}[ln:datastruct:hash_bkt:alloc_free:alloc]
Allocation begins on
\clnrefrange{alloc:b}{alloc:e} with allocation of the underlying memory.
If \clnref{chk_NULL} detects that memory has been exhausted,
\clnref{ret_NULL} returns
\co{NULL} to the caller.
Otherwise, \clnref{set_nbck,set_cmp} initialize
the number of buckets and the pointer to key-comparison function,
and the loop
spanning \clnrefrange{loop:b}{loop:e} initializes the buckets themselves,
including the chain list header on
\clnref{init_head} and the lock on \clnref{init_lock}.
Finally, \clnref{return} returns a pointer to the newly allocated hash table.
\end{fcvref}
\begin{fcvref}[ln:datastruct:hash_bkt:alloc_free:free]
The \co{hashtab_free()} function on
\clnrefrange{b}{e} is straightforward.
\end{fcvref}
\subsection{Hash-Table Performance}
\label{sec:datastruct:Hash-Table Performance}
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zoocpubktlin8}}
\caption{Read-Only Hash-Table Performance For Schr\"odinger's Zoo}
\label{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo}
\end{figure}
The performance results for a single 28-core socket of a 2.1\,GHz
Intel Xeon system using a bucket-locked hash table
with 262,144 buckets are shown in
\cref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo}.
The performance does scale nearly linearly, but falls far short ideal,
even at only 28~CPUs.
Part of this shortfall is due to the fact that lock acquisitions and
releases incur communications cache misses only on two or more CPUs.
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zoocpubktlin}}
\caption{Read-Only Hash-Table Performance For Schr\"odinger's Zoo, 448 CPUs}
\label{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo; 448 CPUs}
\end{figure}
And things only get worse with more CPUs, as can be seen in
\cref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo; 448 CPUs}.
We do not need to show ideal performance:
The performance for 29~CPUs and beyond is abysmal.
This clearly underscores the dangers of extrapolating performance from a
modest number of CPUs.
Of course, one possible reason for the collapse in performance might be
that more hash buckets are needed.
We can test this by increasing the number of hash buckets.
\QuickQuiz{
Instead of simply increasing the number of hash buckets,
wouldn't it be better to cache-align the existing hash buckets?
}\QuickQuizAnswer{
The answer depends on a great many things.
If the hash table has a large number of elements per bucket, it
would clearly be better to increase the number of hash buckets.
On the other hand, if the hash table is lightly loaded,
the answer depends on the hardware, the effectiveness of the
hash function, and the workload.
Interested readers are encouraged to experiment.
}\QuickQuizEnd
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zoocpubktsizelin}}
\caption{Read-Only Hash-Table Performance For Schr\"odinger's Zoo, Varying Buckets}
\label{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo; Varying Buckets}
\end{figure}
However, as can be seen in
\cref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo; Varying Buckets},
changing the number of buckets has almost no effect:
Scalability is still abysmal.
In particular, we still see a sharp dropoff at 29~CPUs and beyond,
clearly demonstrating the complication put forward by
\cref{sec:datastruct:NUMA-induced failure to scale}
of the list of complications on
\cpageref{sec:datastruct:NUMA-induced failure to scale}.
Clearly, something else is going on.
The problem is that this is a multi-socket system, with CPUs~0--27
and~224--251 mapped to the first socket as shown in
\cref{fig:datastruct:NUMA Topology of System Under Test}.
Test runs confined to the first 28~CPUs therefore perform quite
well, but tests that involve socket~0's CPUs~0--27 as well as
socket~1's CPU~28 incur the overhead of passing data across
socket boundaries.
This can severely degrade performance, as was discussed in
\cref{sec:cpu:Hardware System Architecture}.
In short, large multi-socket systems require good locality of reference
in addition to full partitioning.
The remainder of this chapter will discuss ways of providing good
locality of reference within the hash table itself, but in the
meantime please note that one other way to provide good locality
of reference would be to place large data elements in the hash
table.
For example, Schr\"odinger might attain excellent cache locality by
placing photographs or even videos of his animals in each element of
the hash table.
But for those needing hash tables containing small data elements,
please read on!
\QuickQuiz{
Given the negative scalability of the Schr\"odinger's
Zoo application across sockets, why not just run multiple
copies of the application, with each copy having a subset
of the animals and confined to run on a single socket?
}\QuickQuizAnswer{
You can do just that!
In fact, you can extend this idea to large clustered systems,
running one copy of the application on each node of the cluster.
This practice is called ``sharding'', and is heavily used in
practice by large web-based
retailers~\cite{DeCandia:2007:DAH:1323293.1294281}.
However, if you are going to shard on a per-socket basis within
a multisocket system, why not buy separate smaller and cheaper
single-socket systems, and then run one shard of the database
on each of those systems?
}\QuickQuizEnd
One key property of these Schr\"odinger's-zoo experiments is they are
all read-only.
This makes the performance degradation due to lock-acquisition-induced
cache misses all the more painful.
Even though we are not updating the underlying hash table itself, we are
still paying the price for writing to memory.
Of course, if the hash table was never going to be updated, we could dispense
entirely with mutual exclusion.
This approach is quite straightforward and is left as an exercise for the
reader.
But even with the occasional update, avoiding writes avoids cache
misses, and allows the read-mostly data to be replicated across all
the caches, which in turn promotes locality of reference.
The next section therefore examines optimizations that can be carried out in
read-mostly cases where updates are rare, but could happen at any time.
\setlength\dashlinedash{1pt}
\setlength\dashlinegap{2pt}
\begin{table}
\renewcommand*{\arraystretch}{1.2}
\centering
\begin{tabular}{r||c:c}
& \multicolumn{2}{c}{Hyperthread} \\
\cline{2-3}
Socket & 0 & 1 \\
\hline
\hline
0 & 0--27 & 224--251 \\
\cdashline{2-3}
1 & 28--55 & 252--279 \\
\cdashline{2-3}
2 & 56--83 & 280--307 \\
\cdashline{2-3}
3 & 84--111 & 308--335 \\
\cdashline{2-3}
4 & 112--139 & 336--363 \\
\cdashline{2-3}
5 & 140--167 & 364--391 \\
\cdashline{2-3}
6 & 168--195 & 392--419 \\
\cdashline{2-3}
7 & 196--223 & 420--447 \\
\end{tabular}
\caption{NUMA Topology of System Under Test}
\label{fig:datastruct:NUMA Topology of System Under Test}
\end{table}
\section{Read-Mostly Data Structures}
\label{sec:datastruct:Read-Mostly Data Structures}
%
\epigraph{Adapt the remedy to the disease.}{Chinese proverb}
Although partitioned data structures can offer excellent scalability,
\IXacr{numa} effects can result in severe degradations of both performance and
scalability.
In addition,
the need for read-side synchronization can degrade performance in
read-mostly situations.
However, we can achieve both performance and scalability by using
RCU, which was introduced in
\cref{sec:defer:Read-Copy Update (RCU)}.
Similar results can be achieved using \IXpl{hazard pointer}
(\path{hazptr.c})~\cite{MagedMichael04a}, which will be included in
the performance results shown in this
section~\cite{McKenney:2013:SDS:2483852.2483867}.
\subsection{RCU-Protected Hash Table Implementation}
\label{sec:datastruct:RCU-Protected Hash Table Implementation}
For an RCU-protected hash table with per-bucket locking,
updaters use locking as shown in
\cref{sec:datastruct:Partitionable Data Structures},
but readers use RCU\@.
The data structures remain as shown in
\cref{lst:datastruct:Hash-Table Data Structures},
and the \co{HASH2BKT()}, \co{hashtab_lock()}, and \co{hashtab_unlock()}
functions remain as shown in
\cref{lst:datastruct:Hash-Table Mapping and Locking}.
However, readers use the lighter-weight concurrency-control embodied
by \co{hashtab_lock_lookup()} and \co{hashtab_unlock_lookup()}
shown in
\cref{lst:datastruct:RCU-Protected Hash-Table Read-Side Concurrency Control}.
\begin{listing}
\input{CodeSamples/datastruct/hash/hash_bkt_rcu@lock_unlock.fcv}
\caption{RCU-Protected Hash-Table Read-Side Concurrency Control}
\label{lst:datastruct:RCU-Protected Hash-Table Read-Side Concurrency Control}
\end{listing}
\Cref{lst:datastruct:RCU-Protected Hash-Table Lookup}
shows \co{hashtab_lookup()} for the RCU-protected per-bucket-locked
hash table.
This is identical to that in
\cref{lst:datastruct:Hash-Table Lookup}
except that \apiur{cds_list_for_each_entry()} is replaced
by \apiur{cds_list_for_each_entry_rcu()}.
Both of these primitives traverse the hash chain referenced
by \co{htb->htb_head} but \co{cds_list_for_each_entry_rcu()} also
correctly enforces memory ordering in case of concurrent insertion.
This is an important difference between these two hash-table implementations:
Unlike the pure per-bucket-locked implementation, the RCU protected
implementation allows lookups to run concurrently with insertions
and deletions, and RCU-aware primitives like
\co{cds_list_for_each_entry_rcu()} are required to correctly handle
this added concurrency.
Note also that \co{hashtab_lookup()}'s caller must be within an
RCU read-side critical section, for example, the caller must invoke
\co{hashtab_lock_lookup()} before invoking \co{hashtab_lookup()}
(and of course invoke \co{hashtab_unlock_lookup()} some time afterwards).
\begin{listing}
\input{CodeSamples/datastruct/hash/hash_bkt_rcu@lookup.fcv}
\caption{RCU-Protected Hash-Table Lookup}
\label{lst:datastruct:RCU-Protected Hash-Table Lookup}
\end{listing}
\QuickQuiz{
But if elements in a hash table can be removed concurrently
with lookups, doesn't that mean that a lookup could return
a reference to a data element that was removed immediately
after it was looked up?
}\QuickQuizAnswer{
Yes it can!
This is why \co{hashtab_lookup()} must be invoked within an
RCU read-side critical section, and it is why
\co{hashtab_add()} and \co{hashtab_del()} must also use
RCU-aware list-manipulation primitives.
Finally, this is why the caller of \co{hashtab_del()} must
wait for a grace period (e.g., by calling \co{synchronize_rcu()})
before freeing the removed element.
This will ensure that all RCU readers that might reference
the newly removed element have completed before that element
is freed.
}\QuickQuizEnd
\begin{listing}
\input{CodeSamples/datastruct/hash/hash_bkt_rcu@add_del.fcv}
\caption{RCU-Protected Hash-Table Modification}
\label{lst:datastruct:RCU-Protected Hash-Table Modification}
\end{listing}
\Cref{lst:datastruct:RCU-Protected Hash-Table Modification}
shows \co{hashtab_add()} and \co{hashtab_del()}, both of which
are quite similar to their counterparts in the non-RCU hash table
shown in
\cref{lst:datastruct:Hash-Table Modification}.
The \co{hashtab_add()} function uses \apiur{cds_list_add_rcu()} instead
of \apiur{cds_list_add()} in order to ensure proper ordering when
an element is added to the hash table at the same time that it is
being looked up.
The \co{hashtab_del()} function uses \apiur{cds_list_del_rcu()} instead
of \apiur{cds_list_del_init()} to allow for the case where an element is
looked up just before it is deleted.
Unlike \co{cds_list_del_init()}, \co{cds_list_del_rcu()} leaves the
forward pointer intact, so that \co{hashtab_lookup()} can traverse
to the newly deleted element's successor.
Of course, after invoking \co{hashtab_del()}, the caller must wait for
an RCU \IX{grace period} (e.g., by invoking \co{synchronize_rcu()}) before
freeing or otherwise reusing the memory for the newly deleted element.
\subsection{RCU-Protected Hash Table Validation}
\label{sec:datastruct:RCU-Protected Hash Table Validation}
Although the topic of validation is covered in detail in
\cref{chp:Validation}, the fact is that a hash table with lockless
RCU-protected lookups needs special attention to validation sooner rather
than later.
The test suite (``\path{hashtorture.h}'') contains a \co{smoketest()}
function that verifies that a specific series of single-threaded
additions, deletions, and lookups give the expected results.
Concurrent test runs put each updater thread in control of its portion of
the elements, which allows assertions checking for the following issues:
\begin{enumerate}
\item A just-now-to-be-added element already being in the table
according to \co{hastab_lookup()}.
\item A just-now-to-be-added element being marked as being in the
table by its \co{->in_table} flag.
\item A just-now-to-be-deleted element not being in the table according
to \co{hastab_lookup()}.
\item A just-now-to-be-deleted element being marked as not being in
the table by its \co{->in_table} flag.
\end{enumerate}
In addition, concurrent test runs run lookups concurrently with updates
in order to catch all manner of data-structure corruption problems.
Some runs also continually resize the hash table concurrently with both
lookups and updates to verify correct behavior, and also to verify that
resizes do not unduly delay either readers or updaters.
Finally, the concurrent tests output statistics that can be used to
track down performance and scalabilty issues, which provides the raw
data used by
\cref{sec:datastruct:RCU-Protected Hash Table Performance}.
\QuickQuiz{
The \path{hashtorture.h} file contains more than 1,000 lines!
Is that a comprehensive test or what???
}\QuickQuizAnswer{
What.
The \path{hashtorture.h} tests are a good start and suffice
for a textbook algorithm.
If this code was to be used in production, much more testing
would be required:
\begin{enumerate}
\item Have some subset of elements that always reside in the
table, and verify that lookups always find these elements
regardless of the number and type of concurrent updates
in flight.
\item Pair an updater with one or more readers, verifying that
after an element is added, once a reader successfully
looks up that element, all later lookups succeed.
The definition of ``later'' will depend on the table's
consistency requirements.
\item Pair an updater with one or more readers, verifying that
after an element is deleted, once a reader's lookup
of that element fails, all later lookups also fail.
\end{enumerate}
There are many more tests where those came from, the exact
nature of which depend on the details of the requirements
on your particular hash table.
}\QuickQuizEnd
All code requires significant validation effort, and high-performance
concurrent code requires more validation than most.
\subsection{RCU-Protected Hash Table Performance}
\label{sec:datastruct:RCU-Protected Hash Table Performance}
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zoocpu}}
\caption{Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo}
\label{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo}
\end{figure}
\Cref{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo}
shows the read-only performance of RCU-protected and hazard-pointer-protected
hash tables against the previous section's per-bucket-locked implementation.
As you can see, both RCU and hazard pointers perform and scale
much better than per-bucket locking because read-only
replication avoids NUMA effects.
The difference increases with larger numbers of threads.
Results from a globally locked implementation are also shown, and as expected
the results are even worse than those of the per-bucket-locked implementation.
RCU does slightly better than hazard pointers.
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zoocpulin}}
\caption{Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo, Linear Scale}
\label{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo; Linear Scale}
\end{figure}
\Cref{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo; Linear Scale}
shows the same data on a linear scale.
This drops the global-locking trace into the x-axis, but allows the
non-ideal performance of RCU and hazard pointers to be more readily
discerned.
Both show a change in slope at 224~CPUs, and this is due to hardware
multithreading.
At 224 and fewer CPUs, each thread has a core to itself.
In this regime, RCU does better than does hazard pointers because the
latter's read-side \IXpl{memory barrier} result in dead time within the core.
In short, RCU is better able to utilize a core from a single hardware
thread than is hazard pointers.
This situation changes above 224~CPUs.
Because RCU is using more than half of each core's resources from a
single hardware thread, RCU gains relatively little benefit from the
second hardware thread in each core.
The slope of the hazard-pointers trace also decreases at 224~CPUs, but
less dramatically,
because the second hardware thread is able to fill in the time
that the first hardware thread is stalled due to \IXh{memory-barrier}{latency}.
As we will see in later sections, this second-hardware-thread
advantage depends on the workload.
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zoocpulinqsbr}}
\caption{Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo including QSBR, Linear Scale}
\label{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo including QSBR; Linear Scale}
\end{figure}
But why is RCU's performance a factor of five less than ideal?
One possibility is that the per-thread counters manipulated by
\co{rcu_read_lock()} and \co{rcu_read_unlock()} are slowing things down.
\Cref{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo including QSBR; Linear Scale}
therefore adds the results for the \IXacr{qsbr} variant of RCU, whose read-side
primitives do nothing.
And although QSBR does perform slightly better than does RCU, it is still
about a factor of five short of ideal.
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zoocpulinqsbrunsync}}
\caption{Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo including QSBR and Unsynchronized, Linear Scale}
\label{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo including QSBR and Unsynchronized; Linear Scale}
\end{figure}
\Cref{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo including QSBR and Unsynchronized; Linear Scale}
adds completely unsynchronized results, which works because this
is a read-only benchmark with nothing to synchronize.
Even with no synchronization whatsoever, performance still falls far
short of ideal, thus demonstrating
\cref{sec:datastruct:Cache-size-induced failure to scale,sec:datastruct:Cache-size-induced failure to scale redux}
of the list of complications on
\cpageref{sec:datastruct:Cache-size-induced failure to scale redux}.
The problem is that this system has sockets with 28 cores, which have
the modest cache sizes shown in
\cref{tab:cpu:Cache Geometry for 8-Socket System With Intel Xeon Platinum 8176 CPUs @ 2.10GHz}
on \cpageref{tab:cpu:Cache Geometry for 8-Socket System With Intel Xeon Platinum 8176 CPUs @ 2.10GHz}.
Each hash bucket (\co{struct ht_bucket}) occupies 56~bytes and each
element (\co{struct zoo_he}) occupies 72~bytes for the RCU and QSBR runs.
The benchmark generating
\cref{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo including QSBR and Unsynchronized; Linear Scale}
used 262,144 buckets and up to 262,144 elements, for a total of
33,554,448~bytes, which not only overflows the 1,048,576-byte L2 caches
by more than a factor of thirty, but is also uncomfortably close to the
L3 cache size of 40,370,176~bytes, especially given that this cache has
only 11~ways.
This means that L2 cache collisions will be the rule and also that L3
cache collisions will not be uncommon, so that the resulting cache misses
will degrade performance.
In this case, the bottleneck is not in the CPU, but rather in the
memory system.
% This data was generated wtih 262,144 buckets and elements.
% RCU hash ht_elem structure is 40 bytes plus 32 more from zoo_he for 72.
% struct ht_bucket is 56 bytes.
% For "# B" RCU runs, hash buckets weigh in at 14,680,080 bytes and the hash
% elements up to 18,874,368 bytes. The total is 33,554,448 bytes, which
% in theory fits into the 40,370,176-byte L3 cache, but definitely not
% the 1,048,576-byte L2 cache, much less the 32,768-byte L0 and L1 caches.
% The L3 is 11-way set associative.
% So we have 8192 4K pages of data (buckets and elements) that must
% fit into 896 11-way set-associated cache sets.
Additional evidence for this memory-system bottleneck may be found by
examining the unsynchronized code.
This code does not need locks, so each hash bucket occupies only 16~bytes
compared to the 56~bytes for RCU and QSBR\@.
Similarly, each hash-table element occupies only 56~bytes compared to the
72~bytes for RCU and QSBR\@.
So it is unsurprising that the single-CPU unsynchronized run performs up
to about half again faster than that of either QSBR or RCU\@.
\QuickQuiz{
How can we be so sure that the hash-table size is at fault here,
especially given that
\cref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo; Varying Buckets}
on \cpageref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo; Varying Buckets}
shows that varying hash-table size has almost
no effect?
Might the problem instead be something like \IX{false sharing}?
}\QuickQuizAnswer{
Excellent question!
False sharing requires writes, which are not featured in the
unsynchronized and RCU runs of this lookup-only
benchmark.
The problem is therefore not false sharing.
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf-hashsize.2020.12.29a/zoohashsize}}
\caption{Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo at 448 CPUs, Varying Table Size}
\label{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo at 448 CPUs; Varying Table Size}
\end{figure}
Still unconvinced?
Then look at the log-log plot in
\cref{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo at 448 CPUs; Varying Table Size},
which shows performance for 448~CPUs as a function of the
hash-table size, that is, number of buckets and maximum number
of elements.
A hash-table of size 1,024 has 1,024~buckets and contains
at most 1,024~elements, with the average occupancy being
512~elements.
Because this is a read-only benchmark, the actual occupancy is
always equal to the average occupancy.
This figure shows near-ideal performance below about 8,000~elements,
that is, when the hash table comprises less than 1\,MB of data.
This near-ideal performance is consistent with that for the
pre-BSD routing table shown in
\cref{fig:defer:Pre-BSD Routing Table Protected by RCU}
on \cpageref{fig:defer:Pre-BSD Routing Table Protected by RCU},
even at 448~CPUs.
However, the performance drops significantly (this is a log-log
plot) at about 8,000~elements, which is where the 1,048,576-byte
L2 cache overflows.
Performance falls off a cliff (even on this log-log plot) at about
300,000~elements, where the 40,370,176-byte L3 cache overflows.
This demonstrates that the memory-system bottleneck is profound,
degrading performance by well in excess of an order of magnitude
for the large hash tables.
This should not be a surprise, as the size-8,388,608 hash table
occupies about 1\,GB of memory, overflowing the L3 caches by
a factor of 25.
The reason that
\cref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo; Varying Buckets}
on \cpageref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo; Varying Buckets}
shows little effect is that its data was gathered from
bucket-locked hash tables, where locking overhead and contention
drowned out cache-capacity effects.
In contrast, both RCU and hazard-pointers readers avoid stores
to shared data, which means that the cache-capacity effects come
to the fore.
Still not satisfied?
Find a multi-socket system and run this code, making use of
whatever performance-counter hardware is available.
This hardware should allow you to track down the precise cause
of any slowdowns exhibited on your particular system.
The experience gained by doing this exercise will be extremely
valuable, giving you a significant advantage over those whose
understanding of this issue is strictly theoretical.\footnote{
Of course, a theoretical understanding beats no
understanding.
Most of the time, anyway.}
}\QuickQuizEnd
What if the memory footprint is reduced still further?
\Cref{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR With Non-Initial rcu-head}
on \cpageref{fig:defer:Pre-BSD Routing Table Protected by RCU QSBR With Non-Initial rcu-head}
shows that RCU attains very nearly ideal performance on the much smaller
data structure represented by the pre-BSD routing table.
\QuickQuiz{
The memory system is a serious bottleneck on this big system.
Why bother putting 448~CPUs on a system without giving them
enough memory bandwidth to do something useful???
}\QuickQuizAnswer{
It would indeed be a bad idea to use this large and expensive
system for a workload consisting solely of simple hash-table
lookups of small data elements.
However, this system is extremely useful for a great many
workloads that feature more processing and less memory accessing.
For example, some in-memory databases run extremely well on
this class of system, albeit when running much more complex
sets of queries than performed by the benchmarks in this chapter.
For example, such systems might be processing images or video
streams stored in each element, providing further performance
benefits due to the fact that the resulting sequential memory
accesses will make better use of the available memory bandwidth
than will a pure pointer-following workload.
But let this be a lesson to you.
Modern computer systems come in a great many shapes and sizes,
and great care is frequently required to select one that suits
your application.
And perhaps even more frequently, significant care and work is
required to adjust your application to the specific computer
systems at hand.
}\QuickQuizEnd
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zoocatonly}}
\caption{Read-Side Cat-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo at 64 CPUs}
\label{fig:datastruct:Read-Side Cat-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo at 64 CPUs}
\end{figure}
As noted earlier, Schr\"odinger is surprised by the popularity of his
cat~\cite{ErwinSchroedinger1935Cat}, but recognizes the need to reflect
this popularity in his design.
\Cref{fig:datastruct:Read-Side Cat-Only RCU-Protected Hash-Table Performance For Schroedinger's Zoo at 64 CPUs}
shows the results of 64-CPU runs, varying the number of CPUs that are
doing nothing but looking up the cat.
Both RCU and hazard pointers respond well to this challenge, but bucket
locking scales negatively, eventually performing as badly as global
locking.
This should not be a surprise because if all CPUs are doing nothing
but looking up the cat, the lock corresponding to the cat's bucket
is for all intents and purposes a global lock.
This cat-only benchmark illustrates one potential problem with
fully partitioned sharding approaches.
Only the CPUs associated with the cat's
partition is able to access the cat, limiting the cat-only
throughput.
Of course, a great many applications have good load-spreading
properties, and for these applications sharding works
quite well.
However, sharding does not handle ``\IXpl{hot spot}'' very well, with
the hot spot exemplified by Schr\"odinger's cat being but one case
in point.
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zooupdatelu}}
\caption{Read-Side RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo in the Presence of Updates}
\label{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo in the Presence of Updates}
\end{figure}
If we were only ever going to read the data, we would not need any
concurrency control to begin with.
\Cref{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo in the Presence of Updates}
therefore shows the effect of updates on readers.
At the extreme left-hand side of this graph, all but one of the CPUs
are doing lookups, while to the right all 448~CPUs are doing updates.
For all four implementations, the number of lookups per millisecond
decreases as the number of updating CPUs increases, of course reaching
zero lookups per millisecond when all 448~CPUs are updating.
Both hazard pointers and RCU do well compared to per-bucket locking
because their readers do not increase update-side lock contention.
RCU does well relative to hazard pointers as the number of updaters
increases due to the latter's read-side memory barriers, which incur
greater overhead, especially in the presence of updates, and particularly
when execution involves more than one socket.
It therefore seems likely that modern hardware heavily optimizes memory-barrier
execution, greatly reducing memory-barrier overhead in the read-only case.
% @@@ When there is a section covering trees, reference sharding with
% @@@ CPUs being permitted to directly look up popular entries.
\begin{figure}
\centering
\resizebox{2.5in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.perf.2020.11.26a/zooupdate}}
\caption{Update-Side RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo}
\label{fig:datastruct:Update-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo}
\end{figure}
Where
\cref{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo in the Presence of Updates}
showed the effect of increasing update rates on lookups,
\cref{fig:datastruct:Update-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo}
shows the effect of increasing update rates on the updates themselves.
Again, at the left-hand side of the figure all but one of the CPUs are
doing lookups and at the right-hand side of the figure all 448~CPUs are
doing updates.
Hazard pointers and RCU start off with a significant advantage because,
unlike bucket locking, readers do not exclude updaters.
However, as the number of updating CPUs increases, the update-side
deferred-execution overhead starts to make its presence known, first
for RCU and then for hazard pointers.
Of course, all three of these implementations beat global locking, and
by more than an order of magnitude.
It is quite possible that the differences in lookup performance observed in
\cref{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo in the Presence of Updates}
are affected by the differences in update rates.
One way to check this is to artificially throttle the update rates of
per-bucket locking and hazard pointers to match that of RCU\@.
Doing so does not significantly improve the lookup performance of
per-bucket locking, nor does it close the gap between hazard pointers
and RCU\@.
However, removing the read-side memory barriers from hazard pointers
(thus resulting in an unsafe implementation) does nearly close the gap
between hazard pointers and RCU\@.
Although this unsafe hazard-pointer implementation will
usually be reliable enough for benchmarking purposes, it is absolutely
not recommended for production use.
\QuickQuiz{
The dangers of extrapolating from 28~CPUs to 448~CPUs was
made quite clear in
\cref{sec:datastruct:Hash-Table Performance}.
Would extrapolating up from 448~CPUs be any safer?
}\QuickQuizAnswer{
In theory, no, it isn't any safer, and a useful exercise would be
to run these programs on larger systems.
In practice, there are only a very few systems with more than
448~CPUs, in contrast to the huge number having more than 28~CPUs.
This means that although it is dangerous to extrapolate beyond
448~CPUs, there is very little need to do so.
In addition, other testing has shown that RCU read-side primitives
offer consistent performance and scalability up to at least 1024~CPUs.
However, it is useful to review
\cref{fig:datastruct:Read-Only RCU-Protected Hash-Table Performance For Schr\"odinger's Zoo at 448 CPUs; Varying Table Size}
and its associated commentary.
You see, unlike the 448-CPU system that provided this data,
the system enjoying linear scalability up to 1024~CPUs boasted
excellent memory bandwidth.
}\QuickQuizEnd
And this situation demonstrates
\cref{sec:datastruct:Update-activity failure to scale}
in the list complications on
\cpageref{sec:datastruct:Update-activity failure to scale}.
% @@@ Testing strategy. Summarize hashtorture, add QQ for additional
% @@@ things that could be done. Note difference between textbook
% @@@ and production algorithms, also the need for clear understanding
% @@@ of what level of consistency is required.
\subsection{RCU-Protected Hash Table Discussion}
\label{sec:datastruct:RCU-Protected Hash Table Discussion}
One consequence of the RCU and hazard-pointer implementations is
that a pair of concurrent readers might disagree on the state of
the cat.
For example, one of the readers might have fetched the pointer to
the cat's data structure just before it was removed, while another
reader might have fetched this same pointer just afterwards.
The first reader would then believe that the cat was alive, while
the second reader would believe that the cat was dead.
This situation is completely fitting for Schr\"odinger's
cat, but it turns out that it is quite reasonable for normal
non-quantum cats as well.
After all, it is impossible to determine exactly when an animal is born
or dies.
To see this, let's suppose that we detect a cat's death by heartbeat.
This raise the question of exactly how long we should wait after the
last heartbeat before declaring death.
It is clearly ridiculous to wait only one millisecond, because then
a healthy living cat would have to be declared dead---and then
resurrected---more than once per second.
It is equally ridiculous to wait a full month, because by that time the
poor cat's death would be unmistakeably apparent to olfactory sensors.
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{cartoons/2013-08-is-it-dead}}
\caption{Even Veterinarians Disagree!}
\ContributedBy{fig:datastruct:Even Veterinarians Disagree}{Melissa Broussard}
\end{figure}
Because an animal's heart can stop for some seconds and then start up
again, there is a tradeoff between timely recognition of death and
probability of false alarms.
It is quite possible that a pair of veterinarians might disagree on
the time to wait between the last heartbeat and the declaration of
death.
For example, one veterinarian might declare death thirty seconds after
the last heartbeat, while another might insist on waiting a full
minute.
In this case, the two veterinarians would disagree on the state of the
cat for the second period of thirty seconds following the last heartbeat,
as fancifully depicted in
\cref{fig:datastruct:Even Veterinarians Disagree}.
\pplsur{Weiner}{Heisenberg} taught us to live with this sort of
uncertainty~\cite{WeinerHeisenberg1927Uncertain}, which is a good
thing because computing hardware and software acts similarly.
For example, how do you know that a piece of computing hardware
has failed?
Often because it does not respond in a timely fashion.
Just like the cat's heartbeat, this results in a window of
uncertainty as to whether or not the hardware has really failed,
as opposed to just being slow.
Furthermore, most computing systems are intended to interact with
the outside world.
Consistency with the outside world is therefore of paramount importance.
However, as we saw in
\cref{fig:defer:Response Time of RCU vs. Reader-Writer Locking}
on
\cpageref{fig:defer:Response Time of RCU vs. Reader-Writer Locking},
increased internal consistency can come at the expense of degraded
external consistency.
Techniques such as RCU and hazard pointers give up some degree of
internal consistency to attain improved external consistency.
In short, internal consistency is not necessarily a natural part of all
problem domains, and often incurs great expense in terms of performance,
scalability, consistency with the outside
world~\cite{AndreasHaas2012FIFOisnt,AndreasHaas2013CFRelaxedQueues,10.5555/3241639.3241645},
or all of the above.
\section{Non-Partitionable Data Structures}
\label{sec:datastruct:Non-Partitionable Data Structures}
%
\epigraph{Don't be afraid to take a big step if one is indicated.
You can't cross a chasm in two small steps.}
{David Lloyd George}
Fixed-size hash tables are perfectly partitionable, but resizable hash
tables pose partitioning challenges when growing or shrinking, as
fancifully depicted in
\cref{fig:datastruct:Partitioning Problems}.
However, it turns out that it is possible to construct high-performance
scalable RCU-protected hash tables, as described in the following sections.
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{cartoons/2014_Hash-table-hydra}}
\caption{Partitioning Problems}
\ContributedBy{fig:datastruct:Partitioning Problems}{Melissa Broussard}
\end{figure}
\subsection{Resizable Hash Table Design}
\label{sec:datastruct:Resizable Hash Table Design}
In happy contrast to the situation in the early 2000s, there are now
several different types of scalable RCU-protected hash tables.
The first (and simplest) was developed for the Linux kernel by
\ppl{Herbert}{Xu}~\cite{HerbertXu2010RCUResizeHash}, and is described in the
following sections.
Two others are covered briefly in
\cref{sec:datastruct:Other Resizable Hash Tables}.
The key insight behind the first hash-table implementation is that
each data element can have two sets of list pointers, with one set
currently being used by RCU readers (as well as by non-RCU updaters)
and the other being used to construct a new resized hash table.
This approach allows lookups, insertions, and deletions to all run
concurrently with a resize operation (as well as with each other).
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{datastruct/hashxu-a}}
\caption{Growing a Two-List Hash Table, State~(a)}
\label{fig:datastruct:Growing a Two-List Hash Table; State (a)}
\end{figure}
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{datastruct/hashxu-b}}
\caption{Growing a Two-List Hash Table, State~(b)}
\label{fig:datastruct:Growing a Two-List Hash Table; State (b)}
\end{figure}
The resize operation proceeds as shown in
\crefrange{fig:datastruct:Growing a Two-List Hash Table; State (a)}
{fig:datastruct:Growing a Two-List Hash Table; State (d)},
with the initial two-bucket state shown in
\cref{fig:datastruct:Growing a Two-List Hash Table; State (a)}
and with time advancing from figure to figure.
The initial state uses the zero-index links to chain the elements into
hash buckets.
A four-bucket array is allocated, and the one-index links are used to
chain the elements into these four new hash buckets.
This results in state~(b) shown in
\cref{fig:datastruct:Growing a Two-List Hash Table; State (b)},
with readers still using the original two-bucket array.
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{datastruct/hashxu-c}}
\caption{Growing a Two-List Hash Table, State~(c)}
\label{fig:datastruct:Growing a Two-List Hash Table; State (c)}
\end{figure}
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{datastruct/hashxu-d}}
\caption{Growing a Two-List Hash Table, State~(d)}
\label{fig:datastruct:Growing a Two-List Hash Table; State (d)}
\end{figure}
The new four-bucket array is exposed to readers and then a grace-period
operation waits for all readers, resulting in state~(c), shown in
\cref{fig:datastruct:Growing a Two-List Hash Table; State (c)}.
In this state, all readers are using the new four-bucket array,
which means that the old two-bucket array may now be freed, resulting
in state~(d), shown in
\cref{fig:datastruct:Growing a Two-List Hash Table; State (d)}.
This design leads to a relatively straightforward implementation,
which is the subject of the next section.
\subsection{Resizable Hash Table Implementation}
\label{sec:datastruct:Resizable Hash Table Implementation}
\begin{fcvref}[ln:datastruct:hash_resize:data]
Resizing is accomplished by the classic approach of inserting a level
of indirection, in this case, the \co{ht} structure shown on
\clnrefrange{ht:b}{ht:e} of
\cref{lst:datastruct:Resizable Hash-Table Data Structures}
(\path{hash_resize.c}).
The \co{hashtab} structure shown on
\clnrefrange{hashtab:b}{hashtab:e} contains only a
pointer to the current \co{ht} structure along with a spinlock that
is used to serialize concurrent attempts to resize the hash table.
If we were to use a traditional lock- or atomic-operation-based
implementation, this \co{hashtab} structure could become a severe bottleneck
from both performance and scalability viewpoints.
However, because resize operations should be relatively infrequent,
we should be able to make good use of RCU\@.
\begin{listing}
\input{CodeSamples/datastruct/hash/hash_resize@data.fcv}
\caption{Resizable Hash-Table Data Structures}
\label{lst:datastruct:Resizable Hash-Table Data Structures}
\end{listing}
The \co{ht} structure represents a specific size of the hash table,
as specified by the \co{->ht_nbuckets} field on \clnref{ht:nbuckets}.
The size is stored in the same structure containing the array of
buckets (\co{->ht_bkt[]} on
\clnref{ht:bkt}) in order to avoid mismatches between
the size and the array.
The \co{->ht_resize_cur} field on
\clnref{ht:resize_cur} is equal to $-1$ unless a resize
operation
is in progress, in which case it indicates the index of the bucket whose
elements are being inserted into the new hash table, which is referenced
by the \co{->ht_new} field on \clnref{ht:new}.
If there is no resize operation in progress, \co{->ht_new} is \co{NULL}.
Thus, a resize operation proceeds by allocating a new \co{ht} structure
and referencing it via the \co{->ht_new} pointer, then advancing
\co{->ht_resize_cur} through the old table's buckets.
When all the elements have been added to the new table, the new
table is linked into the \co{hashtab} structure's \co{->ht_cur} field.
Once all old readers have completed, the old hash table's \co{ht} structure
may be freed.
The \co{->ht_idx} field on
\clnref{ht:idx} indicates which of the two sets of
list pointers are being used by this instantiation of the hash table,
and is used to index the \co{->hte_next[]} array in the \co{ht_elem}
structure on \clnref{ht_elem:next}.
The \co{->ht_cmp()}, \co{->ht_gethash()}, and \co{->ht_getkey()} fields on
\clnrefrange{ht:cmp}{ht:getkey}
collectively define the per-element key and the hash function.
The \co{->ht_cmp()} function compares a specified key with that of
the specified element,
the \co{->ht_gethash()} calculates the specified key's hash,
and \co{->ht_getkey()} extracts the key from the enclosing data
element.
The \co{ht_lock_state} shown on \clnrefrange{hls:b}{hls:e}
is used to communicate lock state from a new \co{hashtab_lock_mod()}
to \co{hashtab_add()}, \co{hashtab_del()}, and \co{hashtab_unlock_mod()}.
This state prevents the algorithm from being redirected to the wrong
bucket during concurrent resize operations.
The \co{ht_bucket} structure is the same as before, and the
\co{ht_elem} structure differs from that of previous implementations
only in providing a two-element array of list pointer sets in place of
the prior single set of list pointers.
In a fixed-sized hash table, bucket selection is quite straightforward:
Simply transform the hash value to the corresponding bucket index.
In contrast, when resizing, updaters must also determine which of the
old and new sets of buckets to select from.
If the bucket that would be selected from the old table has already
been distributed into the new table, then the bucket should be selected
from the new table as well as from the old table.
Conversely, if the bucket that would be selected from the old table
has not yet been distributed, then the bucket should be selected from
the old table.
\end{fcvref}
\begin{listing}
\input{CodeSamples/datastruct/hash/hash_resize@get_bucket.fcv}
\caption{Resizable Hash-Table Bucket Selection}
\label{lst:datastruct:Resizable Hash-Table Bucket Selection}
\end{listing}
\begin{fcvref}[ln:datastruct:hash_resize:get_bucket]
Bucket selection is shown in
\cref{lst:datastruct:Resizable Hash-Table Bucket Selection},
which shows \co{ht_get_bucket()} on
\clnrefrange{single:b}{single:e} and \co{ht_search_bucket()} on
\clnrefrange{hsb:b}{hsb:e}.
The \co{ht_get_bucket()} function returns a reference to the bucket
corresponding to the specified key in the specified hash table, without
making any allowances for resizing.
It also stores the bucket index corresponding to the key into the location
referenced by parameter~\co{b} on
\clnref{single:gethash}, and the corresponding
hash value corresponding to the key into the location
referenced by parameter~\co{h} (if non-\co{NULL}) on \clnref{single:h}.
\Clnref{single:return} then returns a reference to the corresponding bucket.
The \co{ht_search_bucket()} function searches for the specified key
within the specified hash-table version.
\Clnref{hsb:get_curbkt} obtains a reference to the bucket corresponding
to the specified key.
The loop spanning \clnrefrange{hsb:loop:b}{hsb:loop:e} searches
that bucket, so that if \clnref{hsb:match} detects a match,
\clnref{hsb:ret_match} returns a pointer to the enclosing data element.
Otherwise, if there is no match,
\clnref{hsb:ret_NULL} returns \co{NULL} to indicate
failure.
\end{fcvref}
\QuickQuiz{
How does the code in
\cref{lst:datastruct:Resizable Hash-Table Bucket Selection}
protect against the resizing process progressing past the
selected bucket?
}\QuickQuizAnswer{
It does not provide any such protection.
That is instead the job of the update-side concurrency-control
functions described next.
}\QuickQuizEnd
This implementation of \co{ht_get_bucket()} and \co{ht_search_bucket()}
permits lookups and modifications to run concurrently with a resize
operation.
\begin{listing}
\input{CodeSamples/datastruct/hash/hash_resize@lock_unlock_mod.fcv}
\caption{Resizable Hash-Table Update-Side Concurrency Control}
\label{lst:datastruct:Resizable Hash-Table Update-Side Concurrency Control}
\end{listing}
Read-side concurrency control is provided by RCU as was shown in
\cref{lst:datastruct:RCU-Protected Hash-Table Read-Side Concurrency Control},
but the update-side concurrency\-/control functions
\co{hashtab_lock_mod()} and \co{hashtab_unlock_mod()}
must now deal with the possibility of a
concurrent resize operation as shown in
\cref{lst:datastruct:Resizable Hash-Table Update-Side Concurrency Control}.
\begin{fcvref}[ln:datastruct:hash_resize:lock_unlock_mod:l]
The \co{hashtab_lock_mod()} spans
\clnrefrange{b}{e} in the listing.
\Clnref{rcu_lock} enters an RCU read-side critical section to prevent
the data structures from being freed during the traversal,
\clnref{refhashtbl} acquires a reference to the current hash table, and then
\clnref{refbucket} obtains a reference to the bucket in this hash table
corresponding to the key.
\Clnref{acq_bucket} acquires that bucket's lock, which will prevent any concurrent
resizing operation from distributing that bucket, though of course it
will have no effect if that bucket has already been distributed.
\Clnrefrange{lsp0b}{lsp0e} store the bucket pointer and
pointer-set index into their respective fields in the
\co{ht_lock_state} structure, which communicates the information to
\co{hashtab_add()}, \co{hashtab_del()}, and \co{hashtab_unlock_mod()}.
\Clnref{ifresized} then checks to see if a concurrent resize
operation has already distributed this bucket across the new hash table,
and if not, \clnref{lsp1_1} indicates that there is no
already-resized hash bucket and
\clnref{fastret1} returns with the selected hash bucket's
lock held (thus preventing a concurrent resize operation from distributing
this bucket) and also within an RCU read-side critical section.
\IX{Deadlock} is avoided because the old table's locks are always acquired
before those of the new table, and because the use of RCU prevents more
than two versions from existing at a given time, thus preventing a
deadlock cycle.
Otherwise, a concurrent resize operation has already distributed this
bucket, so \clnref{new_hashtbl} proceeds to the new hash table,
\clnref{get_newbkt} selects the bucket corresponding to the key,
and \clnref{acq_newbkt} acquires the bucket's lock.
\Clnrefrange{lsp1b}{lsp1e} store the new-table bucket pointer and
pointer-set index into their respective fields in the
\co{ht_lock_state} structure, which again communicates this information to
\co{hashtab_add()}, \co{hashtab_del()}, and \co{hashtab_unlock_mod()}.
Because this bucket has already been resized and because
\co{hashtab_add()} and \co{hashtab_del()} affect both the old and the
new \co{ht_bucket} structures, two locks are held, one on each of the
two buckets.
Additionally, both elements of each array in \co{ht_lock_state} structure
are used, with the \co{[0]} element pertaining to the old \co{ht_bucket}
structure and the \co{[1]} element pertaining to the new structure.
Once again, \co{hashtab_lock_mod()} exits within an RCU read-side critical
section.
\end{fcvref}
\begin{fcvref}[ln:datastruct:hash_resize:lock_unlock_mod:ul]
The \co{hashtab_unlock_mod()} function releases the lock(s) acquired by
\co{hashtab_lock_mod()}.
\Clnref{relbkt0} releases the lock on the old \co{ht_bucket} structure.
In the unlikely event that \clnref{ifbkt1} determines that a resize
operation is in progress, \clnref{relbkt1} releases the lock on the
new \co{ht_bucket} structure.
Either way, \clnref{rcu_unlock} exits the RCU read-side critical
section.
\end{fcvref}
\QuickQuiz{
Suppose that one thread is inserting an element into the
hash table during a resize operation.
What prevents this insertion from being lost due to a subsequent
resize operation completing before the insertion does?
}\QuickQuizAnswer{
The second resize operation will not be able to move beyond
the bucket into which the insertion is taking place due to
the insertion holding the lock(s) on one or both of the hash
buckets in the hash tables.
Furthermore, the insertion operation takes place within an
RCU read-side critical section.
As we will see when we examine the \co{hashtab_resize()}
function, this allows each resize operation to use
\co{synchronize_rcu()} invocations to wait for the insertion's
read-side critical section to complete.
These \co{synchronize_rcu()} invocations prevent pre-existing
insertions operation from outliving the resize operation.
}\QuickQuizEnd
\begin{listing}
\input{CodeSamples/datastruct/hash/hash_resize@access.fcv}
\caption{Resizable Hash-Table Access Functions}
\label{lst:datastruct:Resizable Hash-Table Access Functions}
\end{listing}
Now that we have bucket selection and concurrency control in place,
we are ready to search and update our resizable hash table.
The \co{hashtab_lookup()}, \co{hashtab_add()}, and \co{hashtab_del()}
functions are shown in
\cref{lst:datastruct:Resizable Hash-Table Access Functions}.
\begin{fcvref}[ln:datastruct:hash_resize:access:lkp]
The \co{hashtab_lookup()} function on
\clnrefrange{b}{e} of the listing does
hash lookups.
\Clnref{get_curtbl} fetches the current hash table and
\clnref{get_curbkt} searches the bucket corresponding to the
specified key.
\Clnref{ret} returns a pointer to the searched-for element
or \co{NULL} when the search fails.
The caller must be within an RCU read-side critical section.
\end{fcvref}
\QuickQuiz{
The \co{hashtab_lookup()} function in
\cref{lst:datastruct:Resizable Hash-Table Access Functions}
ignores concurrent resize operations.
Doesn't this mean that readers might miss an element that was
previously added during a resize operation?
}\QuickQuizAnswer{
No.
As we will see soon,
the \co{hashtab_add()} and \co{hashtab_del()} functions
keep the old hash table up-to-date while a resize operation
is in progress.
}\QuickQuizEnd
\begin{fcvref}[ln:datastruct:hash_resize:access:add]
The \co{hashtab_add()} function on \clnrefrange{b}{e} of the listing adds
new data elements to the hash table.
\Clnref{htbp} picks up the current \co{ht_bucket} structure into which the
new element is to be added, and \clnref{i} picks up the index of
the pointer pair.
\Clnref{add} adds the new element to the current hash bucket.
If \clnref{ifnew} determines that this bucket has been distributed
to a new version of the hash table, then \clnref{addnew} also adds the
new element to the corresponding new bucket.
The caller is required to handle concurrency, for example, by invoking
\co{hashtab_lock_mod()} before the call to \co{hashtab_add()} and invoking
\co{hashtab_unlock_mod()} afterwards.
\end{fcvref}
\begin{fcvref}[ln:datastruct:hash_resize:access:del]
The \co{hashtab_del()} function on
\clnrefrange{b}{e} of the listing removes
an existing element from the hash table.
\Clnref{i} picks up the index of the pointer pair
and \clnref{del} removes the specified element from the current table.
If \clnref{ifnew} determines that this bucket has been distributed
to a new version of the hash table, then \clnref{delnew} also removes
the specified element from the corresponding new bucket.
As with \co{hashtab_add()}, the caller is responsible for concurrency
control and this concurrency control suffices for synchronizing with
a concurrent resize operation.
\end{fcvref}
\QuickQuiz{
The \co{hashtab_add()} and \co{hashtab_del()} functions in
\cref{lst:datastruct:Resizable Hash-Table Access Functions}
can update two hash buckets while a resize operation is progressing.
This might cause poor performance if the frequency of resize operation
is not negligible.
Isn't it possible to reduce the cost of updates in such cases?
}\QuickQuizAnswer{
Yes, at least assuming that a slight increase in the cost of
\co{hashtab_lookup()} is acceptable.
One approach is shown in
\cref{lst:datastruct:Resizable Hash-Table Access Functions (Fewer Updates),%
lst:datastruct:Resizable Hash-Table Update-Side Locking Function (Fewer Updates)}
(\path{hash_resize_s.c}).
\begin{listing}
\input{CodeSamples/datastruct/hash/hash_resize_s@access.fcv}
\caption{Resizable Hash-Table Access Functions (Fewer Updates)}
\label{lst:datastruct:Resizable Hash-Table Access Functions (Fewer Updates)}
\end{listing}
\begin{listing}
\input{CodeSamples/datastruct/hash/hash_resize_s@lock_mod.fcv}
\caption{Resizable Hash-Table Update-Side Locking Function (Fewer Updates)}
\label{lst:datastruct:Resizable Hash-Table Update-Side Locking Function (Fewer Updates)}
\end{listing}
This version of \co{hashtab_add()} adds an element to
either the old bucket if it is not resized yet, or to the new
bucket if it has been resized, and \co{hashtab_del()} removes
the specified element from any buckets into which it has been inserted.
The \co{hashtab_lookup()} function searches the new bucket
if the search of the old bucket fails, which has the disadvantage
of adding overhead to the lookup fastpath.
The alternative \co{hashtab_lock_mod()} returns the locking
state of the new bucket in \co{->hbp[0]} and \co{->hls_idx[0]}
if resize operation is in progress, instead of the perhaps
more natural choice of \co{->hbp[1]} and \co{->hls_idx[1]}.
However, this less-natural choice has the advantage of simplifying
\co{hashtab_add()}.
Further analysis of the code is an exercise for the reader.
}\QuickQuizEnd
\begin{listing*}
\input{CodeSamples/datastruct/hash/hash_resize@resize.fcv}
\caption{Resizable Hash-Table Resizing}
\label{lst:datastruct:Resizable Hash-Table Resizing}
\end{listing*}
\begin{fcvref}[ln:datastruct:hash_resize:resize]
The actual resizing itself is carried out by \co{hashtab_resize()}, shown in
\cref{lst:datastruct:Resizable Hash-Table Resizing} on
\cpageref{lst:datastruct:Resizable Hash-Table Resizing}.
\Clnref{trylock} conditionally acquires the top-level \co{->ht_lock}, and if
this acquisition fails, \clnref{ret_busy} returns \co{-EBUSY} to indicate that
a resize is already in progress.
Otherwise, \clnref{get_curtbl} picks up a reference to the current hash table,
and \clnrefrange{alloc:b}{alloc:e} allocate a new hash table of the desired size.
If a new set of hash/key functions have been specified, these are
used for the new table, otherwise those of the old table are preserved.
If \clnref{chk_nomem} detects memory-allocation failure,
\clnref{rel_nomem} releases \co{->ht_lock}
and \clnref{ret_nomem} returns a failure indication.
\Clnref{get_curidx} picks up the current table's index and
\clnref{put_curidx} stores its inverse to
the new hash table, thus ensuring that the two hash tables avoid overwriting
each other's linked lists.
\Clnref{set_newtbl} then starts the bucket-distribution process by
installing a reference to the new table into the \co{->ht_new} field of
the old table.
\Clnref{sync_rcu} ensures that all readers who are not aware of the
new table complete before the resize operation continues.
Each pass through the loop spanning \clnrefrange{loop:b}{loop:e} distributes the contents
of one of the old hash table's buckets into the new hash table.
\Clnref{get_oldcur} picks up a reference to the old table's current bucket
and \clnref{acq_oldcur} acquires that bucket's spinlock.
\end{fcvref}
\QuickQuiz{
\begin{fcvref}[ln:datastruct:hash_resize:resize]
In the \co{hashtab_resize()} function in
\cref{lst:datastruct:Resizable Hash-Table Resizing},
what guarantees that the update to \co{->ht_new} on \clnref{set_newtbl}
will be seen as happening before the update to \co{->ht_resize_cur}
on \clnref{update_resize} from the perspective of
\co{hashtab_add()} and \co{hashtab_del()}?
In other words, what prevents \co{hashtab_add()}
and \co{hashtab_del()} from dereferencing
a \co{NULL} pointer loaded from \co{->ht_new}?
\end{fcvref}
}\QuickQuizAnswer{
\begin{fcvref}[ln:datastruct:hash_resize:resize]
The \co{synchronize_rcu()} on \clnref{sync_rcu} of
\cref{lst:datastruct:Resizable Hash-Table Resizing}
ensures that all pre-existing RCU readers have completed between
the time that we install the new hash-table reference on
\clnref{set_newtbl} and the time that we update \co{->ht_resize_cur} on
\clnref{update_resize}.
This means that any reader that sees a non-negative value
of \co{->ht_resize_cur} cannot have started before the
assignment to \co{->ht_new}, and thus must be able to see
the reference to the new hash table.
And this is why the update-side \co{hashtab_add()} and
\co{hashtab_del()} functions must be enclosed
in RCU read-side critical sections, courtesy of
\co{hashtab_lock_mod()} and \co{hashtab_unlock_mod()} in
\cref{lst:datastruct:Resizable Hash-Table Update-Side Concurrency Control}.
\end{fcvref}
}\QuickQuizEnd
\begin{fcvref}[ln:datastruct:hash_resize:resize]
Each pass through the loop spanning
\clnrefrange{loop_list:b}{loop_list:e} adds one data element
from the current old-table bucket to the corresponding new-table bucket,
holding the new-table bucket's lock during the add operation.
\Clnref{update_resize} updates
\co{->ht_resize_cur} to indicate that this bucket has been distributed.
Finally, \clnref{rel_oldcur} releases the old-table bucket lock.
Execution reaches \clnref{rcu_assign} once all old-table buckets have been distributed
across the new table.
\Clnref{rcu_assign} installs the newly created table as the current one, and
\clnref{sync_rcu_2} waits for all old readers (who might still be referencing
the old table) to complete.
Then \clnref{rel_master} releases the resize-serialization lock,
\clnref{free} frees
the old hash table, and finally \clnref{ret_success} returns success.
\end{fcvref}
\QuickQuiz{
\begin{fcvref}[ln:datastruct:hash_resize:resize]
Why is there a \co{WRITE_ONCE()} on \clnref{update_resize}
in \cref{lst:datastruct:Resizable Hash-Table Resizing}?
\end{fcvref}
}\QuickQuizAnswer{
\begin{fcvref}[ln:datastruct:hash_resize:lock_unlock_mod]
Together with the \co{READ_ONCE()}
on \clnref{l:ifresized} in \co{hashtab_lock_mod()}
of \cref{lst:datastruct:Resizable Hash-Table Update-Side Concurrency Control},
it tells the compiler that the non-initialization accesses
to \co{->ht_resize_cur} must remain because reads
from \co{->ht_resize_cur} really can race with writes,
just not in a way to change the ``if'' conditions.
\end{fcvref}
}\QuickQuizEnd
\subsection{Resizable Hash Table Discussion}
\label{sec:datastruct:Resizable Hash Table Discussion}
\begin{figure}
\centering
\resizebox{2.7in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.resize.2020.09.05a/perftestresize}}
\caption{Overhead of Resizing Hash Tables Between 262,144 and 524,288 Buckets vs.\@ Total Number of Elements}
\label{fig:datastruct:Overhead of Resizing Hash Tables Between 262;144 and 524;288 Buckets vs. Total Number of Elements}
\end{figure}
\Cref{fig:datastruct:Overhead of Resizing Hash Tables Between 262;144 and 524;288 Buckets vs. Total Number of Elements}
compares resizing hash tables to their fixed-sized counterparts
for 262,144 and 2,097,152 elements in the hash table.
The figure shows three traces for each element count, one
for a fixed-size 262,144-bucket hash table, another for a
fixed-size 524,288-bucket hash table, and a third for a resizable
hash table that shifts back and forth between 262,144 and 524,288
buckets, with a one-millisecond pause between each resize operation.
The uppermost three traces are for the 262,144-element hash table.\footnote{
You see only two traces?
The dashed one is composed of two traces that differ
only slightly, hence the irregular-looking dash pattern.}
The dashed trace corresponds to the two fixed-size hash tables,
and the solid trace to the resizable hash table.
In this case, the short hash chains cause normal lookup overhead
to be so low that the overhead of resizing dominates over most
of the range.
In particular, the entire hash table fits into L3 cache.
% Nevertheless, the larger fixed-size hash table has a significant
% performance advantage, so that resizing can be quite beneficial,
% at least given sufficient time between resizing operations: One
% millisecond is clearly too short a time.
The lower three traces are for the 2,097,152-element hash table.
The upper dashed trace corresponds to the 262,144-bucket fixed-size
hash table, the solid trace in the middle for low CPU counts and at
the bottom for high CPU counts to the resizable hash table,
and the other trace to the 524,288-bucket fixed-size hash table.
The fact that there are now an average of eight elements per bucket
can only be expected to produce a sharp decrease in performance,
as in fact is shown in the graph.
But worse yet, the hash-table elements occupy 128\,MB, which overflows
each socket's 39\,MB L3 cache, with performance consequences analogous
to those described in \cref{sec:cpu:Costs of Operations}.
The resulting cache overflow means that the memory system is involved
even for a read-only benchmark, and as you can see from the sublinear
portions of the lower three traces, the memory system can be a serious
bottleneck.
\QuickQuiz{
How much of the difference in performance between the large and
small hash tables shown in
\cref{fig:datastruct:Overhead of Resizing Hash Tables Between 262;144 and 524;288 Buckets vs. Total Number of Elements}
was due to long hash chains and how much was due to
memory-system bottlenecks?
}\QuickQuizAnswer{
The easy way to answer this question is to do another run with
2,097,152 elements, but this time also with 2,097,152 buckets,
thus bringing the average number of elements per bucket back down
to unity.
\begin{figure}
\centering
\resizebox{2.7in}{!}{\includegraphics{CodeSamples/datastruct/hash/data/hps.resize.2020.09.27a/perftestresizebig}}
\caption{Effect of Memory-System Bottlenecks on Hash Tables}
\label{fig:datastruct:Effect of Memory-System Bottlenecks on Hash Tables}
\end{figure}
The results are shown by the triple-dashed new trace in
the middle of
\cref{fig:datastruct:Effect of Memory-System Bottlenecks on Hash Tables}.
The other six traces are identical to their counterparts in
\cref{fig:datastruct:Overhead of Resizing Hash Tables Between 262;144 and 524;288 Buckets vs. Total Number of Elements}
on \cpageref{fig:datastruct:Overhead of Resizing Hash Tables Between 262;144 and 524;288 Buckets vs. Total Number of Elements}.
The gap between this new trace and the lower set of three
traces is a rough measure of how much of the difference in
performance was due to hash-chain length, and the gap between
the new trace and the upper set of three traces is a rough measure
of how much of that difference was due to memory-system bottlenecks.
The new trace starts out slightly below its 262,144-element
counterpart at a single CPU, showing that cache capacity is
degrading performance slightly even on that single CPU\@.\footnote{
Yes, as far as hardware architects are concerned,
caches are part of the memory system.}
This is to be expected, given that unlike its smaller counterpart,
the 2,097,152-bucket hash table does not fit into the L3 cache.
This new trace rises just past 28~CPUs, which is also to be
expected.
This rise is due to the fact that the 29\textsuperscript{th}
CPU is on another socket, which brings with it an additional
39\,MB of cache as well as additional memory bandwidth.
But the large hash table's advantage over that of the hash table
with 524,288~buckets (but still 2,097,152 elements) decreases
with additional CPUs, which is consistent with the bottleneck
residing in the memory system.
Above about 400~CPUs, the 2,097,152-bucket hash table is
actually outperformed slightly by the 524,288-bucket hash
table.
This should not be a surprise because the memory system is
the bottleneck and the larger number of buckets increases this
workload's memory footprint.
The alert reader will have noted the word ``rough'' above
and might be interested in a more detailed analysis.
Such readers are invited to run similar benchmarks, using
whatever performance counters or hardware-analysis tools
they might have available.
This can be a long and complex journey, but those brave enough
to embark on it will be rewarded with detailed knowledge of
hardware performance and its effect on software.
}\QuickQuizEnd
Referring to the last column of
\cref{tab:cpu:CPU 0 View of Synchronization Mechanisms on 8-Socket System With Intel Xeon Platinum 8176 CPUs at 2.10GHz},
we recall that the first 28~CPUs are in the first socket, on a
one-CPU-per-core basis, which explains the sharp decrease in performance
of the resizable hash table beyond 28~CPUs.
Sharp though this decrease is, please recall that it is due to constant
resizing back and forth.
It would clearly be better to resize once to 524,288 buckets,
or, even better, do a single eight-fold resize to 2,097,152 elements,
thus dropping the average number of elements per bucket down to the
level enjoyed by the runs producing the upper three traces.
The key point from this data is that the RCU-protected resizable hash
table performs and scales almost as well as does its fixed-size counterpart.
The performance during an actual resize operation of course suffers
somewhat due to the cache misses causes by the updates to each element's
pointers, and this effect is most pronounced when the memory system
becomes a bottleneck.
This indicates that hash tables should be resized by substantial amounts,
and that hysteresis should be applied to prevent performance degradation
due to too-frequent resize operations.
In memory-rich environments, hash-table sizes should furthermore
be increased much more aggressively than they are decreased.
Another key point is that although the \co{hashtab} structure is
non-partitionable, it is also read-mostly, which suggests the use
of RCU\@.
Given that the performance and scalability of this resizable hash table is
very nearly that of RCU-protected fixed-sized hash tables, we must
conclude that this approach was quite successful.
Finally, it is important to note that insertions, deletions, and
lookups can proceed concurrently with a resize operation.
This concurrency is
critically important when resizing large hash tables, especially
for applications that must meet severe response-time constraints.
Of course, the \co{ht_elem} structure's
pair of pointer sets does impose some memory overhead,
which is taken up in the next section.
\subsection{Other Resizable Hash Tables}
\label{sec:datastruct:Other Resizable Hash Tables}
One shortcoming of the resizable hash table described earlier in this
section is memory consumption.
Each data element has two pairs of linked-list pointers rather than just
one.
Is it possible to create an RCU-protected resizable hash table that
makes do with just one pair?
It turns out that the answer is ``yes''.
Josh Triplett et al.~\cite{Triplett:2011:RPHash}
produced a \emph{relativistic hash table} that incrementally
splits and combines corresponding hash chains so that readers always
see valid hash chains at all points during the resizing operation.
This incremental splitting and combining relies on the fact that it
is harmless for a reader to see a data element that should be in some
other hash chain:
When this happens, the reader will simply ignore the extraneous data
element due to key mismatches.
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{datastruct/zipperhashshrink}}
\caption{Shrinking a Relativistic Hash Table}
\label{fig:datastruct:Shrinking a Relativistic Hash Table}
\end{figure}
The process of shrinking a relativistic hash table by a factor of two
is shown in
\cref{fig:datastruct:Shrinking a Relativistic Hash Table},
in this case shrinking a two-bucket hash table into a one-bucket
hash table, otherwise known as a linear list.
This process works by coalescing pairs of buckets in the old larger hash
table into single buckets in the new smaller hash table.
For this process to work correctly, we clearly need to constrain the hash
functions for the two tables.
One such constraint is to use the same underlying hash function for
both tables, but to throw out the low-order bit when shrinking from
large to small.
For example, the old two-bucket hash table would
use the two top bits of the value, while the new one-bucket hash table
could use the top bit of the value.
In this way, a given pair of adjacent even and odd buckets in the old
large hash table can be coalesced into a single bucket in the new small
hash table, while still having a single hash value cover all of the
elements in that single bucket.
The initial state is shown at the top of the figure, with time advancing
from top to bottom, starting with initial state~(a).
The shrinking process begins by allocating the new smaller array of
buckets, and having each bucket of this new smaller array reference
the first element of one of the buckets of the corresponding pair in
the old large hash table, resulting in state~(b).
Then the two hash chains are linked together, resulting in state~(c).
In this state, readers looking up an even-numbered element see no change,
and readers looking up elements~1 and~3 likewise see no change.
However, readers looking up some other odd number will also traverse
elements~0 and~2.
This is harmless because any odd number will compare not-equal to these
two elements.
There is some performance loss, but on the other hand, this is exactly
the same performance loss that will be experienced once the new small
hash table is fully in place.
Next, the new small hash table is made accessible to readers, resulting
in state~(d).
Note that older readers might still be traversing the old large hash
table, so in this state both hash tables are in use.
The next step is to wait for all pre-existing readers to complete,
resulting in state~(e).
In this state, all readers are using the new small hash table, so that
the old large hash table's buckets may be freed, resulting in the final
state~(f).
\begin{figure}
\centering
\resizebox{3in}{!}{\includegraphics{datastruct/zipperhashgrow}}
\caption{Growing a Relativistic Hash Table}
\label{fig:datastruct:Growing a Relativistic Hash Table}
\end{figure}
Growing a relativistic hash table reverses the shrinking process,
but requires more grace-period steps, as shown in
\cref{fig:datastruct:Growing a Relativistic Hash Table}.
The initial state~(a) is at the top of this figure, with time advancing
from top to bottom.
We start by allocating the new large two-bucket hash table, resulting
in state~(b).
Note that each of these new buckets references the first element destined
for that bucket.
These new buckets are published to readers, resulting in state~(c).
After a grace-period operation, all readers are using the new large
hash table, resulting in state~(d).
In this state, only those readers traversing the even-values hash bucket
traverse element~0, which is therefore now colored white.
At this point, the old small hash buckets may be freed, although many
implementations use these old buckets to track progress ``unzipping''
the list of items into their respective new buckets.
The last even-numbered element in the first consecutive run of such
elements now has its pointer-to-next updated to reference the following
even-numbered element.
After a subsequent grace-period operation, the result is state~(e).
The vertical arrow indicates the next element to be unzipped, and
element~1 is now colored black to indicate that only those readers traversing
the odd-values hash bucket may reach it.
Next, the last odd-numbered element in the first consecutive run of such
elements now has its pointer-to-next updated to reference the following
odd-numbered element.
After a subsequent grace-period operation, the result is state~(f).
A final unzipping operation (including a grace-period operation)
results in the final state~(g).
In short, the relativistic hash table reduces the number of per-element
list pointers at the expense of additional grace periods incurred during
resizing.
These additional grace periods are usually not a problem because
insertions, deletions, and lookups may proceed concurrently with
a resize operation.
However, when a hash table is growing quickly, the large numbers of
grace periods might delay the beginning of the next resize operation.
One way to reduce the number of grace periods is to unzip all the
buckets concurrently, so that the number of grace periods will be
limited by the largest number of elements in a single bucket instead
of by the total number of elements in the hash table.
Another approach, also by Josh Triplett, is to maintain a separate
pointer to the element currently being moved from the old bucket to
the new bucket.
Readers then check the old bucket, the separate pointer, and the new
bucket in order.
This requires read-side memory ordering, which slows readers, but
it greatly reduces the required number of resize-time grace periods.
This approach is used within the Linux kernel.
It turns out that it is possible to reduce the per-element memory overhead
from a pair of pointers to a single pointer, while still retaining
$\O{1}$ deletions.
This is accomplished by augmenting split-order
list~\cite{OriShalev2006SplitOrderListHash}
with RCU
protection~\cite{MathieuDesnoyers2009URCU,PaulMcKenney2013LWNURCUhash}.
The data elements in the hash table are arranged into a single
sorted linked list, with each hash bucket referencing the first element
in that bucket.
Elements are deleted by setting low-order bits in their pointer-to-next
fields, and these elements are removed from the list by later traversals
that encounter them.
This RCU-protected split-order list is complex, but offers lock-free
progress guarantees for all insertion, deletion, and lookup operations.
Such guarantees can be important in real-time applications.
However, one downside is that the keys must be bit-reversed during
lookup, which slows down readers.
But for those for whom this slowdown is not a problem, an
implementation is available from recent versions of the userspace RCU
library~\cite{MathieuDesnoyers2009URCU}.
\section{Other Data Structures}
\label{sec:datastruct:Other Data Structures}
%
\epigraph{All life is an experiment.
The more experiments you make the better.}
{Ralph Waldo Emerson}
The preceding sections have focused on data structures that enhance
concurrency due to partitionability
(\cref{sec:datastruct:Partitionable Data Structures}),
efficient handling of read-mostly access patterns
(\cref{sec:datastruct:Read-Mostly Data Structures}),
or application of read-mostly techniques to avoid
non-partitionability
(\cref{sec:datastruct:Non-Partitionable Data Structures}).
This section gives a brief review of other data structures.
One of the hash table's greatest advantages for parallel use is that it
is fully partitionable, at least while not being resized.
One way of preserving the partitionability and the size independence is
to use a radix tree, which is also called a trie.
Tries partition the search key, using each successive key partition
to traverse the next level of the trie.
As such, a trie can be thought of as a set of nested hash tables,
thus providing the required partitionability.
One disadvantage of tries is that a sparse key space can result in
inefficient use of memory.
There are a number of compression techniques that may be used to
work around this disadvantage, including hashing the key value to
a smaller keyspace before the
traversal~\cite{RobertOlsson2007Trash}.
Radix trees are heavily used in practice, including in the Linux
kernel~\cite{NickPiggin2006radixtree}.
One important special case of both a hash table and a trie is what is
perhaps the oldest of data structures, the array and its multi-dimensional
counterpart, the matrix.
The fully partitionable nature of matrices is exploited heavily in
concurrent numerical algorithms.
Self-balancing trees are heavily used in sequential code, with
AVL trees and red-black trees being perhaps the most well-known
examples~\cite{ThomasHCorman2001Algorithms}.
Early attempts to parallelize AVL trees were complex and not necessarily
all that efficient~\cite{Ellis80},
however, more recent work on red-black trees provides better
performance and scalability by using RCU for readers and hashed arrays
of locks\footnote{
In the guise of swissTM~\cite{AleksandarDragovejic2011STMnotToy},
which is a variant of software transactional memory in which
the developer flags non-shared accesses.}
to protect reads and updates,
respectively~\cite{PhilHoward2011RCUTMRBTree,PhilipWHoward2013RCUrbtree}.
It turns out that red-black trees rebalance aggressively, which works
well for sequential programs, but not necessarily so well for parallel
use.
Recent work has therefore made use of RCU-protected ``bonsai trees''
that rebalance less aggressively~\cite{AustinClements2012RCULinux:mmapsem},
trading off optimal tree depth to gain more efficient concurrent updates.
Concurrent skip lists lend themselves well to RCU readers, and in fact
represents an early academic use of a technique resembling
RCU~\cite{Pugh90}.
% @@@ Reference skiplist chapter once added.
Concurrent double-ended queues were discussed in
\cref{sec:SMPdesign:Double-Ended Queue},
and concurrent stacks and queues have a long history~\cite{Treiber86},
though not normally the most impressive performance or scalability.
They are nevertheless a common feature of concurrent
libraries~\cite{PaulMcKenney2013LWNURCUqueuestack}.
Researchers have recently proposed relaxing the ordering constraints
of stacks and queues~\cite{Shavit:2011:DSM:1897852.1897873},
with some work indicating that relaxed-ordered queues actually have
better ordering properties than do strict FIFO
queues~\cite{AndreasHaas2012FIFOisnt,ChristophMKirsch2012FIFOisntTR,AndreasHaas2013CFRelaxedQueues}.
It seems likely that continued work with concurrent data structures will
produce novel algorithms with surprising properties.
\section{Summary}
\label{sec:datastruct:Summary}
%
\epigraph{There's only one thing more painful than learning from
experience, and that is not learning from experience.}
{Archibald MacLeish}
This chapter has focused primarily on hash tables, including resizable
hash tables, which are not fully partitionable.
\Cref{sec:datastruct:Other Data Structures} gave a quick
overview of a few non-hash-table data structures.
Nevertheless, this exposition of hash tables is an excellent introduction
to the many issues surrounding high-performance scalable data access,
including:
\begin{enumerate}
\item Fully partitioned data structures work well on small systems,
for example, single-socket systems.
\item Larger systems require locality of reference as well as
full partitioning.
\item Read-mostly techniques, such as hazard pointers and RCU,
provide good locality of reference for read-mostly workloads,
and thus provide excellent performance and scalability even
on larger systems.
\item Read-mostly techniques also work well on some types of
non-partitionable data structures, such as resizable hash tables.
\item Large data structures can overflow CPU caches, reducing performance
and scalability.
\item Additional performance and scalability can be obtained by
specializing the data structure to a specific workload,
for example, by replacing a general key with a 32-bit integer.
\item Although requirements for portability and for extreme performance
often conflict, there are some data-structure-layout techniques
that can strike a good balance between these two sets of
requirements.
\end{enumerate}
That said, performance and scalability are of little use without reliability,
so the next chapter covers validation.
\QuickQuizAnswersChp{qqzdatastruct}