| % glossary.tex |
| % SPDX-License-Identifier: CC-BY-SA-3.0 |
| |
| \chapter{Glossary and Bibliography} |
| % |
| \Epigraph{Dictionaries are inherently circular in nature.} |
| {\emph{``Self Reference in word definitions'', |
| David~Levary~et~al.}} |
| |
| \begin{description} |
| \item[Associativity:] |
| The number of cache lines that can be held simultaneously in |
| a given cache, when all of these cache lines hash identically |
| in that cache. |
| A cache that could hold four cache lines for each possible |
| hash value would be termed a ``four-way set-associative'' cache, |
| while a cache that could hold only one cache line for each |
| possible hash value would be termed a ``direct-mapped'' cache. |
| A cache whose associativity was equal to its capacity would |
| be termed a ``fully associative'' cache. |
| Fully associative caches have the advantage of eliminating |
| associativity misses, but, due to hardware limitations, |
| fully associative caches are normally quite limited in size. |
| The associativity of the large caches found on modern microprocessors |
| typically range from two-way to eight-way. |
| \item[Associativity Miss:] |
| A cache miss incurred because the corresponding CPU has recently |
| accessed more data hashing to a given set of the cache than will |
| fit in that set. |
| Fully associative caches are not subject to associativity misses |
| (or, equivalently, in fully associative caches, associativity |
| and capacity misses are identical). |
| \item[Atomic:] |
| An operation is considered ``atomic'' if it is not possible to |
| observe any intermediate state. |
| For example, on most CPUs, a store to a properly aligned pointer |
| is atomic, because other CPUs will see either the old value or |
| the new value, but are guaranteed not to see some mixed value |
| containing some pieces of the new and old values. |
| \item[Cache:] |
| In modern computer systems, CPUs have caches in which to hold |
| frequently used data. |
| These caches can be thought of as hardware hash tables with very |
| simple hash functions, |
| but in which each hash bucket (termed a ``set'' by hardware types) |
| can hold only a limited number of data items. |
| The number of data items that can be held by each of a cache's hash |
| buckets is termed the cache's ``associativity''. |
| These data items are normally called ``cache lines'', which |
| can be thought of a fixed-length blocks of data that circulate |
| among the CPUs and memory. |
| \item[Cache Coherence:] |
| A property of most modern SMP machines where all CPUs will |
| observe a sequence of values for a given variable that is |
| consistent with at least one global order of values for |
| that variable. |
| Cache coherence also guarantees that at the end of a group |
| of stores to a given variable, all CPUs will agree |
| on the final value for that variable. |
| Note that cache coherence applies only to the series of values |
| taken on by a single variable. |
| In contrast, the memory consistency model for a given machine |
| describes the order in which loads and stores to groups of |
| variables will appear to occur. |
| See Section~\ref{sec:memorder:Cache Coherence} |
| for more information. |
| \item[Cache Coherence Protocol:] |
| A communications protocol, normally implemented in hardware, |
| that enforces memory consistency and ordering, preventing |
| different CPUs from seeing inconsistent views of data held |
| in their caches. |
| \item[Cache Geometry:] |
| The size and associativity of a cache is termed its geometry. |
| Each cache may be thought of as a two-dimensional array, |
| with rows of cache lines (``sets'') that have the same hash |
| value, and columns of cache lines (``ways'') in which every |
| cache line has a different hash value. |
| The associativity of a given cache is its number of |
| columns (hence the name ``way''---a two-way set-associative |
| cache has two ``ways''), and the size of the cache is its |
| number of rows multiplied by its number of columns. |
| \item[Cache Line:] |
| (1) The unit of data that circulates among the CPUs and memory, |
| usually a moderate power of two in size. |
| Typical cache-line sizes range from 16 to 256 bytes. \\ |
| (2) A physical location in a CPU cache capable of holding |
| one cache-line unit of data. \\ |
| (3) A physical location in memory capable of holding one |
| cache-line unit of data, but that it also aligned |
| on a cache-line boundary. |
| For example, the address of the first word of a cache line |
| in memory will end in 0x00 on systems with 256-byte cache lines. |
| \item[Cache Miss:] |
| A cache miss occurs when data needed by the CPU is not in |
| that CPU's cache. |
| The data might be missing because of a number of reasons, |
| including: |
| (1) this CPU has never accessed the data before |
| (``startup'' or ``warmup'' miss), |
| (2) this CPU has recently accessed more |
| data than would fit in its cache, so that some of the older |
| data had to be removed (``capacity'' miss), |
| (3) this CPU |
| has recently accessed more data in a given set\footnote{ |
| In hardware-cache terminology, the word ``set'' |
| is used in the same way that the word ``bucket'' |
| is used when discussing software caches.} |
| than that set could hold (``associativity'' miss), |
| (4) some other CPU has written to the data (or some other |
| data in the same cache line) since this CPU has accessed it |
| (``communication miss''), or |
| (5) this CPU attempted to write to a cache line that is |
| currently read-only, possibly due to that line being replicated |
| in other CPUs' caches. |
| \item[Capacity Miss:] |
| A cache miss incurred because the corresponding CPU has recently |
| accessed more data than will fit into the cache. |
| \item[Code Locking:] |
| A simple locking design in which a ``global lock'' is used to protect |
| a set of critical sections, so that access by a given thread |
| to that set is |
| granted or denied based only on the set of threads currently |
| occupying the set of critical sections, not based on what |
| data the thread intends to access. |
| The scalability of a code-locked program is limited by the code; |
| increasing the size of the data set will normally not increase |
| scalability (in fact, will typically \emph{decrease} scalability |
| by increasing ``lock contention''). |
| Contrast with ``data locking''. |
| \item[Communication Miss:] |
| A cache miss incurred because some other CPU has written to |
| the cache line since the last time this CPU accessed it. |
| \item[Critical Section:] |
| A section of code guarded by some synchronization mechanism, |
| so that its execution constrained by that primitive. |
| For example, if a set of critical sections are guarded by |
| the same global lock, then only one of those critical sections |
| may be executing at a given time. |
| If a thread is executing in one such critical section, |
| any other threads must wait until the first thread completes |
| before executing any of the critical sections in the set. |
| \item[Data Locking:] |
| A scalable locking design in which each instance of a given |
| data structure has its own lock. |
| If each thread is using a different instance of the |
| data structure, then all of the threads may be executing in |
| the set of critical sections simultaneously. |
| Data locking has the advantage of automatically scaling to |
| increasing numbers of CPUs as the number of instances of |
| data grows. |
| Contrast with ``code locking''. |
| \item[Direct-Mapped Cache:] |
| A cache with only one way, so that it may hold only one cache |
| line with a given hash value. |
| \item[Embarrassingly Parallel:] |
| A problem or algorithm where adding threads does not significantly |
| increase the overall cost of the computation, resulting in |
| linear speedups as threads are added (assuming sufficient |
| CPUs are available). |
| \item[Exclusive Lock:] |
| An exclusive lock is a mutual-exclusion mechanism that |
| permits only one thread at a time into the |
| set of critical sections guarded by that lock. |
| \item[False Sharing:] |
| If two CPUs each frequently write to one of a pair of data items, |
| but the pair of data items are located in the same cache line, |
| this cache line will be repeatedly invalidated, ``ping-ponging'' |
| back and forth between the two CPUs' caches. |
| This is a common cause of ``cache thrashing'', also called |
| ``cacheline bouncing'' (the latter most commonly in the Linux |
| community). |
| False sharing can dramatically reduce both performance and |
| scalability. |
| \item[Fragmentation:] |
| A memory pool that has a large amount of unused memory, but |
| not laid out to permit satisfying a relatively small request |
| is said to be fragmented. |
| External fragmentation occurs when the space is divided up |
| into small fragments lying between allocated blocks of memory, |
| while internal fragmentation occurs when specific requests or |
| types of requests have been allotted more memory than they |
| actually requested. |
| \item[Fully Associative Cache:] |
| A fully associative cache contains only |
| one set, so that it can hold any subset of |
| memory that fits within its capacity. |
| \item[Grace Period:] |
| A grace period is any contiguous time interval such that |
| any RCU read-side critical section that began before the |
| start of that interval has |
| completed before the end of that same interval. |
| Many RCU implementations define a grace period to be a |
| time interval during which each thread has passed through at |
| least one quiescent state. |
| Since RCU read-side critical sections by definition cannot |
| contain quiescent states, these two definitions are almost |
| always interchangeable. |
| \item[Heisenbug:] |
| A timing-sensitive bug that disappears from sight when you |
| add print statements or tracing in an attempt to track it |
| down. |
| \item[Hot Spot:] |
| Data structure that is very heavily used, resulting in high |
| levels of contention on the corresponding lock. |
| One example of this situation would be a hash table with |
| a poorly chosen hash function. |
| \item[Humiliatingly Parallel:] |
| A problem or algorithm where adding threads significantly |
| \emph{decreases} the overall cost of the computation, resulting in |
| large superlinear speedups as threads are added (assuming sufficient |
| CPUs are available). |
| \item[Invalidation:] |
| When a CPU wishes to write to a data item, it must first ensure |
| that this data item is not present in any other CPUs' cache. |
| If necessary, the item is removed from the other CPUs' caches |
| via ``invalidation'' messages from the writing CPUs to any |
| CPUs having a copy in their caches. |
| \item[IPI:] |
| Inter-processor interrupt, which is an |
| interrupt sent from one CPU to another. |
| IPIs are used heavily in the Linux kernel, for example, within |
| the scheduler to alert CPUs that a high-priority process is now |
| runnable. |
| \item[IRQ:] |
| Interrupt request, often used as an abbreviation for ``interrupt'' |
| within the Linux kernel community, as in ``irq handler''. |
| \item[Linearizable:] |
| A sequence of operations is ``linearizable'' if there is at |
| least one global ordering of the sequence that is consistent |
| with the observations of all CPUs and/or threads. |
| Linearizability is much prized by many researchers, but less |
| useful in practice than one might |
| expect~\cite{AndreasHaas2012FIFOisnt}. |
| \item[Lock:] |
| A software abstraction that can be used to guard critical sections, |
| as such, an example of a ``mutual exclusion mechanism''. |
| An ``exclusive lock'' permits only one thread at a time into the |
| set of critical sections guarded by that lock, while a |
| ``reader-writer lock'' permits any number of reading |
| threads, or but one writing thread, into the set of critical |
| sections guarded by that lock. (Just to be clear, the presence |
| of a writer thread in any of a given reader-writer lock's |
| critical sections will prevent any reader from entering |
| any of that lock's critical sections and vice versa.) |
| \item[Lock Contention:] |
| A lock is said to be suffering contention when it is being |
| used so heavily that there is often a CPU waiting on it. |
| Reducing lock contention is often a concern when designing |
| parallel algorithms and when implementing parallel programs. |
| \item[Memory Consistency:] |
| A set of properties that impose constraints on the order in |
| which accesses to groups of variables appear to occur. |
| Memory consistency models range from sequential consistency, |
| a very constraining model popular in academic circles, through |
| process consistency, release consistency, and weak consistency. |
| \item[MESI Protocol:] |
| The |
| cache-coherence protocol featuring |
| modified, exclusive, shared, and invalid (MESI) states, |
| so that this protocol is named after the states that the |
| cache lines in a given cache can take on. |
| A modified line has been recently written to by this CPU, |
| and is the sole representative of the current value of |
| the corresponding memory location. |
| An exclusive cache line has not been written to, but this |
| CPU has the right to write to it at any time, as the line |
| is guaranteed not to be replicated into any other CPU's cache |
| (though the corresponding location in main memory is up to date). |
| A shared cache line is (or might be) replicated in some other |
| CPUs' cache, meaning that this CPU must interact with those other |
| CPUs before writing to this cache line. |
| An invalid cache line contains no value, instead representing |
| ``empty space'' in the cache into which data from memory might |
| be loaded. |
| \item[Mutual-Exclusion Mechanism:] |
| A software abstraction that regulates threads' access to |
| ``critical sections'' and corresponding data. |
| \item[NMI:] |
| Non-maskable interrupt. |
| As the name indicates, this is an extremely high-priority |
| interrupt that cannot be masked. |
| These are used for hardware-specific purposes such as profiling. |
| The advantage of using NMIs for profiling is that it allows you |
| to profile code that runs with interrupts disabled. |
| \item[NUCA:] |
| Non-uniform cache architecture, where groups of CPUs share |
| caches and/or store buffers. |
| CPUs in a group can therefore exchange cache lines with each |
| other much more quickly than they can with CPUs in other groups. |
| Systems comprised of CPUs with hardware threads will generally |
| have a NUCA architecture. |
| \item[NUMA:] |
| Non-uniform memory architecture, where memory is split into |
| banks and each such bank is ``close'' to a group of CPUs, |
| the group being termed a ``NUMA node''. |
| An example NUMA machine is Sequent's NUMA-Q system, where |
| each group of four CPUs had a bank of memory near by. |
| The CPUs in a given group can access their memory much |
| more quickly than another group's memory. |
| \item[NUMA Node:] |
| A group of closely placed CPUs and associated memory within |
| a larger NUMA machines. |
| Note that a NUMA node might well have a NUCA architecture. |
| \item[Pipelined CPU:] |
| A CPU with a pipeline, which is |
| an internal flow of instructions internal to the CPU that |
| is in some way similar to an assembly line, with many of |
| the same advantages and disadvantages. |
| In the 1960s through the early 1980s, pipelined CPUs were the |
| province of supercomputers, but started appearing in microprocessors |
| (such as the 80486) in the late 1980s. |
| \item[Process Consistency:] |
| A memory-consistency model in which each CPU's stores appear to |
| occur in program order, but in which different CPUs might see |
| accesses from more than one CPU as occurring in different orders. |
| \item[Program Order:] |
| The order in which a given thread's instructions |
| would be executed by a now-mythical ``in-order'' CPU that |
| completely executed each instruction before proceeding to |
| the next instruction. |
| (The reason such CPUs are now the stuff of ancient myths |
| and legends is that they were extremely slow. |
| These dinosaurs were one of the many victims of |
| Moore's-Law-driven increases in CPU clock frequency. |
| Some claim that these beasts will roam the earth once again, |
| others vehemently disagree.) |
| \item[Quiescent State:] |
| In RCU, a point in the code where there can be no references held |
| to RCU-protected data structures, which is normally any point |
| outside of an RCU read-side critical section. |
| Any interval of time during which all threads pass through at |
| least one quiescent state each is termed a ``grace period''. |
| \item[Read-Copy Update (RCU):] |
| A synchronization mechanism that can be thought of as a replacement |
| for reader-writer locking or reference counting. |
| RCU provides extremely low-overhead access for readers, while |
| writers incur additional overhead maintaining old versions |
| for the benefit of pre-existing readers. |
| Readers neither block nor spin, and thus cannot participate in |
| deadlocks, however, they also can see stale data and can |
| run concurrently with updates. |
| RCU is thus best-suited for read-mostly situations where |
| stale data can either be tolerated (as in routing tables) |
| or avoided (as in the Linux kernel's System V IPC implementation). |
| \item[Read-Side Critical Section:] |
| A section of code guarded by read-acquisition of |
| some reader-writer synchronization mechanism. |
| For example, if one set of critical sections are guarded by |
| read-acquisition of |
| a given global reader-writer lock, while a second set of critical |
| section are guarded by write-acquisition of that same reader-writer |
| lock, then the first set of critical sections will be the |
| read-side critical sections for that lock. |
| Any number of threads may concurrently execute the read-side |
| critical sections, but only if no thread is executing one of |
| the write-side critical sections. |
| \item[Reader-Writer Lock:] |
| A reader-writer lock is a mutual-exclusion mechanism that |
| permits any number of reading |
| threads, or but one writing thread, into the set of critical |
| sections guarded by that lock. |
| Threads attempting to write must wait until all pre-existing |
| reading threads release the lock, and, similarly, if there |
| is a pre-existing writer, any threads attempting to write must |
| wait for the writer to release the lock. |
| A key concern for reader-writer locks is ``fairness'': |
| can an unending stream of readers starve a writer or vice versa. |
| \item[Sequential Consistency:] |
| A memory-consistency model where all memory references appear to occur |
| in an order consistent with |
| a single global order, and where each CPU's memory references |
| appear to all CPUs to occur in program order. |
| \item[Store Buffer:] |
| A small set of internal registers used by a given CPU |
| to record pending stores |
| while the corresponding cache lines are making their |
| way to that CPU. |
| Also called ``store queue''. |
| \item[Store Forwarding:] |
| An arrangement where a given CPU refers to its store buffer |
| as well as its cache so as to ensure that the software sees |
| the memory operations performed by this CPU as if they |
| were carried out in program order. |
| \item[Super-Scalar CPU:] |
| A scalar (non-vector) CPU capable of executing multiple instructions |
| concurrently. |
| This is a step up from a pipelined CPU that executes multiple |
| instructions in an assembly-line fashion---in a super-scalar |
| CPU, each stage of the pipeline would be capable of handling |
| more than one instruction. |
| For example, if the conditions were exactly right, |
| the Intel Pentium Pro CPU from the mid-1990s could |
| execute two (and sometimes three) instructions per clock cycle. |
| Thus, a 200\,MHz Pentium Pro CPU could ``retire'', or complete the |
| execution of, up to 400 million instructions per second. |
| \item[Teachable:] |
| A topic, concept, method, or mechanism that the teacher understands |
| completely and is therefore comfortable teaching. |
| \item[Transactional Memory (TM):] |
| Shared-memory synchronization scheme featuring ``transactions'', |
| each of which is an atomic sequence of operations |
| that offers atomicity, consistency, isolation, but differ from |
| classic transactions in that they do not offer |
| durability. |
| Transactional memory may be implemented either in hardware |
| (hardware transactional memory, or HTM), in software (software |
| transactional memory, or STM), or in a combination of hardware |
| and software (``unbounded'' transactional memory, or UTM). |
| \item[Unteachable:] |
| A topic, concept, method, or mechanism that the teacher does |
| not understand well is therefore uncomfortable teaching. |
| \item[Vector CPU:] |
| A CPU that can apply a single instruction to multiple items of |
| data concurrently. |
| In the 1960s through the 1980s, only supercomputers had vector |
| capabilities, but the advent of MMX in x86 CPUs and VMX in |
| PowerPC CPUs brought vector processing to the masses. |
| \item[Write Miss:] |
| A cache miss incurred because the corresponding CPU attempted |
| to write to a cache line that is read-only, most likely due |
| to its being replicated in other CPUs' caches. |
| \item[Write-Side Critical Section:] |
| A section of code guarded by write-acquisition of |
| some reader-writer synchronization mechanism. |
| For example, if one set of critical sections are guarded by |
| write-acquisition of |
| a given global reader-writer lock, while a second set of critical |
| section are guarded by read-acquisition of that same reader-writer |
| lock, then the first set of critical sections will be the |
| write-side critical sections for that lock. |
| Only one thread may execute in the write-side critical section |
| at a time, and even then only if there are no threads are |
| executing concurrently in any of the corresponding read-side |
| critical sections. |
| \end{description} |