| % glossary.tex | 
 |  | 
 | \chapter{Glossary and Bibliography} | 
 |  | 
 | \begin{description} | 
 | \item[Associativity:] | 
 | 	The number of cache lines that can be held simultaneously in | 
 | 	a given cache, when all of these cache lines hash identically | 
 | 	in that cache. | 
 | 	A cache that could hold four cache lines for each possible | 
 | 	hash value would be termed a ``four-way set-associative'' cache, | 
 | 	while a cache that could hold only one cache line for each | 
 | 	possible hash value would be termed a ``direct-mapped'' cache. | 
 | 	A cache whose associativity was equal to its capacity would | 
 | 	be termed a ``fully associative'' cache. | 
 | 	Fully associative caches have the advantage of eliminating | 
 | 	associativity misses, but, due to hardware limitations, | 
 | 	fully associative caches are normally quite limited in size. | 
 | 	The associativity of the large caches found on modern microprocessors | 
 | 	typically range from two-way to eight-way. | 
 | \item[Associativity Miss:] | 
 | 	A cache miss incurred because the corresponding CPU has recently | 
 | 	accessed more data hashing to a given set of the cache than will | 
 | 	fit in that set. | 
 | 	Fully associative caches are not subject to associativity misses | 
 | 	(or, equivalently, in fully associative caches, associativity | 
 | 	and capacity misses are identical). | 
 | \item[Atomic:] | 
 | 	An operation is considered ``atomic'' if it is not possible to | 
 | 	observe any intermediate state. | 
 | 	For example, on most CPUs, a store to a properly aligned pointer | 
 | 	is atomic, because other CPUs will see either the old value or | 
 | 	the new value, but are guaranteed not to see some mixed value | 
 | 	containing some pieces of the new and old values. | 
 | \item[Cache:] | 
 | 	In modern computer systems, CPUs have caches in which to hold | 
 | 	frequently used data. | 
 | 	These caches can be thought of as hardware hash tables with very | 
 | 	simple hash functions, | 
 | 	but in which each hash bucket (termed a ``set'' by hardware types) | 
 | 	can hold only a limited number of data items. | 
 | 	The number of data items that can be held by each of a cache's hash | 
 | 	buckets is termed the cache's ``associativity''. | 
 | 	These data items are normally called ``cache lines'', which | 
 | 	can be thought of a fixed-length blocks of data that circulate | 
 | 	among the CPUs and memory. | 
 | \item[Cache Coherence:] | 
 | 	A property of most modern SMP machines where all CPUs will | 
 | 	observe a sequence of values for a given variable that is | 
 | 	consistent with at least one global order of values for | 
 | 	that variable. | 
 | 	Cache coherence also guarantees that at the end of a group | 
 | 	of stores to a given variable, all CPUs will agree | 
 | 	on the final value for that variable. | 
 | 	Note that cache coherence applies only to the series of values | 
 | 	taken on by a single variable. | 
 | 	In contrast, the memory consistency model for a given machine | 
 | 	describes the order in which loads and stores to groups of | 
 | 	variables will appear to occur. | 
 | 	See Section~\ref{sec:advsync:Single-Variable Memory Consistency} | 
 | 	for more information. | 
 | \item[Cache Coherence Protocol:] | 
 | 	A communications protocol, normally implemented in hardware, | 
 | 	that enforces memory consistency and ordering, preventing | 
 | 	different CPUs from seeing inconsistent views of data held | 
 | 	in their caches. | 
 | \item[Cache Geometry:] | 
 | 	The size and associativity of a cache is termed its geometry. | 
 | 	Each cache may be thought of as a two-dimensional array, | 
 | 	with rows of cache lines (``sets'') that have the same hash | 
 | 	value, and columns of cache lines (``ways'') in which every | 
 | 	cache line has a different hash value. | 
 | 	The associativity of a given cache is its number of | 
 | 	columns (hence the name ``way'' -- a two-way set-associative | 
 | 	cache has two ``ways''), and the size of the cache is its | 
 | 	number of rows multiplied by its number of columns. | 
 | \item[Cache Line:] | 
 | 	(1) The unit of data that circulates among the CPUs and memory, | 
 | 	usually a moderate power of two in size. | 
 | 	Typical cache-line sizes range from 16 to 256 bytes. \\ | 
 | 	(2) A physical location in a CPU cache capable of holding | 
 | 	one cache-line unit of data. \\ | 
 | 	(3) A physical location in memory capable of holding one | 
 | 	cache-line unit of data, but that it also aligned | 
 | 	on a cache-line boundary. | 
 | 	For example, the address of the first word of a cache line | 
 | 	in memory will end in 0x00 on systems with 256-byte cache lines. | 
 | \item[Cache Miss:] | 
 | 	A cache miss occurs when data needed by the CPU is not in | 
 | 	that CPU's cache. | 
 | 	The data might be missing because of a number of reasons, | 
 | 	including: | 
 | 	(1) this CPU has never accessed the data before | 
 | 	(``startup'' or ``warmup'' miss), | 
 | 	(2) this CPU has recently accessed more | 
 | 	data than would fit in its cache, so that some of the older | 
 | 	data had to be removed (``capacity'' miss), | 
 | 	(3) this CPU | 
 | 	has recently accessed more data in a given set\footnote{ | 
 | 		In hardware-cache terminology, the word ``set'' | 
 | 		is used in the same way that the word ``bucket'' | 
 | 		is used when discussing software caches.} | 
 | 	than that set could hold (``associativity'' miss), | 
 | 	(4) some other CPU has written to the data (or some other | 
 | 	data in the same cache line) since this CPU has accessed it | 
 | 	(``communication miss''), or | 
 | 	(5) this CPU attempted to write to a cache line that is | 
 | 	currently read-only, possibly due to that line being replicated | 
 | 	in other CPUs' caches. | 
 | \item[Capacity Miss:] | 
 | 	A cache miss incurred because the corresponding CPU has recently | 
 | 	accessed more data than will fit into the cache. | 
 | \item[Code Locking:] | 
 | 	A simple locking design in which a ``global lock'' is used to protect | 
 | 	a set of critical sections, so that access by a given thread | 
 | 	to that set is | 
 | 	granted or denied based only on the set of threads currently | 
 | 	occupying the set of critical sections, not based on what | 
 | 	data the thread intends to access. | 
 | 	The scalability of a code-locked program is limited by the code; | 
 | 	increasing the size of the data set will normally not increase | 
 | 	scalability (in fact, will typically \emph{decrease} scalability | 
 | 	by increasing ``lock contention''). | 
 | 	Contrast with ``data locking''. | 
 | \item[Communication Miss:] | 
 | 	A cache miss incurred because the some other CPU has written to | 
 | 	the cache line since the last time this CPU accessed it. | 
 | \item[Critical Section:] | 
 | 	A section of code guarded by some synchronization mechanism, | 
 | 	so that its execution constrained by that primitive. | 
 | 	For example, if a set of critical sections are guarded by | 
 | 	the same global lock, then only one of those critical sections | 
 | 	may be executing at a given time. | 
 | 	If a thread is executing in one such critical section, | 
 | 	any other threads must wait until the first thread completes | 
 | 	before executing any of the critical sections in the set. | 
 | \item[Data Locking:] | 
 | 	A scalable locking design in which each instance of a given | 
 | 	data structure has its own lock. | 
 | 	If each thread is using a different instance of the | 
 | 	data structure, then all of the threads may be executing in | 
 | 	the set of critical sections simultaneously. | 
 | 	Data locking has the advantage of automatically scaling to | 
 | 	increasing numbers of CPUs as the number of instances of | 
 | 	data grows. | 
 | 	Contrast with ``code locking''. | 
 | \item[Direct-Mapped Cache:] | 
 | 	A cache with only one way, so that it may hold only one cache | 
 | 	line with a given hash value. | 
 | \item[Embarrassingly Parallel:] | 
 | 	A problem or algorithm where adding threads does not significantly | 
 | 	increase the overall cost of the computation, resulting in | 
 | 	linear speedups as threads are added (assuming sufficient | 
 | 	CPUs are available). | 
 | \item[Exclusive Lock:] | 
 | 	An exclusive lock is a mutual-exclusion mechanism that | 
 | 	permits only one thread at a time into the | 
 | 	set of critical sections guarded by that lock. | 
 | \item[False Sharing:] | 
 | 	If two CPUs each frequently write to one of a pair of data items, | 
 | 	but the pair of data items are located in the same cache line, | 
 | 	this cache line will be repeatedly invalidated, ``ping-ponging'' | 
 | 	back and forth between the two CPUs' caches. | 
 | 	This is a common cause of ``cache thrashing'', also called | 
 | 	``cacheline bouncing'' (the latter most commonly in the Linux | 
 | 	community). | 
 | 	False sharing can dramatically reduce both performance and | 
 | 	scalability. | 
 | \item[Fragmentation:] | 
 | 	A memory pool that has a large amount of unused memory, but | 
 | 	not laid out to permit satisfying a relatively small request | 
 | 	is said to be fragmented. | 
 | 	External fragmentation occurs when the space is divided up | 
 | 	into small fragments lying between allocated blocks of memory, | 
 | 	while internal fragmentation occurs when specific requests or | 
 | 	types of requests have been allotted more memory than they | 
 | 	actually requested. | 
 | \item[Fully Associative Cache:] | 
 | 	A fully associative cache contains only | 
 | 	one set, so that it can hold any subset of | 
 | 	memory that fits within its capacity. | 
 | \item[Grace Period:] | 
 | 	A grace period is any contiguous time interval such that | 
 | 	any RCU read-side critical section that began before the | 
 | 	start of that interval has | 
 | 	completed before the end of that same interval. | 
 | 	Many RCU implementations define a grace period to be a | 
 | 	time interval during which each thread has passed through at | 
 | 	least one quiescent state. | 
 | 	Since RCU read-side critical sections by definition cannot | 
 | 	contain quiescent states, these two definitions are almost | 
 | 	always interchangeable. | 
 | \item[Heisenbug:] | 
 | 	A timing-sensitive bug that disappears from sight when you | 
 | 	add print statements or tracing in an attempt to track it | 
 | 	down. | 
 | \item[Hot Spot:] | 
 | 	Data structure that is very heavily used, resulting in high | 
 | 	levels of contention on the corresponding lock. | 
 | 	One example of this situation would be a hash table with | 
 | 	a poorly chosen hash function. | 
 | \item[Humiliatingly Parallel:] | 
 | 	A problem or algorithm where adding threads significantly | 
 | 	\emph{decreases} the overall cost of the computation, resulting in | 
 | 	large superlinear speedups as threads are added (assuming sufficient | 
 | 	CPUs are available). | 
 | \item[Invalidation:] | 
 | 	When a CPU wishes to write to a data item, it must first ensure | 
 | 	that this data item is not present in any other CPUs' cache. | 
 | 	If necessary, the item is removed from the other CPUs' caches | 
 | 	via ``invalidation'' messages from the writing CPUs to any | 
 | 	CPUs having a copy in their caches. | 
 | \item[IPI:] | 
 | 	Inter-processor interrupt, which is an | 
 | 	interrupt sent from one CPU to another. | 
 | 	IPIs are used heavily in the Linux kernel, for example, within | 
 | 	the scheduler to alert CPUs that a high-priority process is now | 
 | 	runnable. | 
 | \item[IRQ:] | 
 | 	Interrupt request, often used as an abbreviation for ``interrupt'' | 
 | 	within the Linux kernel community, as in ``irq handler''. | 
 | \item[Linearizable:] | 
 | 	A sequence of operations is ``linearizable'' if there is at | 
 | 	least one global ordering of the sequence that is consistent | 
 | 	with the observations of all CPUs/threads. | 
 | \item[Lock:] | 
 | 	A software abstraction that can be used to guard critical sections, | 
 | 	as such, an example of a ''mutual exclusion mechanism''. | 
 | 	An ``exclusive lock'' permits only one thread at a time into the | 
 | 	set of critical sections guarded by that lock, while a | 
 | 	``reader-writer lock'' permits any number of reading | 
 | 	threads, or but one writing thread, into the set of critical | 
 | 	sections guarded by that lock.  (Just to be clear, the presence | 
 | 	of a writer thread in any of a given reader-writer lock's | 
 | 	critical sections will prevent any reader from entering | 
 | 	any of that lock's critical sections and vice versa.) | 
 | \item[Lock Contention:] | 
 | 	A lock is said to be suffering contention when it is being | 
 | 	used so heavily that there is often a CPU waiting on it. | 
 | 	Reducing lock contention is often a concern when designing | 
 | 	parallel algorithms and when implementing parallel programs. | 
 | \item[Memory Consistency:] | 
 | 	A set of properties that impose constraints on the order in | 
 | 	which accesses to groups of variables appear to occur. | 
 | 	Memory consistency models range from sequential consistency, | 
 | 	a very constraining model popular in academic circles, through | 
 | 	process consistency, release consistency, and weak consistency. | 
 | \item[MESI Protocol:] | 
 | 	The | 
 | 	cache-coherence protocol featuring | 
 | 	modified, exclusive, shared, and invalid (MESI) states, | 
 | 	so that this protocol is named after the states that the | 
 | 	cache lines in a given cache can take on. | 
 | 	A modified line has been recently written to by this CPU, | 
 | 	and is the sole representative of the current value of | 
 | 	the corresponding memory location. | 
 | 	An exclusive cache line has not been written to, but this | 
 | 	CPU has the right to write to it at any time, as the line | 
 | 	is guaranteed not to be replicated into any other CPU's cache | 
 | 	(though the corresponding location in main memory is up to date). | 
 | 	A shared cache line is (or might be) replicated in some other | 
 | 	CPUs' cache, meaning that this CPU must interact with those other | 
 | 	CPUs before writing to this cache line. | 
 | 	An invalid cache line contains no value, instead representing | 
 | 	``empty space'' in the cache into which data from memory might | 
 | 	be loaded. | 
 | \item[Mutual-Exclusion Mechanism:] | 
 | 	A software abstraction that regulates threads' access to | 
 | 	``critical sections'' and corresponding data. | 
 | \item[NMI:] | 
 | 	Non-maskable interrupt. | 
 | 	As the name indicates, this is an extremely high-priority | 
 | 	interrupt that cannot be masked. | 
 | 	These are used for hardware-specific purposes such as profiling. | 
 | 	The advantage of using NMIs for profiling is that it allows you | 
 | 	to profile code that runs with interrupts disabled. | 
 | \item[NUCA:] | 
 | 	Non-uniform cache architecture, where groups of CPUs share | 
 | 	caches. | 
 | 	CPUs in a group can therefore exchange cache lines with each | 
 | 	other much more quickly than they can with CPUs in other groups. | 
 | 	Systems comprised of CPUs with hardware threads will generally | 
 | 	have a NUCA architecture. | 
 | \item[NUMA:] | 
 | 	Non-uniform memory architecture, where memory is split into | 
 | 	banks and each such bank is ``close'' to a group of CPUs, | 
 | 	the group being termed a ``NUMA node''. | 
 | 	An example NUMA machine is Sequent's NUMA-Q system, where | 
 | 	each group of four CPUs had a bank of memory near by. | 
 | 	The CPUs in a given group can access their memory much | 
 | 	more quickly than another group's memory. | 
 | \item[NUMA Node:] | 
 | 	A group of closely placed CPUs and associated memory within | 
 | 	a larger NUMA machines. | 
 | 	Note that a NUMA node might well have a NUCA architecture. | 
 | \item[Pipelined CPU:] | 
 | 	A CPU with a pipeline, which is | 
 | 	an internal flow of instructions internal to the CPU that | 
 | 	is in some way similar to an assembly line, with many of | 
 | 	the same advantages and disadvantages. | 
 | 	In the 1960s through the early 1980s, pipelined CPUs were the | 
 | 	province of supercomputers, but started appearing in microprocessors | 
 | 	(such as the 80486) in the late 1980s. | 
 | \item[Process Consistency:] | 
 | 	A memory-consistency model in which each CPU's stores appear to | 
 | 	occur in program order, but in which different CPUs might see | 
 | 	accesses from more than one CPU as occurring in different orders. | 
 | \item[Program Order:] | 
 | 	The order in which a given thread's instructions | 
 | 	would be executed by a now-mythical ``in-order'' CPU that | 
 | 	completely executed each instruction before proceeding to | 
 | 	the next instruction. | 
 | 	(The reason such CPUs are now the stuff of ancient myths | 
 | 	and legends is that they were extremely slow. | 
 | 	These dinosaurs were one of the many victims of | 
 | 	Moore's-Law-driven increases in CPU clock frequency. | 
 | 	Some claim that these beasts will roam the earth once again, | 
 | 	others vehemently disagree.) | 
 | \item[Quiescent State:] | 
 | 	In RCU, a point in the code where there can be no references held | 
 | 	to RCU-protected data structures, which is normally any point | 
 | 	outside of an RCU read-side critical section. | 
 | 	Any interval of time during which all threads pass through at | 
 | 	least one quiescent state each is termed a ``grace period''. | 
 | \item[Read-Copy Update (RCU):] | 
 | 	A synchronization mechanism that can be thought of as a replacement | 
 | 	for reader-writer locking or reference counting. | 
 | 	RCU provides extremely low-overhead access for readers, while | 
 | 	writers incur additional overhead maintaining old versions | 
 | 	for the benefit of pre-existing readers. | 
 | 	Readers neither block nor spin, and thus cannot participate in | 
 | 	deadlocks, however, they also can see stale data and can | 
 | 	run concurrently with updates. | 
 | 	RCU is thus best-suited for read-mostly situations where | 
 | 	stale data can either be tolerated (as in routing tables) | 
 | 	or avoided (as in the Linux kernel's System V IPC implementation). | 
 | \item[Read-Side Critical Section:] | 
 | 	A section of code guarded by read-acquisition of | 
 | 	some reader-writer synchronization mechanism. | 
 | 	For example, if one set of critical sections are guarded by | 
 | 	read-acquisition of | 
 | 	a given global reader-writer lock, while a second set of critical | 
 | 	section are guarded by write-acquisition of that same reader-writer | 
 | 	lock, then the first set of critical sections will be the | 
 | 	read-side critical sections for that lock. | 
 | 	Any number of threads may concurrently execute the read-side | 
 | 	critical sections, but only if no thread is executing one of | 
 | 	the write-side critical sections. | 
 | \item[Reader-Writer Lock:] | 
 | 	A reader-writer lock is a mutual-exclusion mechanism that | 
 | 	permits any number of reading | 
 | 	threads, or but one writing thread, into the set of critical | 
 | 	sections guarded by that lock. | 
 | 	Threads attempting to write must wait until all pre-existing | 
 | 	reading threads release the lock, and, similarly, if there | 
 | 	is a pre-existing writer, any threads attempting to write must | 
 | 	wait for the writer to release the lock. | 
 | 	A key concern for reader-writer locks is ``fairness'': | 
 | 	can an unending stream of readers starve a writer or vice versa. | 
 | \item[Sequential Consistency:] | 
 | 	A memory-consistency model where all memory references appear to occur | 
 | 	in an order consistent with | 
 | 	a single global order, and where each CPU's memory references | 
 | 	appear to all CPUs to occur in program order. | 
 | \item[Store Buffer:] | 
 | 	A small set of internal registers used by a given CPU | 
 | 	to record pending stores | 
 | 	while the corresponding cache lines are making their | 
 | 	way to that CPU. | 
 | 	Also called ``store queue''. | 
 | \item[Store Forwarding:] | 
 | 	An arrangement where a given CPU refers to its store buffer | 
 | 	as well as its cache so as to ensure that the software sees | 
 | 	the memory operations performed by this CPU as if they | 
 | 	were carried out in program order. | 
 | \item[Super-Scalar CPU:] | 
 | 	A scalar (non-vector) CPU capable of executing multiple instructions | 
 | 	concurrently. | 
 | 	This is a step up from a pipelined CPU that executes multiple | 
 | 	instructions in an assembly-line fashion --- in a super-scalar | 
 | 	CPU, each stage of the pipeline would be capable of handling | 
 | 	more than one instruction. | 
 | 	For example, if the conditions were exactly right, | 
 | 	the Intel Pentium Pro CPU from the mid-1990s could | 
 | 	execute two (and sometimes three) instructions per clock cycle. | 
 | 	Thus, a 200MHz Pentium Pro CPU could ``retire'', or complete the | 
 | 	execution of, up to 400 million instructions per second. | 
 | \item[Teachable:] | 
 | 	A topic, concept, method, or mechanism that the teacher understands | 
 | 	completely and is therefore comfortable teaching. | 
 | \item[Transactional Memory (TM):] | 
 | 	Shared-memory synchronization scheme featuring ``transactions'', | 
 | 	each of which is an atomic sequence of operations | 
 | 	that offers atomicity, consistency, isolation, but differ from | 
 | 	classic transactions in that they do not offer | 
 | 	durability. | 
 | 	Transactional memory may be implemented either in hardware | 
 | 	(hardwire transactional memory, or HTM), in software (software | 
 | 	transactional memory, or STM), or in a combination of hardware | 
 | 	and software (``unbounded'' transactional memory, or UTM). | 
 | \item[Unteachable:] | 
 | 	A topic, concept, method, or mechanism that the teacher does | 
 | 	not understand well is therefore uncomfortable teaching. | 
 | \item[Vector CPU:] | 
 | 	A CPU that can apply a single instruction to multiple items of | 
 | 	data concurrently. | 
 | 	In the 1960s through the 1980s, only supercomputers had vector | 
 | 	capabilities, but the advent of MMX in x86 CPUs and VMX in | 
 | 	PowerPC CPUs brought vector processing to the masses. | 
 | \item[Write Miss:] | 
 | 	A cache miss incurred because the corresponding CPU attempted | 
 | 	to write to a cache line that is read-only, most likely due | 
 | 	to its being replicated in other CPUs' caches. | 
 | \item[Write-Side Critical Section:] | 
 | 	A section of code guarded by write-acquisition of | 
 | 	some reader-writer synchronization mechanism. | 
 | 	For example, if one set of critical sections are guarded by | 
 | 	write-acquisition of | 
 | 	a given global reader-writer lock, while a second set of critical | 
 | 	section are guarded by read-acquisition of that same reader-writer | 
 | 	lock, then the first set of critical sections will be the | 
 | 	write-side critical sections for that lock. | 
 | 	Only one thread may execute in the write-side critical section | 
 | 	at a time, and even then only if there are no threads are | 
 | 	executing concurrently in any of the corresponding read-side | 
 | 	critical sections. | 
 | \end{description} |