|  | .. _mm_concepts: | 
|  |  | 
|  | ================= | 
|  | Concepts overview | 
|  | ================= | 
|  |  | 
|  | The memory management in Linux is a complex system that evolved over the | 
|  | years and included more and more functionality to support a variety of | 
|  | systems from MMU-less microcontrollers to supercomputers. The memory | 
|  | management for systems without an MMU is called ``nommu`` and it | 
|  | definitely deserves a dedicated document, which hopefully will be | 
|  | eventually written. Yet, although some of the concepts are the same, | 
|  | here we assume that an MMU is available and a CPU can translate a virtual | 
|  | address to a physical address. | 
|  |  | 
|  | .. contents:: :local: | 
|  |  | 
|  | Virtual Memory Primer | 
|  | ===================== | 
|  |  | 
|  | The physical memory in a computer system is a limited resource and | 
|  | even for systems that support memory hotplug there is a hard limit on | 
|  | the amount of memory that can be installed. The physical memory is not | 
|  | necessarily contiguous; it might be accessible as a set of distinct | 
|  | address ranges. Besides, different CPU architectures, and even | 
|  | different implementations of the same architecture have different views | 
|  | of how these address ranges are defined. | 
|  |  | 
|  | All this makes dealing directly with physical memory quite complex and | 
|  | to avoid this complexity a concept of virtual memory was developed. | 
|  |  | 
|  | The virtual memory abstracts the details of physical memory from the | 
|  | application software, allows to keep only needed information in the | 
|  | physical memory (demand paging) and provides a mechanism for the | 
|  | protection and controlled sharing of data between processes. | 
|  |  | 
|  | With virtual memory, each and every memory access uses a virtual | 
|  | address. When the CPU decodes an instruction that reads (or | 
|  | writes) from (or to) the system memory, it translates the `virtual` | 
|  | address encoded in that instruction to a `physical` address that the | 
|  | memory controller can understand. | 
|  |  | 
|  | The physical system memory is divided into page frames, or pages. The | 
|  | size of each page is architecture specific. Some architectures allow | 
|  | selection of the page size from several supported values; this | 
|  | selection is performed at the kernel build time by setting an | 
|  | appropriate kernel configuration option. | 
|  |  | 
|  | Each physical memory page can be mapped as one or more virtual | 
|  | pages. These mappings are described by page tables that allow | 
|  | translation from a virtual address used by programs to the physical | 
|  | memory address. The page tables are organized hierarchically. | 
|  |  | 
|  | The tables at the lowest level of the hierarchy contain physical | 
|  | addresses of actual pages used by the software. The tables at higher | 
|  | levels contain physical addresses of the pages belonging to the lower | 
|  | levels. The pointer to the top level page table resides in a | 
|  | register. When the CPU performs the address translation, it uses this | 
|  | register to access the top level page table. The high bits of the | 
|  | virtual address are used to index an entry in the top level page | 
|  | table. That entry is then used to access the next level in the | 
|  | hierarchy with the next bits of the virtual address as the index to | 
|  | that level page table. The lowest bits in the virtual address define | 
|  | the offset inside the actual page. | 
|  |  | 
|  | Huge Pages | 
|  | ========== | 
|  |  | 
|  | The address translation requires several memory accesses and memory | 
|  | accesses are slow relatively to CPU speed. To avoid spending precious | 
|  | processor cycles on the address translation, CPUs maintain a cache of | 
|  | such translations called Translation Lookaside Buffer (or | 
|  | TLB). Usually TLB is pretty scarce resource and applications with | 
|  | large memory working set will experience performance hit because of | 
|  | TLB misses. | 
|  |  | 
|  | Many modern CPU architectures allow mapping of the memory pages | 
|  | directly by the higher levels in the page table. For instance, on x86, | 
|  | it is possible to map 2M and even 1G pages using entries in the second | 
|  | and the third level page tables. In Linux such pages are called | 
|  | `huge`. Usage of huge pages significantly reduces pressure on TLB, | 
|  | improves TLB hit-rate and thus improves overall system performance. | 
|  |  | 
|  | There are two mechanisms in Linux that enable mapping of the physical | 
|  | memory with the huge pages. The first one is `HugeTLB filesystem`, or | 
|  | hugetlbfs. It is a pseudo filesystem that uses RAM as its backing | 
|  | store. For the files created in this filesystem the data resides in | 
|  | the memory and mapped using huge pages. The hugetlbfs is described at | 
|  | :ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`. | 
|  |  | 
|  | Another, more recent, mechanism that enables use of the huge pages is | 
|  | called `Transparent HugePages`, or THP. Unlike the hugetlbfs that | 
|  | requires users and/or system administrators to configure what parts of | 
|  | the system memory should and can be mapped by the huge pages, THP | 
|  | manages such mappings transparently to the user and hence the | 
|  | name. See | 
|  | :ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>` | 
|  | for more details about THP. | 
|  |  | 
|  | Zones | 
|  | ===== | 
|  |  | 
|  | Often hardware poses restrictions on how different physical memory | 
|  | ranges can be accessed. In some cases, devices cannot perform DMA to | 
|  | all the addressable memory. In other cases, the size of the physical | 
|  | memory exceeds the maximal addressable size of virtual memory and | 
|  | special actions are required to access portions of the memory. Linux | 
|  | groups memory pages into `zones` according to their possible | 
|  | usage. For example, ZONE_DMA will contain memory that can be used by | 
|  | devices for DMA, ZONE_HIGHMEM will contain memory that is not | 
|  | permanently mapped into kernel's address space and ZONE_NORMAL will | 
|  | contain normally addressed pages. | 
|  |  | 
|  | The actual layout of the memory zones is hardware dependent as not all | 
|  | architectures define all zones, and requirements for DMA are different | 
|  | for different platforms. | 
|  |  | 
|  | Nodes | 
|  | ===== | 
|  |  | 
|  | Many multi-processor machines are NUMA - Non-Uniform Memory Access - | 
|  | systems. In such systems the memory is arranged into banks that have | 
|  | different access latency depending on the "distance" from the | 
|  | processor. Each bank is referred to as a `node` and for each node Linux | 
|  | constructs an independent memory management subsystem. A node has its | 
|  | own set of zones, lists of free and used pages and various statistics | 
|  | counters. You can find more details about NUMA in | 
|  | :ref:`Documentation/mm/numa.rst <numa>` and in | 
|  | :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`. | 
|  |  | 
|  | Page cache | 
|  | ========== | 
|  |  | 
|  | The physical memory is volatile and the common case for getting data | 
|  | into the memory is to read it from files. Whenever a file is read, the | 
|  | data is put into the `page cache` to avoid expensive disk access on | 
|  | the subsequent reads. Similarly, when one writes to a file, the data | 
|  | is placed in the page cache and eventually gets into the backing | 
|  | storage device. The written pages are marked as `dirty` and when Linux | 
|  | decides to reuse them for other purposes, it makes sure to synchronize | 
|  | the file contents on the device with the updated data. | 
|  |  | 
|  | Anonymous Memory | 
|  | ================ | 
|  |  | 
|  | The `anonymous memory` or `anonymous mappings` represent memory that | 
|  | is not backed by a filesystem. Such mappings are implicitly created | 
|  | for program's stack and heap or by explicit calls to mmap(2) system | 
|  | call. Usually, the anonymous mappings only define virtual memory areas | 
|  | that the program is allowed to access. The read accesses will result | 
|  | in creation of a page table entry that references a special physical | 
|  | page filled with zeroes. When the program performs a write, a regular | 
|  | physical page will be allocated to hold the written data. The page | 
|  | will be marked dirty and if the kernel decides to repurpose it, | 
|  | the dirty page will be swapped out. | 
|  |  | 
|  | Reclaim | 
|  | ======= | 
|  |  | 
|  | Throughout the system lifetime, a physical page can be used for storing | 
|  | different types of data. It can be kernel internal data structures, | 
|  | DMA'able buffers for device drivers use, data read from a filesystem, | 
|  | memory allocated by user space processes etc. | 
|  |  | 
|  | Depending on the page usage it is treated differently by the Linux | 
|  | memory management. The pages that can be freed at any time, either | 
|  | because they cache the data available elsewhere, for instance, on a | 
|  | hard disk, or because they can be swapped out, again, to the hard | 
|  | disk, are called `reclaimable`. The most notable categories of the | 
|  | reclaimable pages are page cache and anonymous memory. | 
|  |  | 
|  | In most cases, the pages holding internal kernel data and used as DMA | 
|  | buffers cannot be repurposed, and they remain pinned until freed by | 
|  | their user. Such pages are called `unreclaimable`. However, in certain | 
|  | circumstances, even pages occupied with kernel data structures can be | 
|  | reclaimed. For instance, in-memory caches of filesystem metadata can | 
|  | be re-read from the storage device and therefore it is possible to | 
|  | discard them from the main memory when system is under memory | 
|  | pressure. | 
|  |  | 
|  | The process of freeing the reclaimable physical memory pages and | 
|  | repurposing them is called (surprise!) `reclaim`. Linux can reclaim | 
|  | pages either asynchronously or synchronously, depending on the state | 
|  | of the system. When the system is not loaded, most of the memory is free | 
|  | and allocation requests will be satisfied immediately from the free | 
|  | pages supply. As the load increases, the amount of the free pages goes | 
|  | down and when it reaches a certain threshold (low watermark), an | 
|  | allocation request will awaken the ``kswapd`` daemon. It will | 
|  | asynchronously scan memory pages and either just free them if the data | 
|  | they contain is available elsewhere, or evict to the backing storage | 
|  | device (remember those dirty pages?). As memory usage increases even | 
|  | more and reaches another threshold - min watermark - an allocation | 
|  | will trigger `direct reclaim`. In this case allocation is stalled | 
|  | until enough memory pages are reclaimed to satisfy the request. | 
|  |  | 
|  | Compaction | 
|  | ========== | 
|  |  | 
|  | As the system runs, tasks allocate and free the memory and it becomes | 
|  | fragmented. Although with virtual memory it is possible to present | 
|  | scattered physical pages as virtually contiguous range, sometimes it is | 
|  | necessary to allocate large physically contiguous memory areas. Such | 
|  | need may arise, for instance, when a device driver requires a large | 
|  | buffer for DMA, or when THP allocates a huge page. Memory `compaction` | 
|  | addresses the fragmentation issue. This mechanism moves occupied pages | 
|  | from the lower part of a memory zone to free pages in the upper part | 
|  | of the zone. When a compaction scan is finished free pages are grouped | 
|  | together at the beginning of the zone and allocations of large | 
|  | physically contiguous areas become possible. | 
|  |  | 
|  | Like reclaim, the compaction may happen asynchronously in the ``kcompactd`` | 
|  | daemon or synchronously as a result of a memory allocation request. | 
|  |  | 
|  | OOM killer | 
|  | ========== | 
|  |  | 
|  | It is possible that on a loaded machine memory will be exhausted and the | 
|  | kernel will be unable to reclaim enough memory to continue to operate. In | 
|  | order to save the rest of the system, it invokes the `OOM killer`. | 
|  |  | 
|  | The `OOM killer` selects a task to sacrifice for the sake of the overall | 
|  | system health. The selected task is killed in a hope that after it exits | 
|  | enough memory will be freed to continue normal operation. |