| From: Yu Zhao <yuzhao@google.com> |
| Subject: mm: multi-gen LRU: design doc |
| Date: Sun, 18 Sep 2022 02:00:11 -0600 |
| |
| Add a design doc. |
| |
| Link: https://lkml.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com |
| Signed-off-by: Yu Zhao <yuzhao@google.com> |
| Acked-by: Brian Geffon <bgeffon@google.com> |
| Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> |
| Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> |
| Acked-by: Steven Barrett <steven@liquorix.net> |
| Acked-by: Suleiman Souhlal <suleiman@google.com> |
| Tested-by: Daniel Byrne <djbyrne@mtu.edu> |
| Tested-by: Donald Carr <d@chaos-reins.com> |
| Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> |
| Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> |
| Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> |
| Tested-by: Sofia Trinh <sofia.trinh@edi.works> |
| Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> |
| Cc: Andi Kleen <ak@linux.intel.com> |
| Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> |
| Cc: Barry Song <baohua@kernel.org> |
| Cc: Catalin Marinas <catalin.marinas@arm.com> |
| Cc: Dave Hansen <dave.hansen@linux.intel.com> |
| Cc: Hillf Danton <hdanton@sina.com> |
| Cc: Jens Axboe <axboe@kernel.dk> |
| Cc: Johannes Weiner <hannes@cmpxchg.org> |
| Cc: Jonathan Corbet <corbet@lwn.net> |
| Cc: Linus Torvalds <torvalds@linux-foundation.org> |
| Cc: Matthew Wilcox <willy@infradead.org> |
| Cc: Mel Gorman <mgorman@suse.de> |
| Cc: Miaohe Lin <linmiaohe@huawei.com> |
| Cc: Michael Larabel <Michael@MichaelLarabel.com> |
| Cc: Michal Hocko <mhocko@kernel.org> |
| Cc: Mike Rapoport <rppt@kernel.org> |
| Cc: Mike Rapoport <rppt@linux.ibm.com> |
| Cc: Peter Zijlstra <peterz@infradead.org> |
| Cc: Qi Zheng <zhengqi.arch@bytedance.com> |
| Cc: Tejun Heo <tj@kernel.org> |
| Cc: Vlastimil Babka <vbabka@suse.cz> |
| Cc: Will Deacon <will@kernel.org> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| --- |
| |
| Documentation/mm/index.rst | 1 |
| Documentation/mm/multigen_lru.rst | 159 ++++++++++++++++++++++++++++ |
| 2 files changed, 160 insertions(+) |
| |
| --- a/Documentation/mm/index.rst~mm-multi-gen-lru-design-doc |
| +++ a/Documentation/mm/index.rst |
| @@ -51,6 +51,7 @@ above structured documentation, or delet |
| ksm |
| memory-model |
| mmu_notifier |
| + multigen_lru |
| numa |
| overcommit-accounting |
| page_migration |
| --- /dev/null |
| +++ a/Documentation/mm/multigen_lru.rst |
| @@ -0,0 +1,159 @@ |
| +.. SPDX-License-Identifier: GPL-2.0 |
| + |
| +============= |
| +Multi-Gen LRU |
| +============= |
| +The multi-gen LRU is an alternative LRU implementation that optimizes |
| +page reclaim and improves performance under memory pressure. Page |
| +reclaim decides the kernel's caching policy and ability to overcommit |
| +memory. It directly impacts the kswapd CPU usage and RAM efficiency. |
| + |
| +Design overview |
| +=============== |
| +Objectives |
| +---------- |
| +The design objectives are: |
| + |
| +* Good representation of access recency |
| +* Try to profit from spatial locality |
| +* Fast paths to make obvious choices |
| +* Simple self-correcting heuristics |
| + |
| +The representation of access recency is at the core of all LRU |
| +implementations. In the multi-gen LRU, each generation represents a |
| +group of pages with similar access recency. Generations establish a |
| +(time-based) common frame of reference and therefore help make better |
| +choices, e.g., between different memcgs on a computer or different |
| +computers in a data center (for job scheduling). |
| + |
| +Exploiting spatial locality improves efficiency when gathering the |
| +accessed bit. A rmap walk targets a single page and does not try to |
| +profit from discovering a young PTE. A page table walk can sweep all |
| +the young PTEs in an address space, but the address space can be too |
| +sparse to make a profit. The key is to optimize both methods and use |
| +them in combination. |
| + |
| +Fast paths reduce code complexity and runtime overhead. Unmapped pages |
| +do not require TLB flushes; clean pages do not require writeback. |
| +These facts are only helpful when other conditions, e.g., access |
| +recency, are similar. With generations as a common frame of reference, |
| +additional factors stand out. But obvious choices might not be good |
| +choices; thus self-correction is necessary. |
| + |
| +The benefits of simple self-correcting heuristics are self-evident. |
| +Again, with generations as a common frame of reference, this becomes |
| +attainable. Specifically, pages in the same generation can be |
| +categorized based on additional factors, and a feedback loop can |
| +statistically compare the refault percentages across those categories |
| +and infer which of them are better choices. |
| + |
| +Assumptions |
| +----------- |
| +The protection of hot pages and the selection of cold pages are based |
| +on page access channels and patterns. There are two access channels: |
| + |
| +* Accesses through page tables |
| +* Accesses through file descriptors |
| + |
| +The protection of the former channel is by design stronger because: |
| + |
| +1. The uncertainty in determining the access patterns of the former |
| + channel is higher due to the approximation of the accessed bit. |
| +2. The cost of evicting the former channel is higher due to the TLB |
| + flushes required and the likelihood of encountering the dirty bit. |
| +3. The penalty of underprotecting the former channel is higher because |
| + applications usually do not prepare themselves for major page |
| + faults like they do for blocked I/O. E.g., GUI applications |
| + commonly use dedicated I/O threads to avoid blocking rendering |
| + threads. |
| + |
| +There are also two access patterns: |
| + |
| +* Accesses exhibiting temporal locality |
| +* Accesses not exhibiting temporal locality |
| + |
| +For the reasons listed above, the former channel is assumed to follow |
| +the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is |
| +present, and the latter channel is assumed to follow the latter |
| +pattern unless outlying refaults have been observed. |
| + |
| +Workflow overview |
| +================= |
| +Evictable pages are divided into multiple generations for each |
| +``lruvec``. The youngest generation number is stored in |
| +``lrugen->max_seq`` for both anon and file types as they are aged on |
| +an equal footing. The oldest generation numbers are stored in |
| +``lrugen->min_seq[]`` separately for anon and file types as clean file |
| +pages can be evicted regardless of swap constraints. These three |
| +variables are monotonically increasing. |
| + |
| +Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)`` |
| +bits in order to fit into the gen counter in ``folio->flags``. Each |
| +truncated generation number is an index to ``lrugen->lists[]``. The |
| +sliding window technique is used to track at least ``MIN_NR_GENS`` and |
| +at most ``MAX_NR_GENS`` generations. The gen counter stores a value |
| +within ``[1, MAX_NR_GENS]`` while a page is on one of |
| +``lrugen->lists[]``; otherwise it stores zero. |
| + |
| +Each generation is divided into multiple tiers. A page accessed ``N`` |
| +times through file descriptors is in tier ``order_base_2(N)``. Unlike |
| +generations, tiers do not have dedicated ``lrugen->lists[]``. In |
| +contrast to moving across generations, which requires the LRU lock, |
| +moving across tiers only involves atomic operations on |
| +``folio->flags`` and therefore has a negligible cost. A feedback loop |
| +modeled after the PID controller monitors refaults over all the tiers |
| +from anon and file types and decides which tiers from which types to |
| +evict or protect. |
| + |
| +There are two conceptually independent procedures: the aging and the |
| +eviction. They form a closed-loop system, i.e., the page reclaim. |
| + |
| +Aging |
| +----- |
| +The aging produces young generations. Given an ``lruvec``, it |
| +increments ``max_seq`` when ``max_seq-min_seq+1`` approaches |
| +``MIN_NR_GENS``. The aging promotes hot pages to the youngest |
| +generation when it finds them accessed through page tables; the |
| +demotion of cold pages happens consequently when it increments |
| +``max_seq``. The aging uses page table walks and rmap walks to find |
| +young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list`` |
| +and calls ``walk_page_range()`` with each ``mm_struct`` on this list |
| +to scan PTEs, and after each iteration, it increments ``max_seq``. For |
| +the latter, when the eviction walks the rmap and finds a young PTE, |
| +the aging scans the adjacent PTEs. For both, on finding a young PTE, |
| +the aging clears the accessed bit and updates the gen counter of the |
| +page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``. |
| + |
| +Eviction |
| +-------- |
| +The eviction consumes old generations. Given an ``lruvec``, it |
| +increments ``min_seq`` when ``lrugen->lists[]`` indexed by |
| +``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to |
| +evict from, it first compares ``min_seq[]`` to select the older type. |
| +If both types are equally old, it selects the one whose first tier has |
| +a lower refault percentage. The first tier contains single-use |
| +unmapped clean pages, which are the best bet. The eviction sorts a |
| +page according to its gen counter if the aging has found this page |
| +accessed through page tables and updated its gen counter. It also |
| +moves a page to the next generation, i.e., ``min_seq+1``, if this page |
| +was accessed multiple times through file descriptors and the feedback |
| +loop has detected outlying refaults from the tier this page is in. To |
| +this end, the feedback loop uses the first tier as the baseline, for |
| +the reason stated earlier. |
| + |
| +Summary |
| +------- |
| +The multi-gen LRU can be disassembled into the following parts: |
| + |
| +* Generations |
| +* Rmap walks |
| +* Page table walks |
| +* Bloom filters |
| +* PID controller |
| + |
| +The aging and the eviction form a producer-consumer model; |
| +specifically, the latter drives the former by the sliding window over |
| +generations. Within the aging, rmap walks drive page table walks by |
| +inserting hot densely populated page tables to the Bloom filters. |
| +Within the eviction, the PID controller uses refaults as the feedback |
| +to select types to evict and tiers to protect. |
| _ |