| From: Nico Pache <npache@redhat.com> |
| Subject: mm: document (m)THP defer usage |
| Date: Mon, 28 Apr 2025 12:29:02 -0600 |
| |
| The new defer option for (m)THPs allows for a more conservative approach |
| to (m)THPs. Document its usage in the transhuge admin-guide. |
| |
| Link: https://lkml.kernel.org/r/20250428182904.93989-3-npache@redhat.com |
| Signed-off-by: Nico Pache <npache@redhat.com> |
| Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> |
| Cc: Andrea Arcangeli <aarcange@redhat.com> |
| Cc: Anshuman Khandual <anshuman.khandual@arm.com> |
| Cc: Baolin Wang <baolin.wang@linux.alibaba.com> |
| Cc: Barry Song <baohua@kernel.org> |
| Cc: Catalin Marinas <catalin.marinas@arm.com> |
| Cc: Christoph Lameter (Ampere) <cl@gentwo.org> |
| Cc: David Hildenbrand <david@redhat.com> |
| Cc: David Rientjes <rientjes@google.com> |
| Cc: Dev Jain <dev.jain@arm.com> |
| Cc: Jan Kara <jack@suse.cz> |
| Cc: Johannes Weiner <hannes@cmpxchg.org> |
| Cc: Jonathan Corbet <corbet@lwn.net> |
| Cc: Kefeng Wang <wangkefeng.wang@huawei.com> |
| Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> |
| Cc: Liam Howlett <liam.howlett@oracle.com> |
| Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> |
| Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> |
| Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> |
| Cc: Matthew Wilcox (Oracle) <willy@infradead.org> |
| Cc: Michal Hocko <mhocko@suse.com> |
| Cc: Nanyong Sun <sunnanyong@huawei.com> |
| Cc: Peter Xu <peterx@redhat.com> |
| Cc: Rafael Aquini <raquini@redhat.com> |
| Cc: Randy Dunlap <rdunlap@infradead.org> |
| Cc: Reported-by:Takashi Iwai <tiwai@suse.de> |
| Cc: Ryan Roberts <ryan.roberts@arm.com> |
| Cc: Shuah Khan <shuah@kernel.org> |
| Cc: Steven Rostedt <rostedt@goodmis.org> |
| Cc: Suren Baghdasaryan <surenb@google.com> |
| Cc: Thomas Hellstrom <thomas.hellstrom@linux.intel.com> |
| Cc: Usama Arif <usamaarif642@gmail.com> |
| Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> |
| Cc: Will Deacon <will@kernel.org> |
| Cc: Yang Shi <yang@os.amperecomputing.com> |
| Cc: Zach O'Keefe <zokeefe@google.com> |
| Cc: Zi Yan <ziy@nvidia.com> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| --- |
| |
| Documentation/admin-guide/mm/transhuge.rst | 31 ++++++++++++++----- |
| 1 file changed, 23 insertions(+), 8 deletions(-) |
| |
| --- a/Documentation/admin-guide/mm/transhuge.rst~mm-document-mthp-defer-usage |
| +++ a/Documentation/admin-guide/mm/transhuge.rst |
| @@ -88,8 +88,9 @@ In certain cases when hugepages are enab |
| may end up allocating more memory resources. An application may mmap a |
| large region but only touch 1 byte of it, in that case a 2M page might |
| be allocated instead of a 4k page for no good. This is why it's |
| -possible to disable hugepages system-wide and to only have them inside |
| -MADV_HUGEPAGE madvise regions. |
| +possible to disable hugepages system-wide, only have them inside |
| +MADV_HUGEPAGE madvise regions, or defer them away from the page fault |
| +handler to khugepaged. |
| |
| Embedded systems should enable hugepages only inside madvise regions |
| to eliminate any risk of wasting any precious byte of memory and to |
| @@ -99,6 +100,15 @@ Applications that gets a lot of benefit |
| risk to lose memory by using hugepages, should use |
| madvise(MADV_HUGEPAGE) on their critical mmapped regions. |
| |
| +Applications that would like to benefit from THPs but would still like a |
| +more memory conservative approach can choose 'defer'. This avoids |
| +inserting THPs at the page fault handler unless they are MADV_HUGEPAGE. |
| +Khugepaged will then scan the mappings for potential collapses into (m)THP |
| +pages. Admins using this the 'defer' setting should consider |
| +tweaking khugepaged/max_ptes_none. The current default of 511 may |
| +aggressively collapse your PTEs into PMDs. Lower this value to conserve |
| +more memory (i.e., max_ptes_none=64). |
| + |
| .. _thp_sysfs: |
| |
| sysfs |
| @@ -109,11 +119,14 @@ Global THP controls |
| |
| Transparent Hugepage Support for anonymous memory can be entirely disabled |
| (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE |
| -regions (to avoid the risk of consuming more memory resources) or enabled |
| -system wide. This can be achieved per-supported-THP-size with one of:: |
| +regions (to avoid the risk of consuming more memory resources), deferred to |
| +khugepaged, or enabled system wide. |
| + |
| +This can be achieved per-supported-THP-size with one of:: |
| |
| echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled |
| echo madvise >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled |
| + echo defer >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled |
| echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/enabled |
| |
| where <size> is the hugepage size being addressed, the available sizes |
| @@ -136,6 +149,7 @@ The top-level setting (for use with "inh |
| one of the following commands:: |
| |
| echo always >/sys/kernel/mm/transparent_hugepage/enabled |
| + echo defer >/sys/kernel/mm/transparent_hugepage/enabled |
| echo madvise >/sys/kernel/mm/transparent_hugepage/enabled |
| echo never >/sys/kernel/mm/transparent_hugepage/enabled |
| |
| @@ -274,7 +288,8 @@ of small pages into one large page:: |
| A higher value leads to use additional memory for programs. |
| A lower value leads to gain less thp performance. Value of |
| max_ptes_none can waste cpu time very little, you can |
| -ignore it. |
| +ignore it. Consider lowering this value when using |
| +``transparent_hugepage=defer`` |
| |
| ``max_ptes_swap`` specifies how many pages can be brought in from |
| swap when collapsing a group of pages into a transparent huge page:: |
| @@ -299,14 +314,14 @@ Boot parameters |
| |
| You can change the sysfs boot time default for the top-level "enabled" |
| control by passing the parameter ``transparent_hugepage=always`` or |
| -``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the |
| -kernel command line. |
| +``transparent_hugepage=madvise`` or ``transparent_hugepage=defer`` or |
| +``transparent_hugepage=never`` to the kernel command line. |
| |
| Alternatively, each supported anonymous THP size can be controlled by |
| passing ``thp_anon=<size>[KMG],<size>[KMG]:<state>;<size>[KMG]-<size>[KMG]:<state>``, |
| where ``<size>`` is the THP size (must be a power of 2 of PAGE_SIZE and |
| supported anonymous THP) and ``<state>`` is one of ``always``, ``madvise``, |
| -``never`` or ``inherit``. |
| +``defer``, ``never`` or ``inherit``. |
| |
| For example, the following will set 16K, 32K, 64K THP to ``always``, |
| set 128K, 512K to ``inherit``, set 256K to ``madvise`` and 1M, 2M |
| _ |