| From: Thomas Prescher <thomas.prescher@cyberus-technology.de> |
| Subject: mm: hugetlb: improve parallel huge page allocation time |
| Date: Thu, 27 Feb 2025 23:45:05 +0100 |
| |
| Patch series "Add a command line option that enables control of how many |
| threads should be used to allocate huge pages", v2. |
| |
| Allocating huge pages can take a very long time on servers with terabytes |
| of memory even when they are allocated at boot time where the allocation |
| happens in parallel. |
| |
| Before this series, the kernel used a hard coded value of 2 threads per |
| NUMA node for these allocations. This value might have been good enough |
| in the past but it is not sufficient to fully utilize newer systems. |
| |
| This series changes the default so the kernel uses 25% of the available |
| hardware threads for these allocations. In addition, we allow the user |
| that wish to micro-optimize the allocation time to override this value via |
| a new kernel parameter. |
| |
| We tested this on 2 generations of Xeon CPUs and the results show a big |
| improvement of the overall allocation time. |
| |
| +-----------------------+-------+-------+-------+-------+-------+ |
| | threads | 8 | 16 | 32 | 64 | 128 | |
| +-----------------------+-------+-------+-------+-------+-------+ |
| | skylake 144 cpus | 44s | 22s | 16s | 19s | 20s | |
| | cascade lake 192 cpus | 39s | 20s | 11s | 10s | 9s | |
| +-----------------------+-------+-------+-------+-------+-------+ |
| |
| On skylake, we see an improvment of 2.75x when using 32 threads, on |
| cascade lake we can get even better at 4.3x when we use 128 threads. |
| |
| This speedup is quite significant and users of large machines like these |
| should have the option to make the machines boot as fast as possible. |
| |
| |
| This patch (of 3): |
| |
| Before this patch, the kernel currently used a hard coded value of 2 |
| threads per NUMA node for these allocations. |
| |
| This patch changes this policy and the kernel now uses 25% of the |
| available hardware threads for the allocations. |
| |
| Link: https://lkml.kernel.org/r/20250227-hugepage-parameter-v2-0-7db8c6dc0453@cyberus-technology.de |
| Link: https://lkml.kernel.org/r/20250227-hugepage-parameter-v2-1-7db8c6dc0453@cyberus-technology.de |
| Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de> |
| Cc: Jonathan Corbet <corbet@lwn.net> |
| Cc: Muchun Song <muchun.song@linux.dev> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| --- |
| |
| mm/hugetlb.c | 34 ++++++++++++++++++---------------- |
| 1 file changed, 18 insertions(+), 16 deletions(-) |
| |
| --- a/mm/hugetlb.c~mm-hugetlb-improve-parallel-huge-page-allocation-time |
| +++ a/mm/hugetlb.c |
| @@ -14,9 +14,11 @@ |
| #include <linux/pagemap.h> |
| #include <linux/mempolicy.h> |
| #include <linux/compiler.h> |
| +#include <linux/cpumask.h> |
| #include <linux/cpuset.h> |
| #include <linux/mutex.h> |
| #include <linux/memblock.h> |
| +#include <linux/minmax.h> |
| #include <linux/sysfs.h> |
| #include <linux/slab.h> |
| #include <linux/sched/mm.h> |
| @@ -3605,31 +3607,31 @@ static unsigned long __init hugetlb_page |
| .numa_aware = true |
| }; |
| |
| + unsigned int num_allocation_threads = max(num_online_cpus() / 4, 1); |
| + |
| job.thread_fn = hugetlb_pages_alloc_boot_node; |
| job.start = 0; |
| job.size = h->max_huge_pages; |
| |
| /* |
| - * job.max_threads is twice the num_node_state(N_MEMORY), |
| + * job.max_threads is 25% of the available cpu threads by default. |
| * |
| - * Tests below indicate that a multiplier of 2 significantly improves |
| - * performance, and although larger values also provide improvements, |
| - * the gains are marginal. |
| + * On large servers with terabytes of memory, huge page allocation |
| + * can consume a considerably amount of time. |
| * |
| - * Therefore, choosing 2 as the multiplier strikes a good balance between |
| - * enhancing parallel processing capabilities and maintaining efficient |
| - * resource management. |
| + * Tests below show how long it takes to allocate 1 TiB of memory with 2MiB huge pages. |
| + * 2MiB huge pages. Using more threads can significantly improve allocation time. |
| * |
| - * +------------+-------+-------+-------+-------+-------+ |
| - * | multiplier | 1 | 2 | 3 | 4 | 5 | |
| - * +------------+-------+-------+-------+-------+-------+ |
| - * | 256G 2node | 358ms | 215ms | 157ms | 134ms | 126ms | |
| - * | 2T 4node | 979ms | 679ms | 543ms | 489ms | 481ms | |
| - * | 50G 2node | 71ms | 44ms | 37ms | 30ms | 31ms | |
| - * +------------+-------+-------+-------+-------+-------+ |
| + * +-----------------------+-------+-------+-------+-------+-------+ |
| + * | threads | 8 | 16 | 32 | 64 | 128 | |
| + * +-----------------------+-------+-------+-------+-------+-------+ |
| + * | skylake 144 cpus | 44s | 22s | 16s | 19s | 20s | |
| + * | cascade lake 192 cpus | 39s | 20s | 11s | 10s | 9s | |
| + * +-----------------------+-------+-------+-------+-------+-------+ |
| */ |
| - job.max_threads = num_node_state(N_MEMORY) * 2; |
| - job.min_chunk = h->max_huge_pages / num_node_state(N_MEMORY) / 2; |
| + |
| + job.max_threads = num_allocation_threads; |
| + job.min_chunk = h->max_huge_pages / num_allocation_threads; |
| padata_do_multithreaded(&job); |
| |
| return h->nr_huge_pages; |
| _ |