| From: Mel Gorman <mgorman@techsingularity.net> |
| Subject: sched/numa: apply the scan delay to every new vma |
| Date: Wed, 1 Mar 2023 17:49:00 +0530 |
| |
| Pach series "sched/numa: Enhance vma scanning", v3. |
| |
| The patchset proposes one of the enhancements to numa vma scanning |
| suggested by Mel. This is continuation of [3]. |
| |
| Reposting the rebased patchset to akpm mm-unstable tree (March 1) |
| |
| Existing mechanism of scan period involves, scan period derived from |
| per-thread stats. Process Adaptive autoNUMA [1] proposed to gather NUMA |
| fault stats at per-process level to capture aplication behaviour better. |
| |
| During that course of discussion, Mel proposed several ideas to enhance |
| current numa balancing. One of the suggestion was below |
| |
| Track what threads access a VMA. The suggestion was to use an unsigned |
| long pid_mask and use the lower bits to tag approximately what threads |
| access a VMA. Skip VMAs that did not trap a fault. This would be |
| approximate because of PID collisions but would reduce scanning of areas |
| the thread is not interested in. The above suggestion intends not to |
| penalize threads that has no interest in the vma, thus reduce scanning |
| overhead. |
| |
| V3 changes are mostly based on PeterZ comments (details below in changes) |
| |
| Summary of patchset: |
| |
| Current patchset implements: |
| |
| 1. Delay the vma scanning logic for newly created VMA's so that |
| additional overhead of scanning is not incurred for short lived tasks |
| (implementation by Mel) |
| |
| 2. Store the information of tasks accessing VMA in 2 windows. It is |
| regularly cleared in (4*sysctl_numa_balancing_scan_delay) interval. |
| The above time is derived from experimenting (Suggested by PeterZ) to |
| balance between frequent clearing vs obsolete access data |
| |
| 3. hash_32 used to encode task index accessing VMA information |
| |
| 4. VMA's acess information is used to skip scanning for the tasks |
| which had not accessed VMA |
| |
| Changes since V2: |
| patch1: |
| - Renaming of structure, macro to function, |
| - Add explanation to heuristics |
| - Adding more details from result (PeterZ) |
| Patch2: |
| - Usage of test and set bit (PeterZ) |
| - Move storing access PID info to numa_migrate_prep() |
| - Add a note on fainess among tasks allowed to scan |
| (PeterZ) |
| Patch3: |
| - Maintain two windows of access PID information |
| (PeterZ supported implementation and Gave idea to extend |
| to N if needed) |
| Patch4: |
| - Apply hash_32 function to track VMA accessing PIDs (PeterZ) |
| |
| Changes since RFC V1: |
| - Include Mel's vma scan delay patch |
| - Change the accessing pid store logic (Thanks Mel) |
| - Fencing structure / code to NUMA_BALANCING (David, Mel) |
| - Adding clearing access PID logic (Mel) |
| - Descriptive change log ( Mike Rapoport) |
| |
| Things to ponder over: |
| ========================================== |
| |
| - Improvement to clearing accessing PIDs logic (discussed in-detail in |
| patch3 itself (Done in this patchset by implementing 2 window history) |
| |
| - Current scan period is not changed in the patchset, so we do see |
| frequent tries to scan. Relaxing scan period dynamically could improve |
| results further. |
| |
| [1] sched/numa: Process Adaptive autoNUMA |
| Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/ |
| |
| [2] RFC V1 Link: |
| https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@amd.com/ |
| |
| [3] V2 Link: |
| https://lore.kernel.org/lkml/cover.1675159422.git.raghavendra.kt@amd.com/ |
| |
| |
| Results: |
| Summary: Huge autonuma cost reduction seen in mmtest. Kernbench improvement |
| is more than 5% and huge system time (80%+) improvement from mmtest autonuma. |
| (dbench had huge std deviation to post) |
| |
| kernbench |
| =========== |
| 6.2.0-mmunstable-base 6.2.0-mmunstable-patched |
| Amean user-256 22002.51 ( 0.00%) 22649.95 * -2.94%* |
| Amean syst-256 10162.78 ( 0.00%) 8214.13 * 19.17%* |
| Amean elsp-256 160.74 ( 0.00%) 156.92 * 2.38%* |
| |
| Duration User 66017.43 67959.84 |
| Duration System 30503.15 24657.03 |
| Duration Elapsed 504.61 493.12 |
| |
| 6.2.0-mmunstable-base 6.2.0-mmunstable-patched |
| Ops NUMA alloc hit 1738835089.00 1738780310.00 |
| Ops NUMA alloc local 1738834448.00 1738779711.00 |
| Ops NUMA base-page range updates 477310.00 392566.00 |
| Ops NUMA PTE updates 477310.00 392566.00 |
| Ops NUMA hint faults 96817.00 87555.00 |
| Ops NUMA hint local faults % 10150.00 2192.00 |
| Ops NUMA hint local percent 10.48 2.50 |
| Ops NUMA pages migrated 86660.00 85363.00 |
| Ops AutoNUMA cost 489.07 442.14 |
| |
| autonumabench |
| =============== |
| 6.2.0-mmunstable-base 6.2.0-mmunstable-patched |
| Amean syst-NUMA01 399.50 ( 0.00%) 52.05 * 86.97%* |
| Amean syst-NUMA01_THREADLOCAL 0.21 ( 0.00%) 0.22 * -5.41%* |
| Amean syst-NUMA02 0.80 ( 0.00%) 0.78 * 2.68%* |
| Amean syst-NUMA02_SMT 0.65 ( 0.00%) 0.68 * -3.95%* |
| Amean elsp-NUMA01 313.26 ( 0.00%) 313.11 * 0.05%* |
| Amean elsp-NUMA01_THREADLOCAL 1.06 ( 0.00%) 1.08 * -1.76%* |
| Amean elsp-NUMA02 3.19 ( 0.00%) 3.24 * -1.52%* |
| Amean elsp-NUMA02_SMT 3.72 ( 0.00%) 3.61 * 2.92%* |
| |
| Duration User 396433.47 324835.96 |
| Duration System 2808.70 376.66 |
| Duration Elapsed 2258.61 2258.12 |
| |
| 6.2.0-mmunstable-base 6.2.0-mmunstable-patched |
| Ops NUMA alloc hit 59921806.00 49623489.00 |
| Ops NUMA alloc miss 0.00 0.00 |
| Ops NUMA interleave hit 0.00 0.00 |
| Ops NUMA alloc local 59920880.00 49622594.00 |
| Ops NUMA base-page range updates 152259275.00 50075.00 |
| Ops NUMA PTE updates 152259275.00 50075.00 |
| Ops NUMA PMD updates 0.00 0.00 |
| Ops NUMA hint faults 154660352.00 39014.00 |
| Ops NUMA hint local faults % 138550501.00 23139.00 |
| Ops NUMA hint local percent 89.58 59.31 |
| Ops NUMA pages migrated 8179067.00 14147.00 |
| Ops AutoNUMA cost 774522.98 195.69 |
| |
| |
| This patch (of 4): |
| |
| Currently whenever a new task is created we wait for |
| sysctl_numa_balancing_scan_delay to avoid unnessary scanning overhead. |
| Extend the same logic to new or very short-lived VMAs. |
| |
| [raghavendra.kt@amd.com: add initialization in vm_area_dup())] |
| Link: https://lkml.kernel.org/r/cover.1677672277.git.raghavendra.kt@amd.com |
| Link: https://lkml.kernel.org/r/7a6fbba87c8b51e67efd3e74285bb4cb311a16ca.1677672277.git.raghavendra.kt@amd.com |
| Signed-off-by: Mel Gorman <mgorman@techsingularity.net> |
| Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com> |
| Cc: Bharata B Rao <bharata@amd.com> |
| Cc: David Hildenbrand <david@redhat.com> |
| Cc: Ingo Molnar <mingo@redhat.com> |
| Cc: Mike Rapoport <rppt@kernel.org> |
| Cc: Peter Zijlstra <peterz@infradead.org> |
| Cc: Disha Talreja <dishaa.talreja@amd.com> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| --- |
| |
| include/linux/mm.h | 16 ++++++++++++++++ |
| include/linux/mm_types.h | 7 +++++++ |
| kernel/fork.c | 2 ++ |
| kernel/sched/fair.c | 19 +++++++++++++++++++ |
| 4 files changed, 44 insertions(+) |
| |
| --- a/include/linux/mm.h~sched-numa-apply-the-scan-delay-to-every-new-vma |
| +++ a/include/linux/mm.h |
| @@ -29,6 +29,7 @@ |
| #include <linux/pgtable.h> |
| #include <linux/kasan.h> |
| #include <linux/memremap.h> |
| +#include <linux/slab.h> |
| |
| struct mempolicy; |
| struct anon_vma; |
| @@ -627,6 +628,20 @@ struct vm_operations_struct { |
| unsigned long addr); |
| }; |
| |
| +#ifdef CONFIG_NUMA_BALANCING |
| +static inline void vma_numab_state_init(struct vm_area_struct *vma) |
| +{ |
| + vma->numab_state = NULL; |
| +} |
| +static inline void vma_numab_state_free(struct vm_area_struct *vma) |
| +{ |
| + kfree(vma->numab_state); |
| +} |
| +#else |
| +static inline void vma_numab_state_init(struct vm_area_struct *vma) {} |
| +static inline void vma_numab_state_free(struct vm_area_struct *vma) {} |
| +#endif /* CONFIG_NUMA_BALANCING */ |
| + |
| #ifdef CONFIG_PER_VMA_LOCK |
| /* |
| * Try to read-lock a vma. The function is allowed to occasionally yield false |
| @@ -747,6 +762,7 @@ static inline void vma_init(struct vm_ar |
| vma->vm_ops = &dummy_vm_ops; |
| INIT_LIST_HEAD(&vma->anon_vma_chain); |
| vma_mark_detached(vma, false); |
| + vma_numab_state_init(vma); |
| } |
| |
| /* Use when VMA is not part of the VMA tree and needs no locking */ |
| --- a/include/linux/mm_types.h~sched-numa-apply-the-scan-delay-to-every-new-vma |
| +++ a/include/linux/mm_types.h |
| @@ -475,6 +475,10 @@ struct vma_lock { |
| struct rw_semaphore lock; |
| }; |
| |
| +struct vma_numab_state { |
| + unsigned long next_scan; |
| +}; |
| + |
| /* |
| * This struct describes a virtual memory area. There is one of these |
| * per VM-area/task. A VM area is any part of the process virtual memory |
| @@ -561,6 +565,9 @@ struct vm_area_struct { |
| #ifdef CONFIG_NUMA |
| struct mempolicy *vm_policy; /* NUMA policy for the VMA */ |
| #endif |
| +#ifdef CONFIG_NUMA_BALANCING |
| + struct vma_numab_state *numab_state; /* NUMA Balancing state */ |
| +#endif |
| struct vm_userfaultfd_ctx vm_userfaultfd_ctx; |
| } __randomize_layout; |
| |
| --- a/kernel/fork.c~sched-numa-apply-the-scan-delay-to-every-new-vma |
| +++ a/kernel/fork.c |
| @@ -516,6 +516,7 @@ struct vm_area_struct *vm_area_dup(struc |
| return NULL; |
| } |
| INIT_LIST_HEAD(&new->anon_vma_chain); |
| + vma_numab_state_init(new); |
| dup_anon_vma_name(orig, new); |
| |
| return new; |
| @@ -523,6 +524,7 @@ struct vm_area_struct *vm_area_dup(struc |
| |
| void __vm_area_free(struct vm_area_struct *vma) |
| { |
| + vma_numab_state_free(vma); |
| free_anon_vma_name(vma); |
| vma_lock_free(vma); |
| kmem_cache_free(vm_area_cachep, vma); |
| --- a/kernel/sched/fair.c~sched-numa-apply-the-scan-delay-to-every-new-vma |
| +++ a/kernel/sched/fair.c |
| @@ -3027,6 +3027,25 @@ static void task_numa_work(struct callba |
| if (!vma_is_accessible(vma)) |
| continue; |
| |
| + /* Initialise new per-VMA NUMAB state. */ |
| + if (!vma->numab_state) { |
| + vma->numab_state = kzalloc(sizeof(struct vma_numab_state), |
| + GFP_KERNEL); |
| + if (!vma->numab_state) |
| + continue; |
| + |
| + vma->numab_state->next_scan = now + |
| + msecs_to_jiffies(sysctl_numa_balancing_scan_delay); |
| + } |
| + |
| + /* |
| + * Scanning the VMA's of short lived tasks add more overhead. So |
| + * delay the scan for new VMAs. |
| + */ |
| + if (mm->numa_scan_seq && time_before(jiffies, |
| + vma->numab_state->next_scan)) |
| + continue; |
| + |
| do { |
| start = max(start, vma->vm_start); |
| end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE); |
| _ |