patches/old/sched-numa-apply-the-scan-delay-to-every-new-vma.patch - pub/scm/linux/kernel/git/akpm/25-new - Git at Google

 From: Mel Gorman <mgorman@techsingularity.net>
 Subject: sched/numa: apply the scan delay to every new vma
 Date: Wed, 1 Mar 2023 17:49:00 +0530

 Pach series "sched/numa: Enhance vma scanning", v3.

 The patchset proposes one of the enhancements to numa vma scanning
 suggested by Mel.  This is continuation of [3].

 Reposting the rebased patchset to akpm mm-unstable tree (March 1)

 Existing mechanism of scan period involves, scan period derived from
 per-thread stats.  Process Adaptive autoNUMA [1] proposed to gather NUMA
 fault stats at per-process level to capture aplication behaviour better.

 During that course of discussion, Mel proposed several ideas to enhance
 current numa balancing.  One of the suggestion was below

 Track what threads access a VMA.  The suggestion was to use an unsigned
 long pid_mask and use the lower bits to tag approximately what threads
 access a VMA.  Skip VMAs that did not trap a fault.  This would be
 approximate because of PID collisions but would reduce scanning of areas
 the thread is not interested in.  The above suggestion intends not to
 penalize threads that has no interest in the vma, thus reduce scanning
 overhead.

 V3 changes are mostly based on PeterZ comments (details below in changes)

 Summary of patchset:

 Current patchset implements:

 1. Delay the vma scanning logic for newly created VMA's so that
    additional overhead of scanning is not incurred for short lived tasks
    (implementation by Mel)

 2. Store the information of tasks accessing VMA in 2 windows.  It is
    regularly cleared in (4*sysctl_numa_balancing_scan_delay) interval.
    The above time is derived from experimenting (Suggested by PeterZ) to
    balance between frequent clearing vs obsolete access data

 3. hash_32 used to encode task index accessing VMA information

 4. VMA's acess information is used to skip scanning for the tasks
    which had not accessed VMA

 Changes since V2:
 patch1:
  - Renaming of structure, macro to function,
  - Add explanation to heuristics
  - Adding more details from result (PeterZ)
  Patch2:
  - Usage of test and set bit (PeterZ)
  - Move storing access PID info to numa_migrate_prep()
  - Add a note on fainess among tasks allowed to scan
    (PeterZ)
  Patch3:
  - Maintain two windows of access PID information
   (PeterZ supported implementation and Gave idea to extend
    to N if needed)
  Patch4:
  - Apply hash_32 function to track VMA accessing PIDs (PeterZ)

 Changes since RFC V1:
  - Include Mel's vma scan delay patch
  - Change the accessing pid store logic (Thanks Mel)
  - Fencing structure / code to NUMA_BALANCING (David, Mel)
  - Adding clearing access PID logic (Mel)
  - Descriptive change log ( Mike Rapoport)

 Things to ponder over:
 ==========================================

 - Improvement to clearing accessing PIDs logic (discussed in-detail in
   patch3 itself (Done in this patchset by implementing 2 window history)

 - Current scan period is not changed in the patchset, so we do see
   frequent tries to scan.  Relaxing scan period dynamically could improve
   results further.

 [1] sched/numa: Process Adaptive autoNUMA
  Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/

 [2] RFC V1 Link:
   https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@amd.com/

 [3] V2 Link:
   https://lore.kernel.org/lkml/cover.1675159422.git.raghavendra.kt@amd.com/


 Results:
 Summary: Huge autonuma cost reduction seen in mmtest. Kernbench improvement
 is more than 5% and huge system time (80%+) improvement from mmtest autonuma.
 (dbench had huge std deviation to post)

 kernbench
 ===========
                       6.2.0-mmunstable-base  6.2.0-mmunstable-patched
 Amean     user-256    22002.51 (   0.00%)    22649.95 *  -2.94%*
 Amean     syst-256    10162.78 (   0.00%)     8214.13 *  19.17%*
 Amean     elsp-256      160.74 (   0.00%)      156.92 *   2.38%*

 Duration User       66017.43    67959.84
 Duration System     30503.15    24657.03
 Duration Elapsed      504.61      493.12

                       6.2.0-mmunstable-base  6.2.0-mmunstable-patched
 Ops NUMA alloc hit                1738835089.00  1738780310.00
 Ops NUMA alloc local              1738834448.00  1738779711.00
 Ops NUMA base-page range updates      477310.00      392566.00
 Ops NUMA PTE updates                  477310.00      392566.00
 Ops NUMA hint faults                   96817.00       87555.00
 Ops NUMA hint local faults %           10150.00        2192.00
 Ops NUMA hint local percent               10.48           2.50
 Ops NUMA pages migrated                86660.00       85363.00
 Ops AutoNUMA cost                        489.07         442.14

 autonumabench
 ===============
                       6.2.0-mmunstable-base  6.2.0-mmunstable-patched
 Amean     syst-NUMA01                  399.50 (   0.00%)       52.05 *  86.97%*
 Amean     syst-NUMA01_THREADLOCAL        0.21 (   0.00%)        0.22 *  -5.41%*
 Amean     syst-NUMA02                    0.80 (   0.00%)        0.78 *   2.68%*
 Amean     syst-NUMA02_SMT                0.65 (   0.00%)        0.68 *  -3.95%*
 Amean     elsp-NUMA01                  313.26 (   0.00%)      313.11 *   0.05%*
 Amean     elsp-NUMA01_THREADLOCAL        1.06 (   0.00%)        1.08 *  -1.76%*
 Amean     elsp-NUMA02                    3.19 (   0.00%)        3.24 *  -1.52%*
 Amean     elsp-NUMA02_SMT                3.72 (   0.00%)        3.61 *   2.92%*

 Duration User      396433.47   324835.96
 Duration System      2808.70      376.66
 Duration Elapsed     2258.61     2258.12

                       6.2.0-mmunstable-base  6.2.0-mmunstable-patched
 Ops NUMA alloc hit                  59921806.00    49623489.00
 Ops NUMA alloc miss                        0.00           0.00
 Ops NUMA interleave hit                    0.00           0.00
 Ops NUMA alloc local                59920880.00    49622594.00
 Ops NUMA base-page range updates   152259275.00       50075.00
 Ops NUMA PTE updates               152259275.00       50075.00
 Ops NUMA PMD updates                       0.00           0.00
 Ops NUMA hint faults               154660352.00       39014.00
 Ops NUMA hint local faults %       138550501.00       23139.00
 Ops NUMA hint local percent               89.58          59.31
 Ops NUMA pages migrated              8179067.00       14147.00
 Ops AutoNUMA cost                     774522.98         195.69


 This patch (of 4):

 Currently whenever a new task is created we wait for
 sysctl_numa_balancing_scan_delay to avoid unnessary scanning overhead.
 Extend the same logic to new or very short-lived VMAs.

 [raghavendra.kt@amd.com: add initialization in vm_area_dup())]
 Link: https://lkml.kernel.org/r/cover.1677672277.git.raghavendra.kt@amd.com
 Link: https://lkml.kernel.org/r/7a6fbba87c8b51e67efd3e74285bb4cb311a16ca.1677672277.git.raghavendra.kt@amd.com
 Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
 Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
 Cc: Bharata B Rao <bharata@amd.com>
 Cc: David Hildenbrand <david@redhat.com>
 Cc: Ingo Molnar <mingo@redhat.com>
 Cc: Mike Rapoport <rppt@kernel.org>
 Cc: Peter Zijlstra <peterz@infradead.org>
 Cc: Disha Talreja <dishaa.talreja@amd.com>
 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
 ---

  include/linux/mm.h       |   16 ++++++++++++++++
  include/linux/mm_types.h |    7 +++++++
  kernel/fork.c            |    2 ++
  kernel/sched/fair.c      |   19 +++++++++++++++++++
  4 files changed, 44 insertions(+)

 --- a/include/linux/mm.h~sched-numa-apply-the-scan-delay-to-every-new-vma
 +++ a/include/linux/mm.h
 @@ -29,6 +29,7 @@
  #include <linux/pgtable.h>
  #include <linux/kasan.h>
  #include <linux/memremap.h>
 +#include <linux/slab.h>

  struct mempolicy;
  struct anon_vma;
 @@ -627,6 +628,20 @@ struct vm_operations_struct {
  					  unsigned long addr);
  };

 +#ifdef CONFIG_NUMA_BALANCING
 +static inline void vma_numab_state_init(struct vm_area_struct *vma)
 +{
 +	vma->numab_state = NULL;
 +}
 +static inline void vma_numab_state_free(struct vm_area_struct *vma)
 +{
 +	kfree(vma->numab_state);
 +}
 +#else
 +static inline void vma_numab_state_init(struct vm_area_struct *vma) {}
 +static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
 +#endif /* CONFIG_NUMA_BALANCING */
 +
  #ifdef CONFIG_PER_VMA_LOCK
  /*
   * Try to read-lock a vma. The function is allowed to occasionally yield false
 @@ -747,6 +762,7 @@ static inline void vma_init(struct vm_ar
  	vma->vm_ops = &dummy_vm_ops;
  	INIT_LIST_HEAD(&vma->anon_vma_chain);
  	vma_mark_detached(vma, false);
 +	vma_numab_state_init(vma);
  }

  /* Use when VMA is not part of the VMA tree and needs no locking */
 --- a/include/linux/mm_types.h~sched-numa-apply-the-scan-delay-to-every-new-vma
 +++ a/include/linux/mm_types.h
 @@ -475,6 +475,10 @@ struct vma_lock {
  	struct rw_semaphore lock;
  };

 +struct vma_numab_state {
 +	unsigned long next_scan;
 +};
 +
  /*
   * This struct describes a virtual memory area. There is one of these
   * per VM-area/task. A VM area is any part of the process virtual memory
 @@ -561,6 +565,9 @@ struct vm_area_struct {
  #ifdef CONFIG_NUMA
  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
  #endif
 +#ifdef CONFIG_NUMA_BALANCING
 +	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
 +#endif
  	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
  } __randomize_layout;

 --- a/kernel/fork.c~sched-numa-apply-the-scan-delay-to-every-new-vma
 +++ a/kernel/fork.c
 @@ -516,6 +516,7 @@ struct vm_area_struct *vm_area_dup(struc
  		return NULL;
  	}
  	INIT_LIST_HEAD(&new->anon_vma_chain);
 +	vma_numab_state_init(new);
  	dup_anon_vma_name(orig, new);

  	return new;
 @@ -523,6 +524,7 @@ struct vm_area_struct *vm_area_dup(struc

  void __vm_area_free(struct vm_area_struct *vma)
  {
 +	vma_numab_state_free(vma);
  	free_anon_vma_name(vma);
  	vma_lock_free(vma);
  	kmem_cache_free(vm_area_cachep, vma);
 --- a/kernel/sched/fair.c~sched-numa-apply-the-scan-delay-to-every-new-vma
 +++ a/kernel/sched/fair.c
 @@ -3027,6 +3027,25 @@ static void task_numa_work(struct callba
  		if (!vma_is_accessible(vma))
  			continue;

 +		/* Initialise new per-VMA NUMAB state. */
 +		if (!vma->numab_state) {
 +			vma->numab_state = kzalloc(sizeof(struct vma_numab_state),
 +				GFP_KERNEL);
 +			if (!vma->numab_state)
 +				continue;
 +
 +			vma->numab_state->next_scan = now +
 +				msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
 +		}
 +
 +		/*
 +		 * Scanning the VMA's of short lived tasks add more overhead. So
 +		 * delay the scan for new VMAs.
 +		 */
 +		if (mm->numa_scan_seq && time_before(jiffies,
 +						vma->numab_state->next_scan))
 +			continue;
 +
  		do {
  			start = max(start, vma->vm_start);
  			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
 _
	From: Mel Gorman <mgorman@techsingularity.net>
	Subject: sched/numa: apply the scan delay to every new vma
	Date: Wed, 1 Mar 2023 17:49:00 +0530

	Pach series "sched/numa: Enhance vma scanning", v3.

	The patchset proposes one of the enhancements to numa vma scanning
	suggested by Mel. This is continuation of [3].

	Reposting the rebased patchset to akpm mm-unstable tree (March 1)

	Existing mechanism of scan period involves, scan period derived from
	per-thread stats. Process Adaptive autoNUMA [1] proposed to gather NUMA
	fault stats at per-process level to capture aplication behaviour better.

	During that course of discussion, Mel proposed several ideas to enhance
	current numa balancing. One of the suggestion was below

	Track what threads access a VMA. The suggestion was to use an unsigned
	long pid_mask and use the lower bits to tag approximately what threads
	access a VMA. Skip VMAs that did not trap a fault. This would be
	approximate because of PID collisions but would reduce scanning of areas
	the thread is not interested in. The above suggestion intends not to
	penalize threads that has no interest in the vma, thus reduce scanning
	overhead.

	V3 changes are mostly based on PeterZ comments (details below in changes)

	Summary of patchset:

	Current patchset implements:

	1. Delay the vma scanning logic for newly created VMA's so that
	additional overhead of scanning is not incurred for short lived tasks
	(implementation by Mel)

	2. Store the information of tasks accessing VMA in 2 windows. It is
	regularly cleared in (4*sysctl_numa_balancing_scan_delay) interval.
	The above time is derived from experimenting (Suggested by PeterZ) to
	balance between frequent clearing vs obsolete access data

	3. hash_32 used to encode task index accessing VMA information

	4. VMA's acess information is used to skip scanning for the tasks
	which had not accessed VMA

	Changes since V2:
	patch1:
	- Renaming of structure, macro to function,
	- Add explanation to heuristics
	- Adding more details from result (PeterZ)
	Patch2:
	- Usage of test and set bit (PeterZ)
	- Move storing access PID info to numa_migrate_prep()
	- Add a note on fainess among tasks allowed to scan
	(PeterZ)
	Patch3:
	- Maintain two windows of access PID information
	(PeterZ supported implementation and Gave idea to extend
	to N if needed)
	Patch4:
	- Apply hash_32 function to track VMA accessing PIDs (PeterZ)

	Changes since RFC V1:
	- Include Mel's vma scan delay patch
	- Change the accessing pid store logic (Thanks Mel)
	- Fencing structure / code to NUMA_BALANCING (David, Mel)
	- Adding clearing access PID logic (Mel)
	- Descriptive change log ( Mike Rapoport)

	Things to ponder over:
	==========================================

	- Improvement to clearing accessing PIDs logic (discussed in-detail in
	patch3 itself (Done in this patchset by implementing 2 window history)

	- Current scan period is not changed in the patchset, so we do see
	frequent tries to scan. Relaxing scan period dynamically could improve
	results further.

	[1] sched/numa: Process Adaptive autoNUMA
	Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/

	[2] RFC V1 Link:
	https://lore.kernel.org/all/cover.1673610485.git.raghavendra.kt@amd.com/

	[3] V2 Link:
	https://lore.kernel.org/lkml/cover.1675159422.git.raghavendra.kt@amd.com/


	Results:
	Summary: Huge autonuma cost reduction seen in mmtest. Kernbench improvement
	is more than 5% and huge system time (80%+) improvement from mmtest autonuma.
	(dbench had huge std deviation to post)

	kernbench
	===========
	6.2.0-mmunstable-base 6.2.0-mmunstable-patched
	Amean user-256 22002.51 ( 0.00%) 22649.95 * -2.94%*
	Amean syst-256 10162.78 ( 0.00%) 8214.13 * 19.17%*
	Amean elsp-256 160.74 ( 0.00%) 156.92 * 2.38%*

	Duration User 66017.43 67959.84
	Duration System 30503.15 24657.03
	Duration Elapsed 504.61 493.12

	6.2.0-mmunstable-base 6.2.0-mmunstable-patched
	Ops NUMA alloc hit 1738835089.00 1738780310.00
	Ops NUMA alloc local 1738834448.00 1738779711.00
	Ops NUMA base-page range updates 477310.00 392566.00
	Ops NUMA PTE updates 477310.00 392566.00
	Ops NUMA hint faults 96817.00 87555.00
	Ops NUMA hint local faults % 10150.00 2192.00
	Ops NUMA hint local percent 10.48 2.50
	Ops NUMA pages migrated 86660.00 85363.00
	Ops AutoNUMA cost 489.07 442.14

	autonumabench
	===============
	6.2.0-mmunstable-base 6.2.0-mmunstable-patched
	Amean syst-NUMA01 399.50 ( 0.00%) 52.05 * 86.97%*
	Amean syst-NUMA01_THREADLOCAL 0.21 ( 0.00%) 0.22 * -5.41%*
	Amean syst-NUMA02 0.80 ( 0.00%) 0.78 * 2.68%*
	Amean syst-NUMA02_SMT 0.65 ( 0.00%) 0.68 * -3.95%*
	Amean elsp-NUMA01 313.26 ( 0.00%) 313.11 * 0.05%*
	Amean elsp-NUMA01_THREADLOCAL 1.06 ( 0.00%) 1.08 * -1.76%*
	Amean elsp-NUMA02 3.19 ( 0.00%) 3.24 * -1.52%*
	Amean elsp-NUMA02_SMT 3.72 ( 0.00%) 3.61 * 2.92%*

	Duration User 396433.47 324835.96
	Duration System 2808.70 376.66
	Duration Elapsed 2258.61 2258.12

	6.2.0-mmunstable-base 6.2.0-mmunstable-patched
	Ops NUMA alloc hit 59921806.00 49623489.00
	Ops NUMA alloc miss 0.00 0.00
	Ops NUMA interleave hit 0.00 0.00
	Ops NUMA alloc local 59920880.00 49622594.00
	Ops NUMA base-page range updates 152259275.00 50075.00
	Ops NUMA PTE updates 152259275.00 50075.00
	Ops NUMA PMD updates 0.00 0.00
	Ops NUMA hint faults 154660352.00 39014.00
	Ops NUMA hint local faults % 138550501.00 23139.00
	Ops NUMA hint local percent 89.58 59.31
	Ops NUMA pages migrated 8179067.00 14147.00
	Ops AutoNUMA cost 774522.98 195.69


	This patch (of 4):

	Currently whenever a new task is created we wait for
	sysctl_numa_balancing_scan_delay to avoid unnessary scanning overhead.
	Extend the same logic to new or very short-lived VMAs.

	[raghavendra.kt@amd.com: add initialization in vm_area_dup())]
	Link: https://lkml.kernel.org/r/cover.1677672277.git.raghavendra.kt@amd.com
	Link: https://lkml.kernel.org/r/7a6fbba87c8b51e67efd3e74285bb4cb311a16ca.1677672277.git.raghavendra.kt@amd.com
	Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
	Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
	Cc: Bharata B Rao <bharata@amd.com>
	Cc: David Hildenbrand <david@redhat.com>
	Cc: Ingo Molnar <mingo@redhat.com>
	Cc: Mike Rapoport <rppt@kernel.org>
	Cc: Peter Zijlstra <peterz@infradead.org>
	Cc: Disha Talreja <dishaa.talreja@amd.com>
	Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
	---

	include/linux/mm.h \| 16 ++++++++++++++++
	include/linux/mm_types.h \| 7 +++++++
	kernel/fork.c \| 2 ++
	kernel/sched/fair.c \| 19 +++++++++++++++++++
	4 files changed, 44 insertions(+)

	--- a/include/linux/mm.h~sched-numa-apply-the-scan-delay-to-every-new-vma
	+++ a/include/linux/mm.h
	@@ -29,6 +29,7 @@
	#include <linux/pgtable.h>
	#include <linux/kasan.h>
	#include <linux/memremap.h>
	+#include <linux/slab.h>

	struct mempolicy;
	struct anon_vma;
	@@ -627,6 +628,20 @@ struct vm_operations_struct {
	unsigned long addr);
	};

	+#ifdef CONFIG_NUMA_BALANCING
	+static inline void vma_numab_state_init(struct vm_area_struct *vma)
	+{
	+ vma->numab_state = NULL;
	+}
	+static inline void vma_numab_state_free(struct vm_area_struct *vma)
	+{
	+ kfree(vma->numab_state);
	+}
	+#else
	+static inline void vma_numab_state_init(struct vm_area_struct *vma) {}
	+static inline void vma_numab_state_free(struct vm_area_struct *vma) {}
	+#endif /* CONFIG_NUMA_BALANCING */
	+
	#ifdef CONFIG_PER_VMA_LOCK
	/*
	* Try to read-lock a vma. The function is allowed to occasionally yield false
	@@ -747,6 +762,7 @@ static inline void vma_init(struct vm_ar
	vma->vm_ops = &dummy_vm_ops;
	INIT_LIST_HEAD(&vma->anon_vma_chain);
	vma_mark_detached(vma, false);
	+ vma_numab_state_init(vma);
	}

	/* Use when VMA is not part of the VMA tree and needs no locking */
	--- a/include/linux/mm_types.h~sched-numa-apply-the-scan-delay-to-every-new-vma
	+++ a/include/linux/mm_types.h
	@@ -475,6 +475,10 @@ struct vma_lock {
	struct rw_semaphore lock;
	};

	+struct vma_numab_state {
	+ unsigned long next_scan;
	+};
	+
	/*
	* This struct describes a virtual memory area. There is one of these
	* per VM-area/task. A VM area is any part of the process virtual memory
	@@ -561,6 +565,9 @@ struct vm_area_struct {
	#ifdef CONFIG_NUMA
	struct mempolicy vm_policy; / NUMA policy for the VMA */
	#endif
	+#ifdef CONFIG_NUMA_BALANCING
	+ struct vma_numab_state numab_state; / NUMA Balancing state */
	+#endif
	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
	} __randomize_layout;

	--- a/kernel/fork.c~sched-numa-apply-the-scan-delay-to-every-new-vma
	+++ a/kernel/fork.c
	@@ -516,6 +516,7 @@ struct vm_area_struct *vm_area_dup(struc
	return NULL;
	}
	INIT_LIST_HEAD(&new->anon_vma_chain);
	+ vma_numab_state_init(new);
	dup_anon_vma_name(orig, new);

	return new;
	@@ -523,6 +524,7 @@ struct vm_area_struct *vm_area_dup(struc

	void __vm_area_free(struct vm_area_struct *vma)
	{
	+ vma_numab_state_free(vma);
	free_anon_vma_name(vma);
	vma_lock_free(vma);
	kmem_cache_free(vm_area_cachep, vma);
	--- a/kernel/sched/fair.c~sched-numa-apply-the-scan-delay-to-every-new-vma
	+++ a/kernel/sched/fair.c
	@@ -3027,6 +3027,25 @@ static void task_numa_work(struct callba
	if (!vma_is_accessible(vma))
	continue;

	+ /* Initialise new per-VMA NUMAB state. */
	+ if (!vma->numab_state) {
	+ vma->numab_state = kzalloc(sizeof(struct vma_numab_state),
	+ GFP_KERNEL);
	+ if (!vma->numab_state)
	+ continue;
	+
	+ vma->numab_state->next_scan = now +
	+ msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
	+ }
	+
	+ /*
	+ * Scanning the VMA's of short lived tasks add more overhead. So
	+ * delay the scan for new VMAs.
	+ */
	+ if (mm->numa_scan_seq && time_before(jiffies,
	+ vma->numab_state->next_scan))
	+ continue;
	+
	do {
	start = max(start, vma->vm_start);
	end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
	_