patches/old/sched-numa-fix-the-vma-scan-starving-issue.patch - pub/scm/linux/kernel/git/akpm/25-new - Git at Google

 From: Yujie Liu <yujie.liu@intel.com>
 Subject: sched/numa: Fix the vma scan starving issue
 Date: Tue, 27 Aug 2024 19:29:58 +0800

 Problem statement:
 Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the
 Numa vma scan overhead has been reduced a lot.  Meanwhile, the reducing of
 the vma scan might create less Numa page fault information.  The
 insufficient information makes it harder for the Numa balancer to make
 decision.  Later, commit b7a5b537c55c08 ("sched/numa: Complete scanning of
 partial VMAs regardless of PID activity") and commit 84db47ca7146d7
 ("sched/numa: Fix mm numa_scan_seq based unconditional scan") are found to
 bring back part of the performance.

 Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system, a
 long duration of remote Numa node read was observed by PMU events: A few
 cores having ~500MB/s remote memory access for ~20 seconds.  It causes
 high core-to-core variance and performance penalty.  After the
 investigation, it is found that many vmas are skipped due to the active
 PID check.  According to the trace events, in most cases,
 vma_is_accessed() returns false because the history access info stored in
 pids_active array has been cleared.

 Proposal:
 The main idea is to adjust vma_is_accessed() to let it return true easier.
 Thus compare the diff between mm->numa_scan_seq and
 vma->numab_state->prev_scan_seq.  If the diff has exceeded the threshold,
 scan the vma.

 This patch especially helps the cases where there are small number of
 threads, like the process-based SPECcpu.  Without this patch, if the
 SPECcpu process access the vma at the beginning, then sleeps for a long
 time, the pid_active array will be cleared.  A a result, if this process
 is woken up again, it never has a chance to set prot_none anymore.
 Because only the first 2 times of access is granted for vma scan:
 (current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2 to be
 worse, no other threads within the task can help set the prot_none.  This
 causes information lost.

 Raghavendra helped test current patch and got the positive result
 on the AMD platform:

 autonumabench NUMA01
                             base                  patched
 Amean     syst-NUMA01      194.05 (   0.00%)      165.11 *  14.92%*
 Amean     elsp-NUMA01      324.86 (   0.00%)      315.58 *   2.86%*

 Duration User      380345.36   368252.04
 Duration System      1358.89     1156.23
 Duration Elapsed     2277.45     2213.25

 autonumabench NUMA02

 Amean     syst-NUMA02        1.12 (   0.00%)        1.09 *   2.93%*
 Amean     elsp-NUMA02        3.50 (   0.00%)        3.56 *  -1.84%*

 Duration User        1513.23     1575.48
 Duration System         8.33        8.13
 Duration Elapsed       28.59       29.71

 kernbench

 Amean     user-256    22935.42 (   0.00%)    22535.19 *   1.75%*
 Amean     syst-256     7284.16 (   0.00%)     7608.72 *  -4.46%*
 Amean     elsp-256      159.01 (   0.00%)      158.17 *   0.53%*

 Duration User       68816.41    67615.74
 Duration System     21873.94    22848.08
 Duration Elapsed      506.66      504.55

 Intel 256 CPUs/2 Sockets:
 autonuma benchmark also shows improvements:

                                                v6.10-rc5              v6.10-rc5
                                                                          +patch
 Amean     syst-NUMA01                  245.85 (   0.00%)      230.84 *   6.11%*
 Amean     syst-NUMA01_THREADLOCAL      205.27 (   0.00%)      191.86 *   6.53%*
 Amean     syst-NUMA02                   18.57 (   0.00%)       18.09 *   2.58%*
 Amean     syst-NUMA02_SMT                2.63 (   0.00%)        2.54 *   3.47%*
 Amean     elsp-NUMA01                  517.17 (   0.00%)      526.34 *  -1.77%*
 Amean     elsp-NUMA01_THREADLOCAL       99.92 (   0.00%)      100.59 *  -0.67%*
 Amean     elsp-NUMA02                   15.81 (   0.00%)       15.72 *   0.59%*
 Amean     elsp-NUMA02_SMT               13.23 (   0.00%)       12.89 *   2.53%*

                    v6.10-rc5   v6.10-rc5
                                   +patch
 Duration User     1064010.16  1075416.23
 Duration System      3307.64     3104.66
 Duration Elapsed     4537.54     4604.73

 The SPECcpu remote node access issue disappears with the patch applied.

 Link: https://lkml.kernel.org/r/20240827112958.181388-1-yu.c.chen@intel.com
 Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
 Signed-off-by: Chen Yu <yu.c.chen@intel.com>
 Co-developed-by: Chen Yu <yu.c.chen@intel.com>
 Signed-off-by: Yujie Liu <yujie.liu@intel.com>
 Reported-by: Xiaoping Zhou <xiaoping.zhou@intel.com>
 Reviewed-and-tested-by: Raghavendra K T <raghavendra.kt@amd.com>
 Acked-by: Mel Gorman <mgorman@techsingularity.net>
 Cc: "Chen, Tim C" <tim.c.chen@intel.com>
 Cc: Ingo Molnar <mingo@redhat.com>
 Cc: Juri Lelli <juri.lelli@redhat.com>
 Cc: Peter Zijlstra <peterz@infradead.org>
 Cc: Raghavendra K T <raghavendra.kt@amd.com>
 Cc: Vincent Guittot <vincent.guittot@linaro.org>
 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
 ---

  kernel/sched/fair.c |    9 +++++++++
  1 file changed, 9 insertions(+)

 --- a/kernel/sched/fair.c~sched-numa-fix-the-vma-scan-starving-issue
 +++ a/kernel/sched/fair.c
 @@ -3187,6 +3187,15 @@ static bool vma_is_accessed(struct mm_st
  		return true;
  	}

 +	/*
 +	 * This vma has not been accessed for a while, and if the number
 +	 * the threads in the same process is low, which means no other
 +	 * threads can help scan this vma, force a vma scan.
 +	 */
 +	if (READ_ONCE(mm->numa_scan_seq) >
 +	   (vma->numab_state->prev_scan_seq + get_nr_threads(current)))
 +		return true;
 +
  	return false;
  }

 _
	From: Yujie Liu <yujie.liu@intel.com>
	Subject: sched/numa: Fix the vma scan starving issue
	Date: Tue, 27 Aug 2024 19:29:58 +0800

	Problem statement:
	Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the
	Numa vma scan overhead has been reduced a lot. Meanwhile, the reducing of
	the vma scan might create less Numa page fault information. The
	insufficient information makes it harder for the Numa balancer to make
	decision. Later, commit b7a5b537c55c08 ("sched/numa: Complete scanning of
	partial VMAs regardless of PID activity") and commit 84db47ca7146d7
	("sched/numa: Fix mm numa_scan_seq based unconditional scan") are found to
	bring back part of the performance.

	Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system, a
	long duration of remote Numa node read was observed by PMU events: A few
	cores having ~500MB/s remote memory access for ~20 seconds. It causes
	high core-to-core variance and performance penalty. After the
	investigation, it is found that many vmas are skipped due to the active
	PID check. According to the trace events, in most cases,
	vma_is_accessed() returns false because the history access info stored in
	pids_active array has been cleared.

	Proposal:
	The main idea is to adjust vma_is_accessed() to let it return true easier.
	Thus compare the diff between mm->numa_scan_seq and
	vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold,
	scan the vma.

	This patch especially helps the cases where there are small number of
	threads, like the process-based SPECcpu. Without this patch, if the
	SPECcpu process access the vma at the beginning, then sleeps for a long
	time, the pid_active array will be cleared. A a result, if this process
	is woken up again, it never has a chance to set prot_none anymore.
	Because only the first 2 times of access is granted for vma scan:
	(current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2 to be
	worse, no other threads within the task can help set the prot_none. This
	causes information lost.

	Raghavendra helped test current patch and got the positive result
	on the AMD platform:

	autonumabench NUMA01
	base patched
	Amean syst-NUMA01 194.05 ( 0.00%) 165.11 * 14.92%*
	Amean elsp-NUMA01 324.86 ( 0.00%) 315.58 * 2.86%*

	Duration User 380345.36 368252.04
	Duration System 1358.89 1156.23
	Duration Elapsed 2277.45 2213.25

	autonumabench NUMA02

	Amean syst-NUMA02 1.12 ( 0.00%) 1.09 * 2.93%*
	Amean elsp-NUMA02 3.50 ( 0.00%) 3.56 * -1.84%*

	Duration User 1513.23 1575.48
	Duration System 8.33 8.13
	Duration Elapsed 28.59 29.71

	kernbench

	Amean user-256 22935.42 ( 0.00%) 22535.19 * 1.75%*
	Amean syst-256 7284.16 ( 0.00%) 7608.72 * -4.46%*
	Amean elsp-256 159.01 ( 0.00%) 158.17 * 0.53%*

	Duration User 68816.41 67615.74
	Duration System 21873.94 22848.08
	Duration Elapsed 506.66 504.55

	Intel 256 CPUs/2 Sockets:
	autonuma benchmark also shows improvements:

	v6.10-rc5 v6.10-rc5
	+patch
	Amean syst-NUMA01 245.85 ( 0.00%) 230.84 * 6.11%*
	Amean syst-NUMA01_THREADLOCAL 205.27 ( 0.00%) 191.86 * 6.53%*
	Amean syst-NUMA02 18.57 ( 0.00%) 18.09 * 2.58%*
	Amean syst-NUMA02_SMT 2.63 ( 0.00%) 2.54 * 3.47%*
	Amean elsp-NUMA01 517.17 ( 0.00%) 526.34 * -1.77%*
	Amean elsp-NUMA01_THREADLOCAL 99.92 ( 0.00%) 100.59 * -0.67%*
	Amean elsp-NUMA02 15.81 ( 0.00%) 15.72 * 0.59%*
	Amean elsp-NUMA02_SMT 13.23 ( 0.00%) 12.89 * 2.53%*

	v6.10-rc5 v6.10-rc5
	+patch
	Duration User 1064010.16 1075416.23
	Duration System 3307.64 3104.66
	Duration Elapsed 4537.54 4604.73

	The SPECcpu remote node access issue disappears with the patch applied.

	Link: https://lkml.kernel.org/r/20240827112958.181388-1-yu.c.chen@intel.com
	Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
	Signed-off-by: Chen Yu <yu.c.chen@intel.com>
	Co-developed-by: Chen Yu <yu.c.chen@intel.com>
	Signed-off-by: Yujie Liu <yujie.liu@intel.com>
	Reported-by: Xiaoping Zhou <xiaoping.zhou@intel.com>
	Reviewed-and-tested-by: Raghavendra K T <raghavendra.kt@amd.com>
	Acked-by: Mel Gorman <mgorman@techsingularity.net>
	Cc: "Chen, Tim C" <tim.c.chen@intel.com>
	Cc: Ingo Molnar <mingo@redhat.com>
	Cc: Juri Lelli <juri.lelli@redhat.com>
	Cc: Peter Zijlstra <peterz@infradead.org>
	Cc: Raghavendra K T <raghavendra.kt@amd.com>
	Cc: Vincent Guittot <vincent.guittot@linaro.org>
	Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
	---

	kernel/sched/fair.c \| 9 +++++++++
	1 file changed, 9 insertions(+)

	--- a/kernel/sched/fair.c~sched-numa-fix-the-vma-scan-starving-issue
	+++ a/kernel/sched/fair.c
	@@ -3187,6 +3187,15 @@ static bool vma_is_accessed(struct mm_st
	return true;
	}

	+ /*
	+ * This vma has not been accessed for a while, and if the number
	+ * the threads in the same process is low, which means no other
	+ * threads can help scan this vma, force a vma scan.
	+ */
	+ if (READ_ONCE(mm->numa_scan_seq) >
	+ (vma->numab_state->prev_scan_seq + get_nr_threads(current)))
	+ return true;
	+
	return false;
	}

	_