patches/old/mm-vmscan-restore-incremental-cgroup-iteration.patch - pub/scm/linux/kernel/git/akpm/25-new - Git at Google

 From: Johannes Weiner <hannes@cmpxchg.org>
 Subject: mm: vmscan: restore incremental cgroup iteration
 Date: Tue, 14 May 2024 16:26:41 -0400

 Currently, reclaim always walks the entire cgroup tree in order to ensure
 fairness between groups.  While overreclaim is limited in shrink_lruvec(),
 many of our systems have a sizable number of active groups, and an even
 bigger number of idle cgroups with cache left behind by previous jobs; the
 mere act of walking all these cgroups can impose significant latency on
 direct reclaimers.

 In the past, we've used a save-and-restore iterator that enabled
 incremental tree walks over multiple reclaim invocations.  This ensured
 fairness, while keeping the work of individual reclaimers small.

 However, in edge cases with a lot of reclaim concurrency, individual
 reclaimers would sometimes not see enough of the cgroup tree to make
 forward progress and (prematurely) declare OOM.  Consequently we switched
 to comprehensive walks in 1ba6fc9af35b ("mm: vmscan: do not share cgroup
 iteration between reclaimers").

 To address the latency problem without bringing back the premature OOM
 issue, reinstate the shared iteration, but with a restart condition to do
 the full walk in the OOM case - similar to what we do for memory.low
 enforcement and active page protection.

 In the worst case, we do one more full tree walk before declaring
 OOM. But the vast majority of direct reclaim scans can then finish
 much quicker, while fairness across the tree is maintained:

 - Before this patch, we observed that direct reclaim always takes more
   than 100us and most direct reclaim time is spent in reclaim cycles
   lasting between 1ms and 1 second. Almost 40% of direct reclaim time
   was spent on reclaim cycles exceeding 100ms.

 - With this patch, almost all page reclaim cycles last less than 10ms,
   and a good amount of direct page reclaim finishes in under 100us. No
   page reclaim cycles lasting over 100ms were observed anymore.

 The shared iterator state is maintaned inside the target cgroup, so
 fair and incremental walks are performed during both global reclaim
 and cgroup limit reclaim of complex subtrees.

 Link: https://lkml.kernel.org/r/20240514202641.2821494-1-hannes@cmpxchg.org
 Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
 Signed-off-by: Rik van Riel <riel@surriel.com>
 Reported-by: Rik van Riel <riel@surriel.com>
 Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
 Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
 Cc: Facebook Kernel Team <kernel-team@fb.com>
 Cc: Michal Hocko <mhocko@kernel.org>
 Cc: Rik van Riel <riel@surriel.com>
 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
 ---

  mm/vmscan.c |   42 ++++++++++++++++++++++++++++++++++++++++--
  1 file changed, 40 insertions(+), 2 deletions(-)

 --- a/mm/vmscan.c~mm-vmscan-restore-incremental-cgroup-iteration
 +++ a/mm/vmscan.c
 @@ -128,6 +128,9 @@ struct scan_control {
  	unsigned int memcg_low_reclaim:1;
  	unsigned int memcg_low_skipped:1;

 +	/* Shared cgroup tree walk failed, rescan the whole tree */
 +	unsigned int memcg_full_walk:1;
 +
  	unsigned int hibernation_mode:1;

  	/* One of the zones is ready for compaction */
 @@ -5845,9 +5848,25 @@ static inline bool should_continue_recla
  static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
  {
  	struct mem_cgroup *target_memcg = sc->target_mem_cgroup;
 +	struct mem_cgroup_reclaim_cookie reclaim = {
 +		.pgdat = pgdat,
 +	};
 +	struct mem_cgroup_reclaim_cookie *partial = &reclaim;
  	struct mem_cgroup *memcg;

 -	memcg = mem_cgroup_iter(target_memcg, NULL, NULL);
 +	/*
 +	 * In most cases, direct reclaimers can do partial walks
 +	 * through the cgroup tree, using an iterator state that
 +	 * persists across invocations. This strikes a balance between
 +	 * fairness and allocation latency.
 +	 *
 +	 * For kswapd, reliable forward progress is more important
 +	 * than a quick return to idle. Always do full walks.
 +	 */
 +	if (current_is_kswapd() || sc->memcg_full_walk)
 +		partial = NULL;
 +
 +	memcg = mem_cgroup_iter(target_memcg, NULL, partial);
  	do {
  		struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
  		unsigned long reclaimed;
 @@ -5897,7 +5916,12 @@ static void shrink_node_memcgs(pg_data_t
  				   sc->nr_scanned - scanned,
  				   sc->nr_reclaimed - reclaimed);

 -	} while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL)));
 +		/* If partial walks are allowed, bail once goal is reached */
 +		if (partial && sc->nr_reclaimed >= sc->nr_to_reclaim) {
 +			mem_cgroup_iter_break(target_memcg, memcg);
 +			break;
 +		}
 +	} while ((memcg = mem_cgroup_iter(target_memcg, memcg, partial)));
  }

  static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 @@ -6271,6 +6295,20 @@ retry:
  		return 1;

  	/*
 +	 * In most cases, direct reclaimers can do partial walks
 +	 * through the cgroup tree to meet the reclaim goal while
 +	 * keeping latency low. Since the iterator state is shared
 +	 * among all direct reclaim invocations (to retain fairness
 +	 * among cgroups), though, high concurrency can result in
 +	 * individual threads not seeing enough cgroups to make
 +	 * meaningful forward progress. Avoid false OOMs in this case.
 +	 */
 +	if (!sc->memcg_full_walk) {
 +		sc->memcg_full_walk = 1;
 +		goto retry;
 +	}
 +
 +	/*
  	 * We make inactive:active ratio decisions based on the node's
  	 * composition of memory, but a restrictive reclaim_idx or a
  	 * memory.low cgroup setting can exempt large amounts of
 _
	From: Johannes Weiner <hannes@cmpxchg.org>
	Subject: mm: vmscan: restore incremental cgroup iteration
	Date: Tue, 14 May 2024 16:26:41 -0400

	Currently, reclaim always walks the entire cgroup tree in order to ensure
	fairness between groups. While overreclaim is limited in shrink_lruvec(),
	many of our systems have a sizable number of active groups, and an even
	bigger number of idle cgroups with cache left behind by previous jobs; the
	mere act of walking all these cgroups can impose significant latency on
	direct reclaimers.

	In the past, we've used a save-and-restore iterator that enabled
	incremental tree walks over multiple reclaim invocations. This ensured
	fairness, while keeping the work of individual reclaimers small.

	However, in edge cases with a lot of reclaim concurrency, individual
	reclaimers would sometimes not see enough of the cgroup tree to make
	forward progress and (prematurely) declare OOM. Consequently we switched
	to comprehensive walks in 1ba6fc9af35b ("mm: vmscan: do not share cgroup
	iteration between reclaimers").

	To address the latency problem without bringing back the premature OOM
	issue, reinstate the shared iteration, but with a restart condition to do
	the full walk in the OOM case - similar to what we do for memory.low
	enforcement and active page protection.

	In the worst case, we do one more full tree walk before declaring
	OOM. But the vast majority of direct reclaim scans can then finish
	much quicker, while fairness across the tree is maintained:

	- Before this patch, we observed that direct reclaim always takes more
	than 100us and most direct reclaim time is spent in reclaim cycles
	lasting between 1ms and 1 second. Almost 40% of direct reclaim time
	was spent on reclaim cycles exceeding 100ms.

	- With this patch, almost all page reclaim cycles last less than 10ms,
	and a good amount of direct page reclaim finishes in under 100us. No
	page reclaim cycles lasting over 100ms were observed anymore.

	The shared iterator state is maintaned inside the target cgroup, so
	fair and incremental walks are performed during both global reclaim
	and cgroup limit reclaim of complex subtrees.

	Link: https://lkml.kernel.org/r/20240514202641.2821494-1-hannes@cmpxchg.org
	Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
	Signed-off-by: Rik van Riel <riel@surriel.com>
	Reported-by: Rik van Riel <riel@surriel.com>
	Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
	Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
	Cc: Facebook Kernel Team <kernel-team@fb.com>
	Cc: Michal Hocko <mhocko@kernel.org>
	Cc: Rik van Riel <riel@surriel.com>
	Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
	---

	mm/vmscan.c \| 42 ++++++++++++++++++++++++++++++++++++++++--
	1 file changed, 40 insertions(+), 2 deletions(-)

	--- a/mm/vmscan.c~mm-vmscan-restore-incremental-cgroup-iteration
	+++ a/mm/vmscan.c
	@@ -128,6 +128,9 @@ struct scan_control {
	unsigned int memcg_low_reclaim:1;
	unsigned int memcg_low_skipped:1;

	+ /* Shared cgroup tree walk failed, rescan the whole tree */
	+ unsigned int memcg_full_walk:1;
	+
	unsigned int hibernation_mode:1;

	/* One of the zones is ready for compaction */
	@@ -5845,9 +5848,25 @@ static inline bool should_continue_recla
	static void shrink_node_memcgs(pg_data_t pgdat, struct scan_control sc)
	{
	struct mem_cgroup *target_memcg = sc->target_mem_cgroup;
	+ struct mem_cgroup_reclaim_cookie reclaim = {
	+ .pgdat = pgdat,
	+ };
	+ struct mem_cgroup_reclaim_cookie *partial = &reclaim;
	struct mem_cgroup *memcg;

	- memcg = mem_cgroup_iter(target_memcg, NULL, NULL);
	+ /*
	+ * In most cases, direct reclaimers can do partial walks
	+ * through the cgroup tree, using an iterator state that
	+ * persists across invocations. This strikes a balance between
	+ * fairness and allocation latency.
	+ *
	+ * For kswapd, reliable forward progress is more important
	+ * than a quick return to idle. Always do full walks.
	+ */
	+ if (current_is_kswapd() \|\| sc->memcg_full_walk)
	+ partial = NULL;
	+
	+ memcg = mem_cgroup_iter(target_memcg, NULL, partial);
	do {
	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
	unsigned long reclaimed;
	@@ -5897,7 +5916,12 @@ static void shrink_node_memcgs(pg_data_t
	sc->nr_scanned - scanned,
	sc->nr_reclaimed - reclaimed);

	- } while ((memcg = mem_cgroup_iter(target_memcg, memcg, NULL)));
	+ /* If partial walks are allowed, bail once goal is reached */
	+ if (partial && sc->nr_reclaimed >= sc->nr_to_reclaim) {
	+ mem_cgroup_iter_break(target_memcg, memcg);
	+ break;
	+ }
	+ } while ((memcg = mem_cgroup_iter(target_memcg, memcg, partial)));
	}

	static void shrink_node(pg_data_t pgdat, struct scan_control sc)
	@@ -6271,6 +6295,20 @@ retry:
	return 1;

	/*
	+ * In most cases, direct reclaimers can do partial walks
	+ * through the cgroup tree to meet the reclaim goal while
	+ * keeping latency low. Since the iterator state is shared
	+ * among all direct reclaim invocations (to retain fairness
	+ * among cgroups), though, high concurrency can result in
	+ * individual threads not seeing enough cgroups to make
	+ * meaningful forward progress. Avoid false OOMs in this case.
	+ */
	+ if (!sc->memcg_full_walk) {
	+ sc->memcg_full_walk = 1;
	+ goto retry;
	+ }
	+
	+ /*
	* We make inactive:active ratio decisions based on the node's
	* composition of memory, but a restrictive reclaim_idx or a
	* memory.low cgroup setting can exempt large amounts of
	_