releases/5.9.5/xfs-change-the-order-in-which-child-and-parent-defer.patch - pub/scm/linux/kernel/git/stable/stable-queue - Git at Google

 From c3e274af6835f3dfea379f89e3c0f46cc4c7a29b Mon Sep 17 00:00:00 2001
 From: Sasha Levin <sashal@kernel.org>
 Date: Fri, 25 Sep 2020 17:39:51 -0700
 Subject: xfs: change the order in which child and parent defer ops are
  finished

 From: Darrick J. Wong <darrick.wong@oracle.com>

 [ Upstream commit 27dada070d59c28a441f1907d2cec891b17dcb26 ]

 The defer ops code has been finishing items in the wrong order -- if a
 top level defer op creates items A and B, and finishing item A creates
 more defer ops A1 and A2, we'll put the new items on the end of the
 chain and process them in the order A B A1 A2.  This is kind of weird,
 since it's convenient for programmers to be able to think of A and B as
 an ordered sequence where all the sub-tasks for A must finish before we
 move on to B, e.g. A A1 A2 D.

 Right now, our log intent items are not so complex that this matters,
 but this will become important for the atomic extent swapping patchset.
 In order to maintain correct reference counting of extents, we have to
 unmap and remap extents in that order, and we want to complete that work
 before moving on to the next range that the user wants to swap.  This
 patch fixes defer ops to satsify that requirement.

 The primary symptom of the incorrect order was noticed in an early
 performance analysis of the atomic extent swap code.  An astonishingly
 large number of deferred work items accumulated when userspace requested
 an atomic update of two very fragmented files.  The cause of this was
 traced to the same ordering bug in the inner loop of
 xfs_defer_finish_noroll.

 If the ->finish_item method of a deferred operation queues new deferred
 operations, those new deferred ops are appended to the tail of the
 pending work list.  To illustrate, say that a caller creates a
 transaction t0 with four deferred operations D0-D3.  The first thing
 defer ops does is roll the transaction to t1, leaving us with:

 t1: D0(t0), D1(t0), D2(t0), D3(t0)

 Let's say that finishing each of D0-D3 will create two new deferred ops.
 After finish D0 and roll, we'll have the following chain:

 t2: D1(t0), D2(t0), D3(t0), d4(t1), d5(t1)

 d4 and d5 were logged to t1.  Notice that while we're about to start
 work on D1, we haven't actually completed all the work implied by D0
 being finished.  So far we've been careful (or lucky) to structure the
 dfops callers such that D1 doesn't depend on d4 or d5 being finished,
 but this is a potential logic bomb.

 There's a second problem lurking.  Let's see what happens as we finish
 D1-D3:

 t3: D2(t0), D3(t0), d4(t1), d5(t1), d6(t2), d7(t2)
 t4: D3(t0), d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3)
 t5: d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)

 Let's say that d4-d11 are simple work items that don't queue any other
 operations, which means that we can complete each d4 and roll to t6:

 t6: d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
 t7: d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
 ...
 t11: d10(t4), d11(t4)
 t12: d11(t4)
 <done>

 When we try to roll to transaction #12, we're holding defer op d11,
 which we logged way back in t4.  This means that the tail of the log is
 pinned at t4.  If the log is very small or there are a lot of other
 threads updating metadata, this means that we might have wrapped the log
 and cannot get roll to t11 because there isn't enough space left before
 we'd run into t4.

 Let's shift back to the original failure.  I mentioned before that I
 discovered this flaw while developing the atomic file update code.  In
 that scenario, we have a defer op (D0) that finds a range of file blocks
 to remap, creates a handful of new defer ops to do that, and then asks
 to be continued with however much work remains.

 So, D0 is the original swapext deferred op.  The first thing defer ops
 does is rolls to t1:

 t1: D0(t0)

 We try to finish D0, logging d1 and d2 in the process, but can't get all
 the work done.  We log a done item and a new intent item for the work
 that D0 still has to do, and roll to t2:

 t2: D0'(t1), d1(t1), d2(t1)

 We roll and try to finish D0', but still can't get all the work done, so
 we log a done item and a new intent item for it, requeue D0 a second
 time, and roll to t3:

 t3: D0''(t2), d1(t1), d2(t1), d3(t2), d4(t2)

 If it takes 48 more rolls to complete D0, then we'll finally dispense
 with D0 in t50:

 t50: D<fifty primes>(t49), d1(t1), ..., d102(t50)

 We then try to roll again to get a chain like this:

 t51: d1(t1), d2(t1), ..., d101(t50), d102(t50)
 ...
 t152: d102(t50)
 <done>

 Notice that in rolling to transaction #51, we're holding on to a log
 intent item for d1 that was logged in transaction #1.  This means that
 the tail of the log is pinned at t1.  If the log is very small or there
 are a lot of other threads updating metadata, this means that we might
 have wrapped the log and cannot roll to t51 because there isn't enough
 space left before we'd run into t1.  This is of course problem #2 again.

 But notice the third problem with this scenario: we have 102 defer ops
 tied to this transaction!  Each of these items are backed by pinned
 kernel memory, which means that we risk OOM if the chains get too long.

 Yikes.  Problem #1 is a subtle logic bomb that could hit someone in the
 future; problem #2 applies (rarely) to the current upstream, and problem
 #3 applies to work under development.

 This is not how incremental deferred operations were supposed to work.
 The dfops design of logging in the same transaction an intent-done item
 and a new intent item for the work remaining was to make it so that we
 only have to juggle enough deferred work items to finish that one small
 piece of work.  Deferred log item recovery will find that first
 unfinished work item and restart it, no matter how many other intent
 items might follow it in the log.  Therefore, it's ok to put the new
 intents at the start of the dfops chain.

 For the first example, the chains look like this:

 t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
 t3: d5(t1), D1(t0), D2(t0), D3(t0)
 ...
 t9: d9(t7), D3(t0)
 t10: D3(t0)
 t11: d10(t10), d11(t10)
 t12: d11(t10)

 For the second example, the chains look like this:

 t1: D0(t0)
 t2: d1(t1), d2(t1), D0'(t1)
 t3: d2(t1), D0'(t1)
 t4: D0'(t1)
 t5: d1(t4), d2(t4), D0''(t4)
 ...
 t148: D0<50 primes>(t147)
 t149: d101(t148), d102(t148)
 t150: d102(t148)
 <done>

 This actually sucks more for pinning the log tail (we try to roll to t10
 while holding an intent item that was logged in t1) but we've solved
 problem #1.  We've also reduced the maximum chain length from:

     sum(all the new items) + nr_original_items

 to:

     max(new items that each original item creates) + nr_original_items

 This solves problem #3 by sharply reducing the number of defer ops that
 can be attached to a transaction at any given time.  The change makes
 the problem of log tail pinning worse, but is improvement we need to
 solve problem #2.  Actually solving #2, however, is left to the next
 patch.

 Note that a subsequent analysis of some hard-to-trigger reflink and COW
 livelocks on extremely fragmented filesystems (or systems running a lot
 of IO threads) showed the same symptoms -- uncomfortably large numbers
 of incore deferred work items and occasional stalls in the transaction
 grant code while waiting for log reservations.  I think this patch and
 the next one will also solve these problems.

 As originally written, the code used list_splice_tail_init instead of
 list_splice_init, so change that, and leave a short comment explaining
 our actions.

 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
 Reviewed-by: Dave Chinner <dchinner@redhat.com>
 Reviewed-by: Brian Foster <bfoster@redhat.com>
 Signed-off-by: Sasha Levin <sashal@kernel.org>
 ---
  fs/xfs/libxfs/xfs_defer.c | 11 ++++++++++-
  1 file changed, 10 insertions(+), 1 deletion(-)

 diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
 index 29e9762f3b77c..4959d8a32b606 100644
 --- a/fs/xfs/libxfs/xfs_defer.c
 +++ b/fs/xfs/libxfs/xfs_defer.c
 @@ -430,8 +430,17 @@ xfs_defer_finish_noroll(

  	/* Until we run out of pending work to finish... */
  	while (!list_empty(&dop_pending) || !list_empty(&(*tp)->t_dfops)) {
 +		/*
 +		 * Deferred items that are created in the process of finishing
 +		 * other deferred work items should be queued at the head of
 +		 * the pending list, which puts them ahead of the deferred work
 +		 * that was created by the caller.  This keeps the number of
 +		 * pending work items to a minimum, which decreases the amount
 +		 * of time that any one intent item can stick around in memory,
 +		 * pinning the log tail.
 +		 */
  		xfs_defer_create_intents(*tp);
 -		list_splice_tail_init(&(*tp)->t_dfops, &dop_pending);
 +		list_splice_init(&(*tp)->t_dfops, &dop_pending);

  		error = xfs_defer_trans_roll(tp);
  		if (error)
 --
 2.27.0
	From c3e274af6835f3dfea379f89e3c0f46cc4c7a29b Mon Sep 17 00:00:00 2001
	From: Sasha Levin <sashal@kernel.org>
	Date: Fri, 25 Sep 2020 17:39:51 -0700
	Subject: xfs: change the order in which child and parent defer ops are
	finished

	From: Darrick J. Wong <darrick.wong@oracle.com>

	[ Upstream commit 27dada070d59c28a441f1907d2cec891b17dcb26 ]

	The defer ops code has been finishing items in the wrong order -- if a
	top level defer op creates items A and B, and finishing item A creates
	more defer ops A1 and A2, we'll put the new items on the end of the
	chain and process them in the order A B A1 A2. This is kind of weird,
	since it's convenient for programmers to be able to think of A and B as
	an ordered sequence where all the sub-tasks for A must finish before we
	move on to B, e.g. A A1 A2 D.

	Right now, our log intent items are not so complex that this matters,
	but this will become important for the atomic extent swapping patchset.
	In order to maintain correct reference counting of extents, we have to
	unmap and remap extents in that order, and we want to complete that work
	before moving on to the next range that the user wants to swap. This
	patch fixes defer ops to satsify that requirement.

	The primary symptom of the incorrect order was noticed in an early
	performance analysis of the atomic extent swap code. An astonishingly
	large number of deferred work items accumulated when userspace requested
	an atomic update of two very fragmented files. The cause of this was
	traced to the same ordering bug in the inner loop of
	xfs_defer_finish_noroll.

	If the ->finish_item method of a deferred operation queues new deferred
	operations, those new deferred ops are appended to the tail of the
	pending work list. To illustrate, say that a caller creates a
	transaction t0 with four deferred operations D0-D3. The first thing
	defer ops does is roll the transaction to t1, leaving us with:

	t1: D0(t0), D1(t0), D2(t0), D3(t0)

	Let's say that finishing each of D0-D3 will create two new deferred ops.
	After finish D0 and roll, we'll have the following chain:

	t2: D1(t0), D2(t0), D3(t0), d4(t1), d5(t1)

	d4 and d5 were logged to t1. Notice that while we're about to start
	work on D1, we haven't actually completed all the work implied by D0
	being finished. So far we've been careful (or lucky) to structure the
	dfops callers such that D1 doesn't depend on d4 or d5 being finished,
	but this is a potential logic bomb.

	There's a second problem lurking. Let's see what happens as we finish
	D1-D3:

	t3: D2(t0), D3(t0), d4(t1), d5(t1), d6(t2), d7(t2)
	t4: D3(t0), d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3)
	t5: d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)

	Let's say that d4-d11 are simple work items that don't queue any other
	operations, which means that we can complete each d4 and roll to t6:

	t6: d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
	t7: d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4)
	...
	t11: d10(t4), d11(t4)
	t12: d11(t4)
	<done>

	When we try to roll to transaction #12, we're holding defer op d11,
	which we logged way back in t4. This means that the tail of the log is
	pinned at t4. If the log is very small or there are a lot of other
	threads updating metadata, this means that we might have wrapped the log
	and cannot get roll to t11 because there isn't enough space left before
	we'd run into t4.

	Let's shift back to the original failure. I mentioned before that I
	discovered this flaw while developing the atomic file update code. In
	that scenario, we have a defer op (D0) that finds a range of file blocks
	to remap, creates a handful of new defer ops to do that, and then asks
	to be continued with however much work remains.

	So, D0 is the original swapext deferred op. The first thing defer ops
	does is rolls to t1:

	t1: D0(t0)

	We try to finish D0, logging d1 and d2 in the process, but can't get all
	the work done. We log a done item and a new intent item for the work
	that D0 still has to do, and roll to t2:

	t2: D0'(t1), d1(t1), d2(t1)

	We roll and try to finish D0', but still can't get all the work done, so
	we log a done item and a new intent item for it, requeue D0 a second
	time, and roll to t3:

	t3: D0''(t2), d1(t1), d2(t1), d3(t2), d4(t2)

	If it takes 48 more rolls to complete D0, then we'll finally dispense
	with D0 in t50:

	t50: D<fifty primes>(t49), d1(t1), ..., d102(t50)

	We then try to roll again to get a chain like this:

	t51: d1(t1), d2(t1), ..., d101(t50), d102(t50)
	...
	t152: d102(t50)
	<done>

	Notice that in rolling to transaction #51, we're holding on to a log
	intent item for d1 that was logged in transaction #1. This means that
	the tail of the log is pinned at t1. If the log is very small or there
	are a lot of other threads updating metadata, this means that we might
	have wrapped the log and cannot roll to t51 because there isn't enough
	space left before we'd run into t1. This is of course problem #2 again.

	But notice the third problem with this scenario: we have 102 defer ops
	tied to this transaction! Each of these items are backed by pinned
	kernel memory, which means that we risk OOM if the chains get too long.

	Yikes. Problem #1 is a subtle logic bomb that could hit someone in the
	future; problem #2 applies (rarely) to the current upstream, and problem
	#3 applies to work under development.

	This is not how incremental deferred operations were supposed to work.
	The dfops design of logging in the same transaction an intent-done item
	and a new intent item for the work remaining was to make it so that we
	only have to juggle enough deferred work items to finish that one small
	piece of work. Deferred log item recovery will find that first
	unfinished work item and restart it, no matter how many other intent
	items might follow it in the log. Therefore, it's ok to put the new
	intents at the start of the dfops chain.

	For the first example, the chains look like this:

	t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0)
	t3: d5(t1), D1(t0), D2(t0), D3(t0)
	...
	t9: d9(t7), D3(t0)
	t10: D3(t0)
	t11: d10(t10), d11(t10)
	t12: d11(t10)

	For the second example, the chains look like this:

	t1: D0(t0)
	t2: d1(t1), d2(t1), D0'(t1)
	t3: d2(t1), D0'(t1)
	t4: D0'(t1)
	t5: d1(t4), d2(t4), D0''(t4)
	...
	t148: D0<50 primes>(t147)
	t149: d101(t148), d102(t148)
	t150: d102(t148)
	<done>

	This actually sucks more for pinning the log tail (we try to roll to t10
	while holding an intent item that was logged in t1) but we've solved
	problem #1. We've also reduced the maximum chain length from:

	sum(all the new items) + nr_original_items

	to:

	max(new items that each original item creates) + nr_original_items

	This solves problem #3 by sharply reducing the number of defer ops that
	can be attached to a transaction at any given time. The change makes
	the problem of log tail pinning worse, but is improvement we need to
	solve problem #2. Actually solving #2, however, is left to the next
	patch.

	Note that a subsequent analysis of some hard-to-trigger reflink and COW
	livelocks on extremely fragmented filesystems (or systems running a lot
	of IO threads) showed the same symptoms -- uncomfortably large numbers
	of incore deferred work items and occasional stalls in the transaction
	grant code while waiting for log reservations. I think this patch and
	the next one will also solve these problems.

	As originally written, the code used list_splice_tail_init instead of
	list_splice_init, so change that, and leave a short comment explaining
	our actions.

	Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
	Reviewed-by: Dave Chinner <dchinner@redhat.com>
	Reviewed-by: Brian Foster <bfoster@redhat.com>
	Signed-off-by: Sasha Levin <sashal@kernel.org>
	---
	fs/xfs/libxfs/xfs_defer.c \| 11 ++++++++++-
	1 file changed, 10 insertions(+), 1 deletion(-)

	diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
	index 29e9762f3b77c..4959d8a32b606 100644
	--- a/fs/xfs/libxfs/xfs_defer.c
	+++ b/fs/xfs/libxfs/xfs_defer.c
	@@ -430,8 +430,17 @@ xfs_defer_finish_noroll(

	/* Until we run out of pending work to finish... */
	while (!list_empty(&dop_pending) \|\| !list_empty(&(*tp)->t_dfops)) {
	+ /*
	+ * Deferred items that are created in the process of finishing
	+ * other deferred work items should be queued at the head of
	+ * the pending list, which puts them ahead of the deferred work
	+ * that was created by the caller. This keeps the number of
	+ * pending work items to a minimum, which decreases the amount
	+ * of time that any one intent item can stick around in memory,
	+ * pinning the log tail.
	+ */
	xfs_defer_create_intents(*tp);
	- list_splice_tail_init(&(*tp)->t_dfops, &dop_pending);
	+ list_splice_init(&(*tp)->t_dfops, &dop_pending);

	error = xfs_defer_trans_roll(tp);
	if (error)
	--
	2.27.0