| From c3e274af6835f3dfea379f89e3c0f46cc4c7a29b Mon Sep 17 00:00:00 2001 |
| From: Sasha Levin <sashal@kernel.org> |
| Date: Fri, 25 Sep 2020 17:39:51 -0700 |
| Subject: xfs: change the order in which child and parent defer ops are |
| finished |
| |
| From: Darrick J. Wong <darrick.wong@oracle.com> |
| |
| [ Upstream commit 27dada070d59c28a441f1907d2cec891b17dcb26 ] |
| |
| The defer ops code has been finishing items in the wrong order -- if a |
| top level defer op creates items A and B, and finishing item A creates |
| more defer ops A1 and A2, we'll put the new items on the end of the |
| chain and process them in the order A B A1 A2. This is kind of weird, |
| since it's convenient for programmers to be able to think of A and B as |
| an ordered sequence where all the sub-tasks for A must finish before we |
| move on to B, e.g. A A1 A2 D. |
| |
| Right now, our log intent items are not so complex that this matters, |
| but this will become important for the atomic extent swapping patchset. |
| In order to maintain correct reference counting of extents, we have to |
| unmap and remap extents in that order, and we want to complete that work |
| before moving on to the next range that the user wants to swap. This |
| patch fixes defer ops to satsify that requirement. |
| |
| The primary symptom of the incorrect order was noticed in an early |
| performance analysis of the atomic extent swap code. An astonishingly |
| large number of deferred work items accumulated when userspace requested |
| an atomic update of two very fragmented files. The cause of this was |
| traced to the same ordering bug in the inner loop of |
| xfs_defer_finish_noroll. |
| |
| If the ->finish_item method of a deferred operation queues new deferred |
| operations, those new deferred ops are appended to the tail of the |
| pending work list. To illustrate, say that a caller creates a |
| transaction t0 with four deferred operations D0-D3. The first thing |
| defer ops does is roll the transaction to t1, leaving us with: |
| |
| t1: D0(t0), D1(t0), D2(t0), D3(t0) |
| |
| Let's say that finishing each of D0-D3 will create two new deferred ops. |
| After finish D0 and roll, we'll have the following chain: |
| |
| t2: D1(t0), D2(t0), D3(t0), d4(t1), d5(t1) |
| |
| d4 and d5 were logged to t1. Notice that while we're about to start |
| work on D1, we haven't actually completed all the work implied by D0 |
| being finished. So far we've been careful (or lucky) to structure the |
| dfops callers such that D1 doesn't depend on d4 or d5 being finished, |
| but this is a potential logic bomb. |
| |
| There's a second problem lurking. Let's see what happens as we finish |
| D1-D3: |
| |
| t3: D2(t0), D3(t0), d4(t1), d5(t1), d6(t2), d7(t2) |
| t4: D3(t0), d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3) |
| t5: d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4) |
| |
| Let's say that d4-d11 are simple work items that don't queue any other |
| operations, which means that we can complete each d4 and roll to t6: |
| |
| t6: d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4) |
| t7: d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4) |
| ... |
| t11: d10(t4), d11(t4) |
| t12: d11(t4) |
| <done> |
| |
| When we try to roll to transaction #12, we're holding defer op d11, |
| which we logged way back in t4. This means that the tail of the log is |
| pinned at t4. If the log is very small or there are a lot of other |
| threads updating metadata, this means that we might have wrapped the log |
| and cannot get roll to t11 because there isn't enough space left before |
| we'd run into t4. |
| |
| Let's shift back to the original failure. I mentioned before that I |
| discovered this flaw while developing the atomic file update code. In |
| that scenario, we have a defer op (D0) that finds a range of file blocks |
| to remap, creates a handful of new defer ops to do that, and then asks |
| to be continued with however much work remains. |
| |
| So, D0 is the original swapext deferred op. The first thing defer ops |
| does is rolls to t1: |
| |
| t1: D0(t0) |
| |
| We try to finish D0, logging d1 and d2 in the process, but can't get all |
| the work done. We log a done item and a new intent item for the work |
| that D0 still has to do, and roll to t2: |
| |
| t2: D0'(t1), d1(t1), d2(t1) |
| |
| We roll and try to finish D0', but still can't get all the work done, so |
| we log a done item and a new intent item for it, requeue D0 a second |
| time, and roll to t3: |
| |
| t3: D0''(t2), d1(t1), d2(t1), d3(t2), d4(t2) |
| |
| If it takes 48 more rolls to complete D0, then we'll finally dispense |
| with D0 in t50: |
| |
| t50: D<fifty primes>(t49), d1(t1), ..., d102(t50) |
| |
| We then try to roll again to get a chain like this: |
| |
| t51: d1(t1), d2(t1), ..., d101(t50), d102(t50) |
| ... |
| t152: d102(t50) |
| <done> |
| |
| Notice that in rolling to transaction #51, we're holding on to a log |
| intent item for d1 that was logged in transaction #1. This means that |
| the tail of the log is pinned at t1. If the log is very small or there |
| are a lot of other threads updating metadata, this means that we might |
| have wrapped the log and cannot roll to t51 because there isn't enough |
| space left before we'd run into t1. This is of course problem #2 again. |
| |
| But notice the third problem with this scenario: we have 102 defer ops |
| tied to this transaction! Each of these items are backed by pinned |
| kernel memory, which means that we risk OOM if the chains get too long. |
| |
| Yikes. Problem #1 is a subtle logic bomb that could hit someone in the |
| future; problem #2 applies (rarely) to the current upstream, and problem |
| #3 applies to work under development. |
| |
| This is not how incremental deferred operations were supposed to work. |
| The dfops design of logging in the same transaction an intent-done item |
| and a new intent item for the work remaining was to make it so that we |
| only have to juggle enough deferred work items to finish that one small |
| piece of work. Deferred log item recovery will find that first |
| unfinished work item and restart it, no matter how many other intent |
| items might follow it in the log. Therefore, it's ok to put the new |
| intents at the start of the dfops chain. |
| |
| For the first example, the chains look like this: |
| |
| t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0) |
| t3: d5(t1), D1(t0), D2(t0), D3(t0) |
| ... |
| t9: d9(t7), D3(t0) |
| t10: D3(t0) |
| t11: d10(t10), d11(t10) |
| t12: d11(t10) |
| |
| For the second example, the chains look like this: |
| |
| t1: D0(t0) |
| t2: d1(t1), d2(t1), D0'(t1) |
| t3: d2(t1), D0'(t1) |
| t4: D0'(t1) |
| t5: d1(t4), d2(t4), D0''(t4) |
| ... |
| t148: D0<50 primes>(t147) |
| t149: d101(t148), d102(t148) |
| t150: d102(t148) |
| <done> |
| |
| This actually sucks more for pinning the log tail (we try to roll to t10 |
| while holding an intent item that was logged in t1) but we've solved |
| problem #1. We've also reduced the maximum chain length from: |
| |
| sum(all the new items) + nr_original_items |
| |
| to: |
| |
| max(new items that each original item creates) + nr_original_items |
| |
| This solves problem #3 by sharply reducing the number of defer ops that |
| can be attached to a transaction at any given time. The change makes |
| the problem of log tail pinning worse, but is improvement we need to |
| solve problem #2. Actually solving #2, however, is left to the next |
| patch. |
| |
| Note that a subsequent analysis of some hard-to-trigger reflink and COW |
| livelocks on extremely fragmented filesystems (or systems running a lot |
| of IO threads) showed the same symptoms -- uncomfortably large numbers |
| of incore deferred work items and occasional stalls in the transaction |
| grant code while waiting for log reservations. I think this patch and |
| the next one will also solve these problems. |
| |
| As originally written, the code used list_splice_tail_init instead of |
| list_splice_init, so change that, and leave a short comment explaining |
| our actions. |
| |
| Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> |
| Reviewed-by: Dave Chinner <dchinner@redhat.com> |
| Reviewed-by: Brian Foster <bfoster@redhat.com> |
| Signed-off-by: Sasha Levin <sashal@kernel.org> |
| --- |
| fs/xfs/libxfs/xfs_defer.c | 11 ++++++++++- |
| 1 file changed, 10 insertions(+), 1 deletion(-) |
| |
| diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c |
| index 29e9762f3b77c..4959d8a32b606 100644 |
| --- a/fs/xfs/libxfs/xfs_defer.c |
| +++ b/fs/xfs/libxfs/xfs_defer.c |
| @@ -430,8 +430,17 @@ xfs_defer_finish_noroll( |
| |
| /* Until we run out of pending work to finish... */ |
| while (!list_empty(&dop_pending) || !list_empty(&(*tp)->t_dfops)) { |
| + /* |
| + * Deferred items that are created in the process of finishing |
| + * other deferred work items should be queued at the head of |
| + * the pending list, which puts them ahead of the deferred work |
| + * that was created by the caller. This keeps the number of |
| + * pending work items to a minimum, which decreases the amount |
| + * of time that any one intent item can stick around in memory, |
| + * pinning the log tail. |
| + */ |
| xfs_defer_create_intents(*tp); |
| - list_splice_tail_init(&(*tp)->t_dfops, &dop_pending); |
| + list_splice_init(&(*tp)->t_dfops, &dop_pending); |
| |
| error = xfs_defer_trans_roll(tp); |
| if (error) |
| -- |
| 2.27.0 |
| |