| From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001 |
| From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| To: <linux-cve-announce@vger.kernel.org> |
| Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org> |
| Subject: CVE-2022-48664: btrfs: fix hang during unmount when stopping a space reclaim worker |
| |
| Description |
| =========== |
| |
| In the Linux kernel, the following vulnerability has been resolved: |
| |
| btrfs: fix hang during unmount when stopping a space reclaim worker |
| |
| Often when running generic/562 from fstests we can hang during unmount, |
| resulting in a trace like this: |
| |
| Sep 07 11:52:00 debian9 unknown: run fstests generic/562 at 2022-09-07 11:52:00 |
| Sep 07 11:55:32 debian9 kernel: INFO: task umount:49438 blocked for more than 120 seconds. |
| Sep 07 11:55:32 debian9 kernel: Not tainted 6.0.0-rc2-btrfs-next-122 #1 |
| Sep 07 11:55:32 debian9 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. |
| Sep 07 11:55:32 debian9 kernel: task:umount state:D stack: 0 pid:49438 ppid: 25683 flags:0x00004000 |
| Sep 07 11:55:32 debian9 kernel: Call Trace: |
| Sep 07 11:55:32 debian9 kernel: <TASK> |
| Sep 07 11:55:32 debian9 kernel: __schedule+0x3c8/0xec0 |
| Sep 07 11:55:32 debian9 kernel: ? rcu_read_lock_sched_held+0x12/0x70 |
| Sep 07 11:55:32 debian9 kernel: schedule+0x5d/0xf0 |
| Sep 07 11:55:32 debian9 kernel: schedule_timeout+0xf1/0x130 |
| Sep 07 11:55:32 debian9 kernel: ? lock_release+0x224/0x4a0 |
| Sep 07 11:55:32 debian9 kernel: ? lock_acquired+0x1a0/0x420 |
| Sep 07 11:55:32 debian9 kernel: ? trace_hardirqs_on+0x2c/0xd0 |
| Sep 07 11:55:32 debian9 kernel: __wait_for_common+0xac/0x200 |
| Sep 07 11:55:32 debian9 kernel: ? usleep_range_state+0xb0/0xb0 |
| Sep 07 11:55:32 debian9 kernel: __flush_work+0x26d/0x530 |
| Sep 07 11:55:32 debian9 kernel: ? flush_workqueue_prep_pwqs+0x140/0x140 |
| Sep 07 11:55:32 debian9 kernel: ? trace_clock_local+0xc/0x30 |
| Sep 07 11:55:32 debian9 kernel: __cancel_work_timer+0x11f/0x1b0 |
| Sep 07 11:55:32 debian9 kernel: ? close_ctree+0x12b/0x5b3 [btrfs] |
| Sep 07 11:55:32 debian9 kernel: ? __trace_bputs+0x10b/0x170 |
| Sep 07 11:55:32 debian9 kernel: close_ctree+0x152/0x5b3 [btrfs] |
| Sep 07 11:55:32 debian9 kernel: ? evict_inodes+0x166/0x1c0 |
| Sep 07 11:55:32 debian9 kernel: generic_shutdown_super+0x71/0x120 |
| Sep 07 11:55:32 debian9 kernel: kill_anon_super+0x14/0x30 |
| Sep 07 11:55:32 debian9 kernel: btrfs_kill_super+0x12/0x20 [btrfs] |
| Sep 07 11:55:32 debian9 kernel: deactivate_locked_super+0x2e/0xa0 |
| Sep 07 11:55:32 debian9 kernel: cleanup_mnt+0x100/0x160 |
| Sep 07 11:55:32 debian9 kernel: task_work_run+0x59/0xa0 |
| Sep 07 11:55:32 debian9 kernel: exit_to_user_mode_prepare+0x1a6/0x1b0 |
| Sep 07 11:55:32 debian9 kernel: syscall_exit_to_user_mode+0x16/0x40 |
| Sep 07 11:55:32 debian9 kernel: do_syscall_64+0x48/0x90 |
| Sep 07 11:55:32 debian9 kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd |
| Sep 07 11:55:32 debian9 kernel: RIP: 0033:0x7fcde59a57a7 |
| Sep 07 11:55:32 debian9 kernel: RSP: 002b:00007ffe914217c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 |
| Sep 07 11:55:32 debian9 kernel: RAX: 0000000000000000 RBX: 00007fcde5ae8264 RCX: 00007fcde59a57a7 |
| Sep 07 11:55:32 debian9 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055b57556cdd0 |
| Sep 07 11:55:32 debian9 kernel: RBP: 000055b57556cba0 R08: 0000000000000000 R09: 00007ffe91420570 |
| Sep 07 11:55:32 debian9 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 |
| Sep 07 11:55:32 debian9 kernel: R13: 000055b57556cdd0 R14: 000055b57556ccb8 R15: 0000000000000000 |
| Sep 07 11:55:32 debian9 kernel: </TASK> |
| |
| What happens is the following: |
| |
| 1) The cleaner kthread tries to start a transaction to delete an unused |
| block group, but the metadata reservation can not be satisfied right |
| away, so a reservation ticket is created and it starts the async |
| metadata reclaim task (fs_info->async_reclaim_work); |
| |
| 2) Writeback for all the filler inodes with an i_size of 2K starts |
| (generic/562 creates a lot of 2K files with the goal of filling |
| metadata space). We try to create an inline extent for them, but we |
| fail when trying to insert the inline extent with -ENOSPC (at |
| cow_file_range_inline()) - since this is not critical, we fallback |
| to non-inline mode (back to cow_file_range()), reserve extents, create |
| extent maps and create the ordered extents; |
| |
| 3) An unmount starts, enters close_ctree(); |
| |
| 4) The async reclaim task is flushing stuff, entering the flush states one |
| by one, until it reaches RUN_DELAYED_IPUTS. There it runs all current |
| delayed iputs. |
| |
| After running the delayed iputs and before calling |
| btrfs_wait_on_delayed_iputs(), one or more ordered extents complete, |
| and btrfs_add_delayed_iput() is called for each one through |
| btrfs_finish_ordered_io() -> btrfs_put_ordered_extent(). This results |
| in bumping fs_info->nr_delayed_iputs from 0 to some positive value. |
| |
| So the async reclaim task blocks at btrfs_wait_on_delayed_iputs() waiting |
| for fs_info->nr_delayed_iputs to become 0; |
| |
| 5) The current transaction is committed by the transaction kthread, we then |
| start unpinning extents and end up calling btrfs_try_granting_tickets() |
| through unpin_extent_range(), since we released some space. |
| This results in satisfying the ticket created by the cleaner kthread at |
| step 1, waking up the cleaner kthread; |
| |
| 6) At close_ctree() we ask the cleaner kthread to park; |
| |
| 7) The cleaner kthread starts the transaction, deletes the unused block |
| group, and then calls kthread_should_park(), which returns true, so it |
| parks. And at this point we have the delayed iputs added by the |
| completion of the ordered extents still pending; |
| |
| 8) Then later at close_ctree(), when we call: |
| |
| cancel_work_sync(&fs_info->async_reclaim_work); |
| |
| We hang forever, since the cleaner was parked and no one else can run |
| delayed iputs after that, while the reclaim task is waiting for the |
| remaining delayed iputs to be completed. |
| |
| Fix this by waiting for all ordered extents to complete and running the |
| delayed iputs before attempting to stop the async reclaim tasks. Note that |
| we can not wait for ordered extents with btrfs_wait_ordered_roots() (or |
| other similar functions) because that waits for the BTRFS_ORDERED_COMPLETE |
| flag to be set on an ordered extent, but the delayed iput is added after |
| that, when doing the final btrfs_put_ordered_extent(). So instead wait for |
| the work queues used for executing ordered extent completion to be empty, |
| which works because we do the final put on an ordered extent at |
| btrfs_finish_ordered_io() (while we are in the unmount context). |
| |
| The Linux kernel CVE team has assigned CVE-2022-48664 to this issue. |
| |
| |
| Affected and fixed versions |
| =========================== |
| |
| Issue introduced in 4.20 with commit d6fd0ae25c6495674dc5a41a8d16bc8e0073276d and fixed in 5.10.147 with commit 6ac5b52e3f352f9cb270c89e6e1d4dadb564ddb8 |
| Issue introduced in 4.20 with commit d6fd0ae25c6495674dc5a41a8d16bc8e0073276d and fixed in 5.15.71 with commit d8a76a2e514fbbb315a6dfff2d342de2de833994 |
| Issue introduced in 4.20 with commit d6fd0ae25c6495674dc5a41a8d16bc8e0073276d and fixed in 5.19.12 with commit c338bea1fec5504290dc0acf026c9e7dba25004b |
| Issue introduced in 4.20 with commit d6fd0ae25c6495674dc5a41a8d16bc8e0073276d and fixed in 6.0 with commit a362bb864b8db4861977d00bd2c3222503ccc34b |
| Issue introduced in 4.14.120 with commit 1ec2bf44c3770b9c3d510b1e78d50cd7fd19e8c5 |
| Issue introduced in 4.19.12 with commit b4c7c826709b7d882ec9b264d5032e887e6bd720 |
| |
| Please see https://www.kernel.org for a full list of currently supported |
| kernel versions by the kernel community. |
| |
| Unaffected versions might change over time as fixes are backported to |
| older supported kernel versions. The official CVE entry at |
| https://cve.org/CVERecord/?id=CVE-2022-48664 |
| will be updated if fixes are backported, please check that for the most |
| up to date information about this issue. |
| |
| |
| Affected files |
| ============== |
| |
| The file(s) affected by this issue are: |
| fs/btrfs/disk-io.c |
| |
| |
| Mitigation |
| ========== |
| |
| The Linux kernel CVE team recommends that you update to the latest |
| stable kernel version for this, and many other bugfixes. Individual |
| changes are never tested alone, but rather are part of a larger kernel |
| release. Cherry-picking individual commits is not recommended or |
| supported by the Linux kernel community at all. If however, updating to |
| the latest release is impossible, the individual changes to resolve this |
| issue can be found at these commits: |
| https://git.kernel.org/stable/c/6ac5b52e3f352f9cb270c89e6e1d4dadb564ddb8 |
| https://git.kernel.org/stable/c/d8a76a2e514fbbb315a6dfff2d342de2de833994 |
| https://git.kernel.org/stable/c/c338bea1fec5504290dc0acf026c9e7dba25004b |
| https://git.kernel.org/stable/c/a362bb864b8db4861977d00bd2c3222503ccc34b |