cve/published/2022/CVE-2022-48664.mbox - pub/scm/linux/security/vulns - Git at Google

 From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001
 From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 To: <linux-cve-announce@vger.kernel.org>
 Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
 Subject: CVE-2022-48664: btrfs: fix hang during unmount when stopping a space reclaim worker

 Description
 ===========

 In the Linux kernel, the following vulnerability has been resolved:

 btrfs: fix hang during unmount when stopping a space reclaim worker

 Often when running generic/562 from fstests we can hang during unmount,
 resulting in a trace like this:

   Sep 07 11:52:00 debian9 unknown: run fstests generic/562 at 2022-09-07 11:52:00
   Sep 07 11:55:32 debian9 kernel: INFO: task umount:49438 blocked for more than 120 seconds.
   Sep 07 11:55:32 debian9 kernel:       Not tainted 6.0.0-rc2-btrfs-next-122 #1
   Sep 07 11:55:32 debian9 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   Sep 07 11:55:32 debian9 kernel: task:umount          state:D stack:    0 pid:49438 ppid: 25683 flags:0x00004000
   Sep 07 11:55:32 debian9 kernel: Call Trace:
   Sep 07 11:55:32 debian9 kernel:  <TASK>
   Sep 07 11:55:32 debian9 kernel:  __schedule+0x3c8/0xec0
   Sep 07 11:55:32 debian9 kernel:  ? rcu_read_lock_sched_held+0x12/0x70
   Sep 07 11:55:32 debian9 kernel:  schedule+0x5d/0xf0
   Sep 07 11:55:32 debian9 kernel:  schedule_timeout+0xf1/0x130
   Sep 07 11:55:32 debian9 kernel:  ? lock_release+0x224/0x4a0
   Sep 07 11:55:32 debian9 kernel:  ? lock_acquired+0x1a0/0x420
   Sep 07 11:55:32 debian9 kernel:  ? trace_hardirqs_on+0x2c/0xd0
   Sep 07 11:55:32 debian9 kernel:  __wait_for_common+0xac/0x200
   Sep 07 11:55:32 debian9 kernel:  ? usleep_range_state+0xb0/0xb0
   Sep 07 11:55:32 debian9 kernel:  __flush_work+0x26d/0x530
   Sep 07 11:55:32 debian9 kernel:  ? flush_workqueue_prep_pwqs+0x140/0x140
   Sep 07 11:55:32 debian9 kernel:  ? trace_clock_local+0xc/0x30
   Sep 07 11:55:32 debian9 kernel:  __cancel_work_timer+0x11f/0x1b0
   Sep 07 11:55:32 debian9 kernel:  ? close_ctree+0x12b/0x5b3 [btrfs]
   Sep 07 11:55:32 debian9 kernel:  ? __trace_bputs+0x10b/0x170
   Sep 07 11:55:32 debian9 kernel:  close_ctree+0x152/0x5b3 [btrfs]
   Sep 07 11:55:32 debian9 kernel:  ? evict_inodes+0x166/0x1c0
   Sep 07 11:55:32 debian9 kernel:  generic_shutdown_super+0x71/0x120
   Sep 07 11:55:32 debian9 kernel:  kill_anon_super+0x14/0x30
   Sep 07 11:55:32 debian9 kernel:  btrfs_kill_super+0x12/0x20 [btrfs]
   Sep 07 11:55:32 debian9 kernel:  deactivate_locked_super+0x2e/0xa0
   Sep 07 11:55:32 debian9 kernel:  cleanup_mnt+0x100/0x160
   Sep 07 11:55:32 debian9 kernel:  task_work_run+0x59/0xa0
   Sep 07 11:55:32 debian9 kernel:  exit_to_user_mode_prepare+0x1a6/0x1b0
   Sep 07 11:55:32 debian9 kernel:  syscall_exit_to_user_mode+0x16/0x40
   Sep 07 11:55:32 debian9 kernel:  do_syscall_64+0x48/0x90
   Sep 07 11:55:32 debian9 kernel:  entry_SYSCALL_64_after_hwframe+0x63/0xcd
   Sep 07 11:55:32 debian9 kernel: RIP: 0033:0x7fcde59a57a7
   Sep 07 11:55:32 debian9 kernel: RSP: 002b:00007ffe914217c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
   Sep 07 11:55:32 debian9 kernel: RAX: 0000000000000000 RBX: 00007fcde5ae8264 RCX: 00007fcde59a57a7
   Sep 07 11:55:32 debian9 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055b57556cdd0
   Sep 07 11:55:32 debian9 kernel: RBP: 000055b57556cba0 R08: 0000000000000000 R09: 00007ffe91420570
   Sep 07 11:55:32 debian9 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   Sep 07 11:55:32 debian9 kernel: R13: 000055b57556cdd0 R14: 000055b57556ccb8 R15: 0000000000000000
   Sep 07 11:55:32 debian9 kernel:  </TASK>

 What happens is the following:

 1) The cleaner kthread tries to start a transaction to delete an unused
    block group, but the metadata reservation can not be satisfied right
    away, so a reservation ticket is created and it starts the async
    metadata reclaim task (fs_info->async_reclaim_work);

 2) Writeback for all the filler inodes with an i_size of 2K starts
    (generic/562 creates a lot of 2K files with the goal of filling
    metadata space). We try to create an inline extent for them, but we
    fail when trying to insert the inline extent with -ENOSPC (at
    cow_file_range_inline()) - since this is not critical, we fallback
    to non-inline mode (back to cow_file_range()), reserve extents, create
    extent maps and create the ordered extents;

 3) An unmount starts, enters close_ctree();

 4) The async reclaim task is flushing stuff, entering the flush states one
    by one, until it reaches RUN_DELAYED_IPUTS. There it runs all current
    delayed iputs.

    After running the delayed iputs and before calling
    btrfs_wait_on_delayed_iputs(), one or more ordered extents complete,
    and btrfs_add_delayed_iput() is called for each one through
    btrfs_finish_ordered_io() -> btrfs_put_ordered_extent(). This results
    in bumping fs_info->nr_delayed_iputs from 0 to some positive value.

    So the async reclaim task blocks at btrfs_wait_on_delayed_iputs() waiting
    for fs_info->nr_delayed_iputs to become 0;

 5) The current transaction is committed by the transaction kthread, we then
    start unpinning extents and end up calling btrfs_try_granting_tickets()
    through unpin_extent_range(), since we released some space.
    This results in satisfying the ticket created by the cleaner kthread at
    step 1, waking up the cleaner kthread;

 6) At close_ctree() we ask the cleaner kthread to park;

 7) The cleaner kthread starts the transaction, deletes the unused block
    group, and then calls kthread_should_park(), which returns true, so it
    parks. And at this point we have the delayed iputs added by the
    completion of the ordered extents still pending;

 8) Then later at close_ctree(), when we call:

        cancel_work_sync(&fs_info->async_reclaim_work);

    We hang forever, since the cleaner was parked and no one else can run
    delayed iputs after that, while the reclaim task is waiting for the
    remaining delayed iputs to be completed.

 Fix this by waiting for all ordered extents to complete and running the
 delayed iputs before attempting to stop the async reclaim tasks. Note that
 we can not wait for ordered extents with btrfs_wait_ordered_roots() (or
 other similar functions) because that waits for the BTRFS_ORDERED_COMPLETE
 flag to be set on an ordered extent, but the delayed iput is added after
 that, when doing the final btrfs_put_ordered_extent(). So instead wait for
 the work queues used for executing ordered extent completion to be empty,
 which works because we do the final put on an ordered extent at
 btrfs_finish_ordered_io() (while we are in the unmount context).

 The Linux kernel CVE team has assigned CVE-2022-48664 to this issue.


 Affected and fixed versions
 ===========================

 	Issue introduced in 4.20 with commit d6fd0ae25c6495674dc5a41a8d16bc8e0073276d and fixed in 5.10.147 with commit 6ac5b52e3f352f9cb270c89e6e1d4dadb564ddb8
 	Issue introduced in 4.20 with commit d6fd0ae25c6495674dc5a41a8d16bc8e0073276d and fixed in 5.15.71 with commit d8a76a2e514fbbb315a6dfff2d342de2de833994
 	Issue introduced in 4.20 with commit d6fd0ae25c6495674dc5a41a8d16bc8e0073276d and fixed in 5.19.12 with commit c338bea1fec5504290dc0acf026c9e7dba25004b
 	Issue introduced in 4.20 with commit d6fd0ae25c6495674dc5a41a8d16bc8e0073276d and fixed in 6.0 with commit a362bb864b8db4861977d00bd2c3222503ccc34b
 	Issue introduced in 4.14.120 with commit 1ec2bf44c3770b9c3d510b1e78d50cd7fd19e8c5
 	Issue introduced in 4.19.12 with commit b4c7c826709b7d882ec9b264d5032e887e6bd720

 Please see https://www.kernel.org for a full list of currently supported
 kernel versions by the kernel community.

 Unaffected versions might change over time as fixes are backported to
 older supported kernel versions.  The official CVE entry at
 	https://cve.org/CVERecord/?id=CVE-2022-48664
 will be updated if fixes are backported, please check that for the most
 up to date information about this issue.


 Affected files
 ==============

 The file(s) affected by this issue are:
 	fs/btrfs/disk-io.c


 Mitigation
 ==========

 The Linux kernel CVE team recommends that you update to the latest
 stable kernel version for this, and many other bugfixes.  Individual
 changes are never tested alone, but rather are part of a larger kernel
 release.  Cherry-picking individual commits is not recommended or
 supported by the Linux kernel community at all.  If however, updating to
 the latest release is impossible, the individual changes to resolve this
 issue can be found at these commits:
 	https://git.kernel.org/stable/c/6ac5b52e3f352f9cb270c89e6e1d4dadb564ddb8
 	https://git.kernel.org/stable/c/d8a76a2e514fbbb315a6dfff2d342de2de833994
 	https://git.kernel.org/stable/c/c338bea1fec5504290dc0acf026c9e7dba25004b
 	https://git.kernel.org/stable/c/a362bb864b8db4861977d00bd2c3222503ccc34b
	From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001
	From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	To: <linux-cve-announce@vger.kernel.org>
	Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
	Subject: CVE-2022-48664: btrfs: fix hang during unmount when stopping a space reclaim worker

	Description
	===========

	In the Linux kernel, the following vulnerability has been resolved:

	btrfs: fix hang during unmount when stopping a space reclaim worker

	Often when running generic/562 from fstests we can hang during unmount,
	resulting in a trace like this:

	Sep 07 11:52:00 debian9 unknown: run fstests generic/562 at 2022-09-07 11:52:00
	Sep 07 11:55:32 debian9 kernel: INFO: task umount:49438 blocked for more than 120 seconds.
	Sep 07 11:55:32 debian9 kernel: Not tainted 6.0.0-rc2-btrfs-next-122 #1
	Sep 07 11:55:32 debian9 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	Sep 07 11:55:32 debian9 kernel: task:umount state:D stack: 0 pid:49438 ppid: 25683 flags:0x00004000
	Sep 07 11:55:32 debian9 kernel: Call Trace:
	Sep 07 11:55:32 debian9 kernel: <TASK>
	Sep 07 11:55:32 debian9 kernel: __schedule+0x3c8/0xec0
	Sep 07 11:55:32 debian9 kernel: ? rcu_read_lock_sched_held+0x12/0x70
	Sep 07 11:55:32 debian9 kernel: schedule+0x5d/0xf0
	Sep 07 11:55:32 debian9 kernel: schedule_timeout+0xf1/0x130
	Sep 07 11:55:32 debian9 kernel: ? lock_release+0x224/0x4a0
	Sep 07 11:55:32 debian9 kernel: ? lock_acquired+0x1a0/0x420
	Sep 07 11:55:32 debian9 kernel: ? trace_hardirqs_on+0x2c/0xd0
	Sep 07 11:55:32 debian9 kernel: __wait_for_common+0xac/0x200
	Sep 07 11:55:32 debian9 kernel: ? usleep_range_state+0xb0/0xb0
	Sep 07 11:55:32 debian9 kernel: __flush_work+0x26d/0x530
	Sep 07 11:55:32 debian9 kernel: ? flush_workqueue_prep_pwqs+0x140/0x140
	Sep 07 11:55:32 debian9 kernel: ? trace_clock_local+0xc/0x30
	Sep 07 11:55:32 debian9 kernel: __cancel_work_timer+0x11f/0x1b0
	Sep 07 11:55:32 debian9 kernel: ? close_ctree+0x12b/0x5b3 [btrfs]
	Sep 07 11:55:32 debian9 kernel: ? __trace_bputs+0x10b/0x170
	Sep 07 11:55:32 debian9 kernel: close_ctree+0x152/0x5b3 [btrfs]
	Sep 07 11:55:32 debian9 kernel: ? evict_inodes+0x166/0x1c0
	Sep 07 11:55:32 debian9 kernel: generic_shutdown_super+0x71/0x120
	Sep 07 11:55:32 debian9 kernel: kill_anon_super+0x14/0x30
	Sep 07 11:55:32 debian9 kernel: btrfs_kill_super+0x12/0x20 [btrfs]
	Sep 07 11:55:32 debian9 kernel: deactivate_locked_super+0x2e/0xa0
	Sep 07 11:55:32 debian9 kernel: cleanup_mnt+0x100/0x160
	Sep 07 11:55:32 debian9 kernel: task_work_run+0x59/0xa0
	Sep 07 11:55:32 debian9 kernel: exit_to_user_mode_prepare+0x1a6/0x1b0
	Sep 07 11:55:32 debian9 kernel: syscall_exit_to_user_mode+0x16/0x40
	Sep 07 11:55:32 debian9 kernel: do_syscall_64+0x48/0x90
	Sep 07 11:55:32 debian9 kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd
	Sep 07 11:55:32 debian9 kernel: RIP: 0033:0x7fcde59a57a7
	Sep 07 11:55:32 debian9 kernel: RSP: 002b:00007ffe914217c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
	Sep 07 11:55:32 debian9 kernel: RAX: 0000000000000000 RBX: 00007fcde5ae8264 RCX: 00007fcde59a57a7
	Sep 07 11:55:32 debian9 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055b57556cdd0
	Sep 07 11:55:32 debian9 kernel: RBP: 000055b57556cba0 R08: 0000000000000000 R09: 00007ffe91420570
	Sep 07 11:55:32 debian9 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
	Sep 07 11:55:32 debian9 kernel: R13: 000055b57556cdd0 R14: 000055b57556ccb8 R15: 0000000000000000
	Sep 07 11:55:32 debian9 kernel: </TASK>

	What happens is the following:

	1) The cleaner kthread tries to start a transaction to delete an unused
	block group, but the metadata reservation can not be satisfied right
	away, so a reservation ticket is created and it starts the async
	metadata reclaim task (fs_info->async_reclaim_work);

	2) Writeback for all the filler inodes with an i_size of 2K starts
	(generic/562 creates a lot of 2K files with the goal of filling
	metadata space). We try to create an inline extent for them, but we
	fail when trying to insert the inline extent with -ENOSPC (at
	cow_file_range_inline()) - since this is not critical, we fallback
	to non-inline mode (back to cow_file_range()), reserve extents, create
	extent maps and create the ordered extents;

	3) An unmount starts, enters close_ctree();

	4) The async reclaim task is flushing stuff, entering the flush states one
	by one, until it reaches RUN_DELAYED_IPUTS. There it runs all current
	delayed iputs.

	After running the delayed iputs and before calling
	btrfs_wait_on_delayed_iputs(), one or more ordered extents complete,
	and btrfs_add_delayed_iput() is called for each one through
	btrfs_finish_ordered_io() -> btrfs_put_ordered_extent(). This results
	in bumping fs_info->nr_delayed_iputs from 0 to some positive value.

	So the async reclaim task blocks at btrfs_wait_on_delayed_iputs() waiting
	for fs_info->nr_delayed_iputs to become 0;

	5) The current transaction is committed by the transaction kthread, we then
	start unpinning extents and end up calling btrfs_try_granting_tickets()
	through unpin_extent_range(), since we released some space.
	This results in satisfying the ticket created by the cleaner kthread at
	step 1, waking up the cleaner kthread;

	6) At close_ctree() we ask the cleaner kthread to park;

	7) The cleaner kthread starts the transaction, deletes the unused block
	group, and then calls kthread_should_park(), which returns true, so it
	parks. And at this point we have the delayed iputs added by the
	completion of the ordered extents still pending;

	8) Then later at close_ctree(), when we call:

	cancel_work_sync(&fs_info->async_reclaim_work);

	We hang forever, since the cleaner was parked and no one else can run
	delayed iputs after that, while the reclaim task is waiting for the
	remaining delayed iputs to be completed.

	Fix this by waiting for all ordered extents to complete and running the
	delayed iputs before attempting to stop the async reclaim tasks. Note that
	we can not wait for ordered extents with btrfs_wait_ordered_roots() (or
	other similar functions) because that waits for the BTRFS_ORDERED_COMPLETE
	flag to be set on an ordered extent, but the delayed iput is added after
	that, when doing the final btrfs_put_ordered_extent(). So instead wait for
	the work queues used for executing ordered extent completion to be empty,
	which works because we do the final put on an ordered extent at
	btrfs_finish_ordered_io() (while we are in the unmount context).

	The Linux kernel CVE team has assigned CVE-2022-48664 to this issue.


	Affected and fixed versions
	===========================

	Issue introduced in 4.20 with commit d6fd0ae25c6495674dc5a41a8d16bc8e0073276d and fixed in 5.10.147 with commit 6ac5b52e3f352f9cb270c89e6e1d4dadb564ddb8
	Issue introduced in 4.20 with commit d6fd0ae25c6495674dc5a41a8d16bc8e0073276d and fixed in 5.15.71 with commit d8a76a2e514fbbb315a6dfff2d342de2de833994
	Issue introduced in 4.20 with commit d6fd0ae25c6495674dc5a41a8d16bc8e0073276d and fixed in 5.19.12 with commit c338bea1fec5504290dc0acf026c9e7dba25004b
	Issue introduced in 4.20 with commit d6fd0ae25c6495674dc5a41a8d16bc8e0073276d and fixed in 6.0 with commit a362bb864b8db4861977d00bd2c3222503ccc34b
	Issue introduced in 4.14.120 with commit 1ec2bf44c3770b9c3d510b1e78d50cd7fd19e8c5
	Issue introduced in 4.19.12 with commit b4c7c826709b7d882ec9b264d5032e887e6bd720

	Please see https://www.kernel.org for a full list of currently supported
	kernel versions by the kernel community.

	Unaffected versions might change over time as fixes are backported to
	older supported kernel versions. The official CVE entry at
	https://cve.org/CVERecord/?id=CVE-2022-48664
	will be updated if fixes are backported, please check that for the most
	up to date information about this issue.


	Affected files
	==============

	The file(s) affected by this issue are:
	fs/btrfs/disk-io.c


	Mitigation
	==========

	The Linux kernel CVE team recommends that you update to the latest
	stable kernel version for this, and many other bugfixes. Individual
	changes are never tested alone, but rather are part of a larger kernel
	release. Cherry-picking individual commits is not recommended or
	supported by the Linux kernel community at all. If however, updating to
	the latest release is impossible, the individual changes to resolve this
	issue can be found at these commits:
	https://git.kernel.org/stable/c/6ac5b52e3f352f9cb270c89e6e1d4dadb564ddb8
	https://git.kernel.org/stable/c/d8a76a2e514fbbb315a6dfff2d342de2de833994
	https://git.kernel.org/stable/c/c338bea1fec5504290dc0acf026c9e7dba25004b
	https://git.kernel.org/stable/c/a362bb864b8db4861977d00bd2c3222503ccc34b