| From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001 |
| From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| To: <linux-cve-announce@vger.kernel.org> |
| Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org> |
| Subject: CVE-2022-49547: btrfs: fix deadlock between concurrent dio writes when low on free data space |
| |
| Description |
| =========== |
| |
| In the Linux kernel, the following vulnerability has been resolved: |
| |
| btrfs: fix deadlock between concurrent dio writes when low on free data space |
| |
| When reserving data space for a direct IO write we can end up deadlocking |
| if we have multiple tasks attempting a write to the same file range, there |
| are multiple extents covered by that file range, we are low on available |
| space for data and the writes don't expand the inode's i_size. |
| |
| The deadlock can happen like this: |
| |
| 1) We have a file with an i_size of 1M, at offset 0 it has an extent with |
| a size of 128K and at offset 128K it has another extent also with a |
| size of 128K; |
| |
| 2) Task A does a direct IO write against file range [0, 256K), and because |
| the write is within the i_size boundary, it takes the inode's lock (VFS |
| level) in shared mode; |
| |
| 3) Task A locks the file range [0, 256K) at btrfs_dio_iomap_begin(), and |
| then gets the extent map for the extent covering the range [0, 128K). |
| At btrfs_get_blocks_direct_write(), it creates an ordered extent for |
| that file range ([0, 128K)); |
| |
| 4) Before returning from btrfs_dio_iomap_begin(), it unlocks the file |
| range [0, 256K); |
| |
| 5) Task A executes btrfs_dio_iomap_begin() again, this time for the file |
| range [128K, 256K), and locks the file range [128K, 256K); |
| |
| 6) Task B starts a direct IO write against file range [0, 256K) as well. |
| It also locks the inode in shared mode, as it's within the i_size limit, |
| and then tries to lock file range [0, 256K). It is able to lock the |
| subrange [0, 128K) but then blocks waiting for the range [128K, 256K), |
| as it is currently locked by task A; |
| |
| 7) Task A enters btrfs_get_blocks_direct_write() and tries to reserve data |
| space. Because we are low on available free space, it triggers the |
| async data reclaim task, and waits for it to reserve data space; |
| |
| 8) The async reclaim task decides to wait for all existing ordered extents |
| to complete (through btrfs_wait_ordered_roots()). |
| It finds the ordered extent previously created by task A for the file |
| range [0, 128K) and waits for it to complete; |
| |
| 9) The ordered extent for the file range [0, 128K) can not complete |
| because it blocks at btrfs_finish_ordered_io() when trying to lock the |
| file range [0, 128K). |
| |
| This results in a deadlock, because: |
| |
| - task B is holding the file range [0, 128K) locked, waiting for the |
| range [128K, 256K) to be unlocked by task A; |
| |
| - task A is holding the file range [128K, 256K) locked and it's waiting |
| for the async data reclaim task to satisfy its space reservation |
| request; |
| |
| - the async data reclaim task is waiting for ordered extent [0, 128K) |
| to complete, but the ordered extent can not complete because the |
| file range [0, 128K) is currently locked by task B, which is waiting |
| on task A to unlock file range [128K, 256K) and task A waiting |
| on the async data reclaim task. |
| |
| This results in a deadlock between 4 task: task A, task B, the async |
| data reclaim task and the task doing ordered extent completion (a work |
| queue task). |
| |
| This type of deadlock can sporadically be triggered by the test case |
| generic/300 from fstests, and results in a stack trace like the following: |
| |
| [12084.033689] INFO: task kworker/u16:7:123749 blocked for more than 241 seconds. |
| [12084.034877] Not tainted 5.18.0-rc2-btrfs-next-115 #1 |
| [12084.035562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. |
| [12084.036548] task:kworker/u16:7 state:D stack: 0 pid:123749 ppid: 2 flags:0x00004000 |
| [12084.036554] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] |
| [12084.036599] Call Trace: |
| [12084.036601] <TASK> |
| [12084.036606] __schedule+0x3cb/0xed0 |
| [12084.036616] schedule+0x4e/0xb0 |
| [12084.036620] btrfs_start_ordered_extent+0x109/0x1c0 [btrfs] |
| [12084.036651] ? prepare_to_wait_exclusive+0xc0/0xc0 |
| [12084.036659] btrfs_run_ordered_extent_work+0x1a/0x30 [btrfs] |
| [12084.036688] btrfs_work_helper+0xf8/0x400 [btrfs] |
| [12084.036719] ? lock_is_held_type+0xe8/0x140 |
| [12084.036727] process_one_work+0x252/0x5a0 |
| [12084.036736] ? process_one_work+0x5a0/0x5a0 |
| [12084.036738] worker_thread+0x52/0x3b0 |
| [12084.036743] ? process_one_work+0x5a0/0x5a0 |
| [12084.036745] kthread+0xf2/0x120 |
| [12084.036747] ? kthread_complete_and_exit+0x20/0x20 |
| [12084.036751] ret_from_fork+0x22/0x30 |
| [12084.036765] </TASK> |
| [12084.036769] INFO: task kworker/u16:11:153787 blocked for more than 241 seconds. |
| [12084.037702] Not tainted 5.18.0-rc2-btrfs-next-115 #1 |
| [12084.038540] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. |
| [12084.039506] task:kworker/u16:11 state:D stack: 0 pid:153787 ppid: 2 flags:0x00004000 |
| [12084.039511] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] |
| [12084.039551] Call Trace: |
| [12084.039553] <TASK> |
| [12084.039557] __schedule+0x3cb/0xed0 |
| [12084.039566] schedule+0x4e/0xb0 |
| [12084.039569] schedule_timeout+0xed/0x130 |
| [12084.039573] ? mark_held_locks+0x50/0x80 |
| [12084.039578] ? _raw_spin_unlock_irq+0x24/0x50 |
| [12084.039580] ? lockdep_hardirqs_on+0x7d/0x100 |
| [12084.039585] __wait_for_common+0xaf/0x1f0 |
| [12084.039587] ? usleep_range_state+0xb0/0xb0 |
| [12084.039596] btrfs_wait_ordered_extents+0x3d6/0x470 [btrfs] |
| [12084.039636] btrfs_wait_ordered_roots+0x175/0x240 [btrfs] |
| [12084.039670] flush_space+0x25b/0x630 [btrfs] |
| [12084.039712] btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs] |
| [12084.039747] process_one_work+0x252/0x5a0 |
| [12084.039756] ? process_one_work+0x5a0/0x5a0 |
| [12084.039758] worker_thread+0x52/0x3b0 |
| [12084.039762] ? process_one_work+0x5a0/0x5a0 |
| [12084.039765] kthread+0xf2/0x120 |
| [12084.039766] ? kthread_complete_and_exit+0x20/0x20 |
| [12084.039770] ret_from_fork+0x22/0x30 |
| [12084.039783] </TASK> |
| [12084.039800] INFO: task kworker/u16:17:217907 blocked for more than 241 seconds. |
| [12084.040709] Not tainted 5.18.0-rc2-btrfs-next-115 #1 |
| [12084.041398] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. |
| [12084.042404] task:kworker/u16:17 state:D stack: 0 pid:217907 ppid: 2 flags:0x00004000 |
| [12084.042411] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs] |
| [12084.042461] Call Trace: |
| [12084.042463] <TASK> |
| [12084.042471] __schedule+0x3cb/0xed0 |
| [12084.042485] schedule+0x4e/0xb0 |
| [12084.042490] wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs] |
| [12084.042539] ? prepare_to_wait_exclusive+0xc0/0xc0 |
| [12084.042551] lock_extent_bits+0x37/0x90 [btrfs] |
| [12084.042601] btrfs_finish_ordered_io.isra.0+0x3fd/0x960 [btrfs] |
| [12084.042656] ? lock_is_held_type+0xe8/0x140 |
| [12084.042667] btrfs_work_helper+0xf8/0x400 [btrfs] |
| [12084.042716] ? lock_is_held_type+0xe8/0x140 |
| [12084.042727] process_one_work+0x252/0x5a0 |
| [12084.042742] worker_thread+0x52/0x3b0 |
| [12084.042750] ? process_one_work+0x5a0/0x5a0 |
| [12084.042754] kthread+0xf2/0x120 |
| [12084.042757] ? kthread_complete_and_exit+0x20/0x20 |
| [12084.042763] ret_from_fork+0x22/0x30 |
| [12084.042783] </TASK> |
| [12084.042798] INFO: task fio:234517 blocked for more than 241 seconds. |
| [12084.043598] Not tainted 5.18.0-rc2-btrfs-next-115 #1 |
| [12084.044282] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. |
| [12084.045244] task:fio state:D stack: 0 pid:234517 ppid:234515 flags:0x00004000 |
| [12084.045248] Call Trace: |
| [12084.045250] <TASK> |
| [12084.045254] __schedule+0x3cb/0xed0 |
| [12084.045263] schedule+0x4e/0xb0 |
| [12084.045266] wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs] |
| [12084.045298] ? prepare_to_wait_exclusive+0xc0/0xc0 |
| [12084.045306] lock_extent_bits+0x37/0x90 [btrfs] |
| [12084.045336] btrfs_dio_iomap_begin+0x336/0xc60 [btrfs] |
| [12084.045370] ? lock_is_held_type+0xe8/0x140 |
| [12084.045378] iomap_iter+0x184/0x4c0 |
| [12084.045383] __iomap_dio_rw+0x2c6/0x8a0 |
| [12084.045406] iomap_dio_rw+0xa/0x30 |
| [12084.045408] btrfs_do_write_iter+0x370/0x5e0 [btrfs] |
| [12084.045440] aio_write+0xfa/0x2c0 |
| [12084.045448] ? __might_fault+0x2a/0x70 |
| [12084.045451] ? kvm_sched_clock_read+0x14/0x40 |
| [12084.045455] ? lock_release+0x153/0x4a0 |
| [12084.045463] io_submit_one+0x615/0x9f0 |
| [12084.045467] ? __might_fault+0x2a/0x70 |
| [12084.045469] ? kvm_sched_clock_read+0x14/0x40 |
| [12084.045478] __x64_sys_io_submit+0x83/0x160 |
| [12084.045483] ? syscall_enter_from_user_mode+0x1d/0x50 |
| [12084.045489] do_syscall_64+0x3b/0x90 |
| [12084.045517] entry_SYSCALL_64_after_hwframe+0x44/0xae |
| [12084.045521] RIP: 0033:0x7fa76511af79 |
| [12084.045525] RSP: 002b:00007ffd6d6b9058 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1 |
| [12084.045530] RAX: ffffffffffffffda RBX: 00007fa75ba6e760 RCX: 00007fa76511af79 |
| [12084.045532] RDX: 0000557b304ff3f0 RSI: 0000000000000001 RDI: 00007fa75ba4c000 |
| [12084.045535] RBP: 00007fa75ba4c000 R08: 00007fa751b76000 R09: 0000000000000330 |
| [12084.045537] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 |
| [12084.045540] R13: 0000000000000000 R14: 0000557b304ff3f0 R15: 0000557b30521eb0 |
| [12084.045561] </TASK> |
| |
| Fix this issue by always reserving data space before locking a file range |
| at btrfs_dio_iomap_begin(). If we can't reserve the space, then we don't |
| error out immediately - instead after locking the file range, check if we |
| can do a NOCOW write, and if we can we don't error out since we don't need |
| to allocate a data extent, however if we can't NOCOW then error out with |
| -ENOSPC. This also implies that we may end up reserving space when it's |
| not needed because the write will end up being done in NOCOW mode - in that |
| case we just release the space after we noticed we did a NOCOW write - this |
| is the same type of logic that is done in the path for buffered IO writes. |
| |
| The Linux kernel CVE team has assigned CVE-2022-49547 to this issue. |
| |
| |
| Affected and fixed versions |
| =========================== |
| |
| Issue introduced in 5.17 with commit f0bfa76a11e93d0fe2c896fcb566568c5e8b5d3f and fixed in 5.18.3 with commit cfae6f765b3c40882ee90dae8fbf9325c8de9c35 |
| Issue introduced in 5.17 with commit f0bfa76a11e93d0fe2c896fcb566568c5e8b5d3f and fixed in 5.19 with commit f5585f4f0ef5b17026bbd60fbff6fcc91b99d5bf |
| Issue introduced in 5.15.27 with commit 5afd80c393f4e87451f14eefb7f2f24daf434e06 |
| Issue introduced in 5.16.13 with commit 96f1be29492d9e2fb97bb27f824478ab8cd3ab86 |
| |
| Please see https://www.kernel.org for a full list of currently supported |
| kernel versions by the kernel community. |
| |
| Unaffected versions might change over time as fixes are backported to |
| older supported kernel versions. The official CVE entry at |
| https://cve.org/CVERecord/?id=CVE-2022-49547 |
| will be updated if fixes are backported, please check that for the most |
| up to date information about this issue. |
| |
| |
| Affected files |
| ============== |
| |
| The file(s) affected by this issue are: |
| fs/btrfs/inode.c |
| |
| |
| Mitigation |
| ========== |
| |
| The Linux kernel CVE team recommends that you update to the latest |
| stable kernel version for this, and many other bugfixes. Individual |
| changes are never tested alone, but rather are part of a larger kernel |
| release. Cherry-picking individual commits is not recommended or |
| supported by the Linux kernel community at all. If however, updating to |
| the latest release is impossible, the individual changes to resolve this |
| issue can be found at these commits: |
| https://git.kernel.org/stable/c/cfae6f765b3c40882ee90dae8fbf9325c8de9c35 |
| https://git.kernel.org/stable/c/f5585f4f0ef5b17026bbd60fbff6fcc91b99d5bf |