| From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001 |
| From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| To: <linux-cve-announce@vger.kernel.org> |
| Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org> |
| Subject: CVE-2024-58089: btrfs: fix double accounting race when btrfs_run_delalloc_range() failed |
| |
| Description |
| =========== |
| |
| In the Linux kernel, the following vulnerability has been resolved: |
| |
| btrfs: fix double accounting race when btrfs_run_delalloc_range() failed |
| |
| [BUG] |
| When running btrfs with block size (4K) smaller than page size (64K, |
| aarch64), there is a very high chance to crash the kernel at |
| generic/750, with the following messages: |
| (before the call traces, there are 3 extra debug messages added) |
| |
| BTRFS warning (device dm-3): read-write for sector size 4096 with page size 65536 is experimental |
| BTRFS info (device dm-3): checking UUID tree |
| hrtimer: interrupt took 5451385 ns |
| BTRFS error (device dm-3): cow_file_range failed, root=4957 inode=257 start=1605632 len=69632: -28 |
| BTRFS error (device dm-3): run_delalloc_nocow failed, root=4957 inode=257 start=1605632 len=69632: -28 |
| BTRFS error (device dm-3): failed to run delalloc range, root=4957 ino=257 folio=1572864 submit_bitmap=8-15 start=1605632 len=69632: -28 |
| ------------[ cut here ]------------ |
| WARNING: CPU: 2 PID: 3020984 at ordered-data.c:360 can_finish_ordered_extent+0x370/0x3b8 [btrfs] |
| CPU: 2 UID: 0 PID: 3020984 Comm: kworker/u24:1 Tainted: G OE 6.13.0-rc1-custom+ #89 |
| Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE |
| Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 |
| Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] |
| pc : can_finish_ordered_extent+0x370/0x3b8 [btrfs] |
| lr : can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] |
| Call trace: |
| can_finish_ordered_extent+0x370/0x3b8 [btrfs] (P) |
| can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] (L) |
| btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs] |
| extent_writepage+0x10c/0x3b8 [btrfs] |
| extent_write_cache_pages+0x21c/0x4e8 [btrfs] |
| btrfs_writepages+0x94/0x160 [btrfs] |
| do_writepages+0x74/0x190 |
| filemap_fdatawrite_wbc+0x74/0xa0 |
| start_delalloc_inodes+0x17c/0x3b0 [btrfs] |
| btrfs_start_delalloc_roots+0x17c/0x288 [btrfs] |
| shrink_delalloc+0x11c/0x280 [btrfs] |
| flush_space+0x288/0x328 [btrfs] |
| btrfs_async_reclaim_data_space+0x180/0x228 [btrfs] |
| process_one_work+0x228/0x680 |
| worker_thread+0x1bc/0x360 |
| kthread+0x100/0x118 |
| ret_from_fork+0x10/0x20 |
| ---[ end trace 0000000000000000 ]--- |
| BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1605632 OE len=16384 to_dec=16384 left=0 |
| BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1622016 OE len=12288 to_dec=12288 left=0 |
| Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 |
| BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1634304 OE len=8192 to_dec=4096 left=0 |
| CPU: 1 UID: 0 PID: 3286940 Comm: kworker/u24:3 Tainted: G W OE 6.13.0-rc1-custom+ #89 |
| Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022 |
| Workqueue: btrfs_work_helper [btrfs] (btrfs-endio-write) |
| pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) |
| pc : process_one_work+0x110/0x680 |
| lr : worker_thread+0x1bc/0x360 |
| Call trace: |
| process_one_work+0x110/0x680 (P) |
| worker_thread+0x1bc/0x360 (L) |
| worker_thread+0x1bc/0x360 |
| kthread+0x100/0x118 |
| ret_from_fork+0x10/0x20 |
| Code: f84086a1 f9000fe1 53041c21 b9003361 (f9400661) |
| ---[ end trace 0000000000000000 ]--- |
| Kernel panic - not syncing: Oops: Fatal exception |
| SMP: stopping secondary CPUs |
| SMP: failed to stop secondary CPUs 2-3 |
| Dumping ftrace buffer: |
| (ftrace buffer empty) |
| Kernel Offset: 0x275bb9540000 from 0xffff800080000000 |
| PHYS_OFFSET: 0xffff8fbba0000000 |
| CPU features: 0x100,00000070,00801250,8201720b |
| |
| [CAUSE] |
| The above warning is triggered immediately after the delalloc range |
| failure, this happens in the following sequence: |
| |
| - Range [1568K, 1636K) is dirty |
| |
| 1536K 1568K 1600K 1636K 1664K |
| | |/////////|////////| | |
| |
| Where 1536K, 1600K and 1664K are page boundaries (64K page size) |
| |
| - Enter extent_writepage() for page 1536K |
| |
| - Enter run_delalloc_nocow() with locked page 1536K and range |
| [1568K, 1636K) |
| This is due to the inode having preallocated extents. |
| |
| - Enter cow_file_range() with locked page 1536K and range |
| [1568K, 1636K) |
| |
| - btrfs_reserve_extent() only reserved two extents |
| The main loop of cow_file_range() only reserved two data extents, |
| |
| Now we have: |
| |
| 1536K 1568K 1600K 1636K 1664K |
| | |<-->|<--->|/|///////| | |
| 1584K 1596K |
| Range [1568K, 1596K) has an ordered extent reserved. |
| |
| - btrfs_reserve_extent() failed inside cow_file_range() for file offset |
| 1596K |
| This is already a bug in our space reservation code, but for now let's |
| focus on the error handling path. |
| |
| Now cow_file_range() returned -ENOSPC. |
| |
| - btrfs_run_delalloc_range() do error cleanup <<< ROOT CAUSE |
| Call btrfs_cleanup_ordered_extents() with locked folio 1536K and range |
| [1568K, 1636K) |
| |
| Function btrfs_cleanup_ordered_extents() normally needs to skip the |
| ranges inside the folio, as it will normally be cleaned up by |
| extent_writepage(). |
| |
| Such split error handling is already problematic in the first place. |
| |
| What's worse is the folio range skipping itself, which is not taking |
| subpage cases into consideration at all, it will only skip the range |
| if the page start >= the range start. |
| In our case, the page start < the range start, since for subpage cases |
| we can have delalloc ranges inside the folio but not covering the |
| folio. |
| |
| So it doesn't skip the page range at all. |
| This means all the ordered extents, both [1568K, 1584K) and |
| [1584K, 1596K) will be marked as IOERR. |
| |
| And these two ordered extents have no more pending ios, they are marked |
| finished, and *QUEUED* to be deleted from the io tree. |
| |
| - extent_writepage() do error cleanup |
| Call btrfs_mark_ordered_io_finished() for the range [1536K, 1600K). |
| |
| Although ranges [1568K, 1584K) and [1584K, 1596K) are finished, the |
| deletion from io tree is async, it may or may not happen at this |
| time. |
| |
| If the ranges have not yet been removed, we will do double cleaning on |
| those ranges, triggering the above ordered extent warnings. |
| |
| In theory there are other bugs, like the cleanup in extent_writepage() |
| can cause double accounting on ranges that are submitted asynchronously |
| (compression for example). |
| |
| But that's much harder to trigger because normally we do not mix regular |
| and compression delalloc ranges. |
| |
| [FIX] |
| The folio range split is already buggy and not subpage compatible, it |
| was introduced a long time ago where subpage support was not even considered. |
| |
| So instead of splitting the ordered extents cleanup into the folio range |
| and out of folio range, do all the cleanup inside writepage_delalloc(). |
| |
| - Pass @NULL as locked_folio for btrfs_cleanup_ordered_extents() in |
| btrfs_run_delalloc_range() |
| |
| - Skip the btrfs_cleanup_ordered_extents() if writepage_delalloc() |
| failed |
| |
| So all ordered extents are only cleaned up by |
| btrfs_run_delalloc_range(). |
| |
| - Handle the ranges that already have ordered extents allocated |
| If part of the folio already has ordered extent allocated, and |
| btrfs_run_delalloc_range() failed, we also need to cleanup that range. |
| |
| Now we have a concentrated error handling for ordered extents during |
| btrfs_run_delalloc_range(). |
| |
| The Linux kernel CVE team has assigned CVE-2024-58089 to this issue. |
| |
| |
| Affected and fixed versions |
| =========================== |
| |
| Issue introduced in 5.0 with commit d1051d6ebf8ef3517a5a3cf82bba8436d190f1c2 and fixed in 6.12.17 with commit 21333148b5c9e52f41fafcedec3810b56a5e0e40 |
| Issue introduced in 5.0 with commit d1051d6ebf8ef3517a5a3cf82bba8436d190f1c2 and fixed in 6.13.5 with commit 0283ee1912c8e243c931f4ee5b3672e954fe0384 |
| Issue introduced in 5.0 with commit d1051d6ebf8ef3517a5a3cf82bba8436d190f1c2 and fixed in 6.14 with commit 72dad8e377afa50435940adfb697e070d3556670 |
| Issue introduced in 4.19.73 with commit eb124aaa2e85e9dceac37be5b7166a04b9b26735 |
| |
| Please see https://www.kernel.org for a full list of currently supported |
| kernel versions by the kernel community. |
| |
| Unaffected versions might change over time as fixes are backported to |
| older supported kernel versions. The official CVE entry at |
| https://cve.org/CVERecord/?id=CVE-2024-58089 |
| will be updated if fixes are backported, please check that for the most |
| up to date information about this issue. |
| |
| |
| Affected files |
| ============== |
| |
| The file(s) affected by this issue are: |
| fs/btrfs/extent_io.c |
| fs/btrfs/inode.c |
| |
| |
| Mitigation |
| ========== |
| |
| The Linux kernel CVE team recommends that you update to the latest |
| stable kernel version for this, and many other bugfixes. Individual |
| changes are never tested alone, but rather are part of a larger kernel |
| release. Cherry-picking individual commits is not recommended or |
| supported by the Linux kernel community at all. If however, updating to |
| the latest release is impossible, the individual changes to resolve this |
| issue can be found at these commits: |
| https://git.kernel.org/stable/c/21333148b5c9e52f41fafcedec3810b56a5e0e40 |
| https://git.kernel.org/stable/c/0283ee1912c8e243c931f4ee5b3672e954fe0384 |
| https://git.kernel.org/stable/c/72dad8e377afa50435940adfb697e070d3556670 |