| From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001 |
| From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| To: <linux-cve-announce@vger.kernel.org> |
| Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org> |
| Subject: CVE-2024-37354: btrfs: fix crash on racing fsync and size-extending write into prealloc |
| |
| Description |
| =========== |
| |
| In the Linux kernel, the following vulnerability has been resolved: |
| |
| btrfs: fix crash on racing fsync and size-extending write into prealloc |
| |
| We have been seeing crashes on duplicate keys in |
| btrfs_set_item_key_safe(): |
| |
| BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192) |
| ------------[ cut here ]------------ |
| kernel BUG at fs/btrfs/ctree.c:2620! |
| invalid opcode: 0000 [#1] PREEMPT SMP PTI |
| CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6 |
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014 |
| RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs] |
| |
| With the following stack trace: |
| |
| #0 btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4) |
| #1 btrfs_drop_extents (fs/btrfs/file.c:411:4) |
| #2 log_one_extent (fs/btrfs/tree-log.c:4732:9) |
| #3 btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9) |
| #4 btrfs_log_inode (fs/btrfs/tree-log.c:6626:9) |
| #5 btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8) |
| #6 btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8) |
| #7 btrfs_sync_file (fs/btrfs/file.c:1933:8) |
| #8 vfs_fsync_range (fs/sync.c:188:9) |
| #9 vfs_fsync (fs/sync.c:202:9) |
| #10 do_fsync (fs/sync.c:212:9) |
| #11 __do_sys_fdatasync (fs/sync.c:225:9) |
| #12 __se_sys_fdatasync (fs/sync.c:223:1) |
| #13 __x64_sys_fdatasync (fs/sync.c:223:1) |
| #14 do_syscall_x64 (arch/x86/entry/common.c:52:14) |
| #15 do_syscall_64 (arch/x86/entry/common.c:83:7) |
| #16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121) |
| |
| So we're logging a changed extent from fsync, which is splitting an |
| extent in the log tree. But this split part already exists in the tree, |
| triggering the BUG(). |
| |
| This is the state of the log tree at the time of the crash, dumped with |
| drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py) |
| to get more details than btrfs_print_leaf() gives us: |
| |
| >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"]) |
| leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610 |
| leaf 33439744 flags 0x100000000000000 |
| fs uuid e5bd3946-400c-4223-8923-190ef1f18677 |
| chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da |
| item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160 |
| generation 7 transid 9 size 8192 nbytes 8473563889606862198 |
| block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 |
| sequence 204 flags 0x10(PREALLOC) |
| atime 1716417703.220000000 (2024-05-22 15:41:43) |
| ctime 1716417704.983333333 (2024-05-22 15:41:44) |
| mtime 1716417704.983333333 (2024-05-22 15:41:44) |
| otime 17592186044416.000000000 (559444-03-08 01:40:16) |
| item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13 |
| index 195 namelen 3 name: 193 |
| item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37 |
| location key (0 UNKNOWN.0 0) type XATTR |
| transid 7 data_len 1 name_len 6 |
| name: user.a |
| data a |
| item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53 |
| generation 9 type 1 (regular) |
| extent data disk byte 303144960 nr 12288 |
| extent data offset 0 nr 4096 ram 12288 |
| extent compression 0 (none) |
| item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53 |
| generation 9 type 2 (prealloc) |
| prealloc data disk byte 303144960 nr 12288 |
| prealloc data offset 4096 nr 8192 |
| item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53 |
| generation 9 type 2 (prealloc) |
| prealloc data disk byte 303144960 nr 12288 |
| prealloc data offset 8192 nr 4096 |
| ... |
| |
| So the real problem happened earlier: notice that items 4 (4k-12k) and 5 |
| (8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and |
| item 5 starts at i_size. |
| |
| Here is the state of the filesystem tree at the time of the crash: |
| |
| >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root |
| >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0)) |
| >>> print_extent_buffer(nodes[0]) |
| leaf 30425088 level 0 items 184 generation 9 owner 5 |
| leaf 30425088 flags 0x100000000000000 |
| fs uuid e5bd3946-400c-4223-8923-190ef1f18677 |
| chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da |
| ... |
| item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160 |
| generation 7 transid 7 size 4096 nbytes 12288 |
| block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 |
| sequence 6 flags 0x10(PREALLOC) |
| atime 1716417703.220000000 (2024-05-22 15:41:43) |
| ctime 1716417703.220000000 (2024-05-22 15:41:43) |
| mtime 1716417703.220000000 (2024-05-22 15:41:43) |
| otime 1716417703.220000000 (2024-05-22 15:41:43) |
| item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13 |
| index 195 namelen 3 name: 193 |
| item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37 |
| location key (0 UNKNOWN.0 0) type XATTR |
| transid 7 data_len 1 name_len 6 |
| name: user.a |
| data a |
| item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53 |
| generation 9 type 1 (regular) |
| extent data disk byte 303144960 nr 12288 |
| extent data offset 0 nr 8192 ram 12288 |
| extent compression 0 (none) |
| item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53 |
| generation 9 type 2 (prealloc) |
| prealloc data disk byte 303144960 nr 12288 |
| prealloc data offset 8192 nr 4096 |
| |
| Item 5 in the log tree corresponds to item 183 in the filesystem tree, |
| but nothing matches item 4. Furthermore, item 183 is the last item in |
| the leaf. |
| |
| btrfs_log_prealloc_extents() is responsible for logging prealloc extents |
| beyond i_size. It first truncates any previously logged prealloc extents |
| that start beyond i_size. Then, it walks the filesystem tree and copies |
| the prealloc extent items to the log tree. |
| |
| If it hits the end of a leaf, then it calls btrfs_next_leaf(), which |
| unlocks the tree and does another search. However, while the filesystem |
| tree is unlocked, an ordered extent completion may modify the tree. In |
| particular, it may insert an extent item that overlaps with an extent |
| item that was already copied to the log tree. |
| |
| This may manifest in several ways depending on the exact scenario, |
| including an EEXIST error that is silently translated to a full sync, |
| overlapping items in the log tree, or this crash. This particular crash |
| is triggered by the following sequence of events: |
| |
| - Initially, the file has i_size=4k, a regular extent from 0-4k, and a |
| prealloc extent beyond i_size from 4k-12k. The prealloc extent item is |
| the last item in its B-tree leaf. |
| - The file is fsync'd, which copies its inode item and both extent items |
| to the log tree. |
| - An xattr is set on the file, which sets the |
| BTRFS_INODE_COPY_EVERYTHING flag. |
| - The range 4k-8k in the file is written using direct I/O. i_size is |
| extended to 8k, but the ordered extent is still in flight. |
| - The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this |
| calls copy_inode_items_to_log(), which calls |
| btrfs_log_prealloc_extents(). |
| - btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the |
| filesystem tree. Since it starts before i_size, it skips it. Since it |
| is the last item in its B-tree leaf, it calls btrfs_next_leaf(). |
| - btrfs_next_leaf() unlocks the path. |
| - The ordered extent completion runs, which converts the 4k-8k part of |
| the prealloc extent to written and inserts the remaining prealloc part |
| from 8k-12k. |
| - btrfs_next_leaf() does a search and finds the new prealloc extent |
| 8k-12k. |
| - btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into |
| the log tree. Note that it overlaps with the 4k-12k prealloc extent |
| that was copied to the log tree by the first fsync. |
| - fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k |
| extent that was written. |
| - This tries to drop the range 4k-8k in the log tree, which requires |
| adjusting the start of the 4k-12k prealloc extent in the log tree to |
| 8k. |
| - btrfs_set_item_key_safe() sees that there is already an extent |
| starting at 8k in the log tree and calls BUG(). |
| |
| Fix this by detecting when we're about to insert an overlapping file |
| extent item in the log tree and truncating the part that would overlap. |
| |
| The Linux kernel CVE team has assigned CVE-2024-37354 to this issue. |
| |
| |
| Affected and fixed versions |
| =========================== |
| |
| Fixed in 6.1.94 with commit 1ff2bd566fbcefcb892be85c493bdb92b911c428 |
| Fixed in 6.6.34 with commit 3d08c52ba1887a1ff9c179d4b6a18b427bcb2097 |
| Fixed in 6.9.5 with commit f4e5ed974876c14d3623e04dc43d3e3281bc6011 |
| Fixed in 6.10 with commit 9d274c19a71b3a276949933859610721a453946b |
| |
| Please see https://www.kernel.org for a full list of currently supported |
| kernel versions by the kernel community. |
| |
| Unaffected versions might change over time as fixes are backported to |
| older supported kernel versions. The official CVE entry at |
| https://cve.org/CVERecord/?id=CVE-2024-37354 |
| will be updated if fixes are backported, please check that for the most |
| up to date information about this issue. |
| |
| |
| Affected files |
| ============== |
| |
| The file(s) affected by this issue are: |
| fs/btrfs/tree-log.c |
| |
| |
| Mitigation |
| ========== |
| |
| The Linux kernel CVE team recommends that you update to the latest |
| stable kernel version for this, and many other bugfixes. Individual |
| changes are never tested alone, but rather are part of a larger kernel |
| release. Cherry-picking individual commits is not recommended or |
| supported by the Linux kernel community at all. If however, updating to |
| the latest release is impossible, the individual changes to resolve this |
| issue can be found at these commits: |
| https://git.kernel.org/stable/c/1ff2bd566fbcefcb892be85c493bdb92b911c428 |
| https://git.kernel.org/stable/c/3d08c52ba1887a1ff9c179d4b6a18b427bcb2097 |
| https://git.kernel.org/stable/c/f4e5ed974876c14d3623e04dc43d3e3281bc6011 |
| https://git.kernel.org/stable/c/9d274c19a71b3a276949933859610721a453946b |