| From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001 |
| From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| To: <linux-cve-announce@vger.kernel.org> |
| Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org> |
| Subject: CVE-2024-44972: btrfs: do not clear page dirty inside extent_write_locked_range() |
| |
| Description |
| =========== |
| |
| In the Linux kernel, the following vulnerability has been resolved: |
| |
| btrfs: do not clear page dirty inside extent_write_locked_range() |
| |
| [BUG] |
| For subpage + zoned case, the following workload can lead to rsv data |
| leak at unmount time: |
| |
| # mkfs.btrfs -f -s 4k $dev |
| # mount $dev $mnt |
| # fsstress -w -n 8 -d $mnt -s 1709539240 |
| 0/0: fiemap - no filename |
| 0/1: copyrange read - no filename |
| 0/2: write - no filename |
| 0/3: rename - no source filename |
| 0/4: creat f0 x:0 0 0 |
| 0/4: creat add id=0,parent=-1 |
| 0/5: writev f0[259 1 0 0 0 0] [778052,113,965] 0 |
| 0/6: ioctl(FIEMAP) f0[259 1 0 0 224 887097] [1294220,2291618343991484791,0x10000] -1 |
| 0/7: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 224 887097] return 25, fallback to stat() |
| 0/7: dwrite f0[259 1 0 0 224 887097] [696320,102400] 0 |
| # umount $mnt |
| |
| The dmesg includes the following rsv leak detection warning (all call |
| trace skipped): |
| |
| ------------[ cut here ]------------ |
| WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8653 btrfs_destroy_inode+0x1e0/0x200 [btrfs] |
| ---[ end trace 0000000000000000 ]--- |
| ------------[ cut here ]------------ |
| WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8654 btrfs_destroy_inode+0x1a8/0x200 [btrfs] |
| ---[ end trace 0000000000000000 ]--- |
| ------------[ cut here ]------------ |
| WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8660 btrfs_destroy_inode+0x1a0/0x200 [btrfs] |
| ---[ end trace 0000000000000000 ]--- |
| BTRFS info (device sda): last unmount of filesystem 1b4abba9-de34-4f07-9e7f-157cf12a18d6 |
| ------------[ cut here ]------------ |
| WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs] |
| ---[ end trace 0000000000000000 ]--- |
| BTRFS info (device sda): space_info DATA has 268218368 free, is not full |
| BTRFS info (device sda): space_info total=268435456, used=204800, pinned=0, reserved=0, may_use=12288, readonly=0 zone_unusable=0 |
| BTRFS info (device sda): global_block_rsv: size 0 reserved 0 |
| BTRFS info (device sda): trans_block_rsv: size 0 reserved 0 |
| BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0 |
| BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0 |
| BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0 |
| ------------[ cut here ]------------ |
| WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs] |
| ---[ end trace 0000000000000000 ]--- |
| BTRFS info (device sda): space_info METADATA has 267796480 free, is not full |
| BTRFS info (device sda): space_info total=268435456, used=131072, pinned=0, reserved=0, may_use=262144, readonly=0 zone_unusable=245760 |
| BTRFS info (device sda): global_block_rsv: size 0 reserved 0 |
| BTRFS info (device sda): trans_block_rsv: size 0 reserved 0 |
| BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0 |
| BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0 |
| BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0 |
| |
| Above $dev is a tcmu-runner emulated zoned HDD, which has a max zone |
| append size of 64K, and the system has 64K page size. |
| |
| [CAUSE] |
| I have added several trace_printk() to show the events (header skipped): |
| |
| > btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688 |
| > btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288 |
| > btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536 |
| > btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864 |
| |
| The above lines show our buffered write has dirtied 3 pages of inode |
| 259 of root 5: |
| |
| 704K 768K 832K 896K |
| I |////I/////////////////I///////////| I |
| 756K 868K |
| |
| |///| is the dirtied range using subpage bitmaps. and 'I' is the page |
| boundary. |
| |
| Meanwhile all three pages (704K, 768K, 832K) have their PageDirty |
| flag set. |
| |
| > btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400 |
| |
| Then direct IO write starts, since the range [680K, 780K) covers the |
| beginning part of the above dirty range, we need to writeback the |
| two pages at 704K and 768K. |
| |
| > cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536 |
| > extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536 |
| |
| Now the above 2 lines show that we're writing back for dirty range |
| [756K, 756K + 64K). |
| We only writeback 64K because the zoned device has max zone append size |
| as 64K. |
| |
| > extent_write_locked_range: r/i=5/259 clear dirty for page=786432 |
| |
| !!! The above line shows the root cause. !!! |
| |
| We're calling clear_page_dirty_for_io() inside extent_write_locked_range(), |
| for the page 768K. |
| This is because extent_write_locked_range() can go beyond the current |
| locked page, here we hit the page at 768K and clear its page dirt. |
| |
| In fact this would lead to the desync between subpage dirty and page |
| dirty flags. We have the page dirty flag cleared, but the subpage range |
| [820K, 832K) is still dirty. |
| |
| After the writeback of range [756K, 820K), the dirty flags look like |
| this, as page 768K no longer has dirty flag set. |
| |
| 704K 768K 832K 896K |
| I I | I/////////////| I |
| 820K 868K |
| |
| This means we will no longer writeback range [820K, 832K), thus the |
| reserved data/metadata space would never be properly released. |
| |
| > extent_write_cache_pages: r/i=5/259 skip non-dirty folio=786432 |
| |
| Now even though we try to start writeback for page 768K, since the |
| page is not dirty, we completely skip it at extent_write_cache_pages() |
| time. |
| |
| > btrfs_direct_write: r/i=5/259 dio done filepos=696320 len=0 |
| |
| Now the direct IO finished. |
| |
| > cow_file_range: r/i=5/259 add ordered extent filepos=851968 len=36864 |
| > extent_write_locked_range: r/i=5/259 locked page=851968 start=851968 len=36864 |
| |
| Now we writeback the remaining dirty range, which is [832K, 868K). |
| Causing the range [820K, 832K) never to be submitted, thus leaking the |
| reserved space. |
| |
| This bug only affects subpage and zoned case. For non-subpage and zoned |
| case, we have exactly one sector for each page, thus no such partial dirty |
| cases. |
| |
| For subpage and non-zoned case, we never go into run_delalloc_cow(), and |
| normally all the dirty subpage ranges would be properly submitted inside |
| __extent_writepage_io(). |
| |
| [FIX] |
| Just do not clear the page dirty at all inside extent_write_locked_range(). |
| As __extent_writepage_io() would do a more accurate, subpage compatible |
| clear for page and subpage dirty flags anyway. |
| |
| Now the correct trace would look like this: |
| |
| > btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688 |
| > btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288 |
| > btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536 |
| > btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864 |
| |
| The page dirty part is still the same 3 pages. |
| |
| > btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400 |
| > cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536 |
| > extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536 |
| |
| And the writeback for the first 64K is still correct. |
| |
| > cow_file_range: r/i=5/259 add ordered extent filepos=839680 len=49152 |
| > extent_write_locked_range: r/i=5/259 locked page=786432 start=839680 len=49152 |
| |
| Now with the fix, we can properly writeback the range [820K, 832K), and |
| properly release the reserved data/metadata space. |
| |
| The Linux kernel CVE team has assigned CVE-2024-44972 to this issue. |
| |
| |
| Affected and fixed versions |
| =========================== |
| |
| Fixed in 6.6.46 with commit ba4dedb71356638d8284e34724daca944be70368 |
| Fixed in 6.10.5 with commit d3b403209f767e5857c1b9fda66726e6e6ffc99f |
| Fixed in 6.11 with commit 97713b1a2ced1e4a2a6c40045903797ebd44d7e0 |
| |
| Please see https://www.kernel.org for a full list of currently supported |
| kernel versions by the kernel community. |
| |
| Unaffected versions might change over time as fixes are backported to |
| older supported kernel versions. The official CVE entry at |
| https://cve.org/CVERecord/?id=CVE-2024-44972 |
| will be updated if fixes are backported, please check that for the most |
| up to date information about this issue. |
| |
| |
| Affected files |
| ============== |
| |
| The file(s) affected by this issue are: |
| fs/btrfs/extent_io.c |
| |
| |
| Mitigation |
| ========== |
| |
| The Linux kernel CVE team recommends that you update to the latest |
| stable kernel version for this, and many other bugfixes. Individual |
| changes are never tested alone, but rather are part of a larger kernel |
| release. Cherry-picking individual commits is not recommended or |
| supported by the Linux kernel community at all. If however, updating to |
| the latest release is impossible, the individual changes to resolve this |
| issue can be found at these commits: |
| https://git.kernel.org/stable/c/ba4dedb71356638d8284e34724daca944be70368 |
| https://git.kernel.org/stable/c/d3b403209f767e5857c1b9fda66726e6e6ffc99f |
| https://git.kernel.org/stable/c/97713b1a2ced1e4a2a6c40045903797ebd44d7e0 |