cve/published/2024/CVE-2024-44972.mbox - pub/scm/linux/security/vulns - Git at Google

 From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001
 From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 To: <linux-cve-announce@vger.kernel.org>
 Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
 Subject: CVE-2024-44972: btrfs: do not clear page dirty inside extent_write_locked_range()

 Description
 ===========

 In the Linux kernel, the following vulnerability has been resolved:

 btrfs: do not clear page dirty inside extent_write_locked_range()

 [BUG]
 For subpage + zoned case, the following workload can lead to rsv data
 leak at unmount time:

   # mkfs.btrfs -f -s 4k $dev
   # mount $dev $mnt
   # fsstress -w -n 8 -d $mnt -s 1709539240
   0/0: fiemap - no filename
   0/1: copyrange read - no filename
   0/2: write - no filename
   0/3: rename - no source filename
   0/4: creat f0 x:0 0 0
   0/4: creat add id=0,parent=-1
   0/5: writev f0[259 1 0 0 0 0] [778052,113,965] 0
   0/6: ioctl(FIEMAP) f0[259 1 0 0 224 887097] [1294220,2291618343991484791,0x10000] -1
   0/7: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 224 887097] return 25, fallback to stat()
   0/7: dwrite f0[259 1 0 0 224 887097] [696320,102400] 0
   # umount $mnt

 The dmesg includes the following rsv leak detection warning (all call
 trace skipped):

   ------------[ cut here ]------------
   WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8653 btrfs_destroy_inode+0x1e0/0x200 [btrfs]
   ---[ end trace 0000000000000000 ]---
   ------------[ cut here ]------------
   WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8654 btrfs_destroy_inode+0x1a8/0x200 [btrfs]
   ---[ end trace 0000000000000000 ]---
   ------------[ cut here ]------------
   WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8660 btrfs_destroy_inode+0x1a0/0x200 [btrfs]
   ---[ end trace 0000000000000000 ]---
   BTRFS info (device sda): last unmount of filesystem 1b4abba9-de34-4f07-9e7f-157cf12a18d6
   ------------[ cut here ]------------
   WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs]
   ---[ end trace 0000000000000000 ]---
   BTRFS info (device sda): space_info DATA has 268218368 free, is not full
   BTRFS info (device sda): space_info total=268435456, used=204800, pinned=0, reserved=0, may_use=12288, readonly=0 zone_unusable=0
   BTRFS info (device sda): global_block_rsv: size 0 reserved 0
   BTRFS info (device sda): trans_block_rsv: size 0 reserved 0
   BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0
   BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0
   BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0
   ------------[ cut here ]------------
   WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs]
   ---[ end trace 0000000000000000 ]---
   BTRFS info (device sda): space_info METADATA has 267796480 free, is not full
   BTRFS info (device sda): space_info total=268435456, used=131072, pinned=0, reserved=0, may_use=262144, readonly=0 zone_unusable=245760
   BTRFS info (device sda): global_block_rsv: size 0 reserved 0
   BTRFS info (device sda): trans_block_rsv: size 0 reserved 0
   BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0
   BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0
   BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0

 Above $dev is a tcmu-runner emulated zoned HDD, which has a max zone
 append size of 64K, and the system has 64K page size.

 [CAUSE]
 I have added several trace_printk() to show the events (header skipped):

   > btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688
   > btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288
   > btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536
   > btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864

 The above lines show our buffered write has dirtied 3 pages of inode
 259 of root 5:

   704K             768K              832K              896K
   I           |////I/////////////////I///////////|     I
               756K                               868K

   |///| is the dirtied range using subpage bitmaps. and 'I' is the page
   boundary.

   Meanwhile all three pages (704K, 768K, 832K) have their PageDirty
   flag set.

   > btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400

 Then direct IO write starts, since the range [680K, 780K) covers the
 beginning part of the above dirty range, we need to writeback the
 two pages at 704K and 768K.

   > cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536
   > extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536

 Now the above 2 lines show that we're writing back for dirty range
 [756K, 756K + 64K).
 We only writeback 64K because the zoned device has max zone append size
 as 64K.

   > extent_write_locked_range: r/i=5/259 clear dirty for page=786432

 !!! The above line shows the root cause. !!!

 We're calling clear_page_dirty_for_io() inside extent_write_locked_range(),
 for the page 768K.
 This is because extent_write_locked_range() can go beyond the current
 locked page, here we hit the page at 768K and clear its page dirt.

 In fact this would lead to the desync between subpage dirty and page
 dirty flags.  We have the page dirty flag cleared, but the subpage range
 [820K, 832K) is still dirty.

 After the writeback of range [756K, 820K), the dirty flags look like
 this, as page 768K no longer has dirty flag set.

   704K             768K              832K              896K
   I                I      |          I/////////////|   I
                           820K                     868K

 This means we will no longer writeback range [820K, 832K), thus the
 reserved data/metadata space would never be properly released.

   > extent_write_cache_pages: r/i=5/259 skip non-dirty folio=786432

 Now even though we try to start writeback for page 768K, since the
 page is not dirty, we completely skip it at extent_write_cache_pages()
 time.

   > btrfs_direct_write: r/i=5/259 dio done filepos=696320 len=0

 Now the direct IO finished.

   > cow_file_range: r/i=5/259 add ordered extent filepos=851968 len=36864
   > extent_write_locked_range: r/i=5/259 locked page=851968 start=851968 len=36864

 Now we writeback the remaining dirty range, which is [832K, 868K).
 Causing the range [820K, 832K) never to be submitted, thus leaking the
 reserved space.

 This bug only affects subpage and zoned case.  For non-subpage and zoned
 case, we have exactly one sector for each page, thus no such partial dirty
 cases.

 For subpage and non-zoned case, we never go into run_delalloc_cow(), and
 normally all the dirty subpage ranges would be properly submitted inside
 __extent_writepage_io().

 [FIX]
 Just do not clear the page dirty at all inside extent_write_locked_range().
 As __extent_writepage_io() would do a more accurate, subpage compatible
 clear for page and subpage dirty flags anyway.

 Now the correct trace would look like this:

   > btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688
   > btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288
   > btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536
   > btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864

 The page dirty part is still the same 3 pages.

   > btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400
   > cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536
   > extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536

 And the writeback for the first 64K is still correct.

   > cow_file_range: r/i=5/259 add ordered extent filepos=839680 len=49152
   > extent_write_locked_range: r/i=5/259 locked page=786432 start=839680 len=49152

 Now with the fix, we can properly writeback the range [820K, 832K), and
 properly release the reserved data/metadata space.

 The Linux kernel CVE team has assigned CVE-2024-44972 to this issue.


 Affected and fixed versions
 ===========================

 	Fixed in 6.6.46 with commit ba4dedb71356638d8284e34724daca944be70368
 	Fixed in 6.10.5 with commit d3b403209f767e5857c1b9fda66726e6e6ffc99f
 	Fixed in 6.11 with commit 97713b1a2ced1e4a2a6c40045903797ebd44d7e0

 Please see https://www.kernel.org for a full list of currently supported
 kernel versions by the kernel community.

 Unaffected versions might change over time as fixes are backported to
 older supported kernel versions.  The official CVE entry at
 	https://cve.org/CVERecord/?id=CVE-2024-44972
 will be updated if fixes are backported, please check that for the most
 up to date information about this issue.


 Affected files
 ==============

 The file(s) affected by this issue are:
 	fs/btrfs/extent_io.c


 Mitigation
 ==========

 The Linux kernel CVE team recommends that you update to the latest
 stable kernel version for this, and many other bugfixes.  Individual
 changes are never tested alone, but rather are part of a larger kernel
 release.  Cherry-picking individual commits is not recommended or
 supported by the Linux kernel community at all.  If however, updating to
 the latest release is impossible, the individual changes to resolve this
 issue can be found at these commits:
 	https://git.kernel.org/stable/c/ba4dedb71356638d8284e34724daca944be70368
 	https://git.kernel.org/stable/c/d3b403209f767e5857c1b9fda66726e6e6ffc99f
 	https://git.kernel.org/stable/c/97713b1a2ced1e4a2a6c40045903797ebd44d7e0
	From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001
	From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	To: <linux-cve-announce@vger.kernel.org>
	Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
	Subject: CVE-2024-44972: btrfs: do not clear page dirty inside extent_write_locked_range()

	Description
	===========

	In the Linux kernel, the following vulnerability has been resolved:

	btrfs: do not clear page dirty inside extent_write_locked_range()

	[BUG]
	For subpage + zoned case, the following workload can lead to rsv data
	leak at unmount time:

	# mkfs.btrfs -f -s 4k $dev
	# mount $dev $mnt
	# fsstress -w -n 8 -d $mnt -s 1709539240
	0/0: fiemap - no filename
	0/1: copyrange read - no filename
	0/2: write - no filename
	0/3: rename - no source filename
	0/4: creat f0 x:0 0 0
	0/4: creat add id=0,parent=-1
	0/5: writev f0[259 1 0 0 0 0] [778052,113,965] 0
	0/6: ioctl(FIEMAP) f0[259 1 0 0 224 887097] [1294220,2291618343991484791,0x10000] -1
	0/7: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 224 887097] return 25, fallback to stat()
	0/7: dwrite f0[259 1 0 0 224 887097] [696320,102400] 0
	# umount $mnt

	The dmesg includes the following rsv leak detection warning (all call
	trace skipped):

	------------[ cut here ]------------
	WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8653 btrfs_destroy_inode+0x1e0/0x200 [btrfs]
	---[ end trace 0000000000000000 ]---
	------------[ cut here ]------------
	WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8654 btrfs_destroy_inode+0x1a8/0x200 [btrfs]
	---[ end trace 0000000000000000 ]---
	------------[ cut here ]------------
	WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8660 btrfs_destroy_inode+0x1a0/0x200 [btrfs]
	---[ end trace 0000000000000000 ]---
	BTRFS info (device sda): last unmount of filesystem 1b4abba9-de34-4f07-9e7f-157cf12a18d6
	------------[ cut here ]------------
	WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs]
	---[ end trace 0000000000000000 ]---
	BTRFS info (device sda): space_info DATA has 268218368 free, is not full
	BTRFS info (device sda): space_info total=268435456, used=204800, pinned=0, reserved=0, may_use=12288, readonly=0 zone_unusable=0
	BTRFS info (device sda): global_block_rsv: size 0 reserved 0
	BTRFS info (device sda): trans_block_rsv: size 0 reserved 0
	BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0
	BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0
	BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0
	------------[ cut here ]------------
	WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs]
	---[ end trace 0000000000000000 ]---
	BTRFS info (device sda): space_info METADATA has 267796480 free, is not full
	BTRFS info (device sda): space_info total=268435456, used=131072, pinned=0, reserved=0, may_use=262144, readonly=0 zone_unusable=245760
	BTRFS info (device sda): global_block_rsv: size 0 reserved 0
	BTRFS info (device sda): trans_block_rsv: size 0 reserved 0
	BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0
	BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0
	BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0

	Above $dev is a tcmu-runner emulated zoned HDD, which has a max zone
	append size of 64K, and the system has 64K page size.

	[CAUSE]
	I have added several trace_printk() to show the events (header skipped):

	> btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688
	> btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288
	> btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536
	> btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864

	The above lines show our buffered write has dirtied 3 pages of inode
	259 of root 5:

	704K 768K 832K 896K
	I \|////I/////////////////I///////////\| I
	756K 868K

	\|///\| is the dirtied range using subpage bitmaps. and 'I' is the page
	boundary.

	Meanwhile all three pages (704K, 768K, 832K) have their PageDirty
	flag set.

	> btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400

	Then direct IO write starts, since the range [680K, 780K) covers the
	beginning part of the above dirty range, we need to writeback the
	two pages at 704K and 768K.

	> cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536
	> extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536

	Now the above 2 lines show that we're writing back for dirty range
	[756K, 756K + 64K).
	We only writeback 64K because the zoned device has max zone append size
	as 64K.

	> extent_write_locked_range: r/i=5/259 clear dirty for page=786432

	!!! The above line shows the root cause. !!!

	We're calling clear_page_dirty_for_io() inside extent_write_locked_range(),
	for the page 768K.
	This is because extent_write_locked_range() can go beyond the current
	locked page, here we hit the page at 768K and clear its page dirt.

	In fact this would lead to the desync between subpage dirty and page
	dirty flags. We have the page dirty flag cleared, but the subpage range
	[820K, 832K) is still dirty.

	After the writeback of range [756K, 820K), the dirty flags look like
	this, as page 768K no longer has dirty flag set.

	704K 768K 832K 896K
	I I \| I/////////////\| I
	820K 868K

	This means we will no longer writeback range [820K, 832K), thus the
	reserved data/metadata space would never be properly released.

	> extent_write_cache_pages: r/i=5/259 skip non-dirty folio=786432

	Now even though we try to start writeback for page 768K, since the
	page is not dirty, we completely skip it at extent_write_cache_pages()
	time.

	> btrfs_direct_write: r/i=5/259 dio done filepos=696320 len=0

	Now the direct IO finished.

	> cow_file_range: r/i=5/259 add ordered extent filepos=851968 len=36864
	> extent_write_locked_range: r/i=5/259 locked page=851968 start=851968 len=36864

	Now we writeback the remaining dirty range, which is [832K, 868K).
	Causing the range [820K, 832K) never to be submitted, thus leaking the
	reserved space.

	This bug only affects subpage and zoned case. For non-subpage and zoned
	case, we have exactly one sector for each page, thus no such partial dirty
	cases.

	For subpage and non-zoned case, we never go into run_delalloc_cow(), and
	normally all the dirty subpage ranges would be properly submitted inside
	__extent_writepage_io().

	[FIX]
	Just do not clear the page dirty at all inside extent_write_locked_range().
	As __extent_writepage_io() would do a more accurate, subpage compatible
	clear for page and subpage dirty flags anyway.

	Now the correct trace would look like this:

	> btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688
	> btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288
	> btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536
	> btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864

	The page dirty part is still the same 3 pages.

	> btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400
	> cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536
	> extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536

	And the writeback for the first 64K is still correct.

	> cow_file_range: r/i=5/259 add ordered extent filepos=839680 len=49152
	> extent_write_locked_range: r/i=5/259 locked page=786432 start=839680 len=49152

	Now with the fix, we can properly writeback the range [820K, 832K), and
	properly release the reserved data/metadata space.

	The Linux kernel CVE team has assigned CVE-2024-44972 to this issue.


	Affected and fixed versions
	===========================

	Fixed in 6.6.46 with commit ba4dedb71356638d8284e34724daca944be70368
	Fixed in 6.10.5 with commit d3b403209f767e5857c1b9fda66726e6e6ffc99f
	Fixed in 6.11 with commit 97713b1a2ced1e4a2a6c40045903797ebd44d7e0

	Please see https://www.kernel.org for a full list of currently supported
	kernel versions by the kernel community.

	Unaffected versions might change over time as fixes are backported to
	older supported kernel versions. The official CVE entry at
	https://cve.org/CVERecord/?id=CVE-2024-44972
	will be updated if fixes are backported, please check that for the most
	up to date information about this issue.


	Affected files
	==============

	The file(s) affected by this issue are:
	fs/btrfs/extent_io.c


	Mitigation
	==========

	The Linux kernel CVE team recommends that you update to the latest
	stable kernel version for this, and many other bugfixes. Individual
	changes are never tested alone, but rather are part of a larger kernel
	release. Cherry-picking individual commits is not recommended or
	supported by the Linux kernel community at all. If however, updating to
	the latest release is impossible, the individual changes to resolve this
	issue can be found at these commits:
	https://git.kernel.org/stable/c/ba4dedb71356638d8284e34724daca944be70368
	https://git.kernel.org/stable/c/d3b403209f767e5857c1b9fda66726e6e6ffc99f
	https://git.kernel.org/stable/c/97713b1a2ced1e4a2a6c40045903797ebd44d7e0