cve/published/2024/CVE-2024-58089.mbox - pub/scm/linux/security/vulns - Git at Google

 From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001
 From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 To: <linux-cve-announce@vger.kernel.org>
 Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
 Subject: CVE-2024-58089: btrfs: fix double accounting race when btrfs_run_delalloc_range() failed

 Description
 ===========

 In the Linux kernel, the following vulnerability has been resolved:

 btrfs: fix double accounting race when btrfs_run_delalloc_range() failed

 [BUG]
 When running btrfs with block size (4K) smaller than page size (64K,
 aarch64), there is a very high chance to crash the kernel at
 generic/750, with the following messages:
 (before the call traces, there are 3 extra debug messages added)

   BTRFS warning (device dm-3): read-write for sector size 4096 with page size 65536 is experimental
   BTRFS info (device dm-3): checking UUID tree
   hrtimer: interrupt took 5451385 ns
   BTRFS error (device dm-3): cow_file_range failed, root=4957 inode=257 start=1605632 len=69632: -28
   BTRFS error (device dm-3): run_delalloc_nocow failed, root=4957 inode=257 start=1605632 len=69632: -28
   BTRFS error (device dm-3): failed to run delalloc range, root=4957 ino=257 folio=1572864 submit_bitmap=8-15 start=1605632 len=69632: -28
   ------------[ cut here ]------------
   WARNING: CPU: 2 PID: 3020984 at ordered-data.c:360 can_finish_ordered_extent+0x370/0x3b8 [btrfs]
   CPU: 2 UID: 0 PID: 3020984 Comm: kworker/u24:1 Tainted: G           OE      6.13.0-rc1-custom+ #89
   Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
   Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
   Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
   pc : can_finish_ordered_extent+0x370/0x3b8 [btrfs]
   lr : can_finish_ordered_extent+0x1ec/0x3b8 [btrfs]
   Call trace:
    can_finish_ordered_extent+0x370/0x3b8 [btrfs] (P)
    can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] (L)
    btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs]
    extent_writepage+0x10c/0x3b8 [btrfs]
    extent_write_cache_pages+0x21c/0x4e8 [btrfs]
    btrfs_writepages+0x94/0x160 [btrfs]
    do_writepages+0x74/0x190
    filemap_fdatawrite_wbc+0x74/0xa0
    start_delalloc_inodes+0x17c/0x3b0 [btrfs]
    btrfs_start_delalloc_roots+0x17c/0x288 [btrfs]
    shrink_delalloc+0x11c/0x280 [btrfs]
    flush_space+0x288/0x328 [btrfs]
    btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
    process_one_work+0x228/0x680
    worker_thread+0x1bc/0x360
    kthread+0x100/0x118
    ret_from_fork+0x10/0x20
   ---[ end trace 0000000000000000 ]---
   BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1605632 OE len=16384 to_dec=16384 left=0
   BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1622016 OE len=12288 to_dec=12288 left=0
   Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
   BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1634304 OE len=8192 to_dec=4096 left=0
   CPU: 1 UID: 0 PID: 3286940 Comm: kworker/u24:3 Tainted: G        W  OE      6.13.0-rc1-custom+ #89
   Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
   Workqueue:  btrfs_work_helper [btrfs] (btrfs-endio-write)
   pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
   pc : process_one_work+0x110/0x680
   lr : worker_thread+0x1bc/0x360
   Call trace:
    process_one_work+0x110/0x680 (P)
    worker_thread+0x1bc/0x360 (L)
    worker_thread+0x1bc/0x360
    kthread+0x100/0x118
    ret_from_fork+0x10/0x20
   Code: f84086a1 f9000fe1 53041c21 b9003361 (f9400661)
   ---[ end trace 0000000000000000 ]---
   Kernel panic - not syncing: Oops: Fatal exception
   SMP: stopping secondary CPUs
   SMP: failed to stop secondary CPUs 2-3
   Dumping ftrace buffer:
      (ftrace buffer empty)
   Kernel Offset: 0x275bb9540000 from 0xffff800080000000
   PHYS_OFFSET: 0xffff8fbba0000000
   CPU features: 0x100,00000070,00801250,8201720b

 [CAUSE]
 The above warning is triggered immediately after the delalloc range
 failure, this happens in the following sequence:

 - Range [1568K, 1636K) is dirty

    1536K  1568K     1600K    1636K  1664K
    |      |/////////|////////|      |

   Where 1536K, 1600K and 1664K are page boundaries (64K page size)

 - Enter extent_writepage() for page 1536K

 - Enter run_delalloc_nocow() with locked page 1536K and range
   [1568K, 1636K)
   This is due to the inode having preallocated extents.

 - Enter cow_file_range() with locked page 1536K and range
   [1568K, 1636K)

 - btrfs_reserve_extent() only reserved two extents
   The main loop of cow_file_range() only reserved two data extents,

   Now we have:

    1536K  1568K        1600K    1636K  1664K
    |      |<-->|<--->|/|///////|      |
                1584K  1596K
   Range [1568K, 1596K) has an ordered extent reserved.

 - btrfs_reserve_extent() failed inside cow_file_range() for file offset
   1596K
   This is already a bug in our space reservation code, but for now let's
   focus on the error handling path.

   Now cow_file_range() returned -ENOSPC.

 - btrfs_run_delalloc_range() do error cleanup <<< ROOT CAUSE
   Call btrfs_cleanup_ordered_extents() with locked folio 1536K and range
   [1568K, 1636K)

   Function btrfs_cleanup_ordered_extents() normally needs to skip the
   ranges inside the folio, as it will normally be cleaned up by
   extent_writepage().

   Such split error handling is already problematic in the first place.

   What's worse is the folio range skipping itself, which is not taking
   subpage cases into consideration at all, it will only skip the range
   if the page start >= the range start.
   In our case, the page start < the range start, since for subpage cases
   we can have delalloc ranges inside the folio but not covering the
   folio.

   So it doesn't skip the page range at all.
   This means all the ordered extents, both [1568K, 1584K) and
   [1584K, 1596K) will be marked as IOERR.

   And these two ordered extents have no more pending ios, they are marked
   finished, and *QUEUED* to be deleted from the io tree.

 - extent_writepage() do error cleanup
   Call btrfs_mark_ordered_io_finished() for the range [1536K, 1600K).

   Although ranges [1568K, 1584K) and [1584K, 1596K) are finished, the
   deletion from io tree is async, it may or may not happen at this
   time.

   If the ranges have not yet been removed, we will do double cleaning on
   those ranges, triggering the above ordered extent warnings.

 In theory there are other bugs, like the cleanup in extent_writepage()
 can cause double accounting on ranges that are submitted asynchronously
 (compression for example).

 But that's much harder to trigger because normally we do not mix regular
 and compression delalloc ranges.

 [FIX]
 The folio range split is already buggy and not subpage compatible, it
 was introduced a long time ago where subpage support was not even considered.

 So instead of splitting the ordered extents cleanup into the folio range
 and out of folio range, do all the cleanup inside writepage_delalloc().

 - Pass @NULL as locked_folio for btrfs_cleanup_ordered_extents() in
   btrfs_run_delalloc_range()

 - Skip the btrfs_cleanup_ordered_extents() if writepage_delalloc()
   failed

   So all ordered extents are only cleaned up by
   btrfs_run_delalloc_range().

 - Handle the ranges that already have ordered extents allocated
   If part of the folio already has ordered extent allocated, and
   btrfs_run_delalloc_range() failed, we also need to cleanup that range.

 Now we have a concentrated error handling for ordered extents during
 btrfs_run_delalloc_range().

 The Linux kernel CVE team has assigned CVE-2024-58089 to this issue.


 Affected and fixed versions
 ===========================

 	Issue introduced in 5.0 with commit d1051d6ebf8ef3517a5a3cf82bba8436d190f1c2 and fixed in 6.12.17 with commit 21333148b5c9e52f41fafcedec3810b56a5e0e40
 	Issue introduced in 5.0 with commit d1051d6ebf8ef3517a5a3cf82bba8436d190f1c2 and fixed in 6.13.5 with commit 0283ee1912c8e243c931f4ee5b3672e954fe0384
 	Issue introduced in 5.0 with commit d1051d6ebf8ef3517a5a3cf82bba8436d190f1c2 and fixed in 6.14 with commit 72dad8e377afa50435940adfb697e070d3556670
 	Issue introduced in 4.19.73 with commit eb124aaa2e85e9dceac37be5b7166a04b9b26735

 Please see https://www.kernel.org for a full list of currently supported
 kernel versions by the kernel community.

 Unaffected versions might change over time as fixes are backported to
 older supported kernel versions.  The official CVE entry at
 	https://cve.org/CVERecord/?id=CVE-2024-58089
 will be updated if fixes are backported, please check that for the most
 up to date information about this issue.


 Affected files
 ==============

 The file(s) affected by this issue are:
 	fs/btrfs/extent_io.c
 	fs/btrfs/inode.c


 Mitigation
 ==========

 The Linux kernel CVE team recommends that you update to the latest
 stable kernel version for this, and many other bugfixes.  Individual
 changes are never tested alone, but rather are part of a larger kernel
 release.  Cherry-picking individual commits is not recommended or
 supported by the Linux kernel community at all.  If however, updating to
 the latest release is impossible, the individual changes to resolve this
 issue can be found at these commits:
 	https://git.kernel.org/stable/c/21333148b5c9e52f41fafcedec3810b56a5e0e40
 	https://git.kernel.org/stable/c/0283ee1912c8e243c931f4ee5b3672e954fe0384
 	https://git.kernel.org/stable/c/72dad8e377afa50435940adfb697e070d3556670
	From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001
	From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	To: <linux-cve-announce@vger.kernel.org>
	Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
	Subject: CVE-2024-58089: btrfs: fix double accounting race when btrfs_run_delalloc_range() failed

	Description
	===========

	In the Linux kernel, the following vulnerability has been resolved:

	btrfs: fix double accounting race when btrfs_run_delalloc_range() failed

	[BUG]
	When running btrfs with block size (4K) smaller than page size (64K,
	aarch64), there is a very high chance to crash the kernel at
	generic/750, with the following messages:
	(before the call traces, there are 3 extra debug messages added)

	BTRFS warning (device dm-3): read-write for sector size 4096 with page size 65536 is experimental
	BTRFS info (device dm-3): checking UUID tree
	hrtimer: interrupt took 5451385 ns
	BTRFS error (device dm-3): cow_file_range failed, root=4957 inode=257 start=1605632 len=69632: -28
	BTRFS error (device dm-3): run_delalloc_nocow failed, root=4957 inode=257 start=1605632 len=69632: -28
	BTRFS error (device dm-3): failed to run delalloc range, root=4957 ino=257 folio=1572864 submit_bitmap=8-15 start=1605632 len=69632: -28
	------------[ cut here ]------------
	WARNING: CPU: 2 PID: 3020984 at ordered-data.c:360 can_finish_ordered_extent+0x370/0x3b8 [btrfs]
	CPU: 2 UID: 0 PID: 3020984 Comm: kworker/u24:1 Tainted: G OE 6.13.0-rc1-custom+ #89
	Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
	Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
	Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
	pc : can_finish_ordered_extent+0x370/0x3b8 [btrfs]
	lr : can_finish_ordered_extent+0x1ec/0x3b8 [btrfs]
	Call trace:
	can_finish_ordered_extent+0x370/0x3b8 [btrfs] (P)
	can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] (L)
	btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs]
	extent_writepage+0x10c/0x3b8 [btrfs]
	extent_write_cache_pages+0x21c/0x4e8 [btrfs]
	btrfs_writepages+0x94/0x160 [btrfs]
	do_writepages+0x74/0x190
	filemap_fdatawrite_wbc+0x74/0xa0
	start_delalloc_inodes+0x17c/0x3b0 [btrfs]
	btrfs_start_delalloc_roots+0x17c/0x288 [btrfs]
	shrink_delalloc+0x11c/0x280 [btrfs]
	flush_space+0x288/0x328 [btrfs]
	btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
	process_one_work+0x228/0x680
	worker_thread+0x1bc/0x360
	kthread+0x100/0x118
	ret_from_fork+0x10/0x20
	---[ end trace 0000000000000000 ]---
	BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1605632 OE len=16384 to_dec=16384 left=0
	BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1622016 OE len=12288 to_dec=12288 left=0
	Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
	BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1634304 OE len=8192 to_dec=4096 left=0
	CPU: 1 UID: 0 PID: 3286940 Comm: kworker/u24:3 Tainted: G W OE 6.13.0-rc1-custom+ #89
	Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
	Workqueue: btrfs_work_helper [btrfs] (btrfs-endio-write)
	pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
	pc : process_one_work+0x110/0x680
	lr : worker_thread+0x1bc/0x360
	Call trace:
	process_one_work+0x110/0x680 (P)
	worker_thread+0x1bc/0x360 (L)
	worker_thread+0x1bc/0x360
	kthread+0x100/0x118
	ret_from_fork+0x10/0x20
	Code: f84086a1 f9000fe1 53041c21 b9003361 (f9400661)
	---[ end trace 0000000000000000 ]---
	Kernel panic - not syncing: Oops: Fatal exception
	SMP: stopping secondary CPUs
	SMP: failed to stop secondary CPUs 2-3
	Dumping ftrace buffer:
	(ftrace buffer empty)
	Kernel Offset: 0x275bb9540000 from 0xffff800080000000
	PHYS_OFFSET: 0xffff8fbba0000000
	CPU features: 0x100,00000070,00801250,8201720b

	[CAUSE]
	The above warning is triggered immediately after the delalloc range
	failure, this happens in the following sequence:

	- Range [1568K, 1636K) is dirty

	1536K 1568K 1600K 1636K 1664K
	\| \|/////////\|////////\| \|

	Where 1536K, 1600K and 1664K are page boundaries (64K page size)

	- Enter extent_writepage() for page 1536K

	- Enter run_delalloc_nocow() with locked page 1536K and range
	[1568K, 1636K)
	This is due to the inode having preallocated extents.

	- Enter cow_file_range() with locked page 1536K and range
	[1568K, 1636K)

	- btrfs_reserve_extent() only reserved two extents
	The main loop of cow_file_range() only reserved two data extents,

	Now we have:

	1536K 1568K 1600K 1636K 1664K
	\| \|<-->\|<--->\|/\|///////\| \|
	1584K 1596K
	Range [1568K, 1596K) has an ordered extent reserved.

	- btrfs_reserve_extent() failed inside cow_file_range() for file offset
	1596K
	This is already a bug in our space reservation code, but for now let's
	focus on the error handling path.

	Now cow_file_range() returned -ENOSPC.

	- btrfs_run_delalloc_range() do error cleanup <<< ROOT CAUSE
	Call btrfs_cleanup_ordered_extents() with locked folio 1536K and range
	[1568K, 1636K)

	Function btrfs_cleanup_ordered_extents() normally needs to skip the
	ranges inside the folio, as it will normally be cleaned up by
	extent_writepage().

	Such split error handling is already problematic in the first place.

	What's worse is the folio range skipping itself, which is not taking
	subpage cases into consideration at all, it will only skip the range
	if the page start >= the range start.
	In our case, the page start < the range start, since for subpage cases
	we can have delalloc ranges inside the folio but not covering the
	folio.

	So it doesn't skip the page range at all.
	This means all the ordered extents, both [1568K, 1584K) and
	[1584K, 1596K) will be marked as IOERR.

	And these two ordered extents have no more pending ios, they are marked
	finished, and QUEUED to be deleted from the io tree.

	- extent_writepage() do error cleanup
	Call btrfs_mark_ordered_io_finished() for the range [1536K, 1600K).

	Although ranges [1568K, 1584K) and [1584K, 1596K) are finished, the
	deletion from io tree is async, it may or may not happen at this
	time.

	If the ranges have not yet been removed, we will do double cleaning on
	those ranges, triggering the above ordered extent warnings.

	In theory there are other bugs, like the cleanup in extent_writepage()
	can cause double accounting on ranges that are submitted asynchronously
	(compression for example).

	But that's much harder to trigger because normally we do not mix regular
	and compression delalloc ranges.

	[FIX]
	The folio range split is already buggy and not subpage compatible, it
	was introduced a long time ago where subpage support was not even considered.

	So instead of splitting the ordered extents cleanup into the folio range
	and out of folio range, do all the cleanup inside writepage_delalloc().

	- Pass @NULL as locked_folio for btrfs_cleanup_ordered_extents() in
	btrfs_run_delalloc_range()

	- Skip the btrfs_cleanup_ordered_extents() if writepage_delalloc()
	failed

	So all ordered extents are only cleaned up by
	btrfs_run_delalloc_range().

	- Handle the ranges that already have ordered extents allocated
	If part of the folio already has ordered extent allocated, and
	btrfs_run_delalloc_range() failed, we also need to cleanup that range.

	Now we have a concentrated error handling for ordered extents during
	btrfs_run_delalloc_range().

	The Linux kernel CVE team has assigned CVE-2024-58089 to this issue.


	Affected and fixed versions
	===========================

	Issue introduced in 5.0 with commit d1051d6ebf8ef3517a5a3cf82bba8436d190f1c2 and fixed in 6.12.17 with commit 21333148b5c9e52f41fafcedec3810b56a5e0e40
	Issue introduced in 5.0 with commit d1051d6ebf8ef3517a5a3cf82bba8436d190f1c2 and fixed in 6.13.5 with commit 0283ee1912c8e243c931f4ee5b3672e954fe0384
	Issue introduced in 5.0 with commit d1051d6ebf8ef3517a5a3cf82bba8436d190f1c2 and fixed in 6.14 with commit 72dad8e377afa50435940adfb697e070d3556670
	Issue introduced in 4.19.73 with commit eb124aaa2e85e9dceac37be5b7166a04b9b26735

	Please see https://www.kernel.org for a full list of currently supported
	kernel versions by the kernel community.

	Unaffected versions might change over time as fixes are backported to
	older supported kernel versions. The official CVE entry at
	https://cve.org/CVERecord/?id=CVE-2024-58089
	will be updated if fixes are backported, please check that for the most
	up to date information about this issue.


	Affected files
	==============

	The file(s) affected by this issue are:
	fs/btrfs/extent_io.c
	fs/btrfs/inode.c


	Mitigation
	==========

	The Linux kernel CVE team recommends that you update to the latest
	stable kernel version for this, and many other bugfixes. Individual
	changes are never tested alone, but rather are part of a larger kernel
	release. Cherry-picking individual commits is not recommended or
	supported by the Linux kernel community at all. If however, updating to
	the latest release is impossible, the individual changes to resolve this
	issue can be found at these commits:
	https://git.kernel.org/stable/c/21333148b5c9e52f41fafcedec3810b56a5e0e40
	https://git.kernel.org/stable/c/0283ee1912c8e243c931f4ee5b3672e954fe0384
	https://git.kernel.org/stable/c/72dad8e377afa50435940adfb697e070d3556670