cve/published/2023/CVE-2023-52737.mbox - pub/scm/linux/security/vulns - Git at Google

 From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001
 From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 To: <linux-cve-announce@vger.kernel.org>
 Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
 Subject: CVE-2023-52737: btrfs: lock the inode in shared mode before starting fiemap

 Description
 ===========

 In the Linux kernel, the following vulnerability has been resolved:

 btrfs: lock the inode in shared mode before starting fiemap

 Currently fiemap does not take the inode's lock (VFS lock), it only locks
 a file range in the inode's io tree. This however can lead to a deadlock
 if we have a concurrent fsync on the file and fiemap code triggers a fault
 when accessing the user space buffer with fiemap_fill_next_extent(). The
 deadlock happens on the inode's i_mmap_lock semaphore, which is taken both
 by fsync and btrfs_page_mkwrite(). This deadlock was recently reported by
 syzbot and triggers a trace like the following:

    task:syz-executor361 state:D stack:20264 pid:5668  ppid:5119   flags:0x00004004
    Call Trace:
     <TASK>
     context_switch kernel/sched/core.c:5293 [inline]
     __schedule+0x995/0xe20 kernel/sched/core.c:6606
     schedule+0xcb/0x190 kernel/sched/core.c:6682
     wait_on_state fs/btrfs/extent-io-tree.c:707 [inline]
     wait_extent_bit+0x577/0x6f0 fs/btrfs/extent-io-tree.c:751
     lock_extent+0x1c2/0x280 fs/btrfs/extent-io-tree.c:1742
     find_lock_delalloc_range+0x4e6/0x9c0 fs/btrfs/extent_io.c:488
     writepage_delalloc+0x1ef/0x540 fs/btrfs/extent_io.c:1863
     __extent_writepage+0x736/0x14e0 fs/btrfs/extent_io.c:2174
     extent_write_cache_pages+0x983/0x1220 fs/btrfs/extent_io.c:3091
     extent_writepages+0x219/0x540 fs/btrfs/extent_io.c:3211
     do_writepages+0x3c3/0x680 mm/page-writeback.c:2581
     filemap_fdatawrite_wbc+0x11e/0x170 mm/filemap.c:388
     __filemap_fdatawrite_range mm/filemap.c:421 [inline]
     filemap_fdatawrite_range+0x175/0x200 mm/filemap.c:439
     btrfs_fdatawrite_range fs/btrfs/file.c:3850 [inline]
     start_ordered_ops fs/btrfs/file.c:1737 [inline]
     btrfs_sync_file+0x4ff/0x1190 fs/btrfs/file.c:1839
     generic_write_sync include/linux/fs.h:2885 [inline]
     btrfs_do_write_iter+0xcd3/0x1280 fs/btrfs/file.c:1684
     call_write_iter include/linux/fs.h:2189 [inline]
     new_sync_write fs/read_write.c:491 [inline]
     vfs_write+0x7dc/0xc50 fs/read_write.c:584
     ksys_write+0x177/0x2a0 fs/read_write.c:637
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x63/0xcd
    RIP: 0033:0x7f7d4054e9b9
    RSP: 002b:00007f7d404fa2f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    RAX: ffffffffffffffda RBX: 00007f7d405d87a0 RCX: 00007f7d4054e9b9
    RDX: 0000000000000090 RSI: 0000000020000000 RDI: 0000000000000006
    RBP: 00007f7d405a51d0 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 61635f65646f6e69
    R13: 65646f7475616f6e R14: 7261637369646f6e R15: 00007f7d405d87a8
     </TASK>
    INFO: task syz-executor361:5697 blocked for more than 145 seconds.
          Not tainted 6.2.0-rc3-syzkaller-00376-g7c6984405241 #0
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    task:syz-executor361 state:D stack:21216 pid:5697  ppid:5119   flags:0x00004004
    Call Trace:
     <TASK>
     context_switch kernel/sched/core.c:5293 [inline]
     __schedule+0x995/0xe20 kernel/sched/core.c:6606
     schedule+0xcb/0x190 kernel/sched/core.c:6682
     rwsem_down_read_slowpath+0x5f9/0x930 kernel/locking/rwsem.c:1095
     __down_read_common+0x54/0x2a0 kernel/locking/rwsem.c:1260
     btrfs_page_mkwrite+0x417/0xc80 fs/btrfs/inode.c:8526
     do_page_mkwrite+0x19e/0x5e0 mm/memory.c:2947
     wp_page_shared+0x15e/0x380 mm/memory.c:3295
     handle_pte_fault mm/memory.c:4949 [inline]
     __handle_mm_fault mm/memory.c:5073 [inline]
     handle_mm_fault+0x1b79/0x26b0 mm/memory.c:5219
     do_user_addr_fault+0x69b/0xcb0 arch/x86/mm/fault.c:1428
     handle_page_fault arch/x86/mm/fault.c:1519 [inline]
     exc_page_fault+0x7a/0x110 arch/x86/mm/fault.c:1575
     asm_exc_page_fault+0x22/0x30 arch/x86/include/asm/idtentry.h:570
    RIP: 0010:copy_user_short_string+0xd/0x40 arch/x86/lib/copy_user_64.S:233
    Code: 74 0a 89 (...)
    RSP: 0018:ffffc9000570f330 EFLAGS: 00050202
    RAX: ffffffff843e6601 RBX: 00007fffffffefc8 RCX: 0000000000000007
    RDX: 0000000000000000 RSI: ffffc9000570f3e0 RDI: 0000000020000120
    RBP: ffffc9000570f490 R08: 0000000000000000 R09: fffff52000ae1e83
    R10: fffff52000ae1e83 R11: 1ffff92000ae1e7c R12: 0000000000000038
    R13: ffffc9000570f3e0 R14: 0000000020000120 R15: ffffc9000570f3e0
     copy_user_generic arch/x86/include/asm/uaccess_64.h:37 [inline]
     raw_copy_to_user arch/x86/include/asm/uaccess_64.h:58 [inline]
     _copy_to_user+0xe9/0x130 lib/usercopy.c:34
     copy_to_user include/linux/uaccess.h:169 [inline]
     fiemap_fill_next_extent+0x22e/0x410 fs/ioctl.c:144
     emit_fiemap_extent+0x22d/0x3c0 fs/btrfs/extent_io.c:3458
     fiemap_process_hole+0xa00/0xad0 fs/btrfs/extent_io.c:3716
     extent_fiemap+0xe27/0x2100 fs/btrfs/extent_io.c:3922
     btrfs_fiemap+0x172/0x1e0 fs/btrfs/inode.c:8209
     ioctl_fiemap fs/ioctl.c:219 [inline]
     do_vfs_ioctl+0x185b/0x2980 fs/ioctl.c:810
     __do_sys_ioctl fs/ioctl.c:868 [inline]
     __se_sys_ioctl+0x83/0x170 fs/ioctl.c:856
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x63/0xcd
    RIP: 0033:0x7f7d4054e9b9
    RSP: 002b:00007f7d390d92f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    RAX: ffffffffffffffda RBX: 00007f7d405d87b0 RCX: 00007f7d4054e9b9
    RDX: 0000000020000100 RSI: 00000000c020660b RDI: 0000000000000005
    RBP: 00007f7d405a51d0 R08: 00007f7d390d9700 R09: 0000000000000000
    R10: 00007f7d390d9700 R11: 0000000000000246 R12: 61635f65646f6e69
    R13: 65646f7475616f6e R14: 7261637369646f6e R15: 00007f7d405d87b8
     </TASK>

 What happens is the following:

 1) Task A is doing an fsync, enters btrfs_sync_file() and flushes delalloc
    before locking the inode and the i_mmap_lock semaphore, that is, before
    calling btrfs_inode_lock();

 2) After task A flushes delalloc and before it calls btrfs_inode_lock(),
    another task dirties a page;

 3) Task B starts a fiemap without FIEMAP_FLAG_SYNC, so the page dirtied
    at step 2 remains dirty and unflushed. Then when it enters
    extent_fiemap() and it locks a file range that includes the range of
    the page dirtied in step 2;

 4) Task A calls btrfs_inode_lock() and locks the inode (VFS lock) and the
    inode's i_mmap_lock semaphore in write mode. Then it tries to flush
    delalloc by calling start_ordered_ops(), which will block, at
    find_lock_delalloc_range(), when trying to lock the range of the page
    dirtied at step 2, since this range was locked by the fiemap task (at
    step 3);

 5) Task B generates a page fault when accessing the user space fiemap
    buffer with a call to fiemap_fill_next_extent().

    The fault handler needs to call btrfs_page_mkwrite() for some other
    page of our inode, and there we deadlock when trying to lock the
    inode's i_mmap_lock semaphore in read mode, since the fsync task locked
    it in write mode (step 4) and the fsync task can not progress because
    it's waiting to lock a file range that is currently locked by us (the
    fiemap task, step 3).

 Fix this by taking the inode's lock (VFS lock) in shared mode when
 entering fiemap. This effectively serializes fiemap with fsync (except the
 most expensive part of fsync, the log sync), preventing this deadlock.

 The Linux kernel CVE team has assigned CVE-2023-52737 to this issue.


 Affected and fixed versions
 ===========================

 	Fixed in 6.1.13 with commit d8c594da79bc0244e610a70594e824a401802be1
 	Fixed in 6.2 with commit 519b7e13b5ae8dd38da1e52275705343be6bb508

 Please see https://www.kernel.org for a full list of currently supported
 kernel versions by the kernel community.

 Unaffected versions might change over time as fixes are backported to
 older supported kernel versions.  The official CVE entry at
 	https://cve.org/CVERecord/?id=CVE-2023-52737
 will be updated if fixes are backported, please check that for the most
 up to date information about this issue.


 Affected files
 ==============

 The file(s) affected by this issue are:
 	fs/btrfs/extent_io.c


 Mitigation
 ==========

 The Linux kernel CVE team recommends that you update to the latest
 stable kernel version for this, and many other bugfixes.  Individual
 changes are never tested alone, but rather are part of a larger kernel
 release.  Cherry-picking individual commits is not recommended or
 supported by the Linux kernel community at all.  If however, updating to
 the latest release is impossible, the individual changes to resolve this
 issue can be found at these commits:
 	https://git.kernel.org/stable/c/d8c594da79bc0244e610a70594e824a401802be1
 	https://git.kernel.org/stable/c/519b7e13b5ae8dd38da1e52275705343be6bb508
	From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001
	From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	To: <linux-cve-announce@vger.kernel.org>
	Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
	Subject: CVE-2023-52737: btrfs: lock the inode in shared mode before starting fiemap

	Description
	===========

	In the Linux kernel, the following vulnerability has been resolved:

	btrfs: lock the inode in shared mode before starting fiemap

	Currently fiemap does not take the inode's lock (VFS lock), it only locks
	a file range in the inode's io tree. This however can lead to a deadlock
	if we have a concurrent fsync on the file and fiemap code triggers a fault
	when accessing the user space buffer with fiemap_fill_next_extent(). The
	deadlock happens on the inode's i_mmap_lock semaphore, which is taken both
	by fsync and btrfs_page_mkwrite(). This deadlock was recently reported by
	syzbot and triggers a trace like the following:

	task:syz-executor361 state:D stack:20264 pid:5668 ppid:5119 flags:0x00004004
	Call Trace:
	<TASK>
	context_switch kernel/sched/core.c:5293 [inline]
	__schedule+0x995/0xe20 kernel/sched/core.c:6606
	schedule+0xcb/0x190 kernel/sched/core.c:6682
	wait_on_state fs/btrfs/extent-io-tree.c:707 [inline]
	wait_extent_bit+0x577/0x6f0 fs/btrfs/extent-io-tree.c:751
	lock_extent+0x1c2/0x280 fs/btrfs/extent-io-tree.c:1742
	find_lock_delalloc_range+0x4e6/0x9c0 fs/btrfs/extent_io.c:488
	writepage_delalloc+0x1ef/0x540 fs/btrfs/extent_io.c:1863
	__extent_writepage+0x736/0x14e0 fs/btrfs/extent_io.c:2174
	extent_write_cache_pages+0x983/0x1220 fs/btrfs/extent_io.c:3091
	extent_writepages+0x219/0x540 fs/btrfs/extent_io.c:3211
	do_writepages+0x3c3/0x680 mm/page-writeback.c:2581
	filemap_fdatawrite_wbc+0x11e/0x170 mm/filemap.c:388
	__filemap_fdatawrite_range mm/filemap.c:421 [inline]
	filemap_fdatawrite_range+0x175/0x200 mm/filemap.c:439
	btrfs_fdatawrite_range fs/btrfs/file.c:3850 [inline]
	start_ordered_ops fs/btrfs/file.c:1737 [inline]
	btrfs_sync_file+0x4ff/0x1190 fs/btrfs/file.c:1839
	generic_write_sync include/linux/fs.h:2885 [inline]
	btrfs_do_write_iter+0xcd3/0x1280 fs/btrfs/file.c:1684
	call_write_iter include/linux/fs.h:2189 [inline]
	new_sync_write fs/read_write.c:491 [inline]
	vfs_write+0x7dc/0xc50 fs/read_write.c:584
	ksys_write+0x177/0x2a0 fs/read_write.c:637
	do_syscall_x64 arch/x86/entry/common.c:50 [inline]
	do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
	entry_SYSCALL_64_after_hwframe+0x63/0xcd
	RIP: 0033:0x7f7d4054e9b9
	RSP: 002b:00007f7d404fa2f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
	RAX: ffffffffffffffda RBX: 00007f7d405d87a0 RCX: 00007f7d4054e9b9
	RDX: 0000000000000090 RSI: 0000000020000000 RDI: 0000000000000006
	RBP: 00007f7d405a51d0 R08: 0000000000000000 R09: 0000000000000000
	R10: 0000000000000000 R11: 0000000000000246 R12: 61635f65646f6e69
	R13: 65646f7475616f6e R14: 7261637369646f6e R15: 00007f7d405d87a8
	</TASK>
	INFO: task syz-executor361:5697 blocked for more than 145 seconds.
	Not tainted 6.2.0-rc3-syzkaller-00376-g7c6984405241 #0
	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	task:syz-executor361 state:D stack:21216 pid:5697 ppid:5119 flags:0x00004004
	Call Trace:
	<TASK>
	context_switch kernel/sched/core.c:5293 [inline]
	__schedule+0x995/0xe20 kernel/sched/core.c:6606
	schedule+0xcb/0x190 kernel/sched/core.c:6682
	rwsem_down_read_slowpath+0x5f9/0x930 kernel/locking/rwsem.c:1095
	__down_read_common+0x54/0x2a0 kernel/locking/rwsem.c:1260
	btrfs_page_mkwrite+0x417/0xc80 fs/btrfs/inode.c:8526
	do_page_mkwrite+0x19e/0x5e0 mm/memory.c:2947
	wp_page_shared+0x15e/0x380 mm/memory.c:3295
	handle_pte_fault mm/memory.c:4949 [inline]
	__handle_mm_fault mm/memory.c:5073 [inline]
	handle_mm_fault+0x1b79/0x26b0 mm/memory.c:5219
	do_user_addr_fault+0x69b/0xcb0 arch/x86/mm/fault.c:1428
	handle_page_fault arch/x86/mm/fault.c:1519 [inline]
	exc_page_fault+0x7a/0x110 arch/x86/mm/fault.c:1575
	asm_exc_page_fault+0x22/0x30 arch/x86/include/asm/idtentry.h:570
	RIP: 0010:copy_user_short_string+0xd/0x40 arch/x86/lib/copy_user_64.S:233
	Code: 74 0a 89 (...)
	RSP: 0018:ffffc9000570f330 EFLAGS: 00050202
	RAX: ffffffff843e6601 RBX: 00007fffffffefc8 RCX: 0000000000000007
	RDX: 0000000000000000 RSI: ffffc9000570f3e0 RDI: 0000000020000120
	RBP: ffffc9000570f490 R08: 0000000000000000 R09: fffff52000ae1e83
	R10: fffff52000ae1e83 R11: 1ffff92000ae1e7c R12: 0000000000000038
	R13: ffffc9000570f3e0 R14: 0000000020000120 R15: ffffc9000570f3e0
	copy_user_generic arch/x86/include/asm/uaccess_64.h:37 [inline]
	raw_copy_to_user arch/x86/include/asm/uaccess_64.h:58 [inline]
	_copy_to_user+0xe9/0x130 lib/usercopy.c:34
	copy_to_user include/linux/uaccess.h:169 [inline]
	fiemap_fill_next_extent+0x22e/0x410 fs/ioctl.c:144
	emit_fiemap_extent+0x22d/0x3c0 fs/btrfs/extent_io.c:3458
	fiemap_process_hole+0xa00/0xad0 fs/btrfs/extent_io.c:3716
	extent_fiemap+0xe27/0x2100 fs/btrfs/extent_io.c:3922
	btrfs_fiemap+0x172/0x1e0 fs/btrfs/inode.c:8209
	ioctl_fiemap fs/ioctl.c:219 [inline]
	do_vfs_ioctl+0x185b/0x2980 fs/ioctl.c:810
	__do_sys_ioctl fs/ioctl.c:868 [inline]
	__se_sys_ioctl+0x83/0x170 fs/ioctl.c:856
	do_syscall_x64 arch/x86/entry/common.c:50 [inline]
	do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
	entry_SYSCALL_64_after_hwframe+0x63/0xcd
	RIP: 0033:0x7f7d4054e9b9
	RSP: 002b:00007f7d390d92f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
	RAX: ffffffffffffffda RBX: 00007f7d405d87b0 RCX: 00007f7d4054e9b9
	RDX: 0000000020000100 RSI: 00000000c020660b RDI: 0000000000000005
	RBP: 00007f7d405a51d0 R08: 00007f7d390d9700 R09: 0000000000000000
	R10: 00007f7d390d9700 R11: 0000000000000246 R12: 61635f65646f6e69
	R13: 65646f7475616f6e R14: 7261637369646f6e R15: 00007f7d405d87b8
	</TASK>

	What happens is the following:

	1) Task A is doing an fsync, enters btrfs_sync_file() and flushes delalloc
	before locking the inode and the i_mmap_lock semaphore, that is, before
	calling btrfs_inode_lock();

	2) After task A flushes delalloc and before it calls btrfs_inode_lock(),
	another task dirties a page;

	3) Task B starts a fiemap without FIEMAP_FLAG_SYNC, so the page dirtied
	at step 2 remains dirty and unflushed. Then when it enters
	extent_fiemap() and it locks a file range that includes the range of
	the page dirtied in step 2;

	4) Task A calls btrfs_inode_lock() and locks the inode (VFS lock) and the
	inode's i_mmap_lock semaphore in write mode. Then it tries to flush
	delalloc by calling start_ordered_ops(), which will block, at
	find_lock_delalloc_range(), when trying to lock the range of the page
	dirtied at step 2, since this range was locked by the fiemap task (at
	step 3);

	5) Task B generates a page fault when accessing the user space fiemap
	buffer with a call to fiemap_fill_next_extent().

	The fault handler needs to call btrfs_page_mkwrite() for some other
	page of our inode, and there we deadlock when trying to lock the
	inode's i_mmap_lock semaphore in read mode, since the fsync task locked
	it in write mode (step 4) and the fsync task can not progress because
	it's waiting to lock a file range that is currently locked by us (the
	fiemap task, step 3).

	Fix this by taking the inode's lock (VFS lock) in shared mode when
	entering fiemap. This effectively serializes fiemap with fsync (except the
	most expensive part of fsync, the log sync), preventing this deadlock.

	The Linux kernel CVE team has assigned CVE-2023-52737 to this issue.


	Affected and fixed versions
	===========================

	Fixed in 6.1.13 with commit d8c594da79bc0244e610a70594e824a401802be1
	Fixed in 6.2 with commit 519b7e13b5ae8dd38da1e52275705343be6bb508

	Please see https://www.kernel.org for a full list of currently supported
	kernel versions by the kernel community.

	Unaffected versions might change over time as fixes are backported to
	older supported kernel versions. The official CVE entry at
	https://cve.org/CVERecord/?id=CVE-2023-52737
	will be updated if fixes are backported, please check that for the most
	up to date information about this issue.


	Affected files
	==============

	The file(s) affected by this issue are:
	fs/btrfs/extent_io.c


	Mitigation
	==========

	The Linux kernel CVE team recommends that you update to the latest
	stable kernel version for this, and many other bugfixes. Individual
	changes are never tested alone, but rather are part of a larger kernel
	release. Cherry-picking individual commits is not recommended or
	supported by the Linux kernel community at all. If however, updating to
	the latest release is impossible, the individual changes to resolve this
	issue can be found at these commits:
	https://git.kernel.org/stable/c/d8c594da79bc0244e610a70594e824a401802be1
	https://git.kernel.org/stable/c/519b7e13b5ae8dd38da1e52275705343be6bb508