| From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001 |
| From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| To: <linux-cve-announce@vger.kernel.org> |
| Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org> |
| Subject: CVE-2023-52737: btrfs: lock the inode in shared mode before starting fiemap |
| |
| Description |
| =========== |
| |
| In the Linux kernel, the following vulnerability has been resolved: |
| |
| btrfs: lock the inode in shared mode before starting fiemap |
| |
| Currently fiemap does not take the inode's lock (VFS lock), it only locks |
| a file range in the inode's io tree. This however can lead to a deadlock |
| if we have a concurrent fsync on the file and fiemap code triggers a fault |
| when accessing the user space buffer with fiemap_fill_next_extent(). The |
| deadlock happens on the inode's i_mmap_lock semaphore, which is taken both |
| by fsync and btrfs_page_mkwrite(). This deadlock was recently reported by |
| syzbot and triggers a trace like the following: |
| |
| task:syz-executor361 state:D stack:20264 pid:5668 ppid:5119 flags:0x00004004 |
| Call Trace: |
| <TASK> |
| context_switch kernel/sched/core.c:5293 [inline] |
| __schedule+0x995/0xe20 kernel/sched/core.c:6606 |
| schedule+0xcb/0x190 kernel/sched/core.c:6682 |
| wait_on_state fs/btrfs/extent-io-tree.c:707 [inline] |
| wait_extent_bit+0x577/0x6f0 fs/btrfs/extent-io-tree.c:751 |
| lock_extent+0x1c2/0x280 fs/btrfs/extent-io-tree.c:1742 |
| find_lock_delalloc_range+0x4e6/0x9c0 fs/btrfs/extent_io.c:488 |
| writepage_delalloc+0x1ef/0x540 fs/btrfs/extent_io.c:1863 |
| __extent_writepage+0x736/0x14e0 fs/btrfs/extent_io.c:2174 |
| extent_write_cache_pages+0x983/0x1220 fs/btrfs/extent_io.c:3091 |
| extent_writepages+0x219/0x540 fs/btrfs/extent_io.c:3211 |
| do_writepages+0x3c3/0x680 mm/page-writeback.c:2581 |
| filemap_fdatawrite_wbc+0x11e/0x170 mm/filemap.c:388 |
| __filemap_fdatawrite_range mm/filemap.c:421 [inline] |
| filemap_fdatawrite_range+0x175/0x200 mm/filemap.c:439 |
| btrfs_fdatawrite_range fs/btrfs/file.c:3850 [inline] |
| start_ordered_ops fs/btrfs/file.c:1737 [inline] |
| btrfs_sync_file+0x4ff/0x1190 fs/btrfs/file.c:1839 |
| generic_write_sync include/linux/fs.h:2885 [inline] |
| btrfs_do_write_iter+0xcd3/0x1280 fs/btrfs/file.c:1684 |
| call_write_iter include/linux/fs.h:2189 [inline] |
| new_sync_write fs/read_write.c:491 [inline] |
| vfs_write+0x7dc/0xc50 fs/read_write.c:584 |
| ksys_write+0x177/0x2a0 fs/read_write.c:637 |
| do_syscall_x64 arch/x86/entry/common.c:50 [inline] |
| do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 |
| entry_SYSCALL_64_after_hwframe+0x63/0xcd |
| RIP: 0033:0x7f7d4054e9b9 |
| RSP: 002b:00007f7d404fa2f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 |
| RAX: ffffffffffffffda RBX: 00007f7d405d87a0 RCX: 00007f7d4054e9b9 |
| RDX: 0000000000000090 RSI: 0000000020000000 RDI: 0000000000000006 |
| RBP: 00007f7d405a51d0 R08: 0000000000000000 R09: 0000000000000000 |
| R10: 0000000000000000 R11: 0000000000000246 R12: 61635f65646f6e69 |
| R13: 65646f7475616f6e R14: 7261637369646f6e R15: 00007f7d405d87a8 |
| </TASK> |
| INFO: task syz-executor361:5697 blocked for more than 145 seconds. |
| Not tainted 6.2.0-rc3-syzkaller-00376-g7c6984405241 #0 |
| "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. |
| task:syz-executor361 state:D stack:21216 pid:5697 ppid:5119 flags:0x00004004 |
| Call Trace: |
| <TASK> |
| context_switch kernel/sched/core.c:5293 [inline] |
| __schedule+0x995/0xe20 kernel/sched/core.c:6606 |
| schedule+0xcb/0x190 kernel/sched/core.c:6682 |
| rwsem_down_read_slowpath+0x5f9/0x930 kernel/locking/rwsem.c:1095 |
| __down_read_common+0x54/0x2a0 kernel/locking/rwsem.c:1260 |
| btrfs_page_mkwrite+0x417/0xc80 fs/btrfs/inode.c:8526 |
| do_page_mkwrite+0x19e/0x5e0 mm/memory.c:2947 |
| wp_page_shared+0x15e/0x380 mm/memory.c:3295 |
| handle_pte_fault mm/memory.c:4949 [inline] |
| __handle_mm_fault mm/memory.c:5073 [inline] |
| handle_mm_fault+0x1b79/0x26b0 mm/memory.c:5219 |
| do_user_addr_fault+0x69b/0xcb0 arch/x86/mm/fault.c:1428 |
| handle_page_fault arch/x86/mm/fault.c:1519 [inline] |
| exc_page_fault+0x7a/0x110 arch/x86/mm/fault.c:1575 |
| asm_exc_page_fault+0x22/0x30 arch/x86/include/asm/idtentry.h:570 |
| RIP: 0010:copy_user_short_string+0xd/0x40 arch/x86/lib/copy_user_64.S:233 |
| Code: 74 0a 89 (...) |
| RSP: 0018:ffffc9000570f330 EFLAGS: 00050202 |
| RAX: ffffffff843e6601 RBX: 00007fffffffefc8 RCX: 0000000000000007 |
| RDX: 0000000000000000 RSI: ffffc9000570f3e0 RDI: 0000000020000120 |
| RBP: ffffc9000570f490 R08: 0000000000000000 R09: fffff52000ae1e83 |
| R10: fffff52000ae1e83 R11: 1ffff92000ae1e7c R12: 0000000000000038 |
| R13: ffffc9000570f3e0 R14: 0000000020000120 R15: ffffc9000570f3e0 |
| copy_user_generic arch/x86/include/asm/uaccess_64.h:37 [inline] |
| raw_copy_to_user arch/x86/include/asm/uaccess_64.h:58 [inline] |
| _copy_to_user+0xe9/0x130 lib/usercopy.c:34 |
| copy_to_user include/linux/uaccess.h:169 [inline] |
| fiemap_fill_next_extent+0x22e/0x410 fs/ioctl.c:144 |
| emit_fiemap_extent+0x22d/0x3c0 fs/btrfs/extent_io.c:3458 |
| fiemap_process_hole+0xa00/0xad0 fs/btrfs/extent_io.c:3716 |
| extent_fiemap+0xe27/0x2100 fs/btrfs/extent_io.c:3922 |
| btrfs_fiemap+0x172/0x1e0 fs/btrfs/inode.c:8209 |
| ioctl_fiemap fs/ioctl.c:219 [inline] |
| do_vfs_ioctl+0x185b/0x2980 fs/ioctl.c:810 |
| __do_sys_ioctl fs/ioctl.c:868 [inline] |
| __se_sys_ioctl+0x83/0x170 fs/ioctl.c:856 |
| do_syscall_x64 arch/x86/entry/common.c:50 [inline] |
| do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 |
| entry_SYSCALL_64_after_hwframe+0x63/0xcd |
| RIP: 0033:0x7f7d4054e9b9 |
| RSP: 002b:00007f7d390d92f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 |
| RAX: ffffffffffffffda RBX: 00007f7d405d87b0 RCX: 00007f7d4054e9b9 |
| RDX: 0000000020000100 RSI: 00000000c020660b RDI: 0000000000000005 |
| RBP: 00007f7d405a51d0 R08: 00007f7d390d9700 R09: 0000000000000000 |
| R10: 00007f7d390d9700 R11: 0000000000000246 R12: 61635f65646f6e69 |
| R13: 65646f7475616f6e R14: 7261637369646f6e R15: 00007f7d405d87b8 |
| </TASK> |
| |
| What happens is the following: |
| |
| 1) Task A is doing an fsync, enters btrfs_sync_file() and flushes delalloc |
| before locking the inode and the i_mmap_lock semaphore, that is, before |
| calling btrfs_inode_lock(); |
| |
| 2) After task A flushes delalloc and before it calls btrfs_inode_lock(), |
| another task dirties a page; |
| |
| 3) Task B starts a fiemap without FIEMAP_FLAG_SYNC, so the page dirtied |
| at step 2 remains dirty and unflushed. Then when it enters |
| extent_fiemap() and it locks a file range that includes the range of |
| the page dirtied in step 2; |
| |
| 4) Task A calls btrfs_inode_lock() and locks the inode (VFS lock) and the |
| inode's i_mmap_lock semaphore in write mode. Then it tries to flush |
| delalloc by calling start_ordered_ops(), which will block, at |
| find_lock_delalloc_range(), when trying to lock the range of the page |
| dirtied at step 2, since this range was locked by the fiemap task (at |
| step 3); |
| |
| 5) Task B generates a page fault when accessing the user space fiemap |
| buffer with a call to fiemap_fill_next_extent(). |
| |
| The fault handler needs to call btrfs_page_mkwrite() for some other |
| page of our inode, and there we deadlock when trying to lock the |
| inode's i_mmap_lock semaphore in read mode, since the fsync task locked |
| it in write mode (step 4) and the fsync task can not progress because |
| it's waiting to lock a file range that is currently locked by us (the |
| fiemap task, step 3). |
| |
| Fix this by taking the inode's lock (VFS lock) in shared mode when |
| entering fiemap. This effectively serializes fiemap with fsync (except the |
| most expensive part of fsync, the log sync), preventing this deadlock. |
| |
| The Linux kernel CVE team has assigned CVE-2023-52737 to this issue. |
| |
| |
| Affected and fixed versions |
| =========================== |
| |
| Fixed in 6.1.13 with commit d8c594da79bc0244e610a70594e824a401802be1 |
| Fixed in 6.2 with commit 519b7e13b5ae8dd38da1e52275705343be6bb508 |
| |
| Please see https://www.kernel.org for a full list of currently supported |
| kernel versions by the kernel community. |
| |
| Unaffected versions might change over time as fixes are backported to |
| older supported kernel versions. The official CVE entry at |
| https://cve.org/CVERecord/?id=CVE-2023-52737 |
| will be updated if fixes are backported, please check that for the most |
| up to date information about this issue. |
| |
| |
| Affected files |
| ============== |
| |
| The file(s) affected by this issue are: |
| fs/btrfs/extent_io.c |
| |
| |
| Mitigation |
| ========== |
| |
| The Linux kernel CVE team recommends that you update to the latest |
| stable kernel version for this, and many other bugfixes. Individual |
| changes are never tested alone, but rather are part of a larger kernel |
| release. Cherry-picking individual commits is not recommended or |
| supported by the Linux kernel community at all. If however, updating to |
| the latest release is impossible, the individual changes to resolve this |
| issue can be found at these commits: |
| https://git.kernel.org/stable/c/d8c594da79bc0244e610a70594e824a401802be1 |
| https://git.kernel.org/stable/c/519b7e13b5ae8dd38da1e52275705343be6bb508 |