| From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001 |
| From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| To: <linux-cve-announce@vger.kernel.org> |
| Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org> |
| Subject: CVE-2024-46734: btrfs: fix race between direct IO write and fsync when using same fd |
| |
| Description |
| =========== |
| |
| In the Linux kernel, the following vulnerability has been resolved: |
| |
| btrfs: fix race between direct IO write and fsync when using same fd |
| |
| If we have 2 threads that are using the same file descriptor and one of |
| them is doing direct IO writes while the other is doing fsync, we have a |
| race where we can end up either: |
| |
| 1) Attempt a fsync without holding the inode's lock, triggering an |
| assertion failures when assertions are enabled; |
| |
| 2) Do an invalid memory access from the fsync task because the file private |
| points to memory allocated on stack by the direct IO task and it may be |
| used by the fsync task after the stack was destroyed. |
| |
| The race happens like this: |
| |
| 1) A user space program opens a file descriptor with O_DIRECT; |
| |
| 2) The program spawns 2 threads using libpthread for example; |
| |
| 3) One of the threads uses the file descriptor to do direct IO writes, |
| while the other calls fsync using the same file descriptor. |
| |
| 4) Call task A the thread doing direct IO writes and task B the thread |
| doing fsyncs; |
| |
| 5) Task A does a direct IO write, and at btrfs_direct_write() sets the |
| file's private to an on stack allocated private with the member |
| 'fsync_skip_inode_lock' set to true; |
| |
| 6) Task B enters btrfs_sync_file() and sees that there's a private |
| structure associated to the file which has 'fsync_skip_inode_lock' set |
| to true, so it skips locking the inode's VFS lock; |
| |
| 7) Task A completes the direct IO write, and resets the file's private to |
| NULL since it had no prior private and our private was stack allocated. |
| Then it unlocks the inode's VFS lock; |
| |
| 8) Task B enters btrfs_get_ordered_extents_for_logging(), then the |
| assertion that checks the inode's VFS lock is held fails, since task B |
| never locked it and task A has already unlocked it. |
| |
| The stack trace produced is the following: |
| |
| assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983 |
| ------------[ cut here ]------------ |
| kernel BUG at fs/btrfs/ordered-data.c:983! |
| Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI |
| CPU: 9 PID: 5072 Comm: worker Tainted: G U OE 6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8 |
| Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020 |
| RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs] |
| Code: 50 d6 86 c0 e8 (...) |
| RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246 |
| RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000 |
| RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800 |
| RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38 |
| R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800 |
| R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000 |
| FS: 00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000 |
| CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 |
| CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0 |
| Call Trace: |
| <TASK> |
| ? __die_body.cold+0x14/0x24 |
| ? die+0x2e/0x50 |
| ? do_trap+0xca/0x110 |
| ? do_error_trap+0x6a/0x90 |
| ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] |
| ? exc_invalid_op+0x50/0x70 |
| ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] |
| ? asm_exc_invalid_op+0x1a/0x20 |
| ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] |
| ? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] |
| btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a] |
| ? __seccomp_filter+0x31d/0x4f0 |
| __x64_sys_fdatasync+0x4f/0x90 |
| do_syscall_64+0x82/0x160 |
| ? do_futex+0xcb/0x190 |
| ? __x64_sys_futex+0x10e/0x1d0 |
| ? switch_fpu_return+0x4f/0xd0 |
| ? syscall_exit_to_user_mode+0x72/0x220 |
| ? do_syscall_64+0x8e/0x160 |
| ? syscall_exit_to_user_mode+0x72/0x220 |
| ? do_syscall_64+0x8e/0x160 |
| ? syscall_exit_to_user_mode+0x72/0x220 |
| ? do_syscall_64+0x8e/0x160 |
| ? syscall_exit_to_user_mode+0x72/0x220 |
| ? do_syscall_64+0x8e/0x160 |
| entry_SYSCALL_64_after_hwframe+0x76/0x7e |
| |
| Another problem here is if task B grabs the private pointer and then uses |
| it after task A has finished, since the private was allocated in the stack |
| of task A, it results in some invalid memory access with a hard to predict |
| result. |
| |
| This issue, triggering the assertion, was observed with QEMU workloads by |
| two users in the Link tags below. |
| |
| Fix this by not relying on a file's private to pass information to fsync |
| that it should skip locking the inode and instead pass this information |
| through a special value stored in current->journal_info. This is safe |
| because in the relevant section of the direct IO write path we are not |
| holding a transaction handle, so current->journal_info is NULL. |
| |
| The following C program triggers the issue: |
| |
| $ cat repro.c |
| /* Get the O_DIRECT definition. */ |
| #ifndef _GNU_SOURCE |
| #define _GNU_SOURCE |
| #endif |
| |
| #include <stdio.h> |
| #include <stdlib.h> |
| #include <unistd.h> |
| #include <stdint.h> |
| #include <fcntl.h> |
| #include <errno.h> |
| #include <string.h> |
| #include <pthread.h> |
| |
| static int fd; |
| |
| static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset) |
| { |
| while (count > 0) { |
| ssize_t ret; |
| |
| ret = pwrite(fd, buf, count, offset); |
| if (ret < 0) { |
| if (errno == EINTR) |
| continue; |
| return ret; |
| } |
| count -= ret; |
| buf += ret; |
| } |
| return 0; |
| } |
| |
| static void *fsync_loop(void *arg) |
| { |
| while (1) { |
| int ret; |
| |
| ret = fsync(fd); |
| if (ret != 0) { |
| perror("Fsync failed"); |
| exit(6); |
| } |
| } |
| } |
| |
| int main(int argc, char *argv[]) |
| { |
| long pagesize; |
| void *write_buf; |
| pthread_t fsyncer; |
| int ret; |
| |
| if (argc != 2) { |
| fprintf(stderr, "Use: %s <file path>\n", argv[0]); |
| return 1; |
| } |
| |
| fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666); |
| if (fd == -1) { |
| perror("Failed to open/create file"); |
| return 1; |
| } |
| |
| pagesize = sysconf(_SC_PAGE_SIZE); |
| if (pagesize == -1) { |
| perror("Failed to get page size"); |
| return 2; |
| } |
| |
| ret = posix_memalign(&write_buf, pagesize, pagesize); |
| if (ret) { |
| perror("Failed to allocate buffer"); |
| return 3; |
| } |
| |
| ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL); |
| if (ret != 0) { |
| fprintf(stderr, "Failed to create writer thread: %d\n", ret); |
| return 4; |
| } |
| |
| while (1) { |
| ret = do_write(fd, write_buf, pagesize, 0); |
| if (ret != 0) { |
| perror("Write failed"); |
| exit(5); |
| } |
| } |
| |
| return 0; |
| } |
| |
| $ mkfs.btrfs -f /dev/sdi |
| $ mount /dev/sdi /mnt/sdi |
| $ timeout 10 ./repro /mnt/sdi/foo |
| |
| Usually the race is triggered within less than 1 second. A test case for |
| fstests will follow soon. |
| |
| The Linux kernel CVE team has assigned CVE-2024-46734 to this issue. |
| |
| |
| Affected and fixed versions |
| =========================== |
| |
| Issue introduced in 5.15.165 with commit 4e17707035a65f6e5b2a4d987a308cf8ed8c5ad1 and fixed in 5.15.167 with commit d116a0b0e02f395cedfb8c725bd67480aa7c428c |
| Issue introduced in 6.1.105 with commit 6cae8d04d8b3d1ecfadcaa989e673f6f73349ed5 and fixed in 6.1.110 with commit cd3087582e4fa36e89be4e6f859e75a4400292b4 |
| Issue introduced in 6.6.46 with commit 0a108bde616a7017653385b5a12111015051a294 and fixed in 6.6.51 with commit 7b5595f33c3c273613b590892a578d78186bb400 |
| Issue introduced in 6.10.5 with commit 3831170f740685fddc8f6aa57a83ad0fef4711bf and fixed in 6.10.10 with commit 01681aa609b5f110502f56c4e3b2938efcf4a5bc |
| |
| Please see https://www.kernel.org for a full list of currently supported |
| kernel versions by the kernel community. |
| |
| Unaffected versions might change over time as fixes are backported to |
| older supported kernel versions. The official CVE entry at |
| https://cve.org/CVERecord/?id=CVE-2024-46734 |
| will be updated if fixes are backported, please check that for the most |
| up to date information about this issue. |
| |
| |
| Affected files |
| ============== |
| |
| The file(s) affected by this issue are: |
| fs/btrfs/ctree.h |
| fs/btrfs/direct-io.c |
| fs/btrfs/file.c |
| fs/btrfs/transaction.h |
| |
| |
| Mitigation |
| ========== |
| |
| The Linux kernel CVE team recommends that you update to the latest |
| stable kernel version for this, and many other bugfixes. Individual |
| changes are never tested alone, but rather are part of a larger kernel |
| release. Cherry-picking individual commits is not recommended or |
| supported by the Linux kernel community at all. If however, updating to |
| the latest release is impossible, the individual changes to resolve this |
| issue can be found at these commits: |
| https://git.kernel.org/stable/c/d116a0b0e02f395cedfb8c725bd67480aa7c428c |
| https://git.kernel.org/stable/c/cd3087582e4fa36e89be4e6f859e75a4400292b4 |
| https://git.kernel.org/stable/c/7b5595f33c3c273613b590892a578d78186bb400 |
| https://git.kernel.org/stable/c/01681aa609b5f110502f56c4e3b2938efcf4a5bc |
| https://git.kernel.org/stable/c/cd9253c23aedd61eb5ff11f37a36247cd46faf86 |