refs/heads/generic_647_dio_deadlock_fix - pub/scm/linux/kernel/git/fdmanana/linux

commit	de657e3fdce0a0b84d150f7e01aa815fcae4e365	[log] [tgz]
author	Filipe Manana <fdmanana@suse.com>	Wed Sep 08 11:01:07 2021 +0100
committer	Filipe Manana <fdmanana@suse.com>	Mon Oct 25 10:24:20 2021 +0100
tree	299b9475259733aa86972c08efd913dfbb1cb281
parent	226168f86064d8b08cd79aee00bde8e7faa886be [diff]
btrfs: fix deadlock due to page faults during direct IO reads and writes

If we do a direct IO read or write when the buffer given by the user is
memory mapped to the file range we are going to do IO, we end up ending
in a deadlock. This is triggered by the new test case generic/647 from
fstests.

For a direct IO read we get a trace like this:

[  967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
[  967.874161]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
[  967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  967.875983] task:mmap-rw-fault   state:D stack:    0 pid:12176 ppid: 11884 flags:0x00000000
[  967.875992] Call Trace:
[  967.875999]  __schedule+0x3ca/0xe10
[  967.876015]  schedule+0x43/0xe0
[  967.876020]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
[  967.876109]  ? do_wait_intr_irq+0xb0/0xb0
[  967.876118]  lock_extent_bits+0x37/0x90 [btrfs]
[  967.876150]  btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
[  967.876184]  ? extent_readahead+0xa7/0x530 [btrfs]
[  967.876214]  extent_readahead+0x32d/0x530 [btrfs]
[  967.876253]  ? lru_cache_add+0x104/0x220
[  967.876255]  ? kvm_sched_clock_read+0x14/0x40
[  967.876258]  ? sched_clock_cpu+0xd/0x110
[  967.876263]  ? lock_release+0x155/0x4a0
[  967.876271]  read_pages+0x86/0x270
[  967.876274]  ? lru_cache_add+0x125/0x220
[  967.876281]  page_cache_ra_unbounded+0x1a3/0x220
[  967.876291]  filemap_fault+0x626/0xa20
[  967.876303]  __do_fault+0x36/0xf0
[  967.876308]  __handle_mm_fault+0x83f/0x15f0
[  967.876322]  handle_mm_fault+0x9e/0x260
[  967.876327]  __get_user_pages+0x204/0x620
[  967.876332]  ? get_user_pages_unlocked+0x69/0x340
[  967.876340]  get_user_pages_unlocked+0xd3/0x340
[  967.876349]  internal_get_user_pages_fast+0xbca/0xdc0
[  967.876366]  iov_iter_get_pages+0x8d/0x3a0
[  967.876374]  bio_iov_iter_get_pages+0x82/0x4a0
[  967.876379]  ? lock_release+0x155/0x4a0
[  967.876387]  iomap_dio_bio_actor+0x232/0x410
[  967.876396]  iomap_apply+0x12a/0x4a0
[  967.876398]  ? iomap_dio_rw+0x30/0x30
[  967.876414]  __iomap_dio_rw+0x29f/0x5e0
[  967.876415]  ? iomap_dio_rw+0x30/0x30
[  967.876420]  ? lock_acquired+0xf3/0x420
[  967.876429]  iomap_dio_rw+0xa/0x30
[  967.876431]  btrfs_file_read_iter+0x10b/0x140 [btrfs]
[  967.876460]  new_sync_read+0x118/0x1a0
[  967.876472]  vfs_read+0x128/0x1b0
[  967.876477]  __x64_sys_pread64+0x90/0xc0
[  967.876483]  do_syscall_64+0x3b/0xc0
[  967.876487]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  967.876490] RIP: 0033:0x7fb6f2c038d6
[  967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
[  967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
[  967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
[  967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
[  967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
[  967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000

This happens because at btrfs_dio_iomap_begin() we lock the extent range
and return with it locked - we only unlock in the endio callback, at
end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
faults that resulting in reading the pages, through the readahead callback
btrfs_readahead(), and through there we end to attempt to lock again the
same extent range (or a subrange of what we locked before), resulting in
the deadlock.

For a direct IO write, the scenario is a bit different, and it results in
trace like this:

[ 1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
[ 1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
[ 1330.350540]       Not tainted 5.14.0-rc7-btrfs-next-95 #1
[ 1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1330.351900] task:mmap-rw-fault   state:D stack:    0 pid:184017 ppid:183725 flags:0x00000000
[ 1330.351906] Call Trace:
[ 1330.351913]  __schedule+0x3ca/0xe10
[ 1330.351930]  schedule+0x43/0xe0
[ 1330.351935]  btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
[ 1330.352020]  ? do_wait_intr_irq+0xb0/0xb0
[ 1330.352028]  btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
[ 1330.352064]  ? extent_readahead+0xa7/0x530 [btrfs]
[ 1330.352094]  extent_readahead+0x32d/0x530 [btrfs]
[ 1330.352133]  ? lru_cache_add+0x104/0x220
[ 1330.352135]  ? kvm_sched_clock_read+0x14/0x40
[ 1330.352138]  ? sched_clock_cpu+0xd/0x110
[ 1330.352143]  ? lock_release+0x155/0x4a0
[ 1330.352151]  read_pages+0x86/0x270
[ 1330.352155]  ? lru_cache_add+0x125/0x220
[ 1330.352162]  page_cache_ra_unbounded+0x1a3/0x220
[ 1330.352172]  filemap_fault+0x626/0xa20
[ 1330.352176]  ? filemap_map_pages+0x18b/0x660
[ 1330.352184]  __do_fault+0x36/0xf0
[ 1330.352189]  __handle_mm_fault+0x1253/0x15f0
[ 1330.352203]  handle_mm_fault+0x9e/0x260
[ 1330.352208]  __get_user_pages+0x204/0x620
[ 1330.352212]  ? get_user_pages_unlocked+0x69/0x340
[ 1330.352220]  get_user_pages_unlocked+0xd3/0x340
[ 1330.352229]  internal_get_user_pages_fast+0xbca/0xdc0
[ 1330.352246]  iov_iter_get_pages+0x8d/0x3a0
[ 1330.352254]  bio_iov_iter_get_pages+0x82/0x4a0
[ 1330.352259]  ? lock_release+0x155/0x4a0
[ 1330.352266]  iomap_dio_bio_actor+0x232/0x410
[ 1330.352275]  iomap_apply+0x12a/0x4a0
[ 1330.352278]  ? iomap_dio_rw+0x30/0x30
[ 1330.352292]  __iomap_dio_rw+0x29f/0x5e0
[ 1330.352294]  ? iomap_dio_rw+0x30/0x30
[ 1330.352306]  btrfs_file_write_iter+0x238/0x480 [btrfs]
[ 1330.352339]  new_sync_write+0x11f/0x1b0
[ 1330.352344]  ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
[ 1330.352354]  vfs_write+0x292/0x3c0
[ 1330.352359]  __x64_sys_pwrite64+0x90/0xc0
[ 1330.352365]  do_syscall_64+0x3b/0xc0
[ 1330.352369]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1330.352372] RIP: 0033:0x7f4b0a580986
[ 1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
[ 1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
[ 1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
[ 1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
[ 1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[ 1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
range unlocked, but later when the page faults are triggered and we try
to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
we find the ordered extent for our write, created by the iomap callback
btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
deadlock since we can't complete the ordered extent without reading the
pages (the iomap code only submits the bio after the pages are faulted
in).

Fix this by setting the nofault attribute of the given iov_iter and retry
the direct IO read/write if we get an -EFAULT error returned from iomap.
For reads, also disable page faults completely, this is because when we
read from a hole or a prealloc extent, we can still trigger page faults
due to the call to iov_iter_zero() done by iomap - at the momemnt, it is
oblivious to the value of the ->nofault attribute of an iov_iter.
We also need to keep track of the number of bytes written or read, and
pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.

This depends on the iov_iter and iomap changes done by a recent patchset
from Andreas Gruenbacher, which is not yet merged to Linus' tree at the
moment of this writing. The cover letter has the following subject:

   "[PATCH v8 00/19] gfs2: Fix mmap + page fault deadlocks"

The thread can be found at:

https://lore.kernel.org/linux-fsdevel/20211019134204.3382645-1-agruenba@redhat.com/

Fixing these issues could be done without the iov_iter and iomap changes
introduced in that patchset, however it would be much more complex due to
the need of reordering some operations for writes and having to be able
to pass some state through nested and deep call chains, which would be
particularly cumbersome for reads - for example make the readahead and
the endio handlers for page reads be aware we are in a direct IO read
context and know which inode and extent range we locked before.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
fs/btrfs/file.c[diff]
1 file changed
tree: 299b9475259733aa86972c08efd913dfbb1cb281