| From: Dave Hansen <dave.hansen@linux.intel.com> |
| Subject: filemap: move prefaulting out of hot write path |
| Date: Fri, 28 Feb 2025 12:37:22 -0800 |
| |
| There is a generic anti-pattern that shows up in the VFS and several |
| filesystems where the hot write paths touch userspace twice when they |
| could get away with doing it once. |
| |
| Dave Chinner suggested that they should all be fixed up[1]. I agree[2]. |
| But, the series to do that fixup spans a bunch of filesystems and a lot of |
| people. This patch fixes common code that absolutely everyone uses. It |
| has measurable performance benefits[3]. |
| |
| I think this patch can go in and not be held up by the others. |
| |
| I will post them separately to their separate maintainers for |
| consideration. But, honestly, I'm not going to lose any sleep if |
| the maintainers don't pick those up. |
| |
| 1. https://lore.kernel.org/all/Z5f-x278Z3wTIugL@dread.disaster.area/ |
| 2. https://lore.kernel.org/all/20250129181749.C229F6F3@davehans-spike.ostc.intel.com/ |
| 3. https://lore.kernel.org/all/202502121529.d62a409e-lkp@intel.com/ |
| |
| |
| This patch: |
| |
| There is a bit of a sordid history here. I originally wrote |
| 998ef75ddb57 ("fs: do not prefault sys_write() user buffer pages") |
| to fix a performance issue that showed up on early SMAP hardware. |
| But that was reverted with 00a3d660cbac because it exposed an |
| underlying filesystem bug. |
| |
| This is a reimplementation of the original commit along with some |
| simplification and comment improvements. |
| |
| The basic problem is that the generic write path has two userspace |
| accesses: one to prefault the write source buffer and then another to |
| perform the actual write. On x86, this means an extra STAC/CLAC pair. |
| These are relatively expensive instructions because they function as |
| barriers. |
| |
| Keep the prefaulting behavior but move it into the slow path that gets |
| run when the write did not make any progress. This avoids livelocks |
| that can happen when the write's source and destination target the |
| same folio. Contrary to the existing comments, the fault-in does not |
| prevent deadlocks. That's accomplished by using an "atomic" usercopy |
| that disables page faults. |
| |
| The end result is that the generic write fast path now touches |
| userspace once instead of twice. |
| |
| 0day has shown some improvements on a couple of microbenchmarks: |
| |
| https://lore.kernel.org/all/202502121529.d62a409e-lkp@intel.com/ |
| |
| Link: https://lkml.kernel.org/r/20250228203722.CAEB63AC@davehans-spike.ostc.intel.com |
| Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> |
| Link: https://lore.kernel.org/all/yxyuijjfd6yknryji2q64j3keq2ygw6ca6fs5jwyolklzvo45s@4u63qqqyosy2/ |
| Cc: Ted Ts'o <tytso@mit.edu> |
| Cc: Matthew Wilcox <willy@infradead.org> |
| Cc: Mateusz Guzik <mjguzik@gmail.com> |
| Cc: Dave Chinner <david@fromorbit.com> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| --- |
| |
| mm/filemap.c | 27 ++++++++++++++++----------- |
| 1 file changed, 16 insertions(+), 11 deletions(-) |
| |
| --- a/mm/filemap.c~filemap-move-prefaulting-out-of-hot-write-path |
| +++ a/mm/filemap.c |
| @@ -4169,17 +4169,6 @@ retry: |
| bytes = min(chunk - offset, bytes); |
| balance_dirty_pages_ratelimited(mapping); |
| |
| - /* |
| - * Bring in the user page that we will copy from _first_. |
| - * Otherwise there's a nasty deadlock on copying from the |
| - * same page as we're writing to, without it being marked |
| - * up-to-date. |
| - */ |
| - if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) { |
| - status = -EFAULT; |
| - break; |
| - } |
| - |
| if (fatal_signal_pending(current)) { |
| status = -EINTR; |
| break; |
| @@ -4197,6 +4186,12 @@ retry: |
| if (mapping_writably_mapped(mapping)) |
| flush_dcache_folio(folio); |
| |
| + /* |
| + * Faults here on mmap()s can recurse into arbitrary |
| + * filesystem code. Lots of locks are held that can |
| + * deadlock. Use an atomic copy to avoid deadlocking |
| + * in page fault handling. |
| + */ |
| copied = copy_folio_from_iter_atomic(folio, offset, bytes, i); |
| flush_dcache_folio(folio); |
| |
| @@ -4222,6 +4217,16 @@ retry: |
| bytes = copied; |
| goto retry; |
| } |
| + |
| + /* |
| + * 'folio' is now unlocked and faults on it can be |
| + * handled. Ensure forward progress by trying to |
| + * fault it in now. |
| + */ |
| + if (fault_in_iov_iter_readable(i, bytes) == bytes) { |
| + status = -EFAULT; |
| + break; |
| + } |
| } else { |
| pos += status; |
| written += status; |
| _ |