patches/old/filemap-move-prefaulting-out-of-hot-write-path.patch - pub/scm/linux/kernel/git/akpm/25-new - Git at Google

 From: Dave Hansen <dave.hansen@linux.intel.com>
 Subject: filemap: move prefaulting out of hot write path
 Date: Fri, 28 Feb 2025 12:37:22 -0800

 There is a generic anti-pattern that shows up in the VFS and several
 filesystems where the hot write paths touch userspace twice when they
 could get away with doing it once.

 Dave Chinner suggested that they should all be fixed up[1].  I agree[2].
 But, the series to do that fixup spans a bunch of filesystems and a lot of
 people.  This patch fixes common code that absolutely everyone uses.  It
 has measurable performance benefits[3].

 I think this patch can go in and not be held up by the others.

 I will post them separately to their separate maintainers for
 consideration. But, honestly, I'm not going to lose any sleep if
 the maintainers don't pick those up.

 1. https://lore.kernel.org/all/Z5f-x278Z3wTIugL@dread.disaster.area/
 2. https://lore.kernel.org/all/20250129181749.C229F6F3@davehans-spike.ostc.intel.com/
 3. https://lore.kernel.org/all/202502121529.d62a409e-lkp@intel.com/


 This patch:

 There is a bit of a sordid history here. I originally wrote
 998ef75ddb57 ("fs: do not prefault sys_write() user buffer pages")
 to fix a performance issue that showed up on early SMAP hardware.
 But that was reverted with 00a3d660cbac because it exposed an
 underlying filesystem bug.

 This is a reimplementation of the original commit along with some
 simplification and comment improvements.

 The basic problem is that the generic write path has two userspace
 accesses: one to prefault the write source buffer and then another to
 perform the actual write. On x86, this means an extra STAC/CLAC pair.
 These are relatively expensive instructions because they function as
 barriers.

 Keep the prefaulting behavior but move it into the slow path that gets
 run when the write did not make any progress. This avoids livelocks
 that can happen when the write's source and destination target the
 same folio. Contrary to the existing comments, the fault-in does not
 prevent deadlocks. That's accomplished by using an "atomic" usercopy
 that disables page faults.

 The end result is that the generic write fast path now touches
 userspace once instead of twice.

 0day has shown some improvements on a couple of microbenchmarks:

 	https://lore.kernel.org/all/202502121529.d62a409e-lkp@intel.com/

 Link: https://lkml.kernel.org/r/20250228203722.CAEB63AC@davehans-spike.ostc.intel.com
 Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
 Link: https://lore.kernel.org/all/yxyuijjfd6yknryji2q64j3keq2ygw6ca6fs5jwyolklzvo45s@4u63qqqyosy2/
 Cc: Ted Ts'o <tytso@mit.edu>
 Cc: Matthew Wilcox <willy@infradead.org>
 Cc: Mateusz Guzik <mjguzik@gmail.com>
 Cc: Dave Chinner <david@fromorbit.com>
 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
 ---

  mm/filemap.c |   27 ++++++++++++++++-----------
  1 file changed, 16 insertions(+), 11 deletions(-)

 --- a/mm/filemap.c~filemap-move-prefaulting-out-of-hot-write-path
 +++ a/mm/filemap.c
 @@ -4169,17 +4169,6 @@ retry:
  		bytes = min(chunk - offset, bytes);
  		balance_dirty_pages_ratelimited(mapping);

 -		/*
 -		 * Bring in the user page that we will copy from _first_.
 -		 * Otherwise there's a nasty deadlock on copying from the
 -		 * same page as we're writing to, without it being marked
 -		 * up-to-date.
 -		 */
 -		if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
 -			status = -EFAULT;
 -			break;
 -		}
 -
  		if (fatal_signal_pending(current)) {
  			status = -EINTR;
  			break;
 @@ -4197,6 +4186,12 @@ retry:
  		if (mapping_writably_mapped(mapping))
  			flush_dcache_folio(folio);

 +		/*
 +		 * Faults here on mmap()s can recurse into arbitrary
 +		 * filesystem code. Lots of locks are held that can
 +		 * deadlock. Use an atomic copy to avoid deadlocking
 +		 * in page fault handling.
 +		 */
  		copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
  		flush_dcache_folio(folio);

 @@ -4222,6 +4217,16 @@ retry:
  				bytes = copied;
  				goto retry;
  			}
 +
 +			/*
 +			 * 'folio' is now unlocked and faults on it can be
 +			 * handled. Ensure forward progress by trying to
 +			 * fault it in now.
 +			 */
 +			if (fault_in_iov_iter_readable(i, bytes) == bytes) {
 +				status = -EFAULT;
 +				break;
 +			}
  		} else {
  			pos += status;
  			written += status;
 _
	From: Dave Hansen <dave.hansen@linux.intel.com>
	Subject: filemap: move prefaulting out of hot write path
	Date: Fri, 28 Feb 2025 12:37:22 -0800

	There is a generic anti-pattern that shows up in the VFS and several
	filesystems where the hot write paths touch userspace twice when they
	could get away with doing it once.

	Dave Chinner suggested that they should all be fixed up[1]. I agree[2].
	But, the series to do that fixup spans a bunch of filesystems and a lot of
	people. This patch fixes common code that absolutely everyone uses. It
	has measurable performance benefits[3].

	I think this patch can go in and not be held up by the others.

	I will post them separately to their separate maintainers for
	consideration. But, honestly, I'm not going to lose any sleep if
	the maintainers don't pick those up.

	1. https://lore.kernel.org/all/Z5f-x278Z3wTIugL@dread.disaster.area/
	2. https://lore.kernel.org/all/20250129181749.C229F6F3@davehans-spike.ostc.intel.com/
	3. https://lore.kernel.org/all/202502121529.d62a409e-lkp@intel.com/


	This patch:

	There is a bit of a sordid history here. I originally wrote
	998ef75ddb57 ("fs: do not prefault sys_write() user buffer pages")
	to fix a performance issue that showed up on early SMAP hardware.
	But that was reverted with 00a3d660cbac because it exposed an
	underlying filesystem bug.

	This is a reimplementation of the original commit along with some
	simplification and comment improvements.

	The basic problem is that the generic write path has two userspace
	accesses: one to prefault the write source buffer and then another to
	perform the actual write. On x86, this means an extra STAC/CLAC pair.
	These are relatively expensive instructions because they function as
	barriers.

	Keep the prefaulting behavior but move it into the slow path that gets
	run when the write did not make any progress. This avoids livelocks
	that can happen when the write's source and destination target the
	same folio. Contrary to the existing comments, the fault-in does not
	prevent deadlocks. That's accomplished by using an "atomic" usercopy
	that disables page faults.

	The end result is that the generic write fast path now touches
	userspace once instead of twice.

	0day has shown some improvements on a couple of microbenchmarks:

	https://lore.kernel.org/all/202502121529.d62a409e-lkp@intel.com/

	Link: https://lkml.kernel.org/r/20250228203722.CAEB63AC@davehans-spike.ostc.intel.com
	Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
	Link: https://lore.kernel.org/all/yxyuijjfd6yknryji2q64j3keq2ygw6ca6fs5jwyolklzvo45s@4u63qqqyosy2/
	Cc: Ted Ts'o <tytso@mit.edu>
	Cc: Matthew Wilcox <willy@infradead.org>
	Cc: Mateusz Guzik <mjguzik@gmail.com>
	Cc: Dave Chinner <david@fromorbit.com>
	Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
	---

	mm/filemap.c \| 27 ++++++++++++++++-----------
	1 file changed, 16 insertions(+), 11 deletions(-)

	--- a/mm/filemap.c~filemap-move-prefaulting-out-of-hot-write-path
	+++ a/mm/filemap.c
	@@ -4169,17 +4169,6 @@ retry:
	bytes = min(chunk - offset, bytes);
	balance_dirty_pages_ratelimited(mapping);

	- /*
	- * Bring in the user page that we will copy from _first_.
	- * Otherwise there's a nasty deadlock on copying from the
	- * same page as we're writing to, without it being marked
	- * up-to-date.
	- */
	- if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
	- status = -EFAULT;
	- break;
	- }
	-
	if (fatal_signal_pending(current)) {
	status = -EINTR;
	break;
	@@ -4197,6 +4186,12 @@ retry:
	if (mapping_writably_mapped(mapping))
	flush_dcache_folio(folio);

	+ /*
	+ * Faults here on mmap()s can recurse into arbitrary
	+ * filesystem code. Lots of locks are held that can
	+ * deadlock. Use an atomic copy to avoid deadlocking
	+ * in page fault handling.
	+ */
	copied = copy_folio_from_iter_atomic(folio, offset, bytes, i);
	flush_dcache_folio(folio);

	@@ -4222,6 +4217,16 @@ retry:
	bytes = copied;
	goto retry;
	}
	+
	+ /*
	+ * 'folio' is now unlocked and faults on it can be
	+ * handled. Ensure forward progress by trying to
	+ * fault it in now.
	+ */
	+ if (fault_in_iov_iter_readable(i, bytes) == bytes) {
	+ status = -EFAULT;
	+ break;
	+ }
	} else {
	pos += status;
	written += status;
	_