patches/old/mm-userfaultfd-dont-place-zeropages-when-zeropages-are-disallowed.patch - pub/scm/linux/kernel/git/akpm/25-new - Git at Google

 From: David Hildenbrand <david@redhat.com>
 Subject: mm/userfaultfd: don't place zeropages when zeropages are disallowed
 Date: Thu, 21 Mar 2024 22:59:53 +0100

 Patch series "s390/mm: shared zeropage + KVM fix and optimization".

 This series fixes one issue with uffd + shared zeropages on s390x and
 optimizes "ordinary" KVM guests to make use of shared zeropages again.

 userfaultfd could currently end up mapping shared zeropages into processes
 that forbid shared zeropages.  This only apples to s390x, relevant for
 handling PV guests and guests that use storage kets correctly.  Fix it by
 placing a zeroed folio instead of the shared zeropage during
 UFFDIO_ZEROPAGE instead.

 I stumbled over this issue while looking into a customer scenario that
 is using:

 (1) Memory ballooning for dynamic resizing. Start a VM with, say, 100 GiB
     and inflate the balloon during boot to 60 GiB. The VM has ~40 GiB
     available and additional memory can be "fake hotplugged" to the VM
     later on demand by deflating the balloon. Actual memory overcommit is
     not desired, so physical memory would only be moved between VMs.

 (2) Live migration of VMs between sites to evacuate servers in case of
     emergency.

 Without the shared zeropage, during (2), the VM would suddenly consume 100
 GiB on the migration source and destination.  On the migration source,
 where we don't excpect memory overcommit, we could easilt end up crashing
 the VM during migration.

 Independent of that, memory handed back to the hypervisor using "free page
 reporting" would end up consuming actual memory after the migration on the
 destination, not getting freed up until reused+freed again.

 While there might be ways to optimize parts of this in QEMU, we really
 should just support the shared zeropage again for ordinary VMs.

 We only expect legcy guests to make use of storage keys, so let's handle
 zeropages again when enabling storage keys or when enabling PV.  To not
 break userfaultfd like we did in the past, don't zap the shared zeropages,
 but instead trigger unsharing faults, just like we do for unsharing KSM
 pages in break_ksm().

 Unsharing faults will simply replace the shared zeropage by a zeroed
 anonymous folio.  We can already trigger the same fault path using GUP,
 when trying to long-term pin a shared zeropage, but also when unmerging a
 KSM-placed zeropages, so this is nothing new.


 This patch (of 2):

 s390x must disable shared zeropages for processes running VMs, because the
 VMs could end up making use of "storage keys" or protected virtualization,
 which are incompatible with shared zeropages.

 Yet, with userfaultfd it is possible to insert shared zeropages into such
 processes.  Let's fallback to simply allocating a fresh zeroed anonymous
 folio and insert that instead.

 mm_forbids_zeropage() was introduced in commit 593befa6ab74 ("mm:
 introduce mm_forbids_zeropage function"), briefly before userfaultfd went
 upstream.

 Note that we don't want to fail the UFFDIO_ZEROPAGE request like we do for
 hugetlb, it would be rather unexpected.  Further, we also cannot really
 indicated "not supported" to user space ahead of time: it could be that
 the MM disallows zeropages after userfaultfd was already registered.

 Link: https://lkml.kernel.org/r/20240321215954.177730-1-david@redhat.com
 Link: https://lkml.kernel.org/r/20240321215954.177730-2-david@redhat.com
 Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
 Signed-off-by: David Hildenbrand <david@redhat.com>
 Reviewed-by: Peter Xu <peterx@redhat.com>
 Cc: Alexander Gordeev <agordeev@linux.ibm.com>
 Cc: Andrea Arcangeli <aarcange@redhat.com>
 Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
 Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
 Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
 Cc: Heiko Carstens <hca@linux.ibm.com>
 Cc: Janosch Frank <frankja@linux.ibm.com>
 Cc: Sven Schnelle <svens@linux.ibm.com>
 Cc: Vasily Gorbik <gor@linux.ibm.com>
 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
 ---

  mm/userfaultfd.c |   35 +++++++++++++++++++++++++++++++++++
  1 file changed, 35 insertions(+)

 --- a/mm/userfaultfd.c~mm-userfaultfd-dont-place-zeropages-when-zeropages-are-disallowed
 +++ a/mm/userfaultfd.c
 @@ -316,6 +316,38 @@ out_release:
  	goto out;
  }

 +static int mfill_atomic_pte_zeroed_folio(pmd_t *dst_pmd,
 +		 struct vm_area_struct *dst_vma, unsigned long dst_addr)
 +{
 +	struct folio *folio;
 +	int ret;
 +
 +	folio = vma_alloc_zeroed_movable_folio(dst_vma, dst_addr);
 +	if (!folio)
 +		return -ENOMEM;
 +
 +	ret = -ENOMEM;
 +	if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL))
 +		goto out_put;
 +
 +	/*
 +	 * The memory barrier inside __folio_mark_uptodate makes sure that
 +	 * preceding stores to the page contents become visible before
 +	 * the set_pte_at() write.
 +	 */
 +	__folio_mark_uptodate(folio);
 +
 +	ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
 +				       &folio->page, true, 0);
 +	if (ret)
 +		goto out_put;
 +
 +	return 0;
 +out_put:
 +	folio_put(folio);
 +	return ret;
 +}
 +
  static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,
  				     struct vm_area_struct *dst_vma,
  				     unsigned long dst_addr)
 @@ -324,6 +356,9 @@ static int mfill_atomic_pte_zeropage(pmd
  	spinlock_t *ptl;
  	int ret;

 +	if (mm_forbids_zeropage(dst_vma->mm))
 +		return mfill_atomic_pte_zeroed_folio(dst_pmd, dst_vma, dst_addr);
 +
  	_dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
  					 dst_vma->vm_page_prot));
  	ret = -EAGAIN;
 _
	From: David Hildenbrand <david@redhat.com>
	Subject: mm/userfaultfd: don't place zeropages when zeropages are disallowed
	Date: Thu, 21 Mar 2024 22:59:53 +0100

	Patch series "s390/mm: shared zeropage + KVM fix and optimization".

	This series fixes one issue with uffd + shared zeropages on s390x and
	optimizes "ordinary" KVM guests to make use of shared zeropages again.

	userfaultfd could currently end up mapping shared zeropages into processes
	that forbid shared zeropages. This only apples to s390x, relevant for
	handling PV guests and guests that use storage kets correctly. Fix it by
	placing a zeroed folio instead of the shared zeropage during
	UFFDIO_ZEROPAGE instead.

	I stumbled over this issue while looking into a customer scenario that
	is using:

	(1) Memory ballooning for dynamic resizing. Start a VM with, say, 100 GiB
	and inflate the balloon during boot to 60 GiB. The VM has ~40 GiB
	available and additional memory can be "fake hotplugged" to the VM
	later on demand by deflating the balloon. Actual memory overcommit is
	not desired, so physical memory would only be moved between VMs.

	(2) Live migration of VMs between sites to evacuate servers in case of
	emergency.

	Without the shared zeropage, during (2), the VM would suddenly consume 100
	GiB on the migration source and destination. On the migration source,
	where we don't excpect memory overcommit, we could easilt end up crashing
	the VM during migration.

	Independent of that, memory handed back to the hypervisor using "free page
	reporting" would end up consuming actual memory after the migration on the
	destination, not getting freed up until reused+freed again.

	While there might be ways to optimize parts of this in QEMU, we really
	should just support the shared zeropage again for ordinary VMs.

	We only expect legcy guests to make use of storage keys, so let's handle
	zeropages again when enabling storage keys or when enabling PV. To not
	break userfaultfd like we did in the past, don't zap the shared zeropages,
	but instead trigger unsharing faults, just like we do for unsharing KSM
	pages in break_ksm().

	Unsharing faults will simply replace the shared zeropage by a zeroed
	anonymous folio. We can already trigger the same fault path using GUP,
	when trying to long-term pin a shared zeropage, but also when unmerging a
	KSM-placed zeropages, so this is nothing new.


	This patch (of 2):

	s390x must disable shared zeropages for processes running VMs, because the
	VMs could end up making use of "storage keys" or protected virtualization,
	which are incompatible with shared zeropages.

	Yet, with userfaultfd it is possible to insert shared zeropages into such
	processes. Let's fallback to simply allocating a fresh zeroed anonymous
	folio and insert that instead.

	mm_forbids_zeropage() was introduced in commit 593befa6ab74 ("mm:
	introduce mm_forbids_zeropage function"), briefly before userfaultfd went
	upstream.

	Note that we don't want to fail the UFFDIO_ZEROPAGE request like we do for
	hugetlb, it would be rather unexpected. Further, we also cannot really
	indicated "not supported" to user space ahead of time: it could be that
	the MM disallows zeropages after userfaultfd was already registered.

	Link: https://lkml.kernel.org/r/20240321215954.177730-1-david@redhat.com
	Link: https://lkml.kernel.org/r/20240321215954.177730-2-david@redhat.com
	Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic\|mfill_zeropage: UFFDIO_COPY\|UFFDIO_ZEROPAGE preparation")
	Signed-off-by: David Hildenbrand <david@redhat.com>
	Reviewed-by: Peter Xu <peterx@redhat.com>
	Cc: Alexander Gordeev <agordeev@linux.ibm.com>
	Cc: Andrea Arcangeli <aarcange@redhat.com>
	Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
	Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
	Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
	Cc: Heiko Carstens <hca@linux.ibm.com>
	Cc: Janosch Frank <frankja@linux.ibm.com>
	Cc: Sven Schnelle <svens@linux.ibm.com>
	Cc: Vasily Gorbik <gor@linux.ibm.com>
	Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
	---

	mm/userfaultfd.c \| 35 +++++++++++++++++++++++++++++++++++
	1 file changed, 35 insertions(+)

	--- a/mm/userfaultfd.c~mm-userfaultfd-dont-place-zeropages-when-zeropages-are-disallowed
	+++ a/mm/userfaultfd.c
	@@ -316,6 +316,38 @@ out_release:
	goto out;
	}

	+static int mfill_atomic_pte_zeroed_folio(pmd_t *dst_pmd,
	+ struct vm_area_struct *dst_vma, unsigned long dst_addr)
	+{
	+ struct folio *folio;
	+ int ret;
	+
	+ folio = vma_alloc_zeroed_movable_folio(dst_vma, dst_addr);
	+ if (!folio)
	+ return -ENOMEM;
	+
	+ ret = -ENOMEM;
	+ if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL))
	+ goto out_put;
	+
	+ /*
	+ * The memory barrier inside __folio_mark_uptodate makes sure that
	+ * preceding stores to the page contents become visible before
	+ * the set_pte_at() write.
	+ */
	+ __folio_mark_uptodate(folio);
	+
	+ ret = mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr,
	+ &folio->page, true, 0);
	+ if (ret)
	+ goto out_put;
	+
	+ return 0;
	+out_put:
	+ folio_put(folio);
	+ return ret;
	+}
	+
	static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,
	struct vm_area_struct *dst_vma,
	unsigned long dst_addr)
	@@ -324,6 +356,9 @@ static int mfill_atomic_pte_zeropage(pmd
	spinlock_t *ptl;
	int ret;

	+ if (mm_forbids_zeropage(dst_vma->mm))
	+ return mfill_atomic_pte_zeroed_folio(dst_pmd, dst_vma, dst_addr);
	+
	_dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
	dst_vma->vm_page_prot));
	ret = -EAGAIN;
	_