releases/4.9.85/mm-introduce-get_user_pages_longterm.patch - pub/scm/linux/kernel/git/stable/stable-queue - Git at Google

 From foo@baz Mon Feb 26 20:55:53 CET 2018
 From: Dan Williams <dan.j.williams@intel.com>
 Date: Fri, 23 Feb 2018 14:05:49 -0800
 Subject: mm: introduce get_user_pages_longterm
 To: gregkh@linuxfoundation.org
 Cc: Jan Kara <jack@suse.cz>, Joonyoung Shim <jy0922.shim@samsung.com>, linux-kernel@vger.kernel.org, Seung-Woo Kim <sw0312.kim@samsung.com>, Doug Ledford <dledford@redhat.com>, stable@vger.kernel.org, Christoph Hellwig <hch@lst.de>, Inki Dae <inki.dae@samsung.com>, Jeff Moyer <jmoyer@redhat.com>, Jason Gunthorpe <jgg@mellanox.com>, Mel Gorman <mgorman@suse.de>, Andrew Morton <akpm@linux-foundation.org>, Ross Zwisler <ross.zwisler@linux.intel.com>, Kyungmin Park <kyungmin.park@samsung.com>, Sean Hefty <sean.hefty@intel.com>, Mauro Carvalho Chehab <mchehab@kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, Hal Rosenstock <hal.rosenstock@gmail.com>, Vlastimil Babka <vbabka@suse.cz>
 Message-ID: <151942354920.21775.1595898555475851190.stgit@dwillia2-desk3.amr.corp.intel.com>

 From: Dan Williams <dan.j.williams@intel.com>

 commit 2bb6d2837083de722bfdc369cb0d76ce188dd9b4 upstream.

 Patch series "introduce get_user_pages_longterm()", v2.

 Here is a new get_user_pages api for cases where a driver intends to
 keep an elevated page count indefinitely.  This is distinct from usages
 like iov_iter_get_pages where the elevated page counts are transient.
 The iov_iter_get_pages cases immediately turn around and submit the
 pages to a device driver which will put_page when the i/o operation
 completes (under kernel control).

 In the longterm case userspace is responsible for dropping the page
 reference at some undefined point in the future.  This is untenable for
 filesystem-dax case where the filesystem is in control of the lifetime
 of the block / page and needs reasonable limits on how long it can wait
 for pages in a mapping to become idle.

 Fixing filesystems to actually wait for dax pages to be idle before
 blocks from a truncate/hole-punch operation are repurposed is saved for
 a later patch series.

 Also, allowing longterm registration of dax mappings is a future patch
 series that introduces a "map with lease" semantic where the kernel can
 revoke a lease and force userspace to drop its page references.

 I have also tagged these for -stable to purposely break cases that might
 assume that longterm memory registrations for filesystem-dax mappings
 were supported by the kernel.  The behavior regression this policy
 change implies is one of the reasons we maintain the "dax enabled.
 Warning: EXPERIMENTAL, use at your own risk" notification when mounting
 a filesystem in dax mode.

 It is worth noting the device-dax interface does not suffer the same
 constraints since it does not support file space management operations
 like hole-punch.

 This patch (of 4):

 Until there is a solution to the dma-to-dax vs truncate problem it is
 not safe to allow long standing memory registrations against
 filesytem-dax vmas.  Device-dax vmas do not have this problem and are
 explicitly allowed.

 This is temporary until a "memory registration with layout-lease"
 mechanism can be implemented for the affected sub-systems (RDMA and
 V4L2).

 [akpm@linux-foundation.org: use kcalloc()]
 Link: http://lkml.kernel.org/r/151068939435.7446.13560129395419350737.stgit@dwillia2-desk3.amr.corp.intel.com
 Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
 Signed-off-by: Dan Williams <dan.j.williams@intel.com>
 Suggested-by: Christoph Hellwig <hch@lst.de>
 Cc: Doug Ledford <dledford@redhat.com>
 Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
 Cc: Inki Dae <inki.dae@samsung.com>
 Cc: Jan Kara <jack@suse.cz>
 Cc: Jason Gunthorpe <jgg@mellanox.com>
 Cc: Jeff Moyer <jmoyer@redhat.com>
 Cc: Joonyoung Shim <jy0922.shim@samsung.com>
 Cc: Kyungmin Park <kyungmin.park@samsung.com>
 Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
 Cc: Mel Gorman <mgorman@suse.de>
 Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
 Cc: Sean Hefty <sean.hefty@intel.com>
 Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
 Cc: Vlastimil Babka <vbabka@suse.cz>
 Cc: <stable@vger.kernel.org>
 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 ---
  include/linux/dax.h |    5 ----
  include/linux/fs.h  |   20 ++++++++++++++++
  include/linux/mm.h  |   13 ++++++++++
  mm/gup.c            |   64 ++++++++++++++++++++++++++++++++++++++++++++++++++++
  4 files changed, 97 insertions(+), 5 deletions(-)

 --- a/include/linux/dax.h
 +++ b/include/linux/dax.h
 @@ -61,11 +61,6 @@ static inline int dax_pmd_fault(struct v
  int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
  #define dax_mkwrite(vma, vmf, gb)	dax_fault(vma, vmf, gb)

 -static inline bool vma_is_dax(struct vm_area_struct *vma)
 -{
 -	return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
 -}
 -
  static inline bool dax_mapping(struct address_space *mapping)
  {
  	return mapping->host && IS_DAX(mapping->host);
 --- a/include/linux/fs.h
 +++ b/include/linux/fs.h
 @@ -18,6 +18,7 @@
  #include <linux/bug.h>
  #include <linux/mutex.h>
  #include <linux/rwsem.h>
 +#include <linux/mm_types.h>
  #include <linux/capability.h>
  #include <linux/semaphore.h>
  #include <linux/fiemap.h>
 @@ -3033,6 +3034,25 @@ static inline bool io_is_direct(struct f
  	return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
  }

 +static inline bool vma_is_dax(struct vm_area_struct *vma)
 +{
 +	return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
 +}
 +
 +static inline bool vma_is_fsdax(struct vm_area_struct *vma)
 +{
 +	struct inode *inode;
 +
 +	if (!vma->vm_file)
 +		return false;
 +	if (!vma_is_dax(vma))
 +		return false;
 +	inode = file_inode(vma->vm_file);
 +	if (inode->i_mode == S_IFCHR)
 +		return false; /* device-dax */
 +	return true;
 +}
 +
  static inline int iocb_flags(struct file *file)
  {
  	int res = 0;
 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -1288,6 +1288,19 @@ long __get_user_pages_unlocked(struct ta
  			       struct page **pages, unsigned int gup_flags);
  long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
  		    struct page **pages, unsigned int gup_flags);
 +#ifdef CONFIG_FS_DAX
 +long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
 +			    unsigned int gup_flags, struct page **pages,
 +			    struct vm_area_struct **vmas);
 +#else
 +static inline long get_user_pages_longterm(unsigned long start,
 +		unsigned long nr_pages, unsigned int gup_flags,
 +		struct page **pages, struct vm_area_struct **vmas)
 +{
 +	return get_user_pages(start, nr_pages, gup_flags, pages, vmas);
 +}
 +#endif /* CONFIG_FS_DAX */
 +
  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
  			struct page **pages);

 --- a/mm/gup.c
 +++ b/mm/gup.c
 @@ -982,6 +982,70 @@ long get_user_pages(unsigned long start,
  }
  EXPORT_SYMBOL(get_user_pages);

 +#ifdef CONFIG_FS_DAX
 +/*
 + * This is the same as get_user_pages() in that it assumes we are
 + * operating on the current task's mm, but it goes further to validate
 + * that the vmas associated with the address range are suitable for
 + * longterm elevated page reference counts. For example, filesystem-dax
 + * mappings are subject to the lifetime enforced by the filesystem and
 + * we need guarantees that longterm users like RDMA and V4L2 only
 + * establish mappings that have a kernel enforced revocation mechanism.
 + *
 + * "longterm" == userspace controlled elevated page count lifetime.
 + * Contrast this to iov_iter_get_pages() usages which are transient.
 + */
 +long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
 +		unsigned int gup_flags, struct page **pages,
 +		struct vm_area_struct **vmas_arg)
 +{
 +	struct vm_area_struct **vmas = vmas_arg;
 +	struct vm_area_struct *vma_prev = NULL;
 +	long rc, i;
 +
 +	if (!pages)
 +		return -EINVAL;
 +
 +	if (!vmas) {
 +		vmas = kcalloc(nr_pages, sizeof(struct vm_area_struct *),
 +			       GFP_KERNEL);
 +		if (!vmas)
 +			return -ENOMEM;
 +	}
 +
 +	rc = get_user_pages(start, nr_pages, gup_flags, pages, vmas);
 +
 +	for (i = 0; i < rc; i++) {
 +		struct vm_area_struct *vma = vmas[i];
 +
 +		if (vma == vma_prev)
 +			continue;
 +
 +		vma_prev = vma;
 +
 +		if (vma_is_fsdax(vma))
 +			break;
 +	}
 +
 +	/*
 +	 * Either get_user_pages() failed, or the vma validation
 +	 * succeeded, in either case we don't need to put_page() before
 +	 * returning.
 +	 */
 +	if (i >= rc)
 +		goto out;
 +
 +	for (i = 0; i < rc; i++)
 +		put_page(pages[i]);
 +	rc = -EOPNOTSUPP;
 +out:
 +	if (vmas != vmas_arg)
 +		kfree(vmas);
 +	return rc;
 +}
 +EXPORT_SYMBOL(get_user_pages_longterm);
 +#endif /* CONFIG_FS_DAX */
 +
  /**
   * populate_vma_page_range() -  populate a range of pages in the vma.
   * @vma:   target vma
	From foo@baz Mon Feb 26 20:55:53 CET 2018
	From: Dan Williams <dan.j.williams@intel.com>
	Date: Fri, 23 Feb 2018 14:05:49 -0800
	Subject: mm: introduce get_user_pages_longterm
	To: gregkh@linuxfoundation.org
	Cc: Jan Kara <jack@suse.cz>, Joonyoung Shim <jy0922.shim@samsung.com>, linux-kernel@vger.kernel.org, Seung-Woo Kim <sw0312.kim@samsung.com>, Doug Ledford <dledford@redhat.com>, stable@vger.kernel.org, Christoph Hellwig <hch@lst.de>, Inki Dae <inki.dae@samsung.com>, Jeff Moyer <jmoyer@redhat.com>, Jason Gunthorpe <jgg@mellanox.com>, Mel Gorman <mgorman@suse.de>, Andrew Morton <akpm@linux-foundation.org>, Ross Zwisler <ross.zwisler@linux.intel.com>, Kyungmin Park <kyungmin.park@samsung.com>, Sean Hefty <sean.hefty@intel.com>, Mauro Carvalho Chehab <mchehab@kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, Hal Rosenstock <hal.rosenstock@gmail.com>, Vlastimil Babka <vbabka@suse.cz>
	Message-ID: <151942354920.21775.1595898555475851190.stgit@dwillia2-desk3.amr.corp.intel.com>

	From: Dan Williams <dan.j.williams@intel.com>

	commit 2bb6d2837083de722bfdc369cb0d76ce188dd9b4 upstream.

	Patch series "introduce get_user_pages_longterm()", v2.

	Here is a new get_user_pages api for cases where a driver intends to
	keep an elevated page count indefinitely. This is distinct from usages
	like iov_iter_get_pages where the elevated page counts are transient.
	The iov_iter_get_pages cases immediately turn around and submit the
	pages to a device driver which will put_page when the i/o operation
	completes (under kernel control).

	In the longterm case userspace is responsible for dropping the page
	reference at some undefined point in the future. This is untenable for
	filesystem-dax case where the filesystem is in control of the lifetime
	of the block / page and needs reasonable limits on how long it can wait
	for pages in a mapping to become idle.

	Fixing filesystems to actually wait for dax pages to be idle before
	blocks from a truncate/hole-punch operation are repurposed is saved for
	a later patch series.

	Also, allowing longterm registration of dax mappings is a future patch
	series that introduces a "map with lease" semantic where the kernel can
	revoke a lease and force userspace to drop its page references.

	I have also tagged these for -stable to purposely break cases that might
	assume that longterm memory registrations for filesystem-dax mappings
	were supported by the kernel. The behavior regression this policy
	change implies is one of the reasons we maintain the "dax enabled.
	Warning: EXPERIMENTAL, use at your own risk" notification when mounting
	a filesystem in dax mode.

	It is worth noting the device-dax interface does not suffer the same
	constraints since it does not support file space management operations
	like hole-punch.

	This patch (of 4):

	Until there is a solution to the dma-to-dax vs truncate problem it is
	not safe to allow long standing memory registrations against
	filesytem-dax vmas. Device-dax vmas do not have this problem and are
	explicitly allowed.

	This is temporary until a "memory registration with layout-lease"
	mechanism can be implemented for the affected sub-systems (RDMA and
	V4L2).

	[akpm@linux-foundation.org: use kcalloc()]
	Link: http://lkml.kernel.org/r/151068939435.7446.13560129395419350737.stgit@dwillia2-desk3.amr.corp.intel.com
	Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
	Signed-off-by: Dan Williams <dan.j.williams@intel.com>
	Suggested-by: Christoph Hellwig <hch@lst.de>
	Cc: Doug Ledford <dledford@redhat.com>
	Cc: Hal Rosenstock <hal.rosenstock@gmail.com>
	Cc: Inki Dae <inki.dae@samsung.com>
	Cc: Jan Kara <jack@suse.cz>
	Cc: Jason Gunthorpe <jgg@mellanox.com>
	Cc: Jeff Moyer <jmoyer@redhat.com>
	Cc: Joonyoung Shim <jy0922.shim@samsung.com>
	Cc: Kyungmin Park <kyungmin.park@samsung.com>
	Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
	Cc: Mel Gorman <mgorman@suse.de>
	Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
	Cc: Sean Hefty <sean.hefty@intel.com>
	Cc: Seung-Woo Kim <sw0312.kim@samsung.com>
	Cc: Vlastimil Babka <vbabka@suse.cz>
	Cc: <stable@vger.kernel.org>
	Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
	Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
	Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	---
	include/linux/dax.h \| 5 ----
	include/linux/fs.h \| 20 ++++++++++++++++
	include/linux/mm.h \| 13 ++++++++++
	mm/gup.c \| 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++
	4 files changed, 97 insertions(+), 5 deletions(-)

	--- a/include/linux/dax.h
	+++ b/include/linux/dax.h
	@@ -61,11 +61,6 @@ static inline int dax_pmd_fault(struct v
	int dax_pfn_mkwrite(struct vm_area_struct , struct vm_fault );
	#define dax_mkwrite(vma, vmf, gb) dax_fault(vma, vmf, gb)

	-static inline bool vma_is_dax(struct vm_area_struct *vma)
	-{
	- return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
	-}
	-
	static inline bool dax_mapping(struct address_space *mapping)
	{
	return mapping->host && IS_DAX(mapping->host);
	--- a/include/linux/fs.h
	+++ b/include/linux/fs.h
	@@ -18,6 +18,7 @@
	#include <linux/bug.h>
	#include <linux/mutex.h>
	#include <linux/rwsem.h>
	+#include <linux/mm_types.h>
	#include <linux/capability.h>
	#include <linux/semaphore.h>
	#include <linux/fiemap.h>
	@@ -3033,6 +3034,25 @@ static inline bool io_is_direct(struct f
	return (filp->f_flags & O_DIRECT) \|\| IS_DAX(filp->f_mapping->host);
	}

	+static inline bool vma_is_dax(struct vm_area_struct *vma)
	+{
	+ return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
	+}
	+
	+static inline bool vma_is_fsdax(struct vm_area_struct *vma)
	+{
	+ struct inode *inode;
	+
	+ if (!vma->vm_file)
	+ return false;
	+ if (!vma_is_dax(vma))
	+ return false;
	+ inode = file_inode(vma->vm_file);
	+ if (inode->i_mode == S_IFCHR)
	+ return false; /* device-dax */
	+ return true;
	+}
	+
	static inline int iocb_flags(struct file *file)
	{
	int res = 0;
	--- a/include/linux/mm.h
	+++ b/include/linux/mm.h
	@@ -1288,6 +1288,19 @@ long __get_user_pages_unlocked(struct ta
	struct page **pages, unsigned int gup_flags);
	long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
	struct page **pages, unsigned int gup_flags);
	+#ifdef CONFIG_FS_DAX
	+long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
	+ unsigned int gup_flags, struct page **pages,
	+ struct vm_area_struct **vmas);
	+#else
	+static inline long get_user_pages_longterm(unsigned long start,
	+ unsigned long nr_pages, unsigned int gup_flags,
	+ struct page pages, struct vm_area_struct vmas)
	+{
	+ return get_user_pages(start, nr_pages, gup_flags, pages, vmas);
	+}
	+#endif /* CONFIG_FS_DAX */
	+
	int get_user_pages_fast(unsigned long start, int nr_pages, int write,
	struct page **pages);

	--- a/mm/gup.c
	+++ b/mm/gup.c
	@@ -982,6 +982,70 @@ long get_user_pages(unsigned long start,
	}
	EXPORT_SYMBOL(get_user_pages);

	+#ifdef CONFIG_FS_DAX
	+/*
	+ * This is the same as get_user_pages() in that it assumes we are
	+ * operating on the current task's mm, but it goes further to validate
	+ * that the vmas associated with the address range are suitable for
	+ * longterm elevated page reference counts. For example, filesystem-dax
	+ * mappings are subject to the lifetime enforced by the filesystem and
	+ * we need guarantees that longterm users like RDMA and V4L2 only
	+ * establish mappings that have a kernel enforced revocation mechanism.
	+ *
	+ * "longterm" == userspace controlled elevated page count lifetime.
	+ * Contrast this to iov_iter_get_pages() usages which are transient.
	+ */
	+long get_user_pages_longterm(unsigned long start, unsigned long nr_pages,
	+ unsigned int gup_flags, struct page **pages,
	+ struct vm_area_struct **vmas_arg)
	+{
	+ struct vm_area_struct **vmas = vmas_arg;
	+ struct vm_area_struct *vma_prev = NULL;
	+ long rc, i;
	+
	+ if (!pages)
	+ return -EINVAL;
	+
	+ if (!vmas) {
	+ vmas = kcalloc(nr_pages, sizeof(struct vm_area_struct *),
	+ GFP_KERNEL);
	+ if (!vmas)
	+ return -ENOMEM;
	+ }
	+
	+ rc = get_user_pages(start, nr_pages, gup_flags, pages, vmas);
	+
	+ for (i = 0; i < rc; i++) {
	+ struct vm_area_struct *vma = vmas[i];
	+
	+ if (vma == vma_prev)
	+ continue;
	+
	+ vma_prev = vma;
	+
	+ if (vma_is_fsdax(vma))
	+ break;
	+ }
	+
	+ /*
	+ * Either get_user_pages() failed, or the vma validation
	+ * succeeded, in either case we don't need to put_page() before
	+ * returning.
	+ */
	+ if (i >= rc)
	+ goto out;
	+
	+ for (i = 0; i < rc; i++)
	+ put_page(pages[i]);
	+ rc = -EOPNOTSUPP;
	+out:
	+ if (vmas != vmas_arg)
	+ kfree(vmas);
	+ return rc;
	+}
	+EXPORT_SYMBOL(get_user_pages_longterm);
	+#endif /* CONFIG_FS_DAX */
	+
	/**
	* populate_vma_page_range() - populate a range of pages in the vma.
	* @vma: target vma