| From: Dan Williams <dan.j.williams@intel.com> |
| Subject: fsdax: wait on @page not @page->_refcount |
| Date: Fri, 14 Oct 2022 16:57:02 -0700 |
| |
| Patch series "Fix the DAX-gup mistake", v3. |
| |
| ZONE_DEVICE was created to allow for get_user_pages() on DAX mappings. It |
| has since grown other users, but the main motivation was gup and |
| synchronizing device shutdown events with pinned pages. Memory device |
| shutdown triggers driver ->remove(), and ->remove() always succeeds, so |
| ZONE_DEVICE needed a mechanism to stop new page references from being |
| taken, and await existing references to drain before allowing device |
| removal to proceed. This is the origin of 'struct dev_pagemap' and its |
| percpu_ref. |
| |
| The original ZONE_DEVICE implementation started by noticing that 'struct |
| page' initialization, for typical page allocator pages, started pages at a |
| refcount of 1. Later those pages are 'onlined' by freeing them to the |
| page allocator via put_page() to drop that initial reference and populate |
| the free page lists. ZONE_DEVICE abused that "initialized but never |
| freed" state to both avoid leaking ZONE_DEVICE pages into places that were |
| not ready for them, and add some metadata to the unused (because refcount |
| was never 0) page->lru space. |
| |
| As more users of ZONE_DEVICE arrived that special casing became more and |
| more unnecessary, and more and more expensive. The 'struct page' |
| modernization eliminated the need for the ->lru hack. The folio work had |
| to remember to sprinkle special case ZONE_DEVICE accounting in the right |
| places. The MEMORY_DEVICE_PRIVATE use case spearheaded much of the work |
| to support typical reference counting for ZONE_DEVICE pages and allow them |
| to be used in more kernel code paths. All the while the DAX case kept its |
| tech debt in place, until now. |
| |
| However, while fixing the DAX page refcount semantics and arranging for |
| free_zone_device_page() to be the common end of life of all ZONE_DEVICE |
| pages, the mitigation for truncate() vs pinned DAX pages was found to be |
| incomplete. Unlike typical pages that surprisingly can remain pinned for |
| DMA after they have been truncated from a file, the DAX core must enforce |
| that nobody has access to a page after truncate() has disconnected it from |
| inode->i_pages. I.e. the file block that is identical to the page still |
| remains an extent of the file. The existing mitigation handled explicit |
| truncate while the inode was alive, but not the implicit truncate right |
| before the inode is freed. |
| |
| So, in addition to moving DAX pages to be refcount-0 at idle, and add |
| 'break_layouts' wakeups to free_zone_device_page(), this series also |
| introduces another occurrence of 'break_layouts' to the inode freeing |
| path. Recall that 'break_layouts' for DAX is the mechanism to await code |
| paths that previously arbitrated page access to drop their interest / |
| page-pins. This new synchronization point is implemented by special |
| casing dax_delete_mapping_entry(), called by truncate_inode_pages_final(), |
| to await page pins when mapping_exiting() is true. |
| |
| Thanks to Jason for the nudge to get this fixed up properly and the review |
| on v1, Dave, Jan, and Jason for the discussion on what to do about the |
| inode end-of-life-truncate problem, and Alistair for cleaning up the last |
| of the refcount-1 assumptions in the MEMORY_DEVICE_PRIVATE users. |
| |
| |
| This patch (of 25): |
| |
| The __wait_var_event facility calculates a wait queue from a hash of the |
| address of the variable being passed. Use the @page argument directly as |
| it is less to type and is the object that is being waited upon. |
| |
| Link: https://lkml.kernel.org/r/166579181584.2236710.17813547487183983273.stgit@dwillia2-xfh.jf.intel.com |
| Link: https://lkml.kernel.org/r/166579182271.2236710.15120970389485390592.stgit@dwillia2-xfh.jf.intel.com |
| Signed-off-by: Dan Williams <dan.j.williams@intel.com> |
| Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> |
| Cc: Matthew Wilcox <willy@infradead.org> |
| Cc: Jan Kara <jack@suse.cz> |
| Cc: "Darrick J. Wong" <djwong@kernel.org> |
| Cc: Christoph Hellwig <hch@lst.de> |
| Cc: John Hubbard <jhubbard@nvidia.com> |
| Cc: Alex Deucher <alexander.deucher@amd.com> |
| Cc: Alistair Popple <apopple@nvidia.com> |
| Cc: Ben Skeggs <bskeggs@redhat.com> |
| Cc: Christian Kรถnig <christian.koenig@amd.com> |
| Cc: Daniel Vetter <daniel@ffwll.ch> |
| Cc: Dave Chinner <david@fromorbit.com> |
| Cc: David Airlie <airlied@linux.ie> |
| Cc: Felix Kuehling <Felix.Kuehling@amd.com> |
| Cc: Jerome Glisse <jglisse@redhat.com> |
| Cc: Karol Herbst <kherbst@redhat.com> |
| Cc: Lyude Paul <lyude@redhat.com> |
| Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> |
| Cc: kernel test robot <lkp@intel.com> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| --- |
| |
| fs/ext4/inode.c | 8 ++++---- |
| fs/fuse/dax.c | 6 +++--- |
| fs/xfs/xfs_file.c | 6 +++--- |
| mm/memremap.c | 2 +- |
| 4 files changed, 11 insertions(+), 11 deletions(-) |
| |
| --- a/fs/ext4/inode.c~fsdax-wait-on-page-not-page-_refcount |
| +++ a/fs/ext4/inode.c |
| @@ -3968,10 +3968,10 @@ int ext4_break_layouts(struct inode *ino |
| if (!page) |
| return 0; |
| |
| - error = ___wait_var_event(&page->_refcount, |
| - atomic_read(&page->_refcount) == 1, |
| - TASK_INTERRUPTIBLE, 0, 0, |
| - ext4_wait_dax_page(inode)); |
| + error = ___wait_var_event(page, |
| + atomic_read(&page->_refcount) == 1, |
| + TASK_INTERRUPTIBLE, 0, 0, |
| + ext4_wait_dax_page(inode)); |
| } while (error == 0); |
| |
| return error; |
| --- a/fs/fuse/dax.c~fsdax-wait-on-page-not-page-_refcount |
| +++ a/fs/fuse/dax.c |
| @@ -676,9 +676,9 @@ static int __fuse_dax_break_layouts(stru |
| return 0; |
| |
| *retry = true; |
| - return ___wait_var_event(&page->_refcount, |
| - atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, |
| - 0, 0, fuse_wait_dax_page(inode)); |
| + return ___wait_var_event(page, atomic_read(&page->_refcount) == 1, |
| + TASK_INTERRUPTIBLE, 0, 0, |
| + fuse_wait_dax_page(inode)); |
| } |
| |
| /* dmap_end == 0 leads to unmapping of whole file */ |
| --- a/fs/xfs/xfs_file.c~fsdax-wait-on-page-not-page-_refcount |
| +++ a/fs/xfs/xfs_file.c |
| @@ -827,9 +827,9 @@ xfs_break_dax_layouts( |
| return 0; |
| |
| *retry = true; |
| - return ___wait_var_event(&page->_refcount, |
| - atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, |
| - 0, 0, xfs_wait_dax_page(inode)); |
| + return ___wait_var_event(page, atomic_read(&page->_refcount) == 1, |
| + TASK_INTERRUPTIBLE, 0, 0, |
| + xfs_wait_dax_page(inode)); |
| } |
| |
| int |
| --- a/mm/memremap.c~fsdax-wait-on-page-not-page-_refcount |
| +++ a/mm/memremap.c |
| @@ -543,7 +543,7 @@ bool __put_devmap_managed_page_refs(stru |
| * stable because nobody holds a reference on the page. |
| */ |
| if (page_ref_sub_return(page, refs) == 1) |
| - wake_up_var(&page->_refcount); |
| + wake_up_var(page); |
| return true; |
| } |
| EXPORT_SYMBOL(__put_devmap_managed_page_refs); |
| _ |