| From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> |
| Subject: mm: introduce new .mmap_prepare() file callback |
| Date: Fri, 9 May 2025 13:13:34 +0100 |
| |
| Patch series "eliminate mmap() retry merge, add .mmap_prepare hook", v2. |
| |
| During the mmap() of a file-backed mapping, we invoke the underlying |
| driver file's mmap() callback in order to perform driver/file system |
| initialisation of the underlying VMA. |
| |
| This has been a source of issues in the past, including a significant |
| security concern relating to unwinding of error state discovered by Jann |
| Horn, as fixed in commit 5de195060b2e ("mm: resolve faulty mmap_region() |
| error path behaviour") which performed the recent, significant, rework of |
| mmap() as a whole. |
| |
| However, we have had a fly in the ointment remain - drivers have a great |
| deal of freedom in the .mmap() hook to manipulate VMA state (as well as |
| page table state). |
| |
| This can be problematic, as we can no longer reason sensibly about VMA |
| state once the call is complete (the ability to do - anything - here does |
| rather interfere with that). |
| |
| In addition, callers may choose to do odd or unusual things which might |
| interfere with subsequent steps in the mmap() process, and it may do so |
| and then raise an error, requiring very careful unwinding of state about |
| which we can make no assumptions. |
| |
| Rather than providing such an open-ended interface, this series provides |
| an alternative, far more restrictive one - we expose a whitelist of fields |
| which can be adjusted by the driver, along with immutable state upon which |
| the driver can make such decisions: |
| |
| struct vm_area_desc { |
| /* Immutable state. */ |
| struct mm_struct *mm; |
| unsigned long start; |
| unsigned long end; |
| |
| /* Mutable fields. Populated with initial state. */ |
| pgoff_t pgoff; |
| struct file *file; |
| vm_flags_t vm_flags; |
| pgprot_t page_prot; |
| |
| /* Write-only fields. */ |
| const struct vm_operations_struct *vm_ops; |
| void *private_data; |
| }; |
| |
| The mmap logic then updates the state used to either merge with a VMA or |
| establish a new VMA based upon this logic. |
| |
| This is achieved via new file hook .mmap_prepare(), which is, importantly, |
| invoked very early on in the mmap() process. |
| |
| If an error arises, we can very simply abort the operation with very |
| little unwinding of state required. |
| |
| The existing logic contains another, related, peccadillo - since the |
| .mmap() callback might do anything, it may also cause a previously |
| unmergeable VMA to become mergeable with adjacent VMAs. |
| |
| Right now the logic will retry a merge like this only if the driver |
| changes VMA flags, and changes them in such a way that a merge might |
| succeed (that is, the flags are not 'special', that is do not contain any |
| of the flags specified in VM_SPECIAL). |
| |
| This has also been the source of a great deal of pain - it's hard to |
| reason about an .mmap() callback that might do - anything - but it's also |
| hard to reason about setting up a VMA and writing to the maple tree, only |
| to do it again utilising a great deal of shared state. |
| |
| Since .mmap_prepare() sets fields before the first merge is even |
| attempted, the use of this callback obviates the need for this retry merge |
| logic. |
| |
| A driver may only specify .mmap_prepare() or the deprecated .mmap() |
| callback. In future we may add futher callbacks beyond .mmap_prepare() to |
| faciliate all use cass as we convert drivers. |
| |
| In researching this change, I examined every .mmap() callback, and |
| discovered only a very few that set VMA state in such a way that a. the |
| VMA flags changed and b. this would be mergeable. |
| |
| In the majority of cases, it turns out that drivers are mapping kernel |
| memory and thus ultimately set VM_PFNMAP, VM_MIXEDMAP, or other |
| unmergeable VM_SPECIAL flags. |
| |
| Of those that remain I identified a number of cases which are only |
| applicable in DAX, setting the VM_HUGEPAGE flag: |
| |
| * dax_mmap() |
| * erofs_file_mmap() |
| * ext4_file_mmap() |
| * xfs_file_mmap() |
| |
| For this remerge to not occur and to impact users, each of these cases |
| would require a user to mmap() files using DAX, in parts, immediately |
| adjacent to one another. |
| |
| This is a very unlikely usecase and so it does not appear to be worthwhile |
| to adjust this functionality accordingly. |
| |
| We can, however, very quickly do so if needed by simply adding an |
| .mmap_prepare() callback to these as required. |
| |
| There are two further non-DAX cases I idenitfied: |
| |
| * orangefs_file_mmap() - Clears VM_RAND_READ if set, replacing with |
| VM_SEQ_READ. |
| * usb_stream_hwdep_mmap() - Sets VM_DONTDUMP. |
| |
| Both of these cases again seem very unlikely to be mmap()'d immediately |
| adjacent to one another in a fashion that would result in a merge. |
| |
| Finally, we are left with a viable case: |
| |
| * secretmem_mmap() - Set VM_LOCKED, VM_DONTDUMP. |
| |
| This is viable enough that the mm selftests trigger the logic as a matter |
| of course. Therefore, this series replace the .secretmem_mmap() hook with |
| .secret_mmap_prepare(). |
| |
| |
| This patch (of 3): |
| |
| Provide a means by which drivers can specify which fields of those |
| permitted to be changed should be altered to prior to mmap()'ing a range |
| (which may either result from a merge or from mapping an entirely new |
| VMA). |
| |
| Doing so is substantially safer than the existing .mmap() calback which |
| provides unrestricted access to the part-constructed VMA and permits |
| drivers and file systems to do 'creative' things which makes it hard to |
| reason about the state of the VMA after the function returns. |
| |
| The existing .mmap() callback's freedom has caused a great deal of issues, |
| especially in error handling, as unwinding the mmap() state has proven to |
| be non-trivial and caused significant issues in the past, for instance |
| those addressed in commit 5de195060b2e ("mm: resolve faulty mmap_region() |
| error path behaviour"). |
| |
| It also necessitates a second attempt at merge once the .mmap() callback |
| has completed, which has caused issues in the past, is awkward, adds |
| overhead and is difficult to reason about. |
| |
| The .mmap_prepare() callback eliminates this requirement, as we can update |
| fields prior to even attempting the first merge. It is safer, as we |
| heavily restrict what can actually be modified, and being invoked very |
| early in the mmap() process, error handling can be performed safely with |
| very little unwinding of state required. |
| |
| The .mmap_prepare() and deprecated .mmap() callbacks are mutually |
| exclusive, so we permit only one to be invoked at a time. |
| |
| Update vma userland test stubs to account for changes. |
| |
| Link: https://lkml.kernel.org/r/cover.1746792520.git.lorenzo.stoakes@oracle.com |
| Link: https://lkml.kernel.org/r/adb36a7c4affd7393b2fc4b54cc5cfe211e41f71.1746792520.git.lorenzo.stoakes@oracle.com |
| Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> |
| Reviewed-by: Vlastimil Babka <vbabka@suse.cz> |
| Cc: Al Viro <viro@zeniv.linux.org.uk> |
| Cc: Christian Brauner <brauner@kernel.org> |
| Cc: David Hildenbrand <david@redhat.com> |
| Cc: Jan Kara <jack@suse.cz> |
| Cc: Jann Horn <jannh@google.com> |
| Cc: Liam Howlett <liam.howlett@oracle.com> |
| Cc: Matthew Wilcox (Oracle) <willy@infradead.org> |
| Cc: Michal Hocko <mhocko@suse.com> |
| Cc: Mike Rapoport <rppt@kernel.org> |
| Cc: Suren Baghdasaryan <surenb@google.com> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| --- |
| |
| include/linux/fs.h | 25 ++++++++++ |
| include/linux/mm_types.h | 24 ++++++++++ |
| mm/memory.c | 3 - |
| mm/mmap.c | 2 |
| mm/vma.c | 68 ++++++++++++++++++++++++++++- |
| tools/testing/vma/vma_internal.h | 66 ++++++++++++++++++++++++++-- |
| 6 files changed, 180 insertions(+), 8 deletions(-) |
| |
| --- a/include/linux/fs.h~mm-introduce-new-mmap_prepare-file-callback |
| +++ a/include/linux/fs.h |
| @@ -2169,6 +2169,7 @@ struct file_operations { |
| int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags); |
| int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *, |
| unsigned int poll_flags); |
| + int (*mmap_prepare)(struct vm_area_desc *); |
| } __randomize_layout; |
| |
| /* Supports async buffered reads */ |
| @@ -2238,11 +2239,35 @@ struct inode_operations { |
| struct offset_ctx *(*get_offset_ctx)(struct inode *inode); |
| } ____cacheline_aligned; |
| |
| +/* Did the driver provide valid mmap hook configuration? */ |
| +static inline bool file_has_valid_mmap_hooks(struct file *file) |
| +{ |
| + bool has_mmap = file->f_op->mmap; |
| + bool has_mmap_prepare = file->f_op->mmap_prepare; |
| + |
| + /* Hooks are mutually exclusive. */ |
| + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare)) |
| + return false; |
| + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare)) |
| + return false; |
| + |
| + return true; |
| +} |
| + |
| static inline int call_mmap(struct file *file, struct vm_area_struct *vma) |
| { |
| + if (WARN_ON_ONCE(file->f_op->mmap_prepare)) |
| + return -EINVAL; |
| + |
| return file->f_op->mmap(file, vma); |
| } |
| |
| +static inline int __call_mmap_prepare(struct file *file, |
| + struct vm_area_desc *desc) |
| +{ |
| + return file->f_op->mmap_prepare(desc); |
| +} |
| + |
| extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *); |
| extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *); |
| extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *, |
| --- a/include/linux/mm_types.h~mm-introduce-new-mmap_prepare-file-callback |
| +++ a/include/linux/mm_types.h |
| @@ -764,6 +764,30 @@ struct vma_numab_state { |
| }; |
| |
| /* |
| + * Describes a VMA that is about to be mmap()'ed. Drivers may choose to |
| + * manipulate mutable fields which will cause those fields to be updated in the |
| + * resultant VMA. |
| + * |
| + * Helper functions are not required for manipulating any field. |
| + */ |
| +struct vm_area_desc { |
| + /* Immutable state. */ |
| + struct mm_struct *mm; |
| + unsigned long start; |
| + unsigned long end; |
| + |
| + /* Mutable fields. Populated with initial state. */ |
| + pgoff_t pgoff; |
| + struct file *file; |
| + vm_flags_t vm_flags; |
| + pgprot_t page_prot; |
| + |
| + /* Write-only fields. */ |
| + const struct vm_operations_struct *vm_ops; |
| + void *private_data; |
| +}; |
| + |
| +/* |
| * This struct describes a virtual memory area. There is one of these |
| * per VM-area/task. A VM area is any part of the process virtual memory |
| * space that has a special rule for the page-fault handlers (ie a shared |
| --- a/mm/memory.c~mm-introduce-new-mmap_prepare-file-callback |
| +++ a/mm/memory.c |
| @@ -527,10 +527,11 @@ static void print_bad_pte(struct vm_area |
| dump_page(page, "bad pte"); |
| pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n", |
| (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index); |
| - pr_alert("file:%pD fault:%ps mmap:%ps read_folio:%ps\n", |
| + pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n", |
| vma->vm_file, |
| vma->vm_ops ? vma->vm_ops->fault : NULL, |
| vma->vm_file ? vma->vm_file->f_op->mmap : NULL, |
| + vma->vm_file ? vma->vm_file->f_op->mmap_prepare : NULL, |
| mapping ? mapping->a_ops->read_folio : NULL); |
| dump_stack(); |
| add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); |
| --- a/mm/mmap.c~mm-introduce-new-mmap_prepare-file-callback |
| +++ a/mm/mmap.c |
| @@ -475,7 +475,7 @@ unsigned long do_mmap(struct file *file, |
| vm_flags &= ~VM_MAYEXEC; |
| } |
| |
| - if (!file->f_op->mmap) |
| + if (!file_has_valid_mmap_hooks(file)) |
| return -ENODEV; |
| if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP)) |
| return -EINVAL; |
| --- a/mm/vma.c~mm-introduce-new-mmap_prepare-file-callback |
| +++ a/mm/vma.c |
| @@ -17,6 +17,11 @@ struct mmap_state { |
| unsigned long pglen; |
| unsigned long flags; |
| struct file *file; |
| + pgprot_t page_prot; |
| + |
| + /* User-defined fields, perhaps updated by .mmap_prepare(). */ |
| + const struct vm_operations_struct *vm_ops; |
| + void *vm_private_data; |
| |
| unsigned long charged; |
| bool retry_merge; |
| @@ -40,6 +45,7 @@ struct mmap_state { |
| .pglen = PHYS_PFN(len_), \ |
| .flags = flags_, \ |
| .file = file_, \ |
| + .page_prot = vm_get_page_prot(flags_), \ |
| } |
| |
| #define VMG_MMAP_STATE(name, map_, vma_) \ |
| @@ -2385,6 +2391,10 @@ static int __mmap_new_file_vma(struct mm |
| int error; |
| |
| vma->vm_file = get_file(map->file); |
| + |
| + if (!map->file->f_op->mmap) |
| + return 0; |
| + |
| error = mmap_file(vma->vm_file, vma); |
| if (error) { |
| fput(vma->vm_file); |
| @@ -2441,7 +2451,7 @@ static int __mmap_new_vma(struct mmap_st |
| vma_iter_config(vmi, map->addr, map->end); |
| vma_set_range(vma, map->addr, map->end, map->pgoff); |
| vm_flags_init(vma, map->flags); |
| - vma->vm_page_prot = vm_get_page_prot(map->flags); |
| + vma->vm_page_prot = map->page_prot; |
| |
| if (vma_iter_prealloc(vmi, vma)) { |
| error = -ENOMEM; |
| @@ -2528,6 +2538,56 @@ static void __mmap_complete(struct mmap_ |
| vma_set_page_prot(vma); |
| } |
| |
| +/* |
| + * Invoke the f_op->mmap_prepare() callback for a file-backed mapping that |
| + * specifies it. |
| + * |
| + * This is called prior to any merge attempt, and updates whitelisted fields |
| + * that are permitted to be updated by the caller. |
| + * |
| + * All but user-defined fields will be pre-populated with original values. |
| + * |
| + * Returns 0 on success, or an error code otherwise. |
| + */ |
| +static int call_mmap_prepare(struct mmap_state *map) |
| +{ |
| + int err; |
| + struct vm_area_desc desc = { |
| + .mm = map->mm, |
| + .start = map->addr, |
| + .end = map->end, |
| + |
| + .pgoff = map->pgoff, |
| + .file = map->file, |
| + .vm_flags = map->flags, |
| + .page_prot = map->page_prot, |
| + }; |
| + |
| + /* Invoke the hook. */ |
| + err = __call_mmap_prepare(map->file, &desc); |
| + if (err) |
| + return err; |
| + |
| + /* Update fields permitted to be changed. */ |
| + map->pgoff = desc.pgoff; |
| + map->file = desc.file; |
| + map->flags = desc.vm_flags; |
| + map->page_prot = desc.page_prot; |
| + /* User-defined fields. */ |
| + map->vm_ops = desc.vm_ops; |
| + map->vm_private_data = desc.private_data; |
| + |
| + return 0; |
| +} |
| + |
| +static void set_vma_user_defined_fields(struct vm_area_struct *vma, |
| + struct mmap_state *map) |
| +{ |
| + if (map->vm_ops) |
| + vma->vm_ops = map->vm_ops; |
| + vma->vm_private_data = map->vm_private_data; |
| +} |
| + |
| static unsigned long __mmap_region(struct file *file, unsigned long addr, |
| unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, |
| struct list_head *uf) |
| @@ -2535,10 +2595,13 @@ static unsigned long __mmap_region(struc |
| struct mm_struct *mm = current->mm; |
| struct vm_area_struct *vma = NULL; |
| int error; |
| + bool have_mmap_prepare = file && file->f_op->mmap_prepare; |
| VMA_ITERATOR(vmi, mm, addr); |
| MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file); |
| |
| error = __mmap_prepare(&map, uf); |
| + if (!error && have_mmap_prepare) |
| + error = call_mmap_prepare(&map); |
| if (error) |
| goto abort_munmap; |
| |
| @@ -2556,6 +2619,9 @@ static unsigned long __mmap_region(struc |
| goto unacct_error; |
| } |
| |
| + if (have_mmap_prepare) |
| + set_vma_user_defined_fields(vma, &map); |
| + |
| /* If flags changed, we might be able to merge, so try again. */ |
| if (map.retry_merge) { |
| struct vm_area_struct *merged; |
| --- a/tools/testing/vma/vma_internal.h~mm-introduce-new-mmap_prepare-file-callback |
| +++ a/tools/testing/vma/vma_internal.h |
| @@ -253,8 +253,40 @@ struct mm_struct { |
| unsigned long flags; /* Must use atomic bitops to access */ |
| }; |
| |
| +struct vm_area_struct; |
| + |
| +/* |
| + * Describes a VMA that is about to be mmap()'ed. Drivers may choose to |
| + * manipulate mutable fields which will cause those fields to be updated in the |
| + * resultant VMA. |
| + * |
| + * Helper functions are not required for manipulating any field. |
| + */ |
| +struct vm_area_desc { |
| + /* Immutable state. */ |
| + struct mm_struct *mm; |
| + unsigned long start; |
| + unsigned long end; |
| + |
| + /* Mutable fields. Populated with initial state. */ |
| + pgoff_t pgoff; |
| + struct file *file; |
| + vm_flags_t vm_flags; |
| + pgprot_t page_prot; |
| + |
| + /* Write-only fields. */ |
| + const struct vm_operations_struct *vm_ops; |
| + void *private_data; |
| +}; |
| + |
| +struct file_operations { |
| + int (*mmap)(struct file *, struct vm_area_struct *); |
| + int (*mmap_prepare)(struct vm_area_desc *); |
| +}; |
| + |
| struct file { |
| struct address_space *f_mapping; |
| + const struct file_operations *f_op; |
| }; |
| |
| #define VMA_LOCK_OFFSET 0x40000000 |
| @@ -1125,11 +1157,6 @@ static inline void vm_flags_clear(struct |
| vma->__vm_flags &= ~flags; |
| } |
| |
| -static inline int call_mmap(struct file *, struct vm_area_struct *) |
| -{ |
| - return 0; |
| -} |
| - |
| static inline int shmem_zero_setup(struct vm_area_struct *) |
| { |
| return 0; |
| @@ -1405,4 +1432,33 @@ static inline void free_anon_vma_name(st |
| (void)vma; |
| } |
| |
| +/* Did the driver provide valid mmap hook configuration? */ |
| +static inline bool file_has_valid_mmap_hooks(struct file *file) |
| +{ |
| + bool has_mmap = file->f_op->mmap; |
| + bool has_mmap_prepare = file->f_op->mmap_prepare; |
| + |
| + /* Hooks are mutually exclusive. */ |
| + if (WARN_ON_ONCE(has_mmap && has_mmap_prepare)) |
| + return false; |
| + if (WARN_ON_ONCE(!has_mmap && !has_mmap_prepare)) |
| + return false; |
| + |
| + return true; |
| +} |
| + |
| +static inline int call_mmap(struct file *file, struct vm_area_struct *vma) |
| +{ |
| + if (WARN_ON_ONCE(file->f_op->mmap_prepare)) |
| + return -EINVAL; |
| + |
| + return file->f_op->mmap(file, vma); |
| +} |
| + |
| +static inline int __call_mmap_prepare(struct file *file, |
| + struct vm_area_desc *desc) |
| +{ |
| + return file->f_op->mmap_prepare(desc); |
| +} |
| + |
| #endif /* __MM_VMA_INTERNAL_H */ |
| _ |