| From: Dev Jain <dev.jain@arm.com> |
| Subject: mm: optimize mprotect() by PTE batching |
| Date: Fri, 18 Jul 2025 14:32:43 +0530 |
| |
| Use folio_pte_batch to batch process a large folio. Note that, PTE |
| batching here will save a few function calls, and this strategy in certain |
| cases (not this one) batches atomic operations in general, so we have a |
| performance win for all arches. This patch paves the way for patch 7 |
| which will help us elide the TLBI per contig block on arm64. |
| |
| The correctness of this patch lies on the correctness of setting the new |
| ptes based upon information only from the first pte of the batch (which |
| may also have accumulated a/d bits via modify_prot_start_ptes()). |
| |
| Observe that the flag combination we pass to mprotect_folio_pte_batch() |
| guarantees that the batch is uniform w.r.t the soft-dirty bit and the |
| writable bit. Therefore, the only bits which may differ are the a/d bits. |
| So we only need to worry about code which is concerned about the a/d bits |
| of the PTEs. |
| |
| Setting extra a/d bits on the new ptes where previously they were not set, |
| is fine - setting access bit when it was not set is not an incorrectness |
| problem but will only possibly delay the reclaim of the page mapped by the |
| pte (which is in fact intended because the kernel just operated on this |
| region via mprotect()!). Setting dirty bit when it was not set is again |
| not an incorrectness problem but will only possibly force an unnecessary |
| writeback. |
| |
| So now we need to reason whether something can go wrong via |
| can_change_pte_writable(). The pte_protnone, pte_needs_soft_dirty_wp, and |
| userfaultfd_pte_wp cases are solved due to uniformity in the corresponding |
| bits guaranteed by the flag combination. The ptes all belong to the same |
| VMA (since callers guarantee that [start, end) will lie within the VMA) |
| therefore the conditional based on the VMA is also safe to batch around. |
| |
| Since the dirty bit on the PTE really is just an indication that the folio |
| got written to - even if the PTE is not actually dirty but one of the PTEs |
| in the batch is, the wp-fault optimization can be made. Therefore, it is |
| safe to batch around pte_dirty() in can_change_shared_pte_writable() (in |
| fact this is better since without batching, it may happen that some ptes |
| aren't changed to writable just because they are not dirty, even though |
| the other ptes mapping the same large folio are dirty). |
| |
| To batch around the PageAnonExclusive case, we must check the |
| corresponding condition for every single page. Therefore, from the large |
| folio batch, we process sub batches of ptes mapping pages with the same |
| PageAnonExclusive condition, and process that sub batch, then determine |
| and process the next sub batch, and so on. Note that this does not cause |
| any extra overhead; if suppose the size of the folio batch is 512, then |
| the sub batch processing in total will take 512 iterations, which is the |
| same as what we would have done before. |
| |
| For pte_needs_flush(): |
| |
| ppc does not care about the a/d bits. |
| |
| For x86, PAGE_SAVED_DIRTY is ignored. We will flush only when a/d bits |
| get cleared; since we can only have extra a/d bits due to batching, we |
| will only have an extra flush, not a case where we elide a flush due to |
| batching when we shouldn't have. |
| |
| Link: https://lkml.kernel.org/r/20250718090244.21092-7-dev.jain@arm.com |
| Signed-off-by: Dev Jain <dev.jain@arm.com> |
| Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> |
| Reviewed-by: Zi Yan <ziy@nvidia.com> |
| Cc: Anshuman Khandual <anshuman.khandual@arm.com> |
| Cc: Barry Song <baohua@kernel.org> |
| Cc: Catalin Marinas <catalin.marinas@arm.com> |
| Cc: Christophe Leroy <christophe.leroy@csgroup.eu> |
| Cc: David Hildenbrand <david@redhat.com> |
| Cc: Hugh Dickins <hughd@google.com> |
| Cc: Jann Horn <jannh@google.com> |
| Cc: Joey Gouly <joey.gouly@arm.com> |
| Cc: Kevin Brodsky <kevin.brodsky@arm.com> |
| Cc: Lance Yang <ioworker0@gmail.com> |
| Cc: Liam Howlett <liam.howlett@oracle.com> |
| Cc: Matthew Wilcox (Oracle) <willy@infradead.org> |
| Cc: Peter Xu <peterx@redhat.com> |
| Cc: Ryan Roberts <ryan.roberts@arm.com> |
| Cc: Vlastimil Babka <vbabka@suse.cz> |
| Cc: Will Deacon <will@kernel.org> |
| Cc: Yang Shi <yang@os.amperecomputing.com> |
| Cc: Yicong Yang <yangyicong@hisilicon.com> |
| Cc: Zhenhua Huang <quic_zhenhuah@quicinc.com> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| --- |
| |
| mm/mprotect.c | 125 +++++++++++++++++++++++++++++++++++++++++++----- |
| 1 file changed, 113 insertions(+), 12 deletions(-) |
| |
| --- a/mm/mprotect.c~mm-optimize-mprotect-by-pte-batching |
| +++ a/mm/mprotect.c |
| @@ -106,7 +106,7 @@ bool can_change_pte_writable(struct vm_a |
| } |
| |
| static int mprotect_folio_pte_batch(struct folio *folio, pte_t *ptep, |
| - pte_t pte, int max_nr_ptes) |
| + pte_t pte, int max_nr_ptes, fpb_t flags) |
| { |
| /* No underlying folio, so cannot batch */ |
| if (!folio) |
| @@ -115,7 +115,7 @@ static int mprotect_folio_pte_batch(stru |
| if (!folio_test_large(folio)) |
| return 1; |
| |
| - return folio_pte_batch(folio, ptep, pte, max_nr_ptes); |
| + return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr_ptes, flags); |
| } |
| |
| static bool prot_numa_skip(struct vm_area_struct *vma, unsigned long addr, |
| @@ -177,6 +177,102 @@ skip: |
| return ret; |
| } |
| |
| +/* Set nr_ptes number of ptes, starting from idx */ |
| +static void prot_commit_flush_ptes(struct vm_area_struct *vma, unsigned long addr, |
| + pte_t *ptep, pte_t oldpte, pte_t ptent, int nr_ptes, |
| + int idx, bool set_write, struct mmu_gather *tlb) |
| +{ |
| + /* |
| + * Advance the position in the batch by idx; note that if idx > 0, |
| + * then the nr_ptes passed here is <= batch size - idx. |
| + */ |
| + addr += idx * PAGE_SIZE; |
| + ptep += idx; |
| + oldpte = pte_advance_pfn(oldpte, idx); |
| + ptent = pte_advance_pfn(ptent, idx); |
| + |
| + if (set_write) |
| + ptent = pte_mkwrite(ptent, vma); |
| + |
| + modify_prot_commit_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes); |
| + if (pte_needs_flush(oldpte, ptent)) |
| + tlb_flush_pte_range(tlb, addr, nr_ptes * PAGE_SIZE); |
| +} |
| + |
| +/* |
| + * Get max length of consecutive ptes pointing to PageAnonExclusive() pages or |
| + * !PageAnonExclusive() pages, starting from start_idx. Caller must enforce |
| + * that the ptes point to consecutive pages of the same anon large folio. |
| + */ |
| +static int page_anon_exclusive_sub_batch(int start_idx, int max_len, |
| + struct page *first_page, bool expected_anon_exclusive) |
| +{ |
| + int idx; |
| + |
| + for (idx = start_idx + 1; idx < start_idx + max_len; ++idx) { |
| + if (expected_anon_exclusive != PageAnonExclusive(first_page + idx)) |
| + break; |
| + } |
| + return idx - start_idx; |
| +} |
| + |
| +/* |
| + * This function is a result of trying our very best to retain the |
| + * "avoid the write-fault handler" optimization. In can_change_pte_writable(), |
| + * if the vma is a private vma, and we cannot determine whether to change |
| + * the pte to writable just from the vma and the pte, we then need to look |
| + * at the actual page pointed to by the pte. Unfortunately, if we have a |
| + * batch of ptes pointing to consecutive pages of the same anon large folio, |
| + * the anon-exclusivity (or the negation) of the first page does not guarantee |
| + * the anon-exclusivity (or the negation) of the other pages corresponding to |
| + * the pte batch; hence in this case it is incorrect to decide to change or |
| + * not change the ptes to writable just by using information from the first |
| + * pte of the batch. Therefore, we must individually check all pages and |
| + * retrieve sub-batches. |
| + */ |
| +static void commit_anon_folio_batch(struct vm_area_struct *vma, |
| + struct folio *folio, unsigned long addr, pte_t *ptep, |
| + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) |
| +{ |
| + struct page *first_page = folio_page(folio, 0); |
| + bool expected_anon_exclusive; |
| + int sub_batch_idx = 0; |
| + int len; |
| + |
| + while (nr_ptes) { |
| + expected_anon_exclusive = PageAnonExclusive(first_page + sub_batch_idx); |
| + len = page_anon_exclusive_sub_batch(sub_batch_idx, nr_ptes, |
| + first_page, expected_anon_exclusive); |
| + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, len, |
| + sub_batch_idx, expected_anon_exclusive, tlb); |
| + sub_batch_idx += len; |
| + nr_ptes -= len; |
| + } |
| +} |
| + |
| +static void set_write_prot_commit_flush_ptes(struct vm_area_struct *vma, |
| + struct folio *folio, unsigned long addr, pte_t *ptep, |
| + pte_t oldpte, pte_t ptent, int nr_ptes, struct mmu_gather *tlb) |
| +{ |
| + bool set_write; |
| + |
| + if (vma->vm_flags & VM_SHARED) { |
| + set_write = can_change_shared_pte_writable(vma, ptent); |
| + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes, |
| + /* idx = */ 0, set_write, tlb); |
| + return; |
| + } |
| + |
| + set_write = maybe_change_pte_writable(vma, ptent) && |
| + (folio && folio_test_anon(folio)); |
| + if (!set_write) { |
| + prot_commit_flush_ptes(vma, addr, ptep, oldpte, ptent, nr_ptes, |
| + /* idx = */ 0, set_write, tlb); |
| + return; |
| + } |
| + commit_anon_folio_batch(vma, folio, addr, ptep, oldpte, ptent, nr_ptes, tlb); |
| +} |
| + |
| static long change_pte_range(struct mmu_gather *tlb, |
| struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, |
| unsigned long end, pgprot_t newprot, unsigned long cp_flags) |
| @@ -206,8 +302,9 @@ static long change_pte_range(struct mmu_ |
| nr_ptes = 1; |
| oldpte = ptep_get(pte); |
| if (pte_present(oldpte)) { |
| + const fpb_t flags = FPB_RESPECT_SOFT_DIRTY | FPB_RESPECT_WRITE; |
| int max_nr_ptes = (end - addr) >> PAGE_SHIFT; |
| - struct folio *folio; |
| + struct folio *folio = NULL; |
| pte_t ptent; |
| |
| /* |
| @@ -221,11 +318,16 @@ static long change_pte_range(struct mmu_ |
| |
| /* determine batch to skip */ |
| nr_ptes = mprotect_folio_pte_batch(folio, |
| - pte, oldpte, max_nr_ptes); |
| + pte, oldpte, max_nr_ptes, /* flags = */ 0); |
| continue; |
| } |
| } |
| |
| + if (!folio) |
| + folio = vm_normal_folio(vma, addr, oldpte); |
| + |
| + nr_ptes = mprotect_folio_pte_batch(folio, pte, oldpte, max_nr_ptes, flags); |
| + |
| oldpte = modify_prot_start_ptes(vma, addr, pte, nr_ptes); |
| ptent = pte_modify(oldpte, newprot); |
| |
| @@ -248,14 +350,13 @@ static long change_pte_range(struct mmu_ |
| * COW or special handling is required. |
| */ |
| if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && |
| - !pte_write(ptent) && |
| - can_change_pte_writable(vma, addr, ptent)) |
| - ptent = pte_mkwrite(ptent, vma); |
| - |
| - modify_prot_commit_ptes(vma, addr, pte, oldpte, ptent, nr_ptes); |
| - if (pte_needs_flush(oldpte, ptent)) |
| - tlb_flush_pte_range(tlb, addr, PAGE_SIZE); |
| - pages++; |
| + !pte_write(ptent)) |
| + set_write_prot_commit_flush_ptes(vma, folio, |
| + addr, pte, oldpte, ptent, nr_ptes, tlb); |
| + else |
| + prot_commit_flush_ptes(vma, addr, pte, oldpte, ptent, |
| + nr_ptes, /* idx = */ 0, /* set_write = */ false, tlb); |
| + pages += nr_ptes; |
| } else if (is_swap_pte(oldpte)) { |
| swp_entry_t entry = pte_to_swp_entry(oldpte); |
| pte_t newpte; |
| _ |