| From: Harry Yoo <harry.yoo@oracle.com> |
| Subject: mm: move page table sync declarations to linux/pgtable.h |
| Date: Mon, 18 Aug 2025 11:02:04 +0900 |
| |
| During our internal testing, we started observing intermittent boot |
| failures when the machine uses 4-level paging and has a large amount of |
| persistent memory: |
| |
| BUG: unable to handle page fault for address: ffffe70000000034 |
| #PF: supervisor write access in kernel mode |
| #PF: error_code(0x0002) - not-present page |
| PGD 0 P4D 0 |
| Oops: 0002 [#1] SMP NOPTI |
| RIP: 0010:__init_single_page+0x9/0x6d |
| Call Trace: |
| <TASK> |
| __init_zone_device_page+0x17/0x5d |
| memmap_init_zone_device+0x154/0x1bb |
| pagemap_range+0x2e0/0x40f |
| memremap_pages+0x10b/0x2f0 |
| devm_memremap_pages+0x1e/0x60 |
| dev_dax_probe+0xce/0x2ec [device_dax] |
| dax_bus_probe+0x6d/0xc9 |
| [... snip ...] |
| </TASK> |
| |
| It turns out that the kernel panics while initializing vmemmap (struct |
| page array) when the vmemmap region spans two PGD entries, because the new |
| PGD entry is only installed in init_mm.pgd, but not in the page tables of |
| other tasks. |
| |
| And looking at __populate_section_memmap(): |
| if (vmemmap_can_optimize(altmap, pgmap)) |
| // does not sync top level page tables |
| r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap); |
| else |
| // sync top level page tables in x86 |
| r = vmemmap_populate(start, end, nid, altmap); |
| |
| In the normal path, vmemmap_populate() in arch/x86/mm/init_64.c |
| synchronizes the top level page table (See commit 9b861528a801 ("x86-64, |
| mem: Update all PGDs for direct mapping and vmemmap mapping changes")) so |
| that all tasks in the system can see the new vmemmap area. |
| |
| However, when vmemmap_can_optimize() returns true, the optimized path |
| skips synchronization of top-level page tables. This is because |
| vmemmap_populate_compound_pages() is implemented in core MM code, which |
| does not handle synchronization of the top-level page tables. Instead, |
| the core MM has historically relied on each architecture to perform this |
| synchronization manually. |
| |
| We're not the first party to encounter a crash caused by not-sync'd top |
| level page tables: earlier this year, Gwan-gyeong Mun attempted to address |
| the issue [1] [2] after hitting a kernel panic when x86 code accessed the |
| vmemmap area before the corresponding top-level entries were synced. At |
| that time, the issue was believed to be triggered only when struct page |
| was enlarged for debugging purposes, and the patch did not get further |
| updates. |
| |
| It turns out that current approach of relying on each arch to handle the |
| page table sync manually is fragile because 1) it's easy to forget to sync |
| the top level page table, and 2) it's also easy to overlook that the |
| kernel should not access the vmemmap and direct mapping areas before the |
| sync. |
| |
| # The solution: Make page table sync more code robust and harder to miss |
| |
| To address this, Dave Hansen suggested [3] [4] introducing |
| {pgd,p4d}_populate_kernel() for updating kernel portion of the page tables |
| and allow each architecture to explicitly perform synchronization when |
| installing top-level entries. With this approach, we no longer need to |
| worry about missing the sync step, reducing the risk of future |
| regressions. |
| |
| The new interface reuses existing ARCH_PAGE_TABLE_SYNC_MASK, |
| PGTBL_P*D_MODIFIED and arch_sync_kernel_mappings() facility used by |
| vmalloc and ioremap to synchronize page tables. |
| |
| pgd_populate_kernel() looks like this: |
| static inline void pgd_populate_kernel(unsigned long addr, pgd_t *pgd, |
| p4d_t *p4d) |
| { |
| pgd_populate(&init_mm, pgd, p4d); |
| if (ARCH_PAGE_TABLE_SYNC_MASK & PGTBL_PGD_MODIFIED) |
| arch_sync_kernel_mappings(addr, addr); |
| } |
| |
| It is worth noting that vmalloc() and apply_to_range() carefully |
| synchronizes page tables by calling p*d_alloc_track() and |
| arch_sync_kernel_mappings(), and thus they are not affected by this patch |
| series. |
| |
| This series was hugely inspired by Dave Hansen's suggestion and hence |
| added Suggested-by: Dave Hansen. |
| |
| Cc stable because lack of this series opens the door to intermittent |
| boot failures. |
| |
| |
| This patch (of 3): |
| |
| Move ARCH_PAGE_TABLE_SYNC_MASK and arch_sync_kernel_mappings() to |
| linux/pgtable.h so that they can be used outside of vmalloc and ioremap. |
| |
| Link: https://lkml.kernel.org/r/20250818020206.4517-1-harry.yoo@oracle.com |
| Link: https://lkml.kernel.org/r/20250818020206.4517-2-harry.yoo@oracle.com |
| Link: https://lore.kernel.org/linux-mm/20250220064105.808339-1-gwan-gyeong.mun@intel.com [1] |
| Link: https://lore.kernel.org/linux-mm/20250311114420.240341-1-gwan-gyeong.mun@intel.com [2] |
| Link: https://lore.kernel.org/linux-mm/d1da214c-53d3-45ac-a8b6-51821c5416e4@intel.com [3] |
| Link: https://lore.kernel.org/linux-mm/4d800744-7b88-41aa-9979-b245e8bf794b@intel.com [4] |
| Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges") |
| Signed-off-by: Harry Yoo <harry.yoo@oracle.com> |
| Acked-by: Kiryl Shutsemau <kas@kernel.org> |
| Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> |
| Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com> |
| Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> |
| Acked-by: David Hildenbrand <david@redhat.com> |
| Cc: Alexander Potapenko <glider@google.com> |
| Cc: Alistair Popple <apopple@nvidia.com> |
| Cc: Andrey Konovalov <andreyknvl@gmail.com> |
| Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> |
| Cc: Andy Lutomirski <luto@kernel.org> |
| Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> |
| Cc: Anshuman Khandual <anshuman.khandual@arm.com> |
| Cc: Ard Biesheuvel <ardb@kernel.org> |
| Cc: Arnd Bergmann <arnd@arndb.de> |
| Cc: bibo mao <maobibo@loongson.cn> |
| Cc: Borislav Betkov <bp@alien8.de> |
| Cc: Christoph Lameter (Ampere) <cl@gentwo.org> |
| Cc: Dennis Zhou <dennis@kernel.org> |
| Cc: Dev Jain <dev.jain@arm.com> |
| Cc: Dmitriy Vyukov <dvyukov@google.com> |
| Cc: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> |
| Cc: Ingo Molnar <mingo@redhat.com> |
| Cc: Jane Chu <jane.chu@oracle.com> |
| Cc: Joao Martins <joao.m.martins@oracle.com> |
| Cc: Joerg Roedel <joro@8bytes.org> |
| Cc: John Hubbard <jhubbard@nvidia.com> |
| Cc: Kevin Brodsky <kevin.brodsky@arm.com> |
| Cc: Liam Howlett <liam.howlett@oracle.com> |
| Cc: Michal Hocko <mhocko@suse.com> |
| Cc: Oscar Salvador <osalvador@suse.de> |
| Cc: Peter Xu <peterx@redhat.com> |
| Cc: Peter Zijlstra <peterz@infradead.org> |
| Cc: Qi Zheng <zhengqi.arch@bytedance.com> |
| Cc: Ryan Roberts <ryan.roberts@arm.com> |
| Cc: Suren Baghdasaryan <surenb@google.com> |
| Cc: Tejun Heo <tj@kernel.org> |
| Cc: Thomas Gleinxer <tglx@linutronix.de> |
| Cc: Thomas Huth <thuth@redhat.com> |
| Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> |
| Cc: Vlastimil Babka <vbabka@suse.cz> |
| Cc: Dave Hansen <dave.hansen@linux.intel.com> |
| Cc: <stable@vger.kernel.org> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| --- |
| |
| include/linux/pgtable.h | 16 ++++++++++++++++ |
| include/linux/vmalloc.h | 16 ---------------- |
| 2 files changed, 16 insertions(+), 16 deletions(-) |
| |
| --- a/include/linux/pgtable.h~mm-move-page-table-sync-declarations-to-linux-pgtableh |
| +++ a/include/linux/pgtable.h |
| @@ -1467,6 +1467,22 @@ static inline void modify_prot_commit_pt |
| } |
| #endif |
| |
| +/* |
| + * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values |
| + * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings() |
| + * needs to be called. |
| + */ |
| +#ifndef ARCH_PAGE_TABLE_SYNC_MASK |
| +#define ARCH_PAGE_TABLE_SYNC_MASK 0 |
| +#endif |
| + |
| +/* |
| + * There is no default implementation for arch_sync_kernel_mappings(). It is |
| + * relied upon the compiler to optimize calls out if ARCH_PAGE_TABLE_SYNC_MASK |
| + * is 0. |
| + */ |
| +void arch_sync_kernel_mappings(unsigned long start, unsigned long end); |
| + |
| #endif /* CONFIG_MMU */ |
| |
| /* |
| --- a/include/linux/vmalloc.h~mm-move-page-table-sync-declarations-to-linux-pgtableh |
| +++ a/include/linux/vmalloc.h |
| @@ -220,22 +220,6 @@ int vmap_pages_range(unsigned long addr, |
| struct page **pages, unsigned int page_shift); |
| |
| /* |
| - * Architectures can set this mask to a combination of PGTBL_P?D_MODIFIED values |
| - * and let generic vmalloc and ioremap code know when arch_sync_kernel_mappings() |
| - * needs to be called. |
| - */ |
| -#ifndef ARCH_PAGE_TABLE_SYNC_MASK |
| -#define ARCH_PAGE_TABLE_SYNC_MASK 0 |
| -#endif |
| - |
| -/* |
| - * There is no default implementation for arch_sync_kernel_mappings(). It is |
| - * relied upon the compiler to optimize calls out if ARCH_PAGE_TABLE_SYNC_MASK |
| - * is 0. |
| - */ |
| -void arch_sync_kernel_mappings(unsigned long start, unsigned long end); |
| - |
| -/* |
| * Lowlevel-APIs (not for driver use!) |
| */ |
| |
| _ |