| From: Jann Horn <jannh@google.com> |
| Subject: userfaultfd: fix checks for huge PMDs |
| Date: Tue, 13 Aug 2024 22:25:21 +0200 |
| |
| Patch series "userfaultfd: fix races around pmd_trans_huge() check", v2. |
| |
| The pmd_trans_huge() code in mfill_atomic() is wrong in three different |
| ways depending on kernel version: |
| |
| 1. The pmd_trans_huge() check is racy and can lead to a BUG_ON() (if you hit |
| the right two race windows) - I've tested this in a kernel build with |
| some extra mdelay() calls. See the commit message for a description |
| of the race scenario. |
| On older kernels (before 6.5), I think the same bug can even |
| theoretically lead to accessing transhuge page contents as a page table |
| if you hit the right 5 narrow race windows (I haven't tested this case). |
| 2. As pointed out by Qi Zheng, pmd_trans_huge() is not sufficient for |
| detecting PMDs that don't point to page tables. |
| On older kernels (before 6.5), you'd just have to win a single fairly |
| wide race to hit this. |
| I've tested this on 6.1 stable by racing migration (with a mdelay() |
| patched into try_to_migrate()) against UFFDIO_ZEROPAGE - on my x86 |
| VM, that causes a kernel oops in ptlock_ptr(). |
| 3. On newer kernels (>=6.5), for shmem mappings, khugepaged is allowed |
| to yank page tables out from under us (though I haven't tested that), |
| so I think the BUG_ON() checks in mfill_atomic() are just wrong. |
| |
| I decided to write two separate fixes for these (one fix for bugs 1+2, one |
| fix for bug 3), so that the first fix can be backported to kernels |
| affected by bugs 1+2. |
| |
| |
| This patch (of 2): |
| |
| This fixes two issues. |
| |
| I discovered that the following race can occur: |
| |
| mfill_atomic other thread |
| ============ ============ |
| <zap PMD> |
| pmdp_get_lockless() [reads none pmd] |
| <bail if trans_huge> |
| <if none:> |
| <pagefault creates transhuge zeropage> |
| __pte_alloc [no-op] |
| <zap PMD> |
| <bail if pmd_trans_huge(*dst_pmd)> |
| BUG_ON(pmd_none(*dst_pmd)) |
| |
| I have experimentally verified this in a kernel with extra mdelay() calls; |
| the BUG_ON(pmd_none(*dst_pmd)) triggers. |
| |
| On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow |
| pte_offset_map[_lock]() to fail"), this can't lead to anything worse than |
| a BUG_ON(), since the page table access helpers are actually designed to |
| deal with page tables concurrently disappearing; but on older kernels |
| (<=6.4), I think we could probably theoretically race past the two |
| BUG_ON() checks and end up treating a hugepage as a page table. |
| |
| The second issue is that, as Qi Zheng pointed out, there are other types |
| of huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs |
| (in particular, migration PMDs). |
| |
| On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a |
| PMD that contains a migration entry (which just requires winning a single, |
| fairly wide race), it will pass the PMD to pte_offset_map_lock(), which |
| assumes that the PMD points to a page table. |
| |
| Breakage follows: First, the kernel tries to take the PTE lock (which will |
| crash or maybe worse if there is no "struct page" for the address bits in |
| the migration entry PMD - I think at least on X86 there usually is no |
| corresponding "struct page" thanks to the PTE inversion mitigation, amd64 |
| looks different). |
| |
| If that didn't crash, the kernel would next try to write a PTE into what |
| it wrongly thinks is a page table. |
| |
| As part of fixing these issues, get rid of the check for pmd_trans_huge() |
| before __pte_alloc() - that's redundant, we're going to have to check for |
| that after the __pte_alloc() anyway. |
| |
| Backport note: pmdp_get_lockless() is pmd_read_atomic() in older kernels. |
| |
| Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-0-5efa61078a41@google.com |
| Link: https://lkml.kernel.org/r/20240813-uffd-thp-flip-fix-v2-1-5efa61078a41@google.com |
| Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation") |
| Signed-off-by: Jann Horn <jannh@google.com> |
| Acked-by: David Hildenbrand <david@redhat.com> |
| Cc: Andrea Arcangeli <aarcange@redhat.com> |
| Cc: Hugh Dickins <hughd@google.com> |
| Cc: Jann Horn <jannh@google.com> |
| Cc: Pavel Emelyanov <xemul@virtuozzo.com> |
| Cc: Qi Zheng <zhengqi.arch@bytedance.com> |
| Cc: <stable@vger.kernel.org> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| --- |
| |
| mm/userfaultfd.c | 22 ++++++++++++---------- |
| 1 file changed, 12 insertions(+), 10 deletions(-) |
| |
| --- a/mm/userfaultfd.c~userfaultfd-fix-checks-for-huge-pmds |
| +++ a/mm/userfaultfd.c |
| @@ -787,21 +787,23 @@ retry: |
| } |
| |
| dst_pmdval = pmdp_get_lockless(dst_pmd); |
| - /* |
| - * If the dst_pmd is mapped as THP don't |
| - * override it and just be strict. |
| - */ |
| - if (unlikely(pmd_trans_huge(dst_pmdval))) { |
| - err = -EEXIST; |
| - break; |
| - } |
| if (unlikely(pmd_none(dst_pmdval)) && |
| unlikely(__pte_alloc(dst_mm, dst_pmd))) { |
| err = -ENOMEM; |
| break; |
| } |
| - /* If an huge pmd materialized from under us fail */ |
| - if (unlikely(pmd_trans_huge(*dst_pmd))) { |
| + dst_pmdval = pmdp_get_lockless(dst_pmd); |
| + /* |
| + * If the dst_pmd is THP don't override it and just be strict. |
| + * (This includes the case where the PMD used to be THP and |
| + * changed back to none after __pte_alloc().) |
| + */ |
| + if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) || |
| + pmd_devmap(dst_pmdval))) { |
| + err = -EEXIST; |
| + break; |
| + } |
| + if (unlikely(pmd_bad(dst_pmdval))) { |
| err = -EFAULT; |
| break; |
| } |
| _ |