| From: Suren Baghdasaryan <surenb@google.com> |
| Subject: userfaultfd: do not block on locking a large folio with raised refcount |
| Date: Wed, 26 Feb 2025 10:55:08 -0800 |
| |
| Lokesh recently raised an issue about UFFDIO_MOVE getting into a deadlock |
| state when it goes into split_folio() with raised folio refcount. |
| split_folio() expects the reference count to be exactly mapcount + |
| num_pages_in_folio + 1 (see can_split_folio()) and fails with EAGAIN |
| otherwise. |
| |
| If multiple processes are trying to move the same large folio, they raise |
| the refcount (all tasks succeed in that) then one of them succeeds in |
| locking the folio, while others will block in folio_lock() while keeping |
| the refcount raised. The winner of this race will proceed with calling |
| split_folio() and will fail returning EAGAIN to the caller and unlocking |
| the folio. The next competing process will get the folio locked and will |
| go through the same flow. In the meantime the original winner will be |
| retried and will block in folio_lock(), getting into the queue of waiting |
| processes only to repeat the same path. All this results in a livelock. |
| |
| An easy fix would be to avoid waiting for the folio lock while holding |
| folio refcount, similar to madvise_free_huge_pmd() where folio lock is |
| acquired before raising the folio refcount. Since we lock and take a |
| refcount of the folio while holding the PTE lock, changing the order of |
| these operations should not break anything. |
| |
| Modify move_pages_pte() to try locking the folio first and if that fails |
| and the folio is large then return EAGAIN without touching the folio |
| refcount. If the folio is single-page then split_folio() is not called, |
| so we don't have this issue. Lokesh has a reproducer [1] and I verified |
| that this change fixes the issue. |
| |
| [1] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock |
| |
| [akpm@linux-foundation.org: reflow comment to 80 cols, s/end/end up/] |
| Link: https://lkml.kernel.org/r/20250226185510.2732648-2-surenb@google.com |
| Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") |
| Signed-off-by: Suren Baghdasaryan <surenb@google.com> |
| Reported-by: Lokesh Gidra <lokeshgidra@google.com> |
| Reviewed-by: Peter Xu <peterx@redhat.com> |
| Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> |
| Cc: Andrea Arcangeli <aarcange@redhat.com> |
| Cc: Barry Song <21cnbao@gmail.com> |
| Cc: Barry Song <v-songbaohua@oppo.com> |
| Cc: David Hildenbrand <david@redhat.com> |
| Cc: Hugh Dickins <hughd@google.com> |
| Cc: Jann Horn <jannh@google.com> |
| Cc: Kalesh Singh <kaleshsingh@google.com> |
| Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> |
| Cc: Matthew Wilcow (Oracle) <willy@infradead.org> |
| Cc: <stable@vger.kernel.org> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| --- |
| |
| mm/userfaultfd.c | 17 ++++++++++++++++- |
| 1 file changed, 16 insertions(+), 1 deletion(-) |
| |
| --- a/mm/userfaultfd.c~userfaultfd-do-not-block-on-locking-a-large-folio-with-raised-refcount |
| +++ a/mm/userfaultfd.c |
| @@ -1250,6 +1250,7 @@ retry: |
| */ |
| if (!src_folio) { |
| struct folio *folio; |
| + bool locked; |
| |
| /* |
| * Pin the page while holding the lock to be sure the |
| @@ -1269,12 +1270,26 @@ retry: |
| goto out; |
| } |
| |
| + locked = folio_trylock(folio); |
| + /* |
| + * We avoid waiting for folio lock with a raised |
| + * refcount for large folios because extra refcounts |
| + * will result in split_folio() failing later and |
| + * retrying. If multiple tasks are trying to move a |
| + * large folio we can end up livelocking. |
| + */ |
| + if (!locked && folio_test_large(folio)) { |
| + spin_unlock(src_ptl); |
| + err = -EAGAIN; |
| + goto out; |
| + } |
| + |
| folio_get(folio); |
| src_folio = folio; |
| src_folio_pte = orig_src_pte; |
| spin_unlock(src_ptl); |
| |
| - if (!folio_trylock(src_folio)) { |
| + if (!locked) { |
| pte_unmap(&orig_src_pte); |
| pte_unmap(&orig_dst_pte); |
| src_pte = dst_pte = NULL; |
| _ |