| From: Barry Song <v-songbaohua@oppo.com> |
| Subject: mm: hold PTL from the first PTE while reclaiming a large folio |
| Date: Wed, 6 Mar 2024 22:52:19 +1300 |
| |
| Within try_to_unmap_one(), page_vma_mapped_walk() races with other PTE |
| modifications preceded by pte clear. While iterating over PTEs of a large |
| folio, it only starts acquiring PTL from the first valid (present) PTE. |
| PTE modifications can temporarily set PTEs to pte_none. Consequently, the |
| initial PTEs of a large folio might be skipped in try_to_unmap_one(). |
| |
| For example, for an anon folio, if we skip PTE0, we may have PTE0 which is |
| still present, while PTE1 ~ PTE(nr_pages - 1) are swap entries after |
| try_to_unmap_one(). |
| |
| So folio will be still mapped, the folio fails to be reclaimed and is put |
| back to LRU in this round. |
| |
| This also breaks up PTEs optimization such as CONT-PTE on this large folio |
| and may lead to accident folio_split() afterwards. And since a part of |
| PTEs are now swap entries, accessing those parts will introduce overhead - |
| do_swap_page. Although the kernel can withstand all of the above issues, |
| the situation still seems quite awkward and warrants making it more ideal. |
| |
| The same race also occurs with small folios, but they have only one PTE, |
| thus, it won't be possible for them to be partially unmapped. |
| |
| This patch holds PTL from PTE0, allowing us to avoid reading PTE values |
| that are in the process of being transformed. With stable PTE values, we |
| can ensure that this large folio is either completely reclaimed or that |
| all PTEs remain untouched in this round. |
| |
| A corner case is that if we hold PTL from PTE0 and most initial PTEs have |
| been really unmapped before that, we may increase the duration of holding |
| PTL. Thus we only apply this optimization to folios which are still |
| entirely mapped (not in deferred_split list). |
| |
| [akpm@linux-foundation.org: rewrap comment, per Matthew] |
| Link: https://lkml.kernel.org/r/20240306095219.71086-1-21cnbao@gmail.com |
| Signed-off-by: Barry Song <v-songbaohua@oppo.com> |
| Acked-by: David Hildenbrand <david@redhat.com> |
| Cc: Hugh Dickins <hughd@google.com> |
| Cc: Chris Li <chrisl@kernel.org> |
| Cc: Chuanhua Han <hanchuanhua@oppo.com> |
| Cc: Gao Xiang <xiang@kernel.org> |
| Cc: Huang, Ying <ying.huang@intel.com> |
| Cc: Hugh Dickins <hughd@google.com> |
| Cc: Kefeng Wang <wangkefeng.wang@huawei.com> |
| Cc: Matthew Wilcox (Oracle) <willy@infradead.org> |
| Cc: Michal Hocko <mhocko@suse.com> |
| Cc: Ryan Roberts <ryan.roberts@arm.com> |
| Cc: Yang Shi <shy828301@gmail.com> |
| Cc: Yu Zhao <yuzhao@google.com> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| --- |
| |
| mm/vmscan.c | 14 ++++++++++++++ |
| 1 file changed, 14 insertions(+) |
| |
| --- a/mm/vmscan.c~mm-hold-ptl-from-the-first-pte-while-reclaiming-a-large-folio |
| +++ a/mm/vmscan.c |
| @@ -1257,6 +1257,20 @@ retry: |
| |
| if (folio_test_pmd_mappable(folio)) |
| flags |= TTU_SPLIT_HUGE_PMD; |
| + /* |
| + * Without TTU_SYNC, try_to_unmap will only begin to |
| + * hold PTL from the first present PTE within a large |
| + * folio. Some initial PTEs might be skipped due to |
| + * races with parallel PTE writes in which PTEs can be |
| + * cleared temporarily before being written new present |
| + * values. This will lead to a large folio is still |
| + * mapped while some subpages have been partially |
| + * unmapped after try_to_unmap; TTU_SYNC helps |
| + * try_to_unmap acquire PTL from the first PTE, |
| + * eliminating the influence of temporary PTE values. |
| + */ |
| + if (folio_test_large(folio) && list_empty(&folio->_deferred_list)) |
| + flags |= TTU_SYNC; |
| |
| try_to_unmap(folio, flags); |
| if (folio_mapped(folio)) { |
| _ |