| From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001 |
| From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| To: <linux-cve-announce@vger.kernel.org> |
| Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org> |
| Subject: CVE-2024-53079: mm/thp: fix deferred split unqueue naming and locking |
| |
| Description |
| =========== |
| |
| In the Linux kernel, the following vulnerability has been resolved: |
| |
| mm/thp: fix deferred split unqueue naming and locking |
| |
| Recent changes are putting more pressure on THP deferred split queues: |
| under load revealing long-standing races, causing list_del corruptions, |
| "Bad page state"s and worse (I keep BUGs in both of those, so usually |
| don't get to see how badly they end up without). The relevant recent |
| changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin, |
| improved swap allocation, and underused THP splitting. |
| |
| Before fixing locking: rename misleading folio_undo_large_rmappable(), |
| which does not undo large_rmappable, to folio_unqueue_deferred_split(), |
| which is what it does. But that and its out-of-line __callee are mm |
| internals of very limited usability: add comment and WARN_ON_ONCEs to |
| check usage; and return a bool to say if a deferred split was unqueued, |
| which can then be used in WARN_ON_ONCEs around safety checks (sparing |
| callers the arcane conditionals in __folio_unqueue_deferred_split()). |
| |
| Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all |
| of whose callers now call it beforehand (and if any forget then bad_page() |
| will tell) - except for its caller put_pages_list(), which itself no |
| longer has any callers (and will be deleted separately). |
| |
| Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0 |
| without checking and unqueueing a THP folio from deferred split list; |
| which is unfortunate, since the split_queue_lock depends on the memcg |
| (when memcg is enabled); so swapout has been unqueueing such THPs later, |
| when freeing the folio, using the pgdat's lock instead: potentially |
| corrupting the memcg's list. __remove_mapping() has frozen refcount to 0 |
| here, so no problem with calling folio_unqueue_deferred_split() before |
| resetting memcg_data. |
| |
| That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split |
| shrinker memcg aware"): which included a check on swapcache before adding |
| to deferred queue, but no check on deferred queue before adding THP to |
| swapcache. That worked fine with the usual sequence of events in reclaim |
| (though there were a couple of rare ways in which a THP on deferred queue |
| could have been swapped out), but 6.12 commit dafff3f4c850 ("mm: split |
| underused THPs") avoids splitting underused THPs in reclaim, which makes |
| swapcache THPs on deferred queue commonplace. |
| |
| Keep the check on swapcache before adding to deferred queue? Yes: it is |
| no longer essential, but preserves the existing behaviour, and is likely |
| to be a worthwhile optimization (vmstat showed much more traffic on the |
| queue under swapping load if the check was removed); update its comment. |
| |
| Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing |
| folio->memcg_data without checking and unqueueing a THP folio from the |
| deferred list, sometimes corrupting "from" memcg's list, like swapout. |
| Refcount is non-zero here, so folio_unqueue_deferred_split() can only be |
| used in a WARN_ON_ONCE to validate the fix, which must be done earlier: |
| mem_cgroup_move_charge_pte_range() first try to split the THP (splitting |
| of course unqueues), or skip it if that fails. Not ideal, but moving |
| charge has been requested, and khugepaged should repair the THP later: |
| nobody wants new custom unqueueing code just for this deprecated case. |
| |
| The 87eaceb3faa5 commit did have the code to move from one deferred list |
| to another (but was not conscious of its unsafety while refcount non-0); |
| but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need care |
| deferred split queue in memcg charge move path"), which argued that the |
| existence of a PMD mapping guarantees that the THP cannot be on a deferred |
| list. As above, false in rare cases, and now commonly false. |
| |
| Backport to 6.11 should be straightforward. Earlier backports must take |
| care that other _deferred_list fixes and dependencies are included. There |
| is not a strong case for backports, but they can fix cornercases. |
| |
| The Linux kernel CVE team has assigned CVE-2024-53079 to this issue. |
| |
| |
| Affected and fixed versions |
| =========================== |
| |
| Issue introduced in 5.4 with commit 87eaceb3faa59b9b4d940ec9554ce251325d83fe and fixed in 6.6.62 with commit fc4951c3e3358dd82ea508e893695b916c813f17 |
| Issue introduced in 5.4 with commit 87eaceb3faa59b9b4d940ec9554ce251325d83fe and fixed in 6.11.8 with commit afb1352d06b1b6b2cfd1f901c766a430c87078b3 |
| Issue introduced in 5.4 with commit 87eaceb3faa59b9b4d940ec9554ce251325d83fe and fixed in 6.12 with commit f8f931bba0f92052cf842b7e30917b1afcc77d5a |
| |
| Please see https://www.kernel.org for a full list of currently supported |
| kernel versions by the kernel community. |
| |
| Unaffected versions might change over time as fixes are backported to |
| older supported kernel versions. The official CVE entry at |
| https://cve.org/CVERecord/?id=CVE-2024-53079 |
| will be updated if fixes are backported, please check that for the most |
| up to date information about this issue. |
| |
| |
| Affected files |
| ============== |
| |
| The file(s) affected by this issue are: |
| mm/huge_memory.c |
| mm/internal.h |
| mm/memcontrol-v1.c |
| mm/memcontrol.c |
| mm/migrate.c |
| mm/page_alloc.c |
| mm/swap.c |
| mm/vmscan.c |
| |
| |
| Mitigation |
| ========== |
| |
| The Linux kernel CVE team recommends that you update to the latest |
| stable kernel version for this, and many other bugfixes. Individual |
| changes are never tested alone, but rather are part of a larger kernel |
| release. Cherry-picking individual commits is not recommended or |
| supported by the Linux kernel community at all. If however, updating to |
| the latest release is impossible, the individual changes to resolve this |
| issue can be found at these commits: |
| https://git.kernel.org/stable/c/fc4951c3e3358dd82ea508e893695b916c813f17 |
| https://git.kernel.org/stable/c/afb1352d06b1b6b2cfd1f901c766a430c87078b3 |
| https://git.kernel.org/stable/c/f8f931bba0f92052cf842b7e30917b1afcc77d5a |