| From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001 |
| From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| To: <linux-cve-announce@vger.kernel.org> |
| Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org> |
| Subject: CVE-2021-47209: sched/fair: Prevent dead task groups from regaining cfs_rq's |
| |
| Description |
| =========== |
| |
| In the Linux kernel, the following vulnerability has been resolved: |
| |
| sched/fair: Prevent dead task groups from regaining cfs_rq's |
| |
| Kevin is reporting crashes which point to a use-after-free of a cfs_rq |
| in update_blocked_averages(). Initial debugging revealed that we've |
| live cfs_rq's (on_list=1) in an about to be kfree()'d task group in |
| free_fair_sched_group(). However, it was unclear how that can happen. |
| |
| His kernel config happened to lead to a layout of struct sched_entity |
| that put the 'my_q' member directly into the middle of the object |
| which makes it incidentally overlap with SLUB's freelist pointer. |
| That, in combination with SLAB_FREELIST_HARDENED's freelist pointer |
| mangling, leads to a reliable access violation in form of a #GP which |
| made the UAF fail fast. |
| |
| Michal seems to have run into the same issue[1]. He already correctly |
| diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert |
| cfs_rq's to list on unthrottle") is causing the preconditions for the |
| UAF to happen by re-adding cfs_rq's also to task groups that have no |
| more running tasks, i.e. also to dead ones. His analysis, however, |
| misses the real root cause and it cannot be seen from the crash |
| backtrace only, as the real offender is tg_unthrottle_up() getting |
| called via sched_cfs_period_timer() via the timer interrupt at an |
| inconvenient time. |
| |
| When unregister_fair_sched_group() unlinks all cfs_rq's from the dying |
| task group, it doesn't protect itself from getting interrupted. If the |
| timer interrupt triggers while we iterate over all CPUs or after |
| unregister_fair_sched_group() has finished but prior to unlinking the |
| task group, sched_cfs_period_timer() will execute and walk the list of |
| task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the |
| dying task group. These will later -- in free_fair_sched_group() -- be |
| kfree()'ed while still being linked, leading to the fireworks Kevin |
| and Michal are seeing. |
| |
| To fix this race, ensure the dying task group gets unlinked first. |
| However, simply switching the order of unregistering and unlinking the |
| task group isn't sufficient, as concurrent RCU walkers might still see |
| it, as can be seen below: |
| |
| CPU1: CPU2: |
| : timer IRQ: |
| : do_sched_cfs_period_timer(): |
| : : |
| : distribute_cfs_runtime(): |
| : rcu_read_lock(); |
| : : |
| : unthrottle_cfs_rq(): |
| sched_offline_group(): : |
| : walk_tg_tree_from(…,tg_unthrottle_up,…): |
| list_del_rcu(&tg->list); : |
| (1) : list_for_each_entry_rcu(child, &parent->children, siblings) |
| : : |
| (2) list_del_rcu(&tg->siblings); : |
| : tg_unthrottle_up(): |
| unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; |
| : : |
| list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); : |
| : : |
| : if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running) |
| (3) : list_add_leaf_cfs_rq(cfs_rq); |
| : : |
| : : |
| : : |
| : : |
| : : |
| (4) : rcu_read_unlock(); |
| |
| CPU 2 walks the task group list in parallel to sched_offline_group(), |
| specifically, it'll read the soon to be unlinked task group entry at |
| (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from |
| still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink |
| all cfs_rq's via list_del_leaf_cfs_rq() in |
| unregister_fair_sched_group(). Meanwhile CPU 2 will re-add some of |
| these at (3), which is the cause of the UAF later on. |
| |
| To prevent this additional race from happening, we need to wait until |
| walk_tg_tree_from() has finished traversing the task groups, i.e. |
| after the RCU read critical section ends in (4). Afterwards we're safe |
| to call unregister_fair_sched_group(), as each new walk won't see the |
| dying task group any more. |
| |
| On top of that, we need to wait yet another RCU grace period after |
| unregister_fair_sched_group() to ensure print_cfs_stats(), which might |
| run concurrently, always sees valid objects, i.e. not already free'd |
| ones. |
| |
| This patch survives Michal's reproducer[2] for 8h+ now, which used to |
| trigger within minutes before. |
| |
| [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/ |
| [2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/ |
| |
| [peterz: shuffle code around a bit] |
| |
| The Linux kernel CVE team has assigned CVE-2021-47209 to this issue. |
| |
| |
| Affected and fixed versions |
| =========================== |
| |
| Issue introduced in 5.13 with commit a7b359fc6a37faaf472125867c8dc5a068c90982 and fixed in 5.15.5 with commit 512e21c150c1c3ee298852660f3a796e267e62ec |
| Issue introduced in 5.13 with commit a7b359fc6a37faaf472125867c8dc5a068c90982 and fixed in 5.16 with commit b027789e5e50494c2325cc70c8642e7fd6059479 |
| |
| Please see https://www.kernel.org for a full list of currently supported |
| kernel versions by the kernel community. |
| |
| Unaffected versions might change over time as fixes are backported to |
| older supported kernel versions. The official CVE entry at |
| https://cve.org/CVERecord/?id=CVE-2021-47209 |
| will be updated if fixes are backported, please check that for the most |
| up to date information about this issue. |
| |
| |
| Affected files |
| ============== |
| |
| The file(s) affected by this issue are: |
| kernel/sched/autogroup.c |
| kernel/sched/core.c |
| kernel/sched/fair.c |
| kernel/sched/rt.c |
| kernel/sched/sched.h |
| |
| |
| Mitigation |
| ========== |
| |
| The Linux kernel CVE team recommends that you update to the latest |
| stable kernel version for this, and many other bugfixes. Individual |
| changes are never tested alone, but rather are part of a larger kernel |
| release. Cherry-picking individual commits is not recommended or |
| supported by the Linux kernel community at all. If however, updating to |
| the latest release is impossible, the individual changes to resolve this |
| issue can be found at these commits: |
| https://git.kernel.org/stable/c/512e21c150c1c3ee298852660f3a796e267e62ec |
| https://git.kernel.org/stable/c/b027789e5e50494c2325cc70c8642e7fd6059479 |