cve/published/2021/CVE-2021-47209.mbox - pub/scm/linux/security/vulns - Git at Google

 From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001
 From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 To: <linux-cve-announce@vger.kernel.org>
 Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
 Subject: CVE-2021-47209: sched/fair: Prevent dead task groups from regaining cfs_rq's

 Description
 ===========

 In the Linux kernel, the following vulnerability has been resolved:

 sched/fair: Prevent dead task groups from regaining cfs_rq's

 Kevin is reporting crashes which point to a use-after-free of a cfs_rq
 in update_blocked_averages(). Initial debugging revealed that we've
 live cfs_rq's (on_list=1) in an about to be kfree()'d task group in
 free_fair_sched_group(). However, it was unclear how that can happen.

 His kernel config happened to lead to a layout of struct sched_entity
 that put the 'my_q' member directly into the middle of the object
 which makes it incidentally overlap with SLUB's freelist pointer.
 That, in combination with SLAB_FREELIST_HARDENED's freelist pointer
 mangling, leads to a reliable access violation in form of a #GP which
 made the UAF fail fast.

 Michal seems to have run into the same issue[1]. He already correctly
 diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert
 cfs_rq's to list on unthrottle") is causing the preconditions for the
 UAF to happen by re-adding cfs_rq's also to task groups that have no
 more running tasks, i.e. also to dead ones. His analysis, however,
 misses the real root cause and it cannot be seen from the crash
 backtrace only, as the real offender is tg_unthrottle_up() getting
 called via sched_cfs_period_timer() via the timer interrupt at an
 inconvenient time.

 When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
 task group, it doesn't protect itself from getting interrupted. If the
 timer interrupt triggers while we iterate over all CPUs or after
 unregister_fair_sched_group() has finished but prior to unlinking the
 task group, sched_cfs_period_timer() will execute and walk the list of
 task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
 dying task group. These will later -- in free_fair_sched_group() -- be
 kfree()'ed while still being linked, leading to the fireworks Kevin
 and Michal are seeing.

 To fix this race, ensure the dying task group gets unlinked first.
 However, simply switching the order of unregistering and unlinking the
 task group isn't sufficient, as concurrent RCU walkers might still see
 it, as can be seen below:

     CPU1:                                      CPU2:
       :                                        timer IRQ:
       :                                          do_sched_cfs_period_timer():
       :                                            :
       :                                            distribute_cfs_runtime():
       :                                              rcu_read_lock();
       :                                              :
       :                                              unthrottle_cfs_rq():
     sched_offline_group():                             :
       :                                                walk_tg_tree_from(…,tg_unthrottle_up,…):
       list_del_rcu(&tg->list);                           :
  (1)  :                                                  list_for_each_entry_rcu(child, &parent->children, siblings)
       :                                                    :
  (2)  list_del_rcu(&tg->siblings);                         :
       :                                                    tg_unthrottle_up():
       unregister_fair_sched_group():                         struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
         :                                                    :
         list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);               :
         :                                                    :
         :                                                    if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
  (3)    :                                                        list_add_leaf_cfs_rq(cfs_rq);
       :                                                      :
       :                                                    :
       :                                                  :
       :                                                :
       :                                              :
  (4)  :                                              rcu_read_unlock();

 CPU 2 walks the task group list in parallel to sched_offline_group(),
 specifically, it'll read the soon to be unlinked task group entry at
 (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
 still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink
 all cfs_rq's via list_del_leaf_cfs_rq() in
 unregister_fair_sched_group().  Meanwhile CPU 2 will re-add some of
 these at (3), which is the cause of the UAF later on.

 To prevent this additional race from happening, we need to wait until
 walk_tg_tree_from() has finished traversing the task groups, i.e.
 after the RCU read critical section ends in (4). Afterwards we're safe
 to call unregister_fair_sched_group(), as each new walk won't see the
 dying task group any more.

 On top of that, we need to wait yet another RCU grace period after
 unregister_fair_sched_group() to ensure print_cfs_stats(), which might
 run concurrently, always sees valid objects, i.e. not already free'd
 ones.

 This patch survives Michal's reproducer[2] for 8h+ now, which used to
 trigger within minutes before.

   [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/
   [2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/

 [peterz: shuffle code around a bit]

 The Linux kernel CVE team has assigned CVE-2021-47209 to this issue.


 Affected and fixed versions
 ===========================

 	Issue introduced in 5.13 with commit a7b359fc6a37faaf472125867c8dc5a068c90982 and fixed in 5.15.5 with commit 512e21c150c1c3ee298852660f3a796e267e62ec
 	Issue introduced in 5.13 with commit a7b359fc6a37faaf472125867c8dc5a068c90982 and fixed in 5.16 with commit b027789e5e50494c2325cc70c8642e7fd6059479

 Please see https://www.kernel.org for a full list of currently supported
 kernel versions by the kernel community.

 Unaffected versions might change over time as fixes are backported to
 older supported kernel versions.  The official CVE entry at
 	https://cve.org/CVERecord/?id=CVE-2021-47209
 will be updated if fixes are backported, please check that for the most
 up to date information about this issue.


 Affected files
 ==============

 The file(s) affected by this issue are:
 	kernel/sched/autogroup.c
 	kernel/sched/core.c
 	kernel/sched/fair.c
 	kernel/sched/rt.c
 	kernel/sched/sched.h


 Mitigation
 ==========

 The Linux kernel CVE team recommends that you update to the latest
 stable kernel version for this, and many other bugfixes.  Individual
 changes are never tested alone, but rather are part of a larger kernel
 release.  Cherry-picking individual commits is not recommended or
 supported by the Linux kernel community at all.  If however, updating to
 the latest release is impossible, the individual changes to resolve this
 issue can be found at these commits:
 	https://git.kernel.org/stable/c/512e21c150c1c3ee298852660f3a796e267e62ec
 	https://git.kernel.org/stable/c/b027789e5e50494c2325cc70c8642e7fd6059479
	From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001
	From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	To: <linux-cve-announce@vger.kernel.org>
	Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
	Subject: CVE-2021-47209: sched/fair: Prevent dead task groups from regaining cfs_rq's

	Description
	===========

	In the Linux kernel, the following vulnerability has been resolved:

	sched/fair: Prevent dead task groups from regaining cfs_rq's

	Kevin is reporting crashes which point to a use-after-free of a cfs_rq
	in update_blocked_averages(). Initial debugging revealed that we've
	live cfs_rq's (on_list=1) in an about to be kfree()'d task group in
	free_fair_sched_group(). However, it was unclear how that can happen.

	His kernel config happened to lead to a layout of struct sched_entity
	that put the 'my_q' member directly into the middle of the object
	which makes it incidentally overlap with SLUB's freelist pointer.
	That, in combination with SLAB_FREELIST_HARDENED's freelist pointer
	mangling, leads to a reliable access violation in form of a #GP which
	made the UAF fail fast.

	Michal seems to have run into the same issue[1]. He already correctly
	diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert
	cfs_rq's to list on unthrottle") is causing the preconditions for the
	UAF to happen by re-adding cfs_rq's also to task groups that have no
	more running tasks, i.e. also to dead ones. His analysis, however,
	misses the real root cause and it cannot be seen from the crash
	backtrace only, as the real offender is tg_unthrottle_up() getting
	called via sched_cfs_period_timer() via the timer interrupt at an
	inconvenient time.

	When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
	task group, it doesn't protect itself from getting interrupted. If the
	timer interrupt triggers while we iterate over all CPUs or after
	unregister_fair_sched_group() has finished but prior to unlinking the
	task group, sched_cfs_period_timer() will execute and walk the list of
	task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
	dying task group. These will later -- in free_fair_sched_group() -- be
	kfree()'ed while still being linked, leading to the fireworks Kevin
	and Michal are seeing.

	To fix this race, ensure the dying task group gets unlinked first.
	However, simply switching the order of unregistering and unlinking the
	task group isn't sufficient, as concurrent RCU walkers might still see
	it, as can be seen below:

	CPU1: CPU2:
	: timer IRQ:
	: do_sched_cfs_period_timer():
	: :
	: distribute_cfs_runtime():
	: rcu_read_lock();
	: :
	: unthrottle_cfs_rq():
	sched_offline_group(): :
	: walk_tg_tree_from(…,tg_unthrottle_up,…):
	list_del_rcu(&tg->list); :
	(1) : list_for_each_entry_rcu(child, &parent->children, siblings)
	: :
	(2) list_del_rcu(&tg->siblings); :
	: tg_unthrottle_up():
	unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
	: :
	list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); :
	: :
	: if (!cfs_rq_is_decayed(cfs_rq) \|\| cfs_rq->nr_running)
	(3) : list_add_leaf_cfs_rq(cfs_rq);
	: :
	: :
	: :
	: :
	: :
	(4) : rcu_read_unlock();

	CPU 2 walks the task group list in parallel to sched_offline_group(),
	specifically, it'll read the soon to be unlinked task group entry at
	(1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
	still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink
	all cfs_rq's via list_del_leaf_cfs_rq() in
	unregister_fair_sched_group(). Meanwhile CPU 2 will re-add some of
	these at (3), which is the cause of the UAF later on.

	To prevent this additional race from happening, we need to wait until
	walk_tg_tree_from() has finished traversing the task groups, i.e.
	after the RCU read critical section ends in (4). Afterwards we're safe
	to call unregister_fair_sched_group(), as each new walk won't see the
	dying task group any more.

	On top of that, we need to wait yet another RCU grace period after
	unregister_fair_sched_group() to ensure print_cfs_stats(), which might
	run concurrently, always sees valid objects, i.e. not already free'd
	ones.

	This patch survives Michal's reproducer[2] for 8h+ now, which used to
	trigger within minutes before.

	[1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/
	[2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/

	[peterz: shuffle code around a bit]

	The Linux kernel CVE team has assigned CVE-2021-47209 to this issue.


	Affected and fixed versions
	===========================

	Issue introduced in 5.13 with commit a7b359fc6a37faaf472125867c8dc5a068c90982 and fixed in 5.15.5 with commit 512e21c150c1c3ee298852660f3a796e267e62ec
	Issue introduced in 5.13 with commit a7b359fc6a37faaf472125867c8dc5a068c90982 and fixed in 5.16 with commit b027789e5e50494c2325cc70c8642e7fd6059479

	Please see https://www.kernel.org for a full list of currently supported
	kernel versions by the kernel community.

	Unaffected versions might change over time as fixes are backported to
	older supported kernel versions. The official CVE entry at
	https://cve.org/CVERecord/?id=CVE-2021-47209
	will be updated if fixes are backported, please check that for the most
	up to date information about this issue.


	Affected files
	==============

	The file(s) affected by this issue are:
	kernel/sched/autogroup.c
	kernel/sched/core.c
	kernel/sched/fair.c
	kernel/sched/rt.c
	kernel/sched/sched.h


	Mitigation
	==========

	The Linux kernel CVE team recommends that you update to the latest
	stable kernel version for this, and many other bugfixes. Individual
	changes are never tested alone, but rather are part of a larger kernel
	release. Cherry-picking individual commits is not recommended or
	supported by the Linux kernel community at all. If however, updating to
	the latest release is impossible, the individual changes to resolve this
	issue can be found at these commits:
	https://git.kernel.org/stable/c/512e21c150c1c3ee298852660f3a796e267e62ec
	https://git.kernel.org/stable/c/b027789e5e50494c2325cc70c8642e7fd6059479