cve/rejected/2024/CVE-2024-53054.mbox - pub/scm/linux/security/vulns - Git at Google

 From bippy-8e903de6a542 Mon Sep 17 00:00:00 2001
 From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 To: <linux-cve-announce@vger.kernel.org>
 Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
 Subject: CVE-2024-53054: cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction

 Description
 ===========

 In the Linux kernel, the following vulnerability has been resolved:

 cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction

 A hung_task problem shown below was found:

 INFO: task kworker/0:0:8 blocked for more than 327 seconds.
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 Workqueue: events cgroup_bpf_release
 Call Trace:
  <TASK>
  __schedule+0x5a2/0x2050
  ? find_held_lock+0x33/0x100
  ? wq_worker_sleeping+0x9e/0xe0
  schedule+0x9f/0x180
  schedule_preempt_disabled+0x25/0x50
  __mutex_lock+0x512/0x740
  ? cgroup_bpf_release+0x1e/0x4d0
  ? cgroup_bpf_release+0xcf/0x4d0
  ? process_scheduled_works+0x161/0x8a0
  ? cgroup_bpf_release+0x1e/0x4d0
  ? mutex_lock_nested+0x2b/0x40
  ? __pfx_delay_tsc+0x10/0x10
  mutex_lock_nested+0x2b/0x40
  cgroup_bpf_release+0xcf/0x4d0
  ? process_scheduled_works+0x161/0x8a0
  ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
  ? process_scheduled_works+0x161/0x8a0
  process_scheduled_works+0x23a/0x8a0
  worker_thread+0x231/0x5b0
  ? __pfx_worker_thread+0x10/0x10
  kthread+0x14d/0x1c0
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x59/0x70
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1b/0x30
  </TASK>

 This issue can be reproduced by the following pressuse test:
 1. A large number of cpuset cgroups are deleted.
 2. Set cpu on and off repeatly.
 3. Set watchdog_thresh repeatly.
 The scripts can be obtained at LINK mentioned above the signature.

 The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
 acquired in different tasks, which may lead to deadlock.
 It can lead to a deadlock through the following steps:
 1. A large number of cpusets are deleted asynchronously, which puts a
    large number of cgroup_bpf_release works into system_wq. The max_active
    of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
    cgroup_bpf_release works, and many cgroup_bpf_release works will be put
    into inactive queue. As illustrated in the diagram, there are 256 (in
    the acvtive queue) + n (in the inactive queue) works.
 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
    smp_call_on_cpu work into system_wq. However step 1 has already filled
    system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
    to wait until the works that were put into the inacvtive queue earlier
    have executed (n cgroup_bpf_release), so it will be blocked for a while.
 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
 4. Cpusets that were deleted at step 1 put cgroup_release works into
    cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
    When cgroup_metux is acqured by work at css_killed_work_fn, it will
    call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
    However, cpuset_css_offline will be blocked for step 3.
 5. At this moment, there are 256 works in active queue that are
    cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
    a result, all of them are blocked. Consequently, sscs.work can not be
    executed. Ultimately, this situation leads to four processes being
    blocked, forming a deadlock.

 system_wq(step1)		WatchDog(step2)			cpu offline(step3)	cgroup_destroy_wq(step4)
 ...
 2000+ cgroups deleted asyn
 256 actives + n inactives
 				__lockup_detector_reconfigure
 				P(cpu_hotplug_lock.read)
 				put sscs.work into system_wq
 256 + n + 1(sscs.work)
 sscs.work wait to be executed
 				warting sscs.work finish
 								percpu_down_write
 								P(cpu_hotplug_lock.write)
 								...blocking...
 											css_killed_work_fn
 											P(cgroup_mutex)
 											cpuset_css_offline
 											P(cpu_hotplug_lock.read)
 											...blocking...
 256 cgroup_bpf_release
 mutex_lock(&cgroup_mutex);
 ..blocking...

 To fix the problem, place cgroup_bpf_release works on a dedicated
 workqueue which can break the loop and solve the problem. System wqs are
 for misc things which shouldn't create a large number of concurrent work
 items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent
 work items, it should use its own dedicated workqueue.

 The Linux kernel CVE team has assigned CVE-2024-53054 to this issue.


 Affected and fixed versions
 ===========================

 	Issue introduced in 5.3 with commit 4bfc0bb2c60e and fixed in 6.1.116 with commit 71f14a9f5c7d
 	Issue introduced in 5.3 with commit 4bfc0bb2c60e and fixed in 6.6.60 with commit 0d86cd70fc6a
 	Issue introduced in 5.3 with commit 4bfc0bb2c60e and fixed in 6.11.7 with commit 6dab3331523b
 	Issue introduced in 5.3 with commit 4bfc0bb2c60e and fixed in 6.12 with commit 117932eea99b

 Please see https://www.kernel.org for a full list of currently supported
 kernel versions by the kernel community.

 Unaffected versions might change over time as fixes are backported to
 older supported kernel versions.  The official CVE entry at
 	https://cve.org/CVERecord/?id=CVE-2024-53054
 will be updated if fixes are backported, please check that for the most
 up to date information about this issue.


 Affected files
 ==============

 The file(s) affected by this issue are:
 	kernel/bpf/cgroup.c


 Mitigation
 ==========

 The Linux kernel CVE team recommends that you update to the latest
 stable kernel version for this, and many other bugfixes.  Individual
 changes are never tested alone, but rather are part of a larger kernel
 release.  Cherry-picking individual commits is not recommended or
 supported by the Linux kernel community at all.  If however, updating to
 the latest release is impossible, the individual changes to resolve this
 issue can be found at these commits:
 	https://git.kernel.org/stable/c/71f14a9f5c7db72fdbc56e667d4ed42a1a760494
 	https://git.kernel.org/stable/c/0d86cd70fc6a7ba18becb52ad8334d5ad3eca530
 	https://git.kernel.org/stable/c/6dab3331523ba73db1345d19e6f586dcd5f6efb4
 	https://git.kernel.org/stable/c/117932eea99b729ee5d12783601a4f7f5fd58a23
	From bippy-8e903de6a542 Mon Sep 17 00:00:00 2001
	From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	To: <linux-cve-announce@vger.kernel.org>
	Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
	Subject: CVE-2024-53054: cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction

	Description
	===========

	In the Linux kernel, the following vulnerability has been resolved:

	cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction

	A hung_task problem shown below was found:

	INFO: task kworker/0:0:8 blocked for more than 327 seconds.
	"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	Workqueue: events cgroup_bpf_release
	Call Trace:
	<TASK>
	__schedule+0x5a2/0x2050
	? find_held_lock+0x33/0x100
	? wq_worker_sleeping+0x9e/0xe0
	schedule+0x9f/0x180
	schedule_preempt_disabled+0x25/0x50
	__mutex_lock+0x512/0x740
	? cgroup_bpf_release+0x1e/0x4d0
	? cgroup_bpf_release+0xcf/0x4d0
	? process_scheduled_works+0x161/0x8a0
	? cgroup_bpf_release+0x1e/0x4d0
	? mutex_lock_nested+0x2b/0x40
	? __pfx_delay_tsc+0x10/0x10
	mutex_lock_nested+0x2b/0x40
	cgroup_bpf_release+0xcf/0x4d0
	? process_scheduled_works+0x161/0x8a0
	? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
	? process_scheduled_works+0x161/0x8a0
	process_scheduled_works+0x23a/0x8a0
	worker_thread+0x231/0x5b0
	? __pfx_worker_thread+0x10/0x10
	kthread+0x14d/0x1c0
	? __pfx_kthread+0x10/0x10
	ret_from_fork+0x59/0x70
	? __pfx_kthread+0x10/0x10
	ret_from_fork_asm+0x1b/0x30
	</TASK>

	This issue can be reproduced by the following pressuse test:
	1. A large number of cpuset cgroups are deleted.
	2. Set cpu on and off repeatly.
	3. Set watchdog_thresh repeatly.
	The scripts can be obtained at LINK mentioned above the signature.

	The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
	acquired in different tasks, which may lead to deadlock.
	It can lead to a deadlock through the following steps:
	1. A large number of cpusets are deleted asynchronously, which puts a
	large number of cgroup_bpf_release works into system_wq. The max_active
	of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
	cgroup_bpf_release works, and many cgroup_bpf_release works will be put
	into inactive queue. As illustrated in the diagram, there are 256 (in
	the acvtive queue) + n (in the inactive queue) works.
	2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
	smp_call_on_cpu work into system_wq. However step 1 has already filled
	system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
	to wait until the works that were put into the inacvtive queue earlier
	have executed (n cgroup_bpf_release), so it will be blocked for a while.
	3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
	4. Cpusets that were deleted at step 1 put cgroup_release works into
	cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
	When cgroup_metux is acqured by work at css_killed_work_fn, it will
	call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
	However, cpuset_css_offline will be blocked for step 3.
	5. At this moment, there are 256 works in active queue that are
	cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
	a result, all of them are blocked. Consequently, sscs.work can not be
	executed. Ultimately, this situation leads to four processes being
	blocked, forming a deadlock.

	system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4)
	...
	2000+ cgroups deleted asyn
	256 actives + n inactives
	__lockup_detector_reconfigure
	P(cpu_hotplug_lock.read)
	put sscs.work into system_wq
	256 + n + 1(sscs.work)
	sscs.work wait to be executed
	warting sscs.work finish
	percpu_down_write
	P(cpu_hotplug_lock.write)
	...blocking...
	css_killed_work_fn
	P(cgroup_mutex)
	cpuset_css_offline
	P(cpu_hotplug_lock.read)
	...blocking...
	256 cgroup_bpf_release
	mutex_lock(&cgroup_mutex);
	..blocking...

	To fix the problem, place cgroup_bpf_release works on a dedicated
	workqueue which can break the loop and solve the problem. System wqs are
	for misc things which shouldn't create a large number of concurrent work
	items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent
	work items, it should use its own dedicated workqueue.

	The Linux kernel CVE team has assigned CVE-2024-53054 to this issue.


	Affected and fixed versions
	===========================

	Issue introduced in 5.3 with commit 4bfc0bb2c60e and fixed in 6.1.116 with commit 71f14a9f5c7d
	Issue introduced in 5.3 with commit 4bfc0bb2c60e and fixed in 6.6.60 with commit 0d86cd70fc6a
	Issue introduced in 5.3 with commit 4bfc0bb2c60e and fixed in 6.11.7 with commit 6dab3331523b
	Issue introduced in 5.3 with commit 4bfc0bb2c60e and fixed in 6.12 with commit 117932eea99b

	Please see https://www.kernel.org for a full list of currently supported
	kernel versions by the kernel community.

	Unaffected versions might change over time as fixes are backported to
	older supported kernel versions. The official CVE entry at
	https://cve.org/CVERecord/?id=CVE-2024-53054
	will be updated if fixes are backported, please check that for the most
	up to date information about this issue.


	Affected files
	==============

	The file(s) affected by this issue are:
	kernel/bpf/cgroup.c


	Mitigation
	==========

	The Linux kernel CVE team recommends that you update to the latest
	stable kernel version for this, and many other bugfixes. Individual
	changes are never tested alone, but rather are part of a larger kernel
	release. Cherry-picking individual commits is not recommended or
	supported by the Linux kernel community at all. If however, updating to
	the latest release is impossible, the individual changes to resolve this
	issue can be found at these commits:
	https://git.kernel.org/stable/c/71f14a9f5c7db72fdbc56e667d4ed42a1a760494
	https://git.kernel.org/stable/c/0d86cd70fc6a7ba18becb52ad8334d5ad3eca530
	https://git.kernel.org/stable/c/6dab3331523ba73db1345d19e6f586dcd5f6efb4
	https://git.kernel.org/stable/c/117932eea99b729ee5d12783601a4f7f5fd58a23