| From bippy-8e903de6a542 Mon Sep 17 00:00:00 2001 |
| From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| To: <linux-cve-announce@vger.kernel.org> |
| Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org> |
| Subject: CVE-2024-53054: cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction |
| |
| Description |
| =========== |
| |
| In the Linux kernel, the following vulnerability has been resolved: |
| |
| cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction |
| |
| A hung_task problem shown below was found: |
| |
| INFO: task kworker/0:0:8 blocked for more than 327 seconds. |
| "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. |
| Workqueue: events cgroup_bpf_release |
| Call Trace: |
| <TASK> |
| __schedule+0x5a2/0x2050 |
| ? find_held_lock+0x33/0x100 |
| ? wq_worker_sleeping+0x9e/0xe0 |
| schedule+0x9f/0x180 |
| schedule_preempt_disabled+0x25/0x50 |
| __mutex_lock+0x512/0x740 |
| ? cgroup_bpf_release+0x1e/0x4d0 |
| ? cgroup_bpf_release+0xcf/0x4d0 |
| ? process_scheduled_works+0x161/0x8a0 |
| ? cgroup_bpf_release+0x1e/0x4d0 |
| ? mutex_lock_nested+0x2b/0x40 |
| ? __pfx_delay_tsc+0x10/0x10 |
| mutex_lock_nested+0x2b/0x40 |
| cgroup_bpf_release+0xcf/0x4d0 |
| ? process_scheduled_works+0x161/0x8a0 |
| ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0 |
| ? process_scheduled_works+0x161/0x8a0 |
| process_scheduled_works+0x23a/0x8a0 |
| worker_thread+0x231/0x5b0 |
| ? __pfx_worker_thread+0x10/0x10 |
| kthread+0x14d/0x1c0 |
| ? __pfx_kthread+0x10/0x10 |
| ret_from_fork+0x59/0x70 |
| ? __pfx_kthread+0x10/0x10 |
| ret_from_fork_asm+0x1b/0x30 |
| </TASK> |
| |
| This issue can be reproduced by the following pressuse test: |
| 1. A large number of cpuset cgroups are deleted. |
| 2. Set cpu on and off repeatly. |
| 3. Set watchdog_thresh repeatly. |
| The scripts can be obtained at LINK mentioned above the signature. |
| |
| The reason for this issue is cgroup_mutex and cpu_hotplug_lock are |
| acquired in different tasks, which may lead to deadlock. |
| It can lead to a deadlock through the following steps: |
| 1. A large number of cpusets are deleted asynchronously, which puts a |
| large number of cgroup_bpf_release works into system_wq. The max_active |
| of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are |
| cgroup_bpf_release works, and many cgroup_bpf_release works will be put |
| into inactive queue. As illustrated in the diagram, there are 256 (in |
| the acvtive queue) + n (in the inactive queue) works. |
| 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put |
| smp_call_on_cpu work into system_wq. However step 1 has already filled |
| system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has |
| to wait until the works that were put into the inacvtive queue earlier |
| have executed (n cgroup_bpf_release), so it will be blocked for a while. |
| 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2. |
| 4. Cpusets that were deleted at step 1 put cgroup_release works into |
| cgroup_destroy_wq. They are competing to get cgroup_mutex all the time. |
| When cgroup_metux is acqured by work at css_killed_work_fn, it will |
| call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read. |
| However, cpuset_css_offline will be blocked for step 3. |
| 5. At this moment, there are 256 works in active queue that are |
| cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as |
| a result, all of them are blocked. Consequently, sscs.work can not be |
| executed. Ultimately, this situation leads to four processes being |
| blocked, forming a deadlock. |
| |
| system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4) |
| ... |
| 2000+ cgroups deleted asyn |
| 256 actives + n inactives |
| __lockup_detector_reconfigure |
| P(cpu_hotplug_lock.read) |
| put sscs.work into system_wq |
| 256 + n + 1(sscs.work) |
| sscs.work wait to be executed |
| warting sscs.work finish |
| percpu_down_write |
| P(cpu_hotplug_lock.write) |
| ...blocking... |
| css_killed_work_fn |
| P(cgroup_mutex) |
| cpuset_css_offline |
| P(cpu_hotplug_lock.read) |
| ...blocking... |
| 256 cgroup_bpf_release |
| mutex_lock(&cgroup_mutex); |
| ..blocking... |
| |
| To fix the problem, place cgroup_bpf_release works on a dedicated |
| workqueue which can break the loop and solve the problem. System wqs are |
| for misc things which shouldn't create a large number of concurrent work |
| items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent |
| work items, it should use its own dedicated workqueue. |
| |
| The Linux kernel CVE team has assigned CVE-2024-53054 to this issue. |
| |
| |
| Affected and fixed versions |
| =========================== |
| |
| Issue introduced in 5.3 with commit 4bfc0bb2c60e and fixed in 6.1.116 with commit 71f14a9f5c7d |
| Issue introduced in 5.3 with commit 4bfc0bb2c60e and fixed in 6.6.60 with commit 0d86cd70fc6a |
| Issue introduced in 5.3 with commit 4bfc0bb2c60e and fixed in 6.11.7 with commit 6dab3331523b |
| Issue introduced in 5.3 with commit 4bfc0bb2c60e and fixed in 6.12 with commit 117932eea99b |
| |
| Please see https://www.kernel.org for a full list of currently supported |
| kernel versions by the kernel community. |
| |
| Unaffected versions might change over time as fixes are backported to |
| older supported kernel versions. The official CVE entry at |
| https://cve.org/CVERecord/?id=CVE-2024-53054 |
| will be updated if fixes are backported, please check that for the most |
| up to date information about this issue. |
| |
| |
| Affected files |
| ============== |
| |
| The file(s) affected by this issue are: |
| kernel/bpf/cgroup.c |
| |
| |
| Mitigation |
| ========== |
| |
| The Linux kernel CVE team recommends that you update to the latest |
| stable kernel version for this, and many other bugfixes. Individual |
| changes are never tested alone, but rather are part of a larger kernel |
| release. Cherry-picking individual commits is not recommended or |
| supported by the Linux kernel community at all. If however, updating to |
| the latest release is impossible, the individual changes to resolve this |
| issue can be found at these commits: |
| https://git.kernel.org/stable/c/71f14a9f5c7db72fdbc56e667d4ed42a1a760494 |
| https://git.kernel.org/stable/c/0d86cd70fc6a7ba18becb52ad8334d5ad3eca530 |
| https://git.kernel.org/stable/c/6dab3331523ba73db1345d19e6f586dcd5f6efb4 |
| https://git.kernel.org/stable/c/117932eea99b729ee5d12783601a4f7f5fd58a23 |