| From bippy-1.2.0 Mon Sep 17 00:00:00 2001 |
| From: Greg Kroah-Hartman <gregkh@kernel.org> |
| To: <linux-cve-announce@vger.kernel.org> |
| Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org> |
| Subject: CVE-2025-38104: drm/amdgpu: Replace Mutex with Spinlock for RLCG register access to avoid Priority Inversion in SRIOV |
| |
| Description |
| =========== |
| |
| In the Linux kernel, the following vulnerability has been resolved: |
| |
| drm/amdgpu: Replace Mutex with Spinlock for RLCG register access to avoid Priority Inversion in SRIOV |
| |
| RLCG Register Access is a way for virtual functions to safely access GPU |
| registers in a virtualized environment., including TLB flushes and |
| register reads. When multiple threads or VFs try to access the same |
| registers simultaneously, it can lead to race conditions. By using the |
| RLCG interface, the driver can serialize access to the registers. This |
| means that only one thread can access the registers at a time, |
| preventing conflicts and ensuring that operations are performed |
| correctly. Additionally, when a low-priority task holds a mutex that a |
| high-priority task needs, ie., If a thread holding a spinlock tries to |
| acquire a mutex, it can lead to priority inversion. register access in |
| amdgpu_virt_rlcg_reg_rw especially in a fast code path is critical. |
| |
| The call stack shows that the function amdgpu_virt_rlcg_reg_rw is being |
| called, which attempts to acquire the mutex. This function is invoked |
| from amdgpu_sriov_wreg, which in turn is called from |
| gmc_v11_0_flush_gpu_tlb. |
| |
| The [ BUG: Invalid wait context ] indicates that a thread is trying to |
| acquire a mutex while it is in a context that does not allow it to sleep |
| (like holding a spinlock). |
| |
| Fixes the below: |
| |
| [ 253.013423] ============================= |
| [ 253.013434] [ BUG: Invalid wait context ] |
| [ 253.013446] 6.12.0-amdstaging-drm-next-lol-050225 #14 Tainted: G U OE |
| [ 253.013464] ----------------------------- |
| [ 253.013475] kworker/0:1/10 is trying to lock: |
| [ 253.013487] ffff9f30542e3cf8 (&adev->virt.rlcg_reg_lock){+.+.}-{3:3}, at: amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu] |
| [ 253.013815] other info that might help us debug this: |
| [ 253.013827] context-{4:4} |
| [ 253.013835] 3 locks held by kworker/0:1/10: |
| [ 253.013847] #0: ffff9f3040050f58 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680 |
| [ 253.013877] #1: ffffb789c008be40 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680 |
| [ 253.013905] #2: ffff9f3054281838 (&adev->gmc.invalidate_lock){+.+.}-{2:2}, at: gmc_v11_0_flush_gpu_tlb+0x198/0x4f0 [amdgpu] |
| [ 253.014154] stack backtrace: |
| [ 253.014164] CPU: 0 UID: 0 PID: 10 Comm: kworker/0:1 Tainted: G U OE 6.12.0-amdstaging-drm-next-lol-050225 #14 |
| [ 253.014189] Tainted: [U]=USER, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE |
| [ 253.014203] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 11/18/2024 |
| [ 253.014224] Workqueue: events work_for_cpu_fn |
| [ 253.014241] Call Trace: |
| [ 253.014250] <TASK> |
| [ 253.014260] dump_stack_lvl+0x9b/0xf0 |
| [ 253.014275] dump_stack+0x10/0x20 |
| [ 253.014287] __lock_acquire+0xa47/0x2810 |
| [ 253.014303] ? srso_alias_return_thunk+0x5/0xfbef5 |
| [ 253.014321] lock_acquire+0xd1/0x300 |
| [ 253.014333] ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu] |
| [ 253.014562] ? __lock_acquire+0xa6b/0x2810 |
| [ 253.014578] __mutex_lock+0x85/0xe20 |
| [ 253.014591] ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu] |
| [ 253.014782] ? sched_clock_noinstr+0x9/0x10 |
| [ 253.014795] ? srso_alias_return_thunk+0x5/0xfbef5 |
| [ 253.014808] ? local_clock_noinstr+0xe/0xc0 |
| [ 253.014822] ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu] |
| [ 253.015012] ? srso_alias_return_thunk+0x5/0xfbef5 |
| [ 253.015029] mutex_lock_nested+0x1b/0x30 |
| [ 253.015044] ? mutex_lock_nested+0x1b/0x30 |
| [ 253.015057] amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu] |
| [ 253.015249] amdgpu_sriov_wreg+0xc5/0xd0 [amdgpu] |
| [ 253.015435] gmc_v11_0_flush_gpu_tlb+0x44b/0x4f0 [amdgpu] |
| [ 253.015667] gfx_v11_0_hw_init+0x499/0x29c0 [amdgpu] |
| [ 253.015901] ? __pfx_smu_v13_0_update_pcie_parameters+0x10/0x10 [amdgpu] |
| [ 253.016159] ? srso_alias_return_thunk+0x5/0xfbef5 |
| [ 253.016173] ? smu_hw_init+0x18d/0x300 [amdgpu] |
| [ 253.016403] amdgpu_device_init+0x29ad/0x36a0 [amdgpu] |
| [ 253.016614] amdgpu_driver_load_kms+0x1a/0xc0 [amdgpu] |
| [ 253.017057] amdgpu_pci_probe+0x1c2/0x660 [amdgpu] |
| [ 253.017493] local_pci_probe+0x4b/0xb0 |
| [ 253.017746] work_for_cpu_fn+0x1a/0x30 |
| [ 253.017995] process_one_work+0x21e/0x680 |
| [ 253.018248] worker_thread+0x190/0x330 |
| [ 253.018500] ? __pfx_worker_thread+0x10/0x10 |
| [ 253.018746] kthread+0xe7/0x120 |
| [ 253.018988] ? __pfx_kthread+0x10/0x10 |
| [ 253.019231] ret_from_fork+0x3c/0x60 |
| [ 253.019468] ? __pfx_kthread+0x10/0x10 |
| [ 253.019701] ret_from_fork_asm+0x1a/0x30 |
| [ 253.019939] </TASK> |
| |
| v2: s/spin_trylock/spin_lock_irqsave to be safe (Christian). |
| |
| The Linux kernel CVE team has assigned CVE-2025-38104 to this issue. |
| |
| |
| Affected and fixed versions |
| =========================== |
| |
| Issue introduced in 6.11 with commit e864180ee49b4d30e640fd1e1d852b86411420c9 and fixed in 6.13.11 with commit 1c0378830e42c98acd69e0289882c8637d92f285 |
| Issue introduced in 6.11 with commit e864180ee49b4d30e640fd1e1d852b86411420c9 and fixed in 6.14.2 with commit 5c1741a0c176ae11675a64cb7f2dd21d72db6b91 |
| Issue introduced in 6.11 with commit e864180ee49b4d30e640fd1e1d852b86411420c9 and fixed in 6.15 with commit dc0297f3198bd60108ccbd167ee5d9fa4af31ed0 |
| Issue introduced in 6.1.105 with commit f39a3bc42815a7016a915f6cb35e9a1448788f06 |
| Issue introduced in 6.6.46 with commit 1adb5ebe205e96af77a93512e2d5b8c437548787 |
| Issue introduced in 6.10.5 with commit e1ab38e99d1607f80a1670a399511a56464c0253 |
| |
| Please see https://www.kernel.org for a full list of currently supported |
| kernel versions by the kernel community. |
| |
| Unaffected versions might change over time as fixes are backported to |
| older supported kernel versions. The official CVE entry at |
| https://cve.org/CVERecord/?id=CVE-2025-38104 |
| will be updated if fixes are backported, please check that for the most |
| up to date information about this issue. |
| |
| |
| Affected files |
| ============== |
| |
| The file(s) affected by this issue are: |
| drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |
| drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c |
| drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h |
| |
| |
| Mitigation |
| ========== |
| |
| The Linux kernel CVE team recommends that you update to the latest |
| stable kernel version for this, and many other bugfixes. Individual |
| changes are never tested alone, but rather are part of a larger kernel |
| release. Cherry-picking individual commits is not recommended or |
| supported by the Linux kernel community at all. If however, updating to |
| the latest release is impossible, the individual changes to resolve this |
| issue can be found at these commits: |
| https://git.kernel.org/stable/c/1c0378830e42c98acd69e0289882c8637d92f285 |
| https://git.kernel.org/stable/c/5c1741a0c176ae11675a64cb7f2dd21d72db6b91 |
| https://git.kernel.org/stable/c/dc0297f3198bd60108ccbd167ee5d9fa4af31ed0 |