cve/published/2025/CVE-2025-38104.mbox - pub/scm/linux/security/vulns - Git at Google

 From bippy-1.2.0 Mon Sep 17 00:00:00 2001
 From: Greg Kroah-Hartman <gregkh@kernel.org>
 To: <linux-cve-announce@vger.kernel.org>
 Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
 Subject: CVE-2025-38104: drm/amdgpu: Replace Mutex with Spinlock for RLCG register access to avoid Priority Inversion in SRIOV

 Description
 ===========

 In the Linux kernel, the following vulnerability has been resolved:

 drm/amdgpu: Replace Mutex with Spinlock for RLCG register access to avoid Priority Inversion in SRIOV

 RLCG Register Access is a way for virtual functions to safely access GPU
 registers in a virtualized environment., including TLB flushes and
 register reads. When multiple threads or VFs try to access the same
 registers simultaneously, it can lead to race conditions. By using the
 RLCG interface, the driver can serialize access to the registers. This
 means that only one thread can access the registers at a time,
 preventing conflicts and ensuring that operations are performed
 correctly. Additionally, when a low-priority task holds a mutex that a
 high-priority task needs, ie., If a thread holding a spinlock tries to
 acquire a mutex, it can lead to priority inversion. register access in
 amdgpu_virt_rlcg_reg_rw especially in a fast code path is critical.

 The call stack shows that the function amdgpu_virt_rlcg_reg_rw is being
 called, which attempts to acquire the mutex. This function is invoked
 from amdgpu_sriov_wreg, which in turn is called from
 gmc_v11_0_flush_gpu_tlb.

 The [ BUG: Invalid wait context ] indicates that a thread is trying to
 acquire a mutex while it is in a context that does not allow it to sleep
 (like holding a spinlock).

 Fixes the below:

 [  253.013423] =============================
 [  253.013434] [ BUG: Invalid wait context ]
 [  253.013446] 6.12.0-amdstaging-drm-next-lol-050225 #14 Tainted: G     U     OE
 [  253.013464] -----------------------------
 [  253.013475] kworker/0:1/10 is trying to lock:
 [  253.013487] ffff9f30542e3cf8 (&adev->virt.rlcg_reg_lock){+.+.}-{3:3}, at: amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
 [  253.013815] other info that might help us debug this:
 [  253.013827] context-{4:4}
 [  253.013835] 3 locks held by kworker/0:1/10:
 [  253.013847]  #0: ffff9f3040050f58 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680
 [  253.013877]  #1: ffffb789c008be40 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680
 [  253.013905]  #2: ffff9f3054281838 (&adev->gmc.invalidate_lock){+.+.}-{2:2}, at: gmc_v11_0_flush_gpu_tlb+0x198/0x4f0 [amdgpu]
 [  253.014154] stack backtrace:
 [  253.014164] CPU: 0 UID: 0 PID: 10 Comm: kworker/0:1 Tainted: G     U     OE      6.12.0-amdstaging-drm-next-lol-050225 #14
 [  253.014189] Tainted: [U]=USER, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
 [  253.014203] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 11/18/2024
 [  253.014224] Workqueue: events work_for_cpu_fn
 [  253.014241] Call Trace:
 [  253.014250]  <TASK>
 [  253.014260]  dump_stack_lvl+0x9b/0xf0
 [  253.014275]  dump_stack+0x10/0x20
 [  253.014287]  __lock_acquire+0xa47/0x2810
 [  253.014303]  ? srso_alias_return_thunk+0x5/0xfbef5
 [  253.014321]  lock_acquire+0xd1/0x300
 [  253.014333]  ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
 [  253.014562]  ? __lock_acquire+0xa6b/0x2810
 [  253.014578]  __mutex_lock+0x85/0xe20
 [  253.014591]  ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
 [  253.014782]  ? sched_clock_noinstr+0x9/0x10
 [  253.014795]  ? srso_alias_return_thunk+0x5/0xfbef5
 [  253.014808]  ? local_clock_noinstr+0xe/0xc0
 [  253.014822]  ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
 [  253.015012]  ? srso_alias_return_thunk+0x5/0xfbef5
 [  253.015029]  mutex_lock_nested+0x1b/0x30
 [  253.015044]  ? mutex_lock_nested+0x1b/0x30
 [  253.015057]  amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
 [  253.015249]  amdgpu_sriov_wreg+0xc5/0xd0 [amdgpu]
 [  253.015435]  gmc_v11_0_flush_gpu_tlb+0x44b/0x4f0 [amdgpu]
 [  253.015667]  gfx_v11_0_hw_init+0x499/0x29c0 [amdgpu]
 [  253.015901]  ? __pfx_smu_v13_0_update_pcie_parameters+0x10/0x10 [amdgpu]
 [  253.016159]  ? srso_alias_return_thunk+0x5/0xfbef5
 [  253.016173]  ? smu_hw_init+0x18d/0x300 [amdgpu]
 [  253.016403]  amdgpu_device_init+0x29ad/0x36a0 [amdgpu]
 [  253.016614]  amdgpu_driver_load_kms+0x1a/0xc0 [amdgpu]
 [  253.017057]  amdgpu_pci_probe+0x1c2/0x660 [amdgpu]
 [  253.017493]  local_pci_probe+0x4b/0xb0
 [  253.017746]  work_for_cpu_fn+0x1a/0x30
 [  253.017995]  process_one_work+0x21e/0x680
 [  253.018248]  worker_thread+0x190/0x330
 [  253.018500]  ? __pfx_worker_thread+0x10/0x10
 [  253.018746]  kthread+0xe7/0x120
 [  253.018988]  ? __pfx_kthread+0x10/0x10
 [  253.019231]  ret_from_fork+0x3c/0x60
 [  253.019468]  ? __pfx_kthread+0x10/0x10
 [  253.019701]  ret_from_fork_asm+0x1a/0x30
 [  253.019939]  </TASK>

 v2: s/spin_trylock/spin_lock_irqsave to be safe (Christian).

 The Linux kernel CVE team has assigned CVE-2025-38104 to this issue.


 Affected and fixed versions
 ===========================

 	Issue introduced in 6.11 with commit e864180ee49b4d30e640fd1e1d852b86411420c9 and fixed in 6.13.11 with commit 1c0378830e42c98acd69e0289882c8637d92f285
 	Issue introduced in 6.11 with commit e864180ee49b4d30e640fd1e1d852b86411420c9 and fixed in 6.14.2 with commit 5c1741a0c176ae11675a64cb7f2dd21d72db6b91
 	Issue introduced in 6.11 with commit e864180ee49b4d30e640fd1e1d852b86411420c9 and fixed in 6.15 with commit dc0297f3198bd60108ccbd167ee5d9fa4af31ed0
 	Issue introduced in 6.1.105 with commit f39a3bc42815a7016a915f6cb35e9a1448788f06
 	Issue introduced in 6.6.46 with commit 1adb5ebe205e96af77a93512e2d5b8c437548787
 	Issue introduced in 6.10.5 with commit e1ab38e99d1607f80a1670a399511a56464c0253

 Please see https://www.kernel.org for a full list of currently supported
 kernel versions by the kernel community.

 Unaffected versions might change over time as fixes are backported to
 older supported kernel versions.  The official CVE entry at
 	https://cve.org/CVERecord/?id=CVE-2025-38104
 will be updated if fixes are backported, please check that for the most
 up to date information about this issue.


 Affected files
 ==============

 The file(s) affected by this issue are:
 	drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
 	drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
 	drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h


 Mitigation
 ==========

 The Linux kernel CVE team recommends that you update to the latest
 stable kernel version for this, and many other bugfixes.  Individual
 changes are never tested alone, but rather are part of a larger kernel
 release.  Cherry-picking individual commits is not recommended or
 supported by the Linux kernel community at all.  If however, updating to
 the latest release is impossible, the individual changes to resolve this
 issue can be found at these commits:
 	https://git.kernel.org/stable/c/1c0378830e42c98acd69e0289882c8637d92f285
 	https://git.kernel.org/stable/c/5c1741a0c176ae11675a64cb7f2dd21d72db6b91
 	https://git.kernel.org/stable/c/dc0297f3198bd60108ccbd167ee5d9fa4af31ed0
	From bippy-1.2.0 Mon Sep 17 00:00:00 2001
	From: Greg Kroah-Hartman <gregkh@kernel.org>
	To: <linux-cve-announce@vger.kernel.org>
	Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
	Subject: CVE-2025-38104: drm/amdgpu: Replace Mutex with Spinlock for RLCG register access to avoid Priority Inversion in SRIOV

	Description
	===========

	In the Linux kernel, the following vulnerability has been resolved:

	drm/amdgpu: Replace Mutex with Spinlock for RLCG register access to avoid Priority Inversion in SRIOV

	RLCG Register Access is a way for virtual functions to safely access GPU
	registers in a virtualized environment., including TLB flushes and
	register reads. When multiple threads or VFs try to access the same
	registers simultaneously, it can lead to race conditions. By using the
	RLCG interface, the driver can serialize access to the registers. This
	means that only one thread can access the registers at a time,
	preventing conflicts and ensuring that operations are performed
	correctly. Additionally, when a low-priority task holds a mutex that a
	high-priority task needs, ie., If a thread holding a spinlock tries to
	acquire a mutex, it can lead to priority inversion. register access in
	amdgpu_virt_rlcg_reg_rw especially in a fast code path is critical.

	The call stack shows that the function amdgpu_virt_rlcg_reg_rw is being
	called, which attempts to acquire the mutex. This function is invoked
	from amdgpu_sriov_wreg, which in turn is called from
	gmc_v11_0_flush_gpu_tlb.

	The [ BUG: Invalid wait context ] indicates that a thread is trying to
	acquire a mutex while it is in a context that does not allow it to sleep
	(like holding a spinlock).

	Fixes the below:

	[ 253.013423] =============================
	[ 253.013434] [ BUG: Invalid wait context ]
	[ 253.013446] 6.12.0-amdstaging-drm-next-lol-050225 #14 Tainted: G U OE
	[ 253.013464] -----------------------------
	[ 253.013475] kworker/0:1/10 is trying to lock:
	[ 253.013487] ffff9f30542e3cf8 (&adev->virt.rlcg_reg_lock){+.+.}-{3:3}, at: amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
	[ 253.013815] other info that might help us debug this:
	[ 253.013827] context-{4:4}
	[ 253.013835] 3 locks held by kworker/0:1/10:
	[ 253.013847] #0: ffff9f3040050f58 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680
	[ 253.013877] #1: ffffb789c008be40 ((work_completion)(&wfc.work)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680
	[ 253.013905] #2: ffff9f3054281838 (&adev->gmc.invalidate_lock){+.+.}-{2:2}, at: gmc_v11_0_flush_gpu_tlb+0x198/0x4f0 [amdgpu]
	[ 253.014154] stack backtrace:
	[ 253.014164] CPU: 0 UID: 0 PID: 10 Comm: kworker/0:1 Tainted: G U OE 6.12.0-amdstaging-drm-next-lol-050225 #14
	[ 253.014189] Tainted: [U]=USER, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
	[ 253.014203] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.1 11/18/2024
	[ 253.014224] Workqueue: events work_for_cpu_fn
	[ 253.014241] Call Trace:
	[ 253.014250] <TASK>
	[ 253.014260] dump_stack_lvl+0x9b/0xf0
	[ 253.014275] dump_stack+0x10/0x20
	[ 253.014287] __lock_acquire+0xa47/0x2810
	[ 253.014303] ? srso_alias_return_thunk+0x5/0xfbef5
	[ 253.014321] lock_acquire+0xd1/0x300
	[ 253.014333] ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
	[ 253.014562] ? __lock_acquire+0xa6b/0x2810
	[ 253.014578] __mutex_lock+0x85/0xe20
	[ 253.014591] ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
	[ 253.014782] ? sched_clock_noinstr+0x9/0x10
	[ 253.014795] ? srso_alias_return_thunk+0x5/0xfbef5
	[ 253.014808] ? local_clock_noinstr+0xe/0xc0
	[ 253.014822] ? amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
	[ 253.015012] ? srso_alias_return_thunk+0x5/0xfbef5
	[ 253.015029] mutex_lock_nested+0x1b/0x30
	[ 253.015044] ? mutex_lock_nested+0x1b/0x30
	[ 253.015057] amdgpu_virt_rlcg_reg_rw+0xf6/0x330 [amdgpu]
	[ 253.015249] amdgpu_sriov_wreg+0xc5/0xd0 [amdgpu]
	[ 253.015435] gmc_v11_0_flush_gpu_tlb+0x44b/0x4f0 [amdgpu]
	[ 253.015667] gfx_v11_0_hw_init+0x499/0x29c0 [amdgpu]
	[ 253.015901] ? __pfx_smu_v13_0_update_pcie_parameters+0x10/0x10 [amdgpu]
	[ 253.016159] ? srso_alias_return_thunk+0x5/0xfbef5
	[ 253.016173] ? smu_hw_init+0x18d/0x300 [amdgpu]
	[ 253.016403] amdgpu_device_init+0x29ad/0x36a0 [amdgpu]
	[ 253.016614] amdgpu_driver_load_kms+0x1a/0xc0 [amdgpu]
	[ 253.017057] amdgpu_pci_probe+0x1c2/0x660 [amdgpu]
	[ 253.017493] local_pci_probe+0x4b/0xb0
	[ 253.017746] work_for_cpu_fn+0x1a/0x30
	[ 253.017995] process_one_work+0x21e/0x680
	[ 253.018248] worker_thread+0x190/0x330
	[ 253.018500] ? __pfx_worker_thread+0x10/0x10
	[ 253.018746] kthread+0xe7/0x120
	[ 253.018988] ? __pfx_kthread+0x10/0x10
	[ 253.019231] ret_from_fork+0x3c/0x60
	[ 253.019468] ? __pfx_kthread+0x10/0x10
	[ 253.019701] ret_from_fork_asm+0x1a/0x30
	[ 253.019939] </TASK>

	v2: s/spin_trylock/spin_lock_irqsave to be safe (Christian).

	The Linux kernel CVE team has assigned CVE-2025-38104 to this issue.


	Affected and fixed versions
	===========================

	Issue introduced in 6.11 with commit e864180ee49b4d30e640fd1e1d852b86411420c9 and fixed in 6.13.11 with commit 1c0378830e42c98acd69e0289882c8637d92f285
	Issue introduced in 6.11 with commit e864180ee49b4d30e640fd1e1d852b86411420c9 and fixed in 6.14.2 with commit 5c1741a0c176ae11675a64cb7f2dd21d72db6b91
	Issue introduced in 6.11 with commit e864180ee49b4d30e640fd1e1d852b86411420c9 and fixed in 6.15 with commit dc0297f3198bd60108ccbd167ee5d9fa4af31ed0
	Issue introduced in 6.1.105 with commit f39a3bc42815a7016a915f6cb35e9a1448788f06
	Issue introduced in 6.6.46 with commit 1adb5ebe205e96af77a93512e2d5b8c437548787
	Issue introduced in 6.10.5 with commit e1ab38e99d1607f80a1670a399511a56464c0253

	Please see https://www.kernel.org for a full list of currently supported
	kernel versions by the kernel community.

	Unaffected versions might change over time as fixes are backported to
	older supported kernel versions. The official CVE entry at
	https://cve.org/CVERecord/?id=CVE-2025-38104
	will be updated if fixes are backported, please check that for the most
	up to date information about this issue.


	Affected files
	==============

	The file(s) affected by this issue are:
	drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
	drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
	drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h


	Mitigation
	==========

	The Linux kernel CVE team recommends that you update to the latest
	stable kernel version for this, and many other bugfixes. Individual
	changes are never tested alone, but rather are part of a larger kernel
	release. Cherry-picking individual commits is not recommended or
	supported by the Linux kernel community at all. If however, updating to
	the latest release is impossible, the individual changes to resolve this
	issue can be found at these commits:
	https://git.kernel.org/stable/c/1c0378830e42c98acd69e0289882c8637d92f285
	https://git.kernel.org/stable/c/5c1741a0c176ae11675a64cb7f2dd21d72db6b91
	https://git.kernel.org/stable/c/dc0297f3198bd60108ccbd167ee5d9fa4af31ed0