cve/published/2024/CVE-2024-35931.mbox - pub/scm/linux/security/vulns - Git at Google

 From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001
 From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 To: <linux-cve-announce@vger.kernel.org>
 Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
 Subject: CVE-2024-35931: drm/amdgpu: Skip do PCI error slot reset during RAS recovery

 Description
 ===========

 In the Linux kernel, the following vulnerability has been resolved:

 drm/amdgpu: Skip do PCI error slot reset during RAS recovery

 Why:
     The PCI error slot reset maybe triggered after inject ue to UMC multi times, this
     caused system hang.
     [  557.371857] amdgpu 0000:af:00.0: amdgpu: GPU reset succeeded, trying to resume
     [  557.373718] [drm] PCIE GART of 512M enabled.
     [  557.373722] [drm] PTB located at 0x0000031FED700000
     [  557.373788] [drm] VRAM is lost due to GPU reset!
     [  557.373789] [drm] PSP is resuming...
     [  557.547012] mlx5_core 0000:55:00.0: mlx5_pci_err_detected Device state = 1 pci_status: 0. Exit, result = 3, need reset
     [  557.547067] [drm] PCI error: detected callback, state(1)!!
     [  557.547069] [drm] No support for XGMI hive yet...
     [  557.548125] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 0. Enter
     [  557.607763] mlx5_core 0000:55:00.0: wait vital counter value 0x16b5b after 1 iterations
     [  557.607777] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 1. Exit, err = 0, result = 5, recovered
     [  557.610492] [drm] PCI error: slot reset callback!!
     ...
     [  560.689382] amdgpu 0000:3f:00.0: amdgpu: GPU reset(2) succeeded!
     [  560.689546] amdgpu 0000:5a:00.0: amdgpu: GPU reset(2) succeeded!
     [  560.689562] general protection fault, probably for non-canonical address 0x5f080b54534f611f: 0000 [#1] SMP NOPTI
     [  560.701008] CPU: 16 PID: 2361 Comm: kworker/u448:9 Tainted: G           OE     5.15.0-91-generic #101-Ubuntu
     [  560.712057] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C11.AG.1 11/08/2023
     [  560.720959] Workqueue: amdgpu-reset-hive amdgpu_ras_do_recovery [amdgpu]
     [  560.728887] RIP: 0010:amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu]
     [  560.736891] Code: ff 41 89 c6 e9 1b ff ff ff 44 0f b6 45 b0 e9 4f ff ff ff be 01 00 00 00 4c 89 e7 e8 76 c9 8b ff 44 0f b6 45 b0 e9 3c fd ff ff <48> 83 ba 18 02 00 00 00 0f 84 6a f8 ff ff 48 8d 7a 78 be 01 00 00
     [  560.757967] RSP: 0018:ffa0000032e53d80 EFLAGS: 00010202
     [  560.763848] RAX: ffa00000001dfd10 RBX: ffa0000000197090 RCX: ffa0000032e53db0
     [  560.771856] RDX: 5f080b54534f5f07 RSI: 0000000000000000 RDI: ff11000128100010
     [  560.779867] RBP: ffa0000032e53df0 R08: 0000000000000000 R09: ffffffffffe77f08
     [  560.787879] R10: 0000000000ffff0a R11: 0000000000000001 R12: 0000000000000000
     [  560.795889] R13: ffa0000032e53e00 R14: 0000000000000000 R15: 0000000000000000
     [  560.803889] FS:  0000000000000000(0000) GS:ff11007e7e800000(0000) knlGS:0000000000000000
     [  560.812973] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
     [  560.819422] CR2: 000055a04c118e68 CR3: 0000000007410005 CR4: 0000000000771ee0
     [  560.827433] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
     [  560.835433] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
     [  560.843444] PKRU: 55555554
     [  560.846480] Call Trace:
     [  560.849225]  <TASK>
     [  560.851580]  ? show_trace_log_lvl+0x1d6/0x2ea
     [  560.856488]  ? show_trace_log_lvl+0x1d6/0x2ea
     [  560.861379]  ? amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu]
     [  560.867778]  ? show_regs.part.0+0x23/0x29
     [  560.872293]  ? __die_body.cold+0x8/0xd
     [  560.876502]  ? die_addr+0x3e/0x60
     [  560.880238]  ? exc_general_protection+0x1c5/0x410
     [  560.885532]  ? asm_exc_general_protection+0x27/0x30
     [  560.891025]  ? amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu]
     [  560.898323]  amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu]
     [  560.904520]  process_one_work+0x228/0x3d0
 How:
     In RAS recovery, mode-1 reset is issued from RAS fatal error handling and expected
     all the nodes in a hive to be reset. no need to issue another mode-1 during this procedure.

 The Linux kernel CVE team has assigned CVE-2024-35931 to this issue.


 Affected and fixed versions
 ===========================

 	Fixed in 6.8.6 with commit 395ca1031acf89d8ecb26127c544a71688d96f35
 	Fixed in 6.9 with commit 601429cca96b4af3be44172c3b64e4228515dbe1

 Please see https://www.kernel.org for a full list of currently supported
 kernel versions by the kernel community.

 Unaffected versions might change over time as fixes are backported to
 older supported kernel versions.  The official CVE entry at
 	https://cve.org/CVERecord/?id=CVE-2024-35931
 will be updated if fixes are backported, please check that for the most
 up to date information about this issue.


 Affected files
 ==============

 The file(s) affected by this issue are:
 	drivers/gpu/drm/amd/amdgpu/amdgpu_device.c


 Mitigation
 ==========

 The Linux kernel CVE team recommends that you update to the latest
 stable kernel version for this, and many other bugfixes.  Individual
 changes are never tested alone, but rather are part of a larger kernel
 release.  Cherry-picking individual commits is not recommended or
 supported by the Linux kernel community at all.  If however, updating to
 the latest release is impossible, the individual changes to resolve this
 issue can be found at these commits:
 	https://git.kernel.org/stable/c/395ca1031acf89d8ecb26127c544a71688d96f35
 	https://git.kernel.org/stable/c/601429cca96b4af3be44172c3b64e4228515dbe1
	From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001
	From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	To: <linux-cve-announce@vger.kernel.org>
	Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
	Subject: CVE-2024-35931: drm/amdgpu: Skip do PCI error slot reset during RAS recovery

	Description
	===========

	In the Linux kernel, the following vulnerability has been resolved:

	drm/amdgpu: Skip do PCI error slot reset during RAS recovery

	Why:
	The PCI error slot reset maybe triggered after inject ue to UMC multi times, this
	caused system hang.
	[ 557.371857] amdgpu 0000:af:00.0: amdgpu: GPU reset succeeded, trying to resume
	[ 557.373718] [drm] PCIE GART of 512M enabled.
	[ 557.373722] [drm] PTB located at 0x0000031FED700000
	[ 557.373788] [drm] VRAM is lost due to GPU reset!
	[ 557.373789] [drm] PSP is resuming...
	[ 557.547012] mlx5_core 0000:55:00.0: mlx5_pci_err_detected Device state = 1 pci_status: 0. Exit, result = 3, need reset
	[ 557.547067] [drm] PCI error: detected callback, state(1)!!
	[ 557.547069] [drm] No support for XGMI hive yet...
	[ 557.548125] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 0. Enter
	[ 557.607763] mlx5_core 0000:55:00.0: wait vital counter value 0x16b5b after 1 iterations
	[ 557.607777] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 1. Exit, err = 0, result = 5, recovered
	[ 557.610492] [drm] PCI error: slot reset callback!!
	...
	[ 560.689382] amdgpu 0000:3f:00.0: amdgpu: GPU reset(2) succeeded!
	[ 560.689546] amdgpu 0000:5a:00.0: amdgpu: GPU reset(2) succeeded!
	[ 560.689562] general protection fault, probably for non-canonical address 0x5f080b54534f611f: 0000 [#1] SMP NOPTI
	[ 560.701008] CPU: 16 PID: 2361 Comm: kworker/u448:9 Tainted: G OE 5.15.0-91-generic #101-Ubuntu
	[ 560.712057] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C11.AG.1 11/08/2023
	[ 560.720959] Workqueue: amdgpu-reset-hive amdgpu_ras_do_recovery [amdgpu]
	[ 560.728887] RIP: 0010:amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu]
	[ 560.736891] Code: ff 41 89 c6 e9 1b ff ff ff 44 0f b6 45 b0 e9 4f ff ff ff be 01 00 00 00 4c 89 e7 e8 76 c9 8b ff 44 0f b6 45 b0 e9 3c fd ff ff <48> 83 ba 18 02 00 00 00 0f 84 6a f8 ff ff 48 8d 7a 78 be 01 00 00
	[ 560.757967] RSP: 0018:ffa0000032e53d80 EFLAGS: 00010202
	[ 560.763848] RAX: ffa00000001dfd10 RBX: ffa0000000197090 RCX: ffa0000032e53db0
	[ 560.771856] RDX: 5f080b54534f5f07 RSI: 0000000000000000 RDI: ff11000128100010
	[ 560.779867] RBP: ffa0000032e53df0 R08: 0000000000000000 R09: ffffffffffe77f08
	[ 560.787879] R10: 0000000000ffff0a R11: 0000000000000001 R12: 0000000000000000
	[ 560.795889] R13: ffa0000032e53e00 R14: 0000000000000000 R15: 0000000000000000
	[ 560.803889] FS: 0000000000000000(0000) GS:ff11007e7e800000(0000) knlGS:0000000000000000
	[ 560.812973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	[ 560.819422] CR2: 000055a04c118e68 CR3: 0000000007410005 CR4: 0000000000771ee0
	[ 560.827433] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
	[ 560.835433] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
	[ 560.843444] PKRU: 55555554
	[ 560.846480] Call Trace:
	[ 560.849225] <TASK>
	[ 560.851580] ? show_trace_log_lvl+0x1d6/0x2ea
	[ 560.856488] ? show_trace_log_lvl+0x1d6/0x2ea
	[ 560.861379] ? amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu]
	[ 560.867778] ? show_regs.part.0+0x23/0x29
	[ 560.872293] ? __die_body.cold+0x8/0xd
	[ 560.876502] ? die_addr+0x3e/0x60
	[ 560.880238] ? exc_general_protection+0x1c5/0x410
	[ 560.885532] ? asm_exc_general_protection+0x27/0x30
	[ 560.891025] ? amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu]
	[ 560.898323] amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu]
	[ 560.904520] process_one_work+0x228/0x3d0
	How:
	In RAS recovery, mode-1 reset is issued from RAS fatal error handling and expected
	all the nodes in a hive to be reset. no need to issue another mode-1 during this procedure.

	The Linux kernel CVE team has assigned CVE-2024-35931 to this issue.


	Affected and fixed versions
	===========================

	Fixed in 6.8.6 with commit 395ca1031acf89d8ecb26127c544a71688d96f35
	Fixed in 6.9 with commit 601429cca96b4af3be44172c3b64e4228515dbe1

	Please see https://www.kernel.org for a full list of currently supported
	kernel versions by the kernel community.

	Unaffected versions might change over time as fixes are backported to
	older supported kernel versions. The official CVE entry at
	https://cve.org/CVERecord/?id=CVE-2024-35931
	will be updated if fixes are backported, please check that for the most
	up to date information about this issue.


	Affected files
	==============

	The file(s) affected by this issue are:
	drivers/gpu/drm/amd/amdgpu/amdgpu_device.c


	Mitigation
	==========

	The Linux kernel CVE team recommends that you update to the latest
	stable kernel version for this, and many other bugfixes. Individual
	changes are never tested alone, but rather are part of a larger kernel
	release. Cherry-picking individual commits is not recommended or
	supported by the Linux kernel community at all. If however, updating to
	the latest release is impossible, the individual changes to resolve this
	issue can be found at these commits:
	https://git.kernel.org/stable/c/395ca1031acf89d8ecb26127c544a71688d96f35
	https://git.kernel.org/stable/c/601429cca96b4af3be44172c3b64e4228515dbe1