| From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001 |
| From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| To: <linux-cve-announce@vger.kernel.org> |
| Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org> |
| Subject: CVE-2024-35931: drm/amdgpu: Skip do PCI error slot reset during RAS recovery |
| |
| Description |
| =========== |
| |
| In the Linux kernel, the following vulnerability has been resolved: |
| |
| drm/amdgpu: Skip do PCI error slot reset during RAS recovery |
| |
| Why: |
| The PCI error slot reset maybe triggered after inject ue to UMC multi times, this |
| caused system hang. |
| [ 557.371857] amdgpu 0000:af:00.0: amdgpu: GPU reset succeeded, trying to resume |
| [ 557.373718] [drm] PCIE GART of 512M enabled. |
| [ 557.373722] [drm] PTB located at 0x0000031FED700000 |
| [ 557.373788] [drm] VRAM is lost due to GPU reset! |
| [ 557.373789] [drm] PSP is resuming... |
| [ 557.547012] mlx5_core 0000:55:00.0: mlx5_pci_err_detected Device state = 1 pci_status: 0. Exit, result = 3, need reset |
| [ 557.547067] [drm] PCI error: detected callback, state(1)!! |
| [ 557.547069] [drm] No support for XGMI hive yet... |
| [ 557.548125] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 0. Enter |
| [ 557.607763] mlx5_core 0000:55:00.0: wait vital counter value 0x16b5b after 1 iterations |
| [ 557.607777] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 1. Exit, err = 0, result = 5, recovered |
| [ 557.610492] [drm] PCI error: slot reset callback!! |
| ... |
| [ 560.689382] amdgpu 0000:3f:00.0: amdgpu: GPU reset(2) succeeded! |
| [ 560.689546] amdgpu 0000:5a:00.0: amdgpu: GPU reset(2) succeeded! |
| [ 560.689562] general protection fault, probably for non-canonical address 0x5f080b54534f611f: 0000 [#1] SMP NOPTI |
| [ 560.701008] CPU: 16 PID: 2361 Comm: kworker/u448:9 Tainted: G OE 5.15.0-91-generic #101-Ubuntu |
| [ 560.712057] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C11.AG.1 11/08/2023 |
| [ 560.720959] Workqueue: amdgpu-reset-hive amdgpu_ras_do_recovery [amdgpu] |
| [ 560.728887] RIP: 0010:amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu] |
| [ 560.736891] Code: ff 41 89 c6 e9 1b ff ff ff 44 0f b6 45 b0 e9 4f ff ff ff be 01 00 00 00 4c 89 e7 e8 76 c9 8b ff 44 0f b6 45 b0 e9 3c fd ff ff <48> 83 ba 18 02 00 00 00 0f 84 6a f8 ff ff 48 8d 7a 78 be 01 00 00 |
| [ 560.757967] RSP: 0018:ffa0000032e53d80 EFLAGS: 00010202 |
| [ 560.763848] RAX: ffa00000001dfd10 RBX: ffa0000000197090 RCX: ffa0000032e53db0 |
| [ 560.771856] RDX: 5f080b54534f5f07 RSI: 0000000000000000 RDI: ff11000128100010 |
| [ 560.779867] RBP: ffa0000032e53df0 R08: 0000000000000000 R09: ffffffffffe77f08 |
| [ 560.787879] R10: 0000000000ffff0a R11: 0000000000000001 R12: 0000000000000000 |
| [ 560.795889] R13: ffa0000032e53e00 R14: 0000000000000000 R15: 0000000000000000 |
| [ 560.803889] FS: 0000000000000000(0000) GS:ff11007e7e800000(0000) knlGS:0000000000000000 |
| [ 560.812973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 |
| [ 560.819422] CR2: 000055a04c118e68 CR3: 0000000007410005 CR4: 0000000000771ee0 |
| [ 560.827433] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 |
| [ 560.835433] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 |
| [ 560.843444] PKRU: 55555554 |
| [ 560.846480] Call Trace: |
| [ 560.849225] <TASK> |
| [ 560.851580] ? show_trace_log_lvl+0x1d6/0x2ea |
| [ 560.856488] ? show_trace_log_lvl+0x1d6/0x2ea |
| [ 560.861379] ? amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu] |
| [ 560.867778] ? show_regs.part.0+0x23/0x29 |
| [ 560.872293] ? __die_body.cold+0x8/0xd |
| [ 560.876502] ? die_addr+0x3e/0x60 |
| [ 560.880238] ? exc_general_protection+0x1c5/0x410 |
| [ 560.885532] ? asm_exc_general_protection+0x27/0x30 |
| [ 560.891025] ? amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu] |
| [ 560.898323] amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu] |
| [ 560.904520] process_one_work+0x228/0x3d0 |
| How: |
| In RAS recovery, mode-1 reset is issued from RAS fatal error handling and expected |
| all the nodes in a hive to be reset. no need to issue another mode-1 during this procedure. |
| |
| The Linux kernel CVE team has assigned CVE-2024-35931 to this issue. |
| |
| |
| Affected and fixed versions |
| =========================== |
| |
| Fixed in 6.8.6 with commit 395ca1031acf89d8ecb26127c544a71688d96f35 |
| Fixed in 6.9 with commit 601429cca96b4af3be44172c3b64e4228515dbe1 |
| |
| Please see https://www.kernel.org for a full list of currently supported |
| kernel versions by the kernel community. |
| |
| Unaffected versions might change over time as fixes are backported to |
| older supported kernel versions. The official CVE entry at |
| https://cve.org/CVERecord/?id=CVE-2024-35931 |
| will be updated if fixes are backported, please check that for the most |
| up to date information about this issue. |
| |
| |
| Affected files |
| ============== |
| |
| The file(s) affected by this issue are: |
| drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |
| |
| |
| Mitigation |
| ========== |
| |
| The Linux kernel CVE team recommends that you update to the latest |
| stable kernel version for this, and many other bugfixes. Individual |
| changes are never tested alone, but rather are part of a larger kernel |
| release. Cherry-picking individual commits is not recommended or |
| supported by the Linux kernel community at all. If however, updating to |
| the latest release is impossible, the individual changes to resolve this |
| issue can be found at these commits: |
| https://git.kernel.org/stable/c/395ca1031acf89d8ecb26127c544a71688d96f35 |
| https://git.kernel.org/stable/c/601429cca96b4af3be44172c3b64e4228515dbe1 |