| From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001 |
| From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| To: <linux-cve-announce@vger.kernel.org> |
| Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org> |
| Subject: CVE-2024-26762: cxl/pci: Skip to handle RAS errors if CXL.mem device is detached |
| |
| Description |
| =========== |
| |
| In the Linux kernel, the following vulnerability has been resolved: |
| |
| cxl/pci: Skip to handle RAS errors if CXL.mem device is detached |
| |
| The PCI AER model is an awkward fit for CXL error handling. While the |
| expectation is that a PCI device can escalate to link reset to recover |
| from an AER event, the same reset on CXL amounts to a surprise memory |
| hotplug of massive amounts of memory. |
| |
| At present, the CXL error handler attempts some optimistic error |
| handling to unbind the device from the cxl_mem driver after reaping some |
| RAS register values. This results in a "hopeful" attempt to unplug the |
| memory, but there is no guarantee that will succeed. |
| |
| A subsequent AER notification after the memdev unbind event can no |
| longer assume the registers are mapped. Check for memdev bind before |
| reaping status register values to avoid crashes of the form: |
| |
| BUG: unable to handle page fault for address: ffa00000195e9100 |
| #PF: supervisor read access in kernel mode |
| #PF: error_code(0x0000) - not-present page |
| [...] |
| RIP: 0010:__cxl_handle_ras+0x30/0x110 [cxl_core] |
| [...] |
| Call Trace: |
| <TASK> |
| ? __die+0x24/0x70 |
| ? page_fault_oops+0x82/0x160 |
| ? kernelmode_fixup_or_oops+0x84/0x110 |
| ? exc_page_fault+0x113/0x170 |
| ? asm_exc_page_fault+0x26/0x30 |
| ? __pfx_dpc_reset_link+0x10/0x10 |
| ? __cxl_handle_ras+0x30/0x110 [cxl_core] |
| ? find_cxl_port+0x59/0x80 [cxl_core] |
| cxl_handle_rp_ras+0xbc/0xd0 [cxl_core] |
| cxl_error_detected+0x6c/0xf0 [cxl_core] |
| report_error_detected+0xc7/0x1c0 |
| pci_walk_bus+0x73/0x90 |
| pcie_do_recovery+0x23f/0x330 |
| |
| Longer term, the unbind and PCI_ERS_RESULT_DISCONNECT behavior might |
| need to be replaced with a new PCI_ERS_RESULT_PANIC. |
| |
| The Linux kernel CVE team has assigned CVE-2024-26762 to this issue. |
| |
| |
| Affected and fixed versions |
| =========================== |
| |
| Issue introduced in 6.7 with commit 6ac07883dbb5f60f7bc56a13b7a84a382aa9c1ab and fixed in 6.7.7 with commit 21e5e84f3f63fdf44e49642a6e45cd895e921a84 |
| Issue introduced in 6.7 with commit 6ac07883dbb5f60f7bc56a13b7a84a382aa9c1ab and fixed in 6.8 with commit eef5c7b28dbecd6b141987a96db6c54e49828102 |
| |
| Please see https://www.kernel.org for a full list of currently supported |
| kernel versions by the kernel community. |
| |
| Unaffected versions might change over time as fixes are backported to |
| older supported kernel versions. The official CVE entry at |
| https://cve.org/CVERecord/?id=CVE-2024-26762 |
| will be updated if fixes are backported, please check that for the most |
| up to date information about this issue. |
| |
| |
| Affected files |
| ============== |
| |
| The file(s) affected by this issue are: |
| drivers/cxl/core/pci.c |
| |
| |
| Mitigation |
| ========== |
| |
| The Linux kernel CVE team recommends that you update to the latest |
| stable kernel version for this, and many other bugfixes. Individual |
| changes are never tested alone, but rather are part of a larger kernel |
| release. Cherry-picking individual commits is not recommended or |
| supported by the Linux kernel community at all. If however, updating to |
| the latest release is impossible, the individual changes to resolve this |
| issue can be found at these commits: |
| https://git.kernel.org/stable/c/21e5e84f3f63fdf44e49642a6e45cd895e921a84 |
| https://git.kernel.org/stable/c/eef5c7b28dbecd6b141987a96db6c54e49828102 |