cve/published/2024/CVE-2024-26762.mbox - pub/scm/linux/security/vulns - Git at Google

 From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001
 From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 To: <linux-cve-announce@vger.kernel.org>
 Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
 Subject: CVE-2024-26762: cxl/pci: Skip to handle RAS errors if CXL.mem device is detached

 Description
 ===========

 In the Linux kernel, the following vulnerability has been resolved:

 cxl/pci: Skip to handle RAS errors if CXL.mem device is detached

 The PCI AER model is an awkward fit for CXL error handling. While the
 expectation is that a PCI device can escalate to link reset to recover
 from an AER event, the same reset on CXL amounts to a surprise memory
 hotplug of massive amounts of memory.

 At present, the CXL error handler attempts some optimistic error
 handling to unbind the device from the cxl_mem driver after reaping some
 RAS register values. This results in a "hopeful" attempt to unplug the
 memory, but there is no guarantee that will succeed.

 A subsequent AER notification after the memdev unbind event can no
 longer assume the registers are mapped. Check for memdev bind before
 reaping status register values to avoid crashes of the form:

  BUG: unable to handle page fault for address: ffa00000195e9100
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  [...]
  RIP: 0010:__cxl_handle_ras+0x30/0x110 [cxl_core]
  [...]
  Call Trace:
   <TASK>
   ? __die+0x24/0x70
   ? page_fault_oops+0x82/0x160
   ? kernelmode_fixup_or_oops+0x84/0x110
   ? exc_page_fault+0x113/0x170
   ? asm_exc_page_fault+0x26/0x30
   ? __pfx_dpc_reset_link+0x10/0x10
   ? __cxl_handle_ras+0x30/0x110 [cxl_core]
   ? find_cxl_port+0x59/0x80 [cxl_core]
   cxl_handle_rp_ras+0xbc/0xd0 [cxl_core]
   cxl_error_detected+0x6c/0xf0 [cxl_core]
   report_error_detected+0xc7/0x1c0
   pci_walk_bus+0x73/0x90
   pcie_do_recovery+0x23f/0x330

 Longer term, the unbind and PCI_ERS_RESULT_DISCONNECT behavior might
 need to be replaced with a new PCI_ERS_RESULT_PANIC.

 The Linux kernel CVE team has assigned CVE-2024-26762 to this issue.


 Affected and fixed versions
 ===========================

 	Issue introduced in 6.7 with commit 6ac07883dbb5f60f7bc56a13b7a84a382aa9c1ab and fixed in 6.7.7 with commit 21e5e84f3f63fdf44e49642a6e45cd895e921a84
 	Issue introduced in 6.7 with commit 6ac07883dbb5f60f7bc56a13b7a84a382aa9c1ab and fixed in 6.8 with commit eef5c7b28dbecd6b141987a96db6c54e49828102

 Please see https://www.kernel.org for a full list of currently supported
 kernel versions by the kernel community.

 Unaffected versions might change over time as fixes are backported to
 older supported kernel versions.  The official CVE entry at
 	https://cve.org/CVERecord/?id=CVE-2024-26762
 will be updated if fixes are backported, please check that for the most
 up to date information about this issue.


 Affected files
 ==============

 The file(s) affected by this issue are:
 	drivers/cxl/core/pci.c


 Mitigation
 ==========

 The Linux kernel CVE team recommends that you update to the latest
 stable kernel version for this, and many other bugfixes.  Individual
 changes are never tested alone, but rather are part of a larger kernel
 release.  Cherry-picking individual commits is not recommended or
 supported by the Linux kernel community at all.  If however, updating to
 the latest release is impossible, the individual changes to resolve this
 issue can be found at these commits:
 	https://git.kernel.org/stable/c/21e5e84f3f63fdf44e49642a6e45cd895e921a84
 	https://git.kernel.org/stable/c/eef5c7b28dbecd6b141987a96db6c54e49828102
	From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001
	From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	To: <linux-cve-announce@vger.kernel.org>
	Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
	Subject: CVE-2024-26762: cxl/pci: Skip to handle RAS errors if CXL.mem device is detached

	Description
	===========

	In the Linux kernel, the following vulnerability has been resolved:

	cxl/pci: Skip to handle RAS errors if CXL.mem device is detached

	The PCI AER model is an awkward fit for CXL error handling. While the
	expectation is that a PCI device can escalate to link reset to recover
	from an AER event, the same reset on CXL amounts to a surprise memory
	hotplug of massive amounts of memory.

	At present, the CXL error handler attempts some optimistic error
	handling to unbind the device from the cxl_mem driver after reaping some
	RAS register values. This results in a "hopeful" attempt to unplug the
	memory, but there is no guarantee that will succeed.

	A subsequent AER notification after the memdev unbind event can no
	longer assume the registers are mapped. Check for memdev bind before
	reaping status register values to avoid crashes of the form:

	BUG: unable to handle page fault for address: ffa00000195e9100
	#PF: supervisor read access in kernel mode
	#PF: error_code(0x0000) - not-present page
	[...]
	RIP: 0010:__cxl_handle_ras+0x30/0x110 [cxl_core]
	[...]
	Call Trace:
	<TASK>
	? __die+0x24/0x70
	? page_fault_oops+0x82/0x160
	? kernelmode_fixup_or_oops+0x84/0x110
	? exc_page_fault+0x113/0x170
	? asm_exc_page_fault+0x26/0x30
	? __pfx_dpc_reset_link+0x10/0x10
	? __cxl_handle_ras+0x30/0x110 [cxl_core]
	? find_cxl_port+0x59/0x80 [cxl_core]
	cxl_handle_rp_ras+0xbc/0xd0 [cxl_core]
	cxl_error_detected+0x6c/0xf0 [cxl_core]
	report_error_detected+0xc7/0x1c0
	pci_walk_bus+0x73/0x90
	pcie_do_recovery+0x23f/0x330

	Longer term, the unbind and PCI_ERS_RESULT_DISCONNECT behavior might
	need to be replaced with a new PCI_ERS_RESULT_PANIC.

	The Linux kernel CVE team has assigned CVE-2024-26762 to this issue.


	Affected and fixed versions
	===========================

	Issue introduced in 6.7 with commit 6ac07883dbb5f60f7bc56a13b7a84a382aa9c1ab and fixed in 6.7.7 with commit 21e5e84f3f63fdf44e49642a6e45cd895e921a84
	Issue introduced in 6.7 with commit 6ac07883dbb5f60f7bc56a13b7a84a382aa9c1ab and fixed in 6.8 with commit eef5c7b28dbecd6b141987a96db6c54e49828102

	Please see https://www.kernel.org for a full list of currently supported
	kernel versions by the kernel community.

	Unaffected versions might change over time as fixes are backported to
	older supported kernel versions. The official CVE entry at
	https://cve.org/CVERecord/?id=CVE-2024-26762
	will be updated if fixes are backported, please check that for the most
	up to date information about this issue.


	Affected files
	==============

	The file(s) affected by this issue are:
	drivers/cxl/core/pci.c


	Mitigation
	==========

	The Linux kernel CVE team recommends that you update to the latest
	stable kernel version for this, and many other bugfixes. Individual
	changes are never tested alone, but rather are part of a larger kernel
	release. Cherry-picking individual commits is not recommended or
	supported by the Linux kernel community at all. If however, updating to
	the latest release is impossible, the individual changes to resolve this
	issue can be found at these commits:
	https://git.kernel.org/stable/c/21e5e84f3f63fdf44e49642a6e45cd895e921a84
	https://git.kernel.org/stable/c/eef5c7b28dbecd6b141987a96db6c54e49828102