queue-6.6/scsi-aacraid-stop-using-pci_irq_affinity.patch - pub/scm/linux/kernel/git/stable/stable-queue - Git at Google

 From bbf26ca1820b36392ccafff272ef375c03224af0 Mon Sep 17 00:00:00 2001
 From: Sasha Levin <sashal@kernel.org>
 Date: Tue, 15 Jul 2025 11:15:35 +0000
 Subject: scsi: aacraid: Stop using PCI_IRQ_AFFINITY

 From: John Garry <john.g.garry@oracle.com>

 [ Upstream commit dafeaf2c03e71255438ffe5a341d94d180e6c88e ]

 When PCI_IRQ_AFFINITY is set for calling pci_alloc_irq_vectors(), it
 means interrupts are spread around the available CPUs. It also means that
 the interrupts become managed, which means that an interrupt is shutdown
 when all the CPUs in the interrupt affinity mask go offline.

 Using managed interrupts in this way means that we should ensure that
 completions should not occur on HW queues where the associated interrupt
 is shutdown. This is typically achieved by ensuring only CPUs which are
 online can generate IO completion traffic to the HW queue which they are
 mapped to (so that they can also serve completion interrupts for that HW
 queue).

 The problem in the driver is that a CPU can generate completions to a HW
 queue whose interrupt may be shutdown, as the CPUs in the HW queue
 interrupt affinity mask may be offline. This can cause IOs to never
 complete and hang the system. The driver maintains its own CPU <-> HW
 queue mapping for submissions, see aac_fib_vector_assign(), but this does
 not reflect the CPU <-> HW queue interrupt affinity mapping.

 Commit 9dc704dcc09e ("scsi: aacraid: Reply queue mapping to CPUs based on
 IRQ affinity") tried to remedy this issue may mapping CPUs properly to HW
 queue interrupts. However this was later reverted in commit c5becf57dd56
 ("Revert "scsi: aacraid: Reply queue mapping to CPUs based on IRQ
 affinity") - it seems that there were other reports of hangs. I guess
 that this was due to some implementation issue in the original commit or
 maybe a HW issue.

 Fix the very original hang by just not using managed interrupts by not
 setting PCI_IRQ_AFFINITY.  In this way, all CPUs will be in each HW queue
 affinity mask, so should not create completion problems if any CPUs go
 offline.

 Signed-off-by: John Garry <john.g.garry@oracle.com>
 Link: https://lore.kernel.org/r/20250715111535.499853-1-john.g.garry@oracle.com
 Closes: https://lore.kernel.org/linux-scsi/20250618192427.3845724-1-jmeneghi@redhat.com/
 Reviewed-by: John Meneghini <jmeneghi@redhat.com>
 Tested-by: John Meneghini <jmeneghi@redhat.com>
 Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
 Signed-off-by: Sasha Levin <sashal@kernel.org>
 ---
  drivers/scsi/aacraid/comminit.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

 diff --git a/drivers/scsi/aacraid/comminit.c b/drivers/scsi/aacraid/comminit.c
 index 0f64b0244303..31b95e6c96c5 100644
 --- a/drivers/scsi/aacraid/comminit.c
 +++ b/drivers/scsi/aacraid/comminit.c
 @@ -481,8 +481,7 @@ void aac_define_int_mode(struct aac_dev *dev)
  	    pci_find_capability(dev->pdev, PCI_CAP_ID_MSIX)) {
  		min_msix = 2;
  		i = pci_alloc_irq_vectors(dev->pdev,
 -					  min_msix, msi_count,
 -					  PCI_IRQ_MSIX | PCI_IRQ_AFFINITY);
 +					  min_msix, msi_count, PCI_IRQ_MSIX);
  		if (i > 0) {
  			dev->msi_enabled = 1;
  			msi_count = i;
 --
 2.39.5
	From bbf26ca1820b36392ccafff272ef375c03224af0 Mon Sep 17 00:00:00 2001
	From: Sasha Levin <sashal@kernel.org>
	Date: Tue, 15 Jul 2025 11:15:35 +0000
	Subject: scsi: aacraid: Stop using PCI_IRQ_AFFINITY

	From: John Garry <john.g.garry@oracle.com>

	[ Upstream commit dafeaf2c03e71255438ffe5a341d94d180e6c88e ]

	When PCI_IRQ_AFFINITY is set for calling pci_alloc_irq_vectors(), it
	means interrupts are spread around the available CPUs. It also means that
	the interrupts become managed, which means that an interrupt is shutdown
	when all the CPUs in the interrupt affinity mask go offline.

	Using managed interrupts in this way means that we should ensure that
	completions should not occur on HW queues where the associated interrupt
	is shutdown. This is typically achieved by ensuring only CPUs which are
	online can generate IO completion traffic to the HW queue which they are
	mapped to (so that they can also serve completion interrupts for that HW
	queue).

	The problem in the driver is that a CPU can generate completions to a HW
	queue whose interrupt may be shutdown, as the CPUs in the HW queue
	interrupt affinity mask may be offline. This can cause IOs to never
	complete and hang the system. The driver maintains its own CPU <-> HW
	queue mapping for submissions, see aac_fib_vector_assign(), but this does
	not reflect the CPU <-> HW queue interrupt affinity mapping.

	Commit 9dc704dcc09e ("scsi: aacraid: Reply queue mapping to CPUs based on
	IRQ affinity") tried to remedy this issue may mapping CPUs properly to HW
	queue interrupts. However this was later reverted in commit c5becf57dd56
	("Revert "scsi: aacraid: Reply queue mapping to CPUs based on IRQ
	affinity") - it seems that there were other reports of hangs. I guess
	that this was due to some implementation issue in the original commit or
	maybe a HW issue.

	Fix the very original hang by just not using managed interrupts by not
	setting PCI_IRQ_AFFINITY. In this way, all CPUs will be in each HW queue
	affinity mask, so should not create completion problems if any CPUs go
	offline.

	Signed-off-by: John Garry <john.g.garry@oracle.com>
	Link: https://lore.kernel.org/r/20250715111535.499853-1-john.g.garry@oracle.com
	Closes: https://lore.kernel.org/linux-scsi/20250618192427.3845724-1-jmeneghi@redhat.com/
	Reviewed-by: John Meneghini <jmeneghi@redhat.com>
	Tested-by: John Meneghini <jmeneghi@redhat.com>
	Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
	Signed-off-by: Sasha Levin <sashal@kernel.org>
	---
	drivers/scsi/aacraid/comminit.c \| 3 +--
	1 file changed, 1 insertion(+), 2 deletions(-)

	diff --git a/drivers/scsi/aacraid/comminit.c b/drivers/scsi/aacraid/comminit.c
	index 0f64b0244303..31b95e6c96c5 100644
	--- a/drivers/scsi/aacraid/comminit.c
	+++ b/drivers/scsi/aacraid/comminit.c
	@@ -481,8 +481,7 @@ void aac_define_int_mode(struct aac_dev *dev)
	pci_find_capability(dev->pdev, PCI_CAP_ID_MSIX)) {
	min_msix = 2;
	i = pci_alloc_irq_vectors(dev->pdev,
	- min_msix, msi_count,
	- PCI_IRQ_MSIX \| PCI_IRQ_AFFINITY);
	+ min_msix, msi_count, PCI_IRQ_MSIX);
	if (i > 0) {
	dev->msi_enabled = 1;
	msi_count = i;
	--
	2.39.5