| From foo@baz Thu Dec 21 09:02:40 CET 2017 |
| From: Gabriele Paoloni <gabriele.paoloni@huawei.com> |
| Date: Thu, 28 Sep 2017 15:33:05 +0100 |
| Subject: PCI/AER: Report non-fatal errors only to the affected endpoint |
| |
| From: Gabriele Paoloni <gabriele.paoloni@huawei.com> |
| |
| |
| [ Upstream commit 86acc790717fb60fb51ea3095084e331d8711c74 ] |
| |
| Previously, if an non-fatal error was reported by an endpoint, we |
| called report_error_detected() for the endpoint, every sibling on the |
| bus, and their descendents. If any of them did not implement the |
| .error_detected() method, do_recovery() failed, leaving all these |
| devices unrecovered. |
| |
| For example, the system described in the bugzilla below has two devices: |
| |
| 0000:74:02.0 [19e5:a230] SAS controller, driver has .error_detected() |
| 0000:74:03.0 [19e5:a235] SATA controller, driver lacks .error_detected() |
| |
| When a device such as 74:02.0 reported a non-fatal error, do_recovery() |
| failed because 74:03.0 lacked an .error_detected() method. But per PCIe |
| r3.1, sec 6.2.2.2.2, such an error does not compromise the Link and |
| does not affect 74:03.0: |
| |
| Non-fatal errors are uncorrectable errors which cause a particular |
| transaction to be unreliable but the Link is otherwise fully functional. |
| Isolating Non-fatal from Fatal errors provides Requester/Receiver logic |
| in a device or system management software the opportunity to recover from |
| the error without resetting the components on the Link and disturbing |
| other transactions in progress. Devices not associated with the |
| transaction in error are not impacted by the error. |
| |
| Report non-fatal errors only to the endpoint that reported them. We really |
| want to check for AER_NONFATAL here, but the current code structure doesn't |
| allow that. Looking for pci_channel_io_normal is the best we can do now. |
| |
| Link: https://bugzilla.kernel.org/show_bug.cgi?id=197055 |
| Fixes: 6c2b374d7485 ("PCI-Express AER implemetation: AER core and aerdriver") |
| Signed-off-by: Gabriele Paoloni <gabriele.paoloni@huawei.com> |
| Signed-off-by: Dongdong Liu <liudongdong3@huawei.com> |
| [bhelgaas: changelog] |
| Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> |
| |
| Signed-off-by: Sasha Levin <alexander.levin@verizon.com> |
| Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| --- |
| drivers/pci/pcie/aer/aerdrv_core.c | 9 ++++++++- |
| 1 file changed, 8 insertions(+), 1 deletion(-) |
| |
| --- a/drivers/pci/pcie/aer/aerdrv_core.c |
| +++ b/drivers/pci/pcie/aer/aerdrv_core.c |
| @@ -390,7 +390,14 @@ static pci_ers_result_t broadcast_error_ |
| * If the error is reported by an end point, we think this |
| * error is related to the upstream link of the end point. |
| */ |
| - pci_walk_bus(dev->bus, cb, &result_data); |
| + if (state == pci_channel_io_normal) |
| + /* |
| + * the error is non fatal so the bus is ok, just invoke |
| + * the callback for the function that logged the error. |
| + */ |
| + cb(dev, &result_data); |
| + else |
| + pci_walk_bus(dev->bus, cb, &result_data); |
| } |
| |
| return result_data.result; |