| From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001 |
| From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| To: <linux-cve-announce@vger.kernel.org> |
| Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org> |
| Subject: CVE-2025-21892: RDMA/mlx5: Fix the recovery flow of the UMR QP |
| |
| Description |
| =========== |
| |
| In the Linux kernel, the following vulnerability has been resolved: |
| |
| RDMA/mlx5: Fix the recovery flow of the UMR QP |
| |
| This patch addresses an issue in the recovery flow of the UMR QP, |
| ensuring tasks do not get stuck, as highlighted by the call trace [1]. |
| |
| During recovery, before transitioning the QP to the RESET state, the |
| software must wait for all outstanding WRs to complete. |
| |
| Failing to do so can cause the firmware to skip sending some flushed |
| CQEs with errors and simply discard them upon the RESET, as per the IB |
| specification. |
| |
| This race condition can result in lost CQEs and tasks becoming stuck. |
| |
| To resolve this, the patch sends a final WR which serves only as a |
| barrier before moving the QP state to RESET. |
| |
| Once a CQE is received for that final WR, it guarantees that no |
| outstanding WRs remain, making it safe to transition the QP to RESET and |
| subsequently back to RTS, restoring proper functionality. |
| |
| Note: |
| For the barrier WR, we simply reuse the failed and ready WR. |
| Since the QP is in an error state, it will only receive |
| IB_WC_WR_FLUSH_ERR. However, as it serves only as a barrier we don't |
| care about its status. |
| |
| [1] |
| INFO: task rdma_resource_l:1922 blocked for more than 120 seconds. |
| Tainted: G W 6.12.0-rc7+ #1626 |
| "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. |
| task:rdma_resource_l state:D stack:0 pid:1922 tgid:1922 ppid:1369 |
| flags:0x00004004 |
| Call Trace: |
| <TASK> |
| __schedule+0x420/0xd30 |
| schedule+0x47/0x130 |
| schedule_timeout+0x280/0x300 |
| ? mark_held_locks+0x48/0x80 |
| ? lockdep_hardirqs_on_prepare+0xe5/0x1a0 |
| wait_for_completion+0x75/0x130 |
| mlx5r_umr_post_send_wait+0x3c2/0x5b0 [mlx5_ib] |
| ? __pfx_mlx5r_umr_done+0x10/0x10 [mlx5_ib] |
| mlx5r_umr_revoke_mr+0x93/0xc0 [mlx5_ib] |
| __mlx5_ib_dereg_mr+0x299/0x520 [mlx5_ib] |
| ? _raw_spin_unlock_irq+0x24/0x40 |
| ? wait_for_completion+0xfe/0x130 |
| ? rdma_restrack_put+0x63/0xe0 [ib_core] |
| ib_dereg_mr_user+0x5f/0x120 [ib_core] |
| ? lock_release+0xc6/0x280 |
| destroy_hw_idr_uobject+0x1d/0x60 [ib_uverbs] |
| uverbs_destroy_uobject+0x58/0x1d0 [ib_uverbs] |
| uobj_destroy+0x3f/0x70 [ib_uverbs] |
| ib_uverbs_cmd_verbs+0x3e4/0xbb0 [ib_uverbs] |
| ? __pfx_uverbs_destroy_def_handler+0x10/0x10 [ib_uverbs] |
| ? __lock_acquire+0x64e/0x2080 |
| ? mark_held_locks+0x48/0x80 |
| ? find_held_lock+0x2d/0xa0 |
| ? lock_acquire+0xc1/0x2f0 |
| ? ib_uverbs_ioctl+0xcb/0x170 [ib_uverbs] |
| ? __fget_files+0xc3/0x1b0 |
| ib_uverbs_ioctl+0xe7/0x170 [ib_uverbs] |
| ? ib_uverbs_ioctl+0xcb/0x170 [ib_uverbs] |
| __x64_sys_ioctl+0x1b0/0xa70 |
| do_syscall_64+0x6b/0x140 |
| entry_SYSCALL_64_after_hwframe+0x76/0x7e |
| RIP: 0033:0x7f99c918b17b |
| RSP: 002b:00007ffc766d0468 EFLAGS: 00000246 ORIG_RAX: |
| 0000000000000010 |
| RAX: ffffffffffffffda RBX: 00007ffc766d0578 RCX: |
| 00007f99c918b17b |
| RDX: 00007ffc766d0560 RSI: 00000000c0181b01 RDI: |
| 0000000000000003 |
| RBP: 00007ffc766d0540 R08: 00007f99c8f99010 R09: |
| 000000000000bd7e |
| R10: 00007f99c94c1c70 R11: 0000000000000246 R12: |
| 00007ffc766d0530 |
| R13: 000000000000001c R14: 0000000040246a80 R15: |
| 0000000000000000 |
| </TASK> |
| |
| The Linux kernel CVE team has assigned CVE-2025-21892 to this issue. |
| |
| |
| Affected and fixed versions |
| =========================== |
| |
| Issue introduced in 6.0 with commit 158e71bb69e368b8b33e8b7c4ac8c111da0c1ae2 and fixed in 6.12.18 with commit 3e3bf255992cc02404e9d209b127c1c9944239cf |
| Issue introduced in 6.0 with commit 158e71bb69e368b8b33e8b7c4ac8c111da0c1ae2 and fixed in 6.13.6 with commit 1d2b84d8d054313deed2b2fcafe1168bbcb9e99f |
| Issue introduced in 6.0 with commit 158e71bb69e368b8b33e8b7c4ac8c111da0c1ae2 and fixed in 6.14 with commit d97505baea64d93538b16baf14ce7b8c1fbad746 |
| Issue introduced in 5.19.10 with commit d8f7bff9a42627d37f4ecffeb01e44db42167175 |
| |
| Please see https://www.kernel.org for a full list of currently supported |
| kernel versions by the kernel community. |
| |
| Unaffected versions might change over time as fixes are backported to |
| older supported kernel versions. The official CVE entry at |
| https://cve.org/CVERecord/?id=CVE-2025-21892 |
| will be updated if fixes are backported, please check that for the most |
| up to date information about this issue. |
| |
| |
| Affected files |
| ============== |
| |
| The file(s) affected by this issue are: |
| drivers/infiniband/hw/mlx5/umr.c |
| |
| |
| Mitigation |
| ========== |
| |
| The Linux kernel CVE team recommends that you update to the latest |
| stable kernel version for this, and many other bugfixes. Individual |
| changes are never tested alone, but rather are part of a larger kernel |
| release. Cherry-picking individual commits is not recommended or |
| supported by the Linux kernel community at all. If however, updating to |
| the latest release is impossible, the individual changes to resolve this |
| issue can be found at these commits: |
| https://git.kernel.org/stable/c/3e3bf255992cc02404e9d209b127c1c9944239cf |
| https://git.kernel.org/stable/c/1d2b84d8d054313deed2b2fcafe1168bbcb9e99f |
| https://git.kernel.org/stable/c/d97505baea64d93538b16baf14ce7b8c1fbad746 |