summaryrefslogtreecommitdiff
path: root/drivers/infiniband
AgeCommit message (Collapse)Author
11 daysMerge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: "Many AI driven bug fixes, and several big driver API cleanups - Driver bug fixes and minor cleanups in mlx5, hns, rxe, efa, siw, rtrs, mana, irdma, mlx4. Commonly error path flows, integer arithmetic overflows on unsafe data, out of bounds access, and use after free issues under races. - Second half of the new udata API for drivers focusing on uAPI response - bnxt_re supports more options for QP creation that will allow a dv path in rdma-core - Untangle the module dependencies so drivers don't link to ib_uverbs.ko as was originall intended - Provide a new way to handle umems with a consistent simplified uAPI and update several drivers to use it. This brings dmabuf support to more places and more drivers - Support for mlx5 rate limit and packet pacing for UD and UC - A batch of fixes for the new shared FRMR pools infrastructure" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (148 commits) RDMA/irdma: Replace waitqueue and flag with completion RDMA/hns: Fix memory leak of bonding resources RDMA/rtrs-srv: Bound RDMA-Write length to chunk size in rdma_write_sg docs: infiniband: correct name of option to enable the ib_uverbs module RDMA/bnxt_re: Reject GET_TOGGLE_MEM when toggle page was not allocated RDMA/bnxt_re: Fail DBR related page allocation UAPIs if the feature is disabled RDMA/bnxt_re: Avoid repeated requests to allocate WC pages RDMA/bnxt_re: Proper rollback if the ioremap fails RDMA/bnxt_re: Add a max slot check for SQ RDMA/bnxt_re: Avoid displaying the kernel pointer RDMA/bnxt_re: Free CQ toggle page after firmware teardown RDMA/bnxt_re: Free SRQ toggle page after firmware teardown RDMA/bnxt_re: Initialize dpi variable to zero ABI: sysfs-class-infiniband: minor cleanup RDMA/mlx5: Release the HW‑provided UAR index rather than the SW one RDMA/mlx5: Fix undefined shift of user RQ WQE size RDMA/mlx5: Remove raw RSS QP restrack tracking RDMA/mlx5: Remove DCT restrack tracking RDMA/mlx5: Drop FRMR pool handle on UMR revoke failure RDMA/core: Add ib_frmr_pool_drop for unrecoverable handles ...
13 daysRDMA/irdma: Replace waitqueue and flag with completionJacob Moroni
The driver previously used a waitqueue along with an explicit request_done flag, but without proper barriers around request_done. An earlier patch by Gui-Dong Han <hanguidong02@gmail.com> attempted to fix this by adding the missing memory barriers. Rather than adding the barriers, this patch replaces the waitqueue+flag with a completion, which is designed for this exact purpose. Fixes: 44d9e52977a1 ("RDMA/irdma: Implement device initialization definitions") Fixes: 915cc7ac0f8e ("RDMA/irdma: Add miscellaneous utility definitions") Link: https://patch.msgid.link/r/20260616155601.1081448-1-jmoroni@google.com Signed-off-by: Jacob Moroni <jmoroni@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
13 daysRDMA/hns: Fix memory leak of bonding resourcesJunxian Huang
In a corner case of concurrent driver removal and driver reset, bonding resource is first released in hns_roce_hw_v2_exit() during driver removal, and then is allocated again in hns_roce_register_device() during driver reset. This leads to memory leak because the release timing has already passed. This may also lead to a kernel panic as below because of the leaked notifier callback: Call trace: 0xffffa20fccc04978 (P) raw_notifier_call_chain+0x20/0x38 call_netdevice_notifiers_info+0x60/0xb8 netdev_lower_state_changed+0x4c/0xb8 As Sashiko suggested, the teardown order of bonding resources should be inverted to make sure the resources are released when the driver is removed. Fixes: b37ad2e290fc ("RDMA/hns: Initialize bonding resources") Link: https://patch.msgid.link/r/20260613102045.811623-1-huangjunxian6@hisilicon.com Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
13 daysRDMA/rtrs-srv: Bound RDMA-Write length to chunk size in rdma_write_sgZhenhao Wan
When the server answers an RTRS READ, rdma_write_sg() builds the source scatter/gather entry for the IB_WR_RDMA_WRITE that returns data to the peer. Its length is taken directly from the wire descriptor: plist->length = le32_to_cpu(id->rd_msg->desc[0].len); rd_msg points into the chunk buffer that the remote peer filled via RDMA-WRITE-WITH-IMM (rtrs_srv_rdma_done() -> process_io_req() -> process_read()), so desc[0].len is attacker-controlled and, before this change, was only rejected when zero. The source address is the fixed chunk start (dma_addr[msg_id]) and the source lkey is the PD-wide local_dma_lkey, which is not tied to the chunk's MR mapping, so the verbs layer does not constrain the transfer length to max_chunk_size. msg_id and off are bounded against queue_depth and max_chunk_size in rtrs_srv_rdma_done(), but desc[0].len is a separate field that was not checked against the chunk size. A peer that advertises desc[0].len larger than max_chunk_size can make the posted RDMA write read past the chunk's mapped region. The resulting behaviour depends on the IOMMU configuration: with no IOMMU or in passthrough mode the read may extend into memory adjacent to the chunk and be returned to the peer, which can disclose host memory; with a translating IOMMU the out-of-range access is expected to fault and abort the connection. In either case the transfer exceeds what the protocol permits and is driven by a remote peer. Reject a descriptor length above max_chunk_size, mirroring the existing off >= max_chunk_size bound in rtrs_srv_rdma_done(). Legitimate clients do not exceed it: the client sets desc[0].len to its MR length, which is capped at the negotiated max_io_size (max_chunk_size - MAX_HDR_SIZE). Fixes: 9cb837480424 ("RDMA/rtrs: server: main functionality") Link: https://patch.msgid.link/r/20260612-master-v1-1-70cde5c6fdc9@gmail.com Reported-by: Yuhao Jiang <danisjiang@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Zhenhao Wan <whi4ed0g@gmail.com> Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
13 daysRDMA/bnxt_re: Reject GET_TOGGLE_MEM when toggle page was not allocatedSelvin Xavier
If a user calls BNXT_RE_METHOD_GET_TOGGLE_MEM on a device that does not support the CQ/SRQ toggle feature, uctx_cq_page or uctx_srq_page will be NULL. Add an explicit -EOPNOTSUPP return after capturing the address from uctx_cq_page / uctx_srq_page if the address is zero. Fixes: e275919d9669 ("RDMA/bnxt_re: Share a page to expose per CQ info with userspace") Fixes: 181028a0d84c ("RDMA/bnxt_re: Share a page to expose per SRQ info with userspace") Link: https://patch.msgid.link/r/20260615224751.232802-16-selvin.xavier@broadcom.com Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
13 daysRDMA/bnxt_re: Fail DBR related page allocation UAPIs if the feature is disabledSelvin Xavier
No need to support the DBR related page allocations if the pacing feature is disabled. Fail the request if pacing is disabled. Fixes: ea2224857882 ("RDMA/bnxt_re: Update alloc_page uapi for pacing") Link: https://patch.msgid.link/r/20260615224751.232802-15-selvin.xavier@broadcom.com Reviewed-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
13 daysRDMA/bnxt_re: Avoid repeated requests to allocate WC pagesSelvin Xavier
Applications can request multiple WC pages for the same ucontext. As of now, only 1 WC page per ucontext is supported. Add a lock to avoid concurrent access and a check to fail repeated requests. Also, if the mmap entry insert fails for the WC, free the Doorbell page index mapped for the WC page. Fixes: eee6268421a2 ("RDMA/bnxt_re: Move the UAPI methods to a dedicated file") Fixes: 360da60d6c6e ("RDMA/bnxt_re: Enable low latency push") Link: https://patch.msgid.link/r/20260615224751.232802-12-selvin.xavier@broadcom.com Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
13 daysRDMA/bnxt_re: Proper rollback if the ioremap failsSelvin Xavier
bnxt_qplib_alloc_dpi returns success even if ioremap fails. Add the proper rollback when the ioremap fails and return -ENOMEM status. Fixes: 0ac20faf5d83 ("RDMA/bnxt_re: Reorg the bar mapping") Fixes: 360da60d6c6e ("RDMA/bnxt_re: Enable low latency push") Link: https://patch.msgid.link/r/20260615224751.232802-11-selvin.xavier@broadcom.com Reviewed-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
13 daysRDMA/bnxt_re: Add a max slot check for SQSelvin Xavier
The variable WQE mode must be validated against the maximum slots supported by HW. The max supported value is 64K. Adding a max and min check and fail if user supplied value is more than the max supported and zero. Fixes: d8ea645d6984 ("RDMA/bnxt_re: Handle variable WQE support for user applications") Link: https://patch.msgid.link/r/20260615224751.232802-10-selvin.xavier@broadcom.com Reviewed-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
13 daysRDMA/bnxt_re: Avoid displaying the kernel pointerSelvin Xavier
While dumping the info on MR using the rdma tool, we dump the mr_hwq which is a kernel pointer. There is no need to expose this value for end user. So avoid it. Fixes: 7363eb76b7f3 ("RDMA/bnxt_re: Support driver specific data collection using rdma tool") Link: https://patch.msgid.link/r/20260615224751.232802-9-selvin.xavier@broadcom.com Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
13 daysRDMA/bnxt_re: Free CQ toggle page after firmware teardownSelvin Xavier
Free the toggle page only after firmware teardown completes so that an NQ interrupt arriving during bnxt_qplib_destroy_cq() won't write the toggle value to an already-freed page. Move free_page() after bnxt_qplib_destroy_cq. Fixes: e275919d9669 ("RDMA/bnxt_re: Share a page to expose per CQ info with userspace") Link: https://patch.msgid.link/r/20260615224751.232802-4-selvin.xavier@broadcom.com Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
13 daysRDMA/bnxt_re: Free SRQ toggle page after firmware teardownSelvin Xavier
Free the toggle page only after firmware teardown completes so that an NQ interrupt arriving during bnxt_qplib_destroy_srq() won't write the toggle values to an already-freed page. Move free_page() after bnxt_qplib_destroy_srq(). Fixes: 181028a0d84c ("RDMA/bnxt_re: Share a page to expose per SRQ info with userspace") Link: https://patch.msgid.link/r/20260615224751.232802-3-selvin.xavier@broadcom.com Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
13 daysRDMA/bnxt_re: Initialize dpi variable to zeroSelvin Xavier
dpi is initialized only for BNXT_RE_ALLOC_WC_PAGE, but copied for all the cases. So initialize the dpi to 0. Fixes: eee6268421a2 ("RDMA/bnxt_re: Move the UAPI methods to a dedicated file") Fixes: 360da60d6c6e ("RDMA/bnxt_re: Enable low latency push") Link: https://patch.msgid.link/r/20260615224751.232802-2-selvin.xavier@broadcom.com Reviewed-by: Anantha Prabhu <anantha.prabhu@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-7.1-rc8). Conflicts: drivers/net/ethernet/wangxun/txgbe/txgbe_aml.c f67aead16e85 ("net: txgbe: rework service event handling") 57d39faed4c9 ("net: txgbe: improve functions of AML 40G devices") net/rds/info.c 512db8267b73 ("rds: mark snapshot pages dirty in rds_info_getsockopt()") 6e94eeb2a2a6 ("rds: convert to getsockopt_iter") Adjacent changes: include/net/sock.h 1ee90b77b727 ("net: guard timestamp cmsgs to real error queue skbs") f0de88303d5e ("net: make is_skb_wmem() available to modules") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-11RDMA/mlx5: Release the HW‑provided UAR index rather than the SW oneLeon Romanovsky
Free the UAR index returned by the hardware. Fixes: 4ed131d0bb15 ("IB/mlx5: Expose dynamic mmap allocation") Link: https://patch.msgid.link/r/20260611-fix-uar-release-v1-1-f5464d845dbf@nvidia.com Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-11RDMA/mlx5: Fix undefined shift of user RQ WQE sizeMaher Sanalla
set_rq_size() computes the RQ WQE size as "1 << rq_wqe_shift" based on the user-provided rq_wqe_shift, which is only checked to be greater than 32, so shifts of 32 are still accepted. A shift of 31 also overflows a signed integer, leading to undefined behavior. Use check_shl_overflow() to compute the RQ WQE size and reject any invalid values. Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters") Link: https://patch.msgid.link/r/20260611-maher-sec-fixes-v1-1-cd8eb2542869@nvidia.com Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-11RDMA/mlx5: Remove raw RSS QP restrack trackingPatrisious Haddad
Raw RSS QP restrack tracking wasn't working to begin with as it was only tracking the first raw RSS QP which was added, since at creation the raw RSS QP number is reserved so the QP number for this qp type was always zero. The following raw RSS QP additions were always failing silently. Since the fix isn't trivial and there were no users that required or complained about this issue we are dropping this for now instead of fixing. Fixes: 968f0b6f9c01 ("RDMA/mlx5: Consolidate into special function all create QP calls") Link: https://patch.msgid.link/r/20260607-restrack-uaf-fix-v1-2-d72e45eb76c2@nvidia.com Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-11RDMA/mlx5: Remove DCT restrack trackingPatrisious Haddad
DCT restrack tracking wasn't working to begin with as it was only tracking the first DCT which was added, since at creation the DCT number isn't yet initialized because the DCT FW object is only created during modify. The following DCT additions were failing silently. Since the fix isn't trivial and there were no users that required or complained about this issue we are dropping this for now instead of fixing. Fixes: fd3af5e21866 ("RDMA/mlx5: Track DCT, DCI and REG_UMR QPs as diver_detail resources.") Link: https://patch.msgid.link/r/20260607-restrack-uaf-fix-v1-1-d72e45eb76c2@nvidia.com Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-11RDMA/mlx5: Drop FRMR pool handle on UMR revoke failureMichael Guralnik
When UMR revoke fails during MR cleanup, the handle is left in an unknown state and cannot be returned to the pool. The driver already destroys the mkey via the fallback path, but the pool's in_use counter is never decremented, drifting upward over time. Call ib_frmr_pool_drop on the revoke-failure path so the pool's accounting stays consistent with the handles it has handed out. Fixes: 36680ef7bceb ("RDMA/mlx5: Switch from MR cache to FRMR pools") Link: https://patch.msgid.link/r/20260610000145.820592-10-michaelgur@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-11RDMA/core: Add ib_frmr_pool_drop for unrecoverable handlesMichael Guralnik
A driver that has popped a handle from an FRMR pool can hit failures that leave the handle in a state where it can't safely be returned for reuse. The driver destroys the handle itself, but the pool has no way to learn about it, so the in_use counter drifts upward. Add ib_frmr_pool_drop to balance the pool's accounting in this case. Every pop is now balanced by exactly one push or drop. Fixes: 36680ef7bceb ("RDMA/mlx5: Switch from MR cache to FRMR pools") Link: https://patch.msgid.link/r/20260610000145.820592-9-michaelgur@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-11RDMA/core: Fix FRMR handle leak on push failureMichael Guralnik
Failure to push a handle to the pool, caused by ENOMEM on queue page allocation, will trigger missing in_use counter update, skewing pool state indefinitely. Fix that by moving the handling of handle destruction in such case into the FRMR code, ensuring the handle is either pushed to the pool or destroyed inside the same function. Adjust mlx5_ib call site accordingly. Fixes: ce5df0b891ed ("IB/core: Introduce FRMR pools") Link: https://patch.msgid.link/r/20260610000145.820592-8-michaelgur@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-11RDMA/core: Avoid NULL dereference on FRMR bad usageMichael Guralnik
In case a driver calls FRMR pop operation without a successful init, return after triggering a warning to avoid the NULL dereference. Fixes: ce5df0b891ed ("IB/core: Introduce FRMR pools") Link: https://patch.msgid.link/r/20260610000145.820592-7-michaelgur@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-11RDMA/core: Fix FRMR set pinned push error pathMichael Guralnik
Add destruction of FRMR handles in case the push to the pool fails. This prevents resources leak in case pool page allocation fails. Fixes: 020d189d16a6 ("RDMA/core: Add pinned handles to FRMR pools") Link: https://patch.msgid.link/r/20260610000145.820592-6-michaelgur@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Reviewed-by: Tao Cui <cuitao@kylinos.cn> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-11RDMA/core: Fix FRMR aging push to queue error flowMichael Guralnik
Aging pools with pinned handles requires moving handles from the active queue to a non-empty inactive queue that might fail on new page allocation, we are currently not handling the fault and leaking any mkey that fails the push. Fix by Introducing push_queue_to_queue_locked() that fills the destination's partial tail page from the source and then splices the remaining source pages onto the destination, performing no allocation. Replace the per-handle move loop in age_pinned_pool() and the open-coded splice in pool_aging_work() with calls to the helper. As the helper cannot fail under memory pressure, removing a class of GFP_ATOMIC allocations under the pool lock and simplifying the error flow. Fixes: 020d189d16a6 ("RDMA/core: Add pinned handles to FRMR pools") Link: https://patch.msgid.link/r/20260610000145.820592-5-michaelgur@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-11RDMA/core: Fix skipped usage for driver built FRMR keyMichael Guralnik
When creating FRMR handles following a netlink command to pin handles, use the key after driver callback instead of using the key passed directly from user. Fixes: 020d189d16a6 ("RDMA/core: Add pinned handles to FRMR pools") Link: https://patch.msgid.link/r/20260610000145.820592-4-michaelgur@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-11RDMA/mlx5: Fix TPH extraction in FRMR pool keyMichael Guralnik
Fix reading the PH value from the FRMR pool key by shifting the pool key to the relevant bits. Fixes: 36680ef7bceb ("RDMA/mlx5: Switch from MR cache to FRMR pools") Link: https://patch.msgid.link/r/20260610000145.820592-3-michaelgur@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-11RDMA/mlx5: Fix mkey creation error flow rollbackMichael Guralnik
Fix the indices of mkeys destroyed in case of an error in batch mkey creation. Fixes: 36680ef7bceb ("RDMA/mlx5: Switch from MR cache to FRMR pools") Link: https://patch.msgid.link/r/20260610000145.820592-2-michaelgur@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-11IB/core: Delegate IB_QP_RATE_LIMIT validation to driversMaher Sanalla
Remove IB_QP_RATE_LIMIT from the qp_state_table and instead pass it through ib_modify_qp_is_ok() unconditionally. This delegates rate limit attribute validation to the individual drivers that support it. As rate limit support expands to additional QP types and transitions across different vendors, centralizing this policy in the core becomes impractical. Each driver is better positioned to enforce its own supported QP types and transitions over non-standard attributes. Future support for non-standard attributes will be handled per vendor driver instead of in generic IB core qp_state_table. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20260524-packet-pacing-v1-8-3d79439f8d08@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-06-11RDMA/ionic: Validate rate limit attribute in modify QPMaher Sanalla
Rate limit transition validation for RC QPs currently relies on the IB core qp_state_table. Add a driver-level helper to validate the rate limit attribute directly during QP modify, ensuring it is only accepted for RC QPs in INIT->RTR, RTR->RTS and RTS->RTS transitions. This makes the driver responsible for rate limit validation and prepares for a follow-up IB core change that delegates IB_QP_RATE_LIMIT and all future non-standard modify attributes handling to individual vendor drivers. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20260524-packet-pacing-v1-7-3d79439f8d08@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-06-11RDMA/bnxt_re: Validate rate limit attribute in modify QPMaher Sanalla
Rate limit transition validation for RC QPs currently relies on the IB core qp_state_table. Add a driver-level helper to validate the rate limit attribute directly during QP modify, ensuring it is only accepted for RC QPs in INIT->RTR, RTR->RTS and RTS->RTS transitions. This makes the driver responsible for rate limit validation and prepares for a follow-up IB core change that delegates IB_QP_RATE_LIMIT and all future non-standard modify attributes handling to individual vendor drivers. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20260524-packet-pacing-v1-6-3d79439f8d08@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-06-11RDMA/mlx5: Report packet pacing capabilities when querying deviceMaher Sanalla
When querying device, report packet pacing capabilities for UD and UC QPs when device supports it. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20260524-packet-pacing-v1-5-3d79439f8d08@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-06-11RDMA/mlx5: Support deferred rate limit configurationMaher Sanalla
Allow passing a rate limit attribute in modify QP flows even when the QP is in a state that does not support packet pacing programming in the lower layers. When the user sets a rate limit during a QP transition that is not to RTS, store the value in the mlx5 QP struct and program it to FW when the QP later transitions to RTS, which is the state that allows configuring the rate limit index in the QP context. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20260524-packet-pacing-v1-4-3d79439f8d08@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-06-11RDMA/mlx5: Add support for rate limit in UD and UC QPsMaher Sanalla
Rate limiting is currently supported only for raw packet QPs, where the packet pacing index is programmed into the SQC during SQ modify. Extend rate limit support to UD and UC QPs by setting the pacing index in the QPC during RTR2RTS and RTS2RTS transitions. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20260524-packet-pacing-v1-3-3d79439f8d08@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-06-11RDMA/mlx5: Refactor raw packet QP rate limit handlingMaher Sanalla
Refactor the raw packet QP modify path to extract rate limit configuration into a qp_rl_parse() helper that parses user attributes, and a qp_rl_prepare() helper that handles FW rate limit table adjustments before the SQ modify itself. Use qp_rl_commit() to commit changes to QP once FW call succeeds, and qp_rl_rollback() to rollback changes done to the FW rate limit table in the prepare stage, in case the modify operation fails. These helpers will be reused for extending rate limit support to additional QP types in the following patch. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20260524-packet-pacing-v1-2-3d79439f8d08@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-06-10RDMA/core: Fix broadcast address falsely detected as localMaher Sanalla
When rdma_resolve_addr() is invoked with a broadcast destination on an IPoIB interface, is_dst_local() inspects the resolved route and incorrectly concludes that the address is local. As a result, the resolution fails with -ENODEV. The issue stems from using '&' to compare rt_type with RTN_LOCAL. The RTN_* values form a sequential enum, not a bitmask (RTN_LOCAL=2, RTN_BROADCAST=3). Thus, "rt_type & RTN_LOCAL" yields a non-zero result for a broadcast route as well. Replace '&' with '==' when comparing rt_type against RTN_LOCAL. Link: https://patch.msgid.link/r/20260609-fix-rdma-resolve-addr-v1-1-449b8b4e6c09@nvidia.com Cc: stable@vger.kernel.org Fixes: c31e4038c97f ("RDMA/core: Use route entry flag to decide on loopback traffic") Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-10RDMA/bnxt_re: Check debugfs parameter allocation for failureRuoyu Wang
bnxt_re_debugfs_add_pdev() allocates per-file private data for the CC configuration debugfs entries. The loop that initializes those entries uses rdev->cc_config_params immediately, so allocation failure would lead to NULL pointer dereferences while setting up debugfs. Debugfs is best-effort. If the CC configuration private data cannot be allocated just stop. Link: https://patch.msgid.link/r/20260606040644.13-1-ruoyuw560@gmail.com Signed-off-by: Ruoyu Wang <ruoyuw560@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-09net: change ndo_set_rx_mode_async return type to intStanislav Fomichev
Change the return type of ndo_set_rx_mode_async from void to int to allow drivers to report failures back to the core stack. This is a prerequisite for adding retry logic in the core when drivers fail to program RX filters (e.g. bnxt VF when PF is unavailable). All existing implementations return 0 for now, maintaining current behavior. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260608154014.227538-2-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09RDMA/mana_ib: Allocate interrupt contexts on EQsLong Li
Use the GIC functions to allocate interrupt contexts for RDMA EQs. These interrupt contexts may be shared with Ethernet EQs when MSI-X vectors are limited. The driver now supports allocating dedicated MSI-X for each EQ. Indicate this capability through driver capability bits. The RDMA EQs pass use_msi_bitmap=false to share MSI-X vectors with Ethernet, while the capability flag advertises that the driver supports per-vPort EQ separation when hardware has sufficient vectors. Populate eq.irq on all RDMA EQs for consistency with the Ethernet path. Also relocate the GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE define to its numeric BIT(6) position among the other capability flags. Signed-off-by: Long Li <longli@microsoft.com> Acked-by: Leon Romanovsky <leon@kernel.org> Link: https://patch.msgid.link/20260605005717.2059954-7-longli@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: mana: Create separate EQs for each vPortLong Li
To prepare for assigning vPorts to dedicated MSI-X vectors, remove EQ sharing among the vPorts and create dedicated EQs for each vPort. Move the EQ definition from struct mana_context to struct mana_port_context and update related support functions. Export mana_create_eq() and mana_destroy_eq() for use by the MANA RDMA driver. RSS QPs now take a vport reference via pd->vport_use_count to ensure EQs outlive all QP consumers. The vport must already be configured by a raw QP before an RSS QP can be created. EQs are only destroyed when the last QP (raw or RSS) on the PD releases its reference. Restrict each vport to a single RSS QP. The hardware only supports one steering configuration (indirection table / hash key) per vport, and mana_disable_vport_rx() on QP destroy disables RX globally for the vport. Previously, creating a second RSS QP would silently overwrite the first QP's steering config and destroy would blackhole all traffic. This is now explicitly rejected with -EBUSY. Existing applications (DPDK being the primary RDMA consumer) always create one RSS QP per vport, so no real-world flows are affected. Reject cross-port PD sharing for both raw and RSS QPs. Since EQs and vport configuration are per-port, a PD is bound to the port used by its first raw QP. Subsequent QPs on the same PD must use the same port or the creation fails with -EINVAL. Previously this was silently broken: with shared EQs it appeared to work, but with per-vPort EQs a cross-port PD would cause wrong-port EQ teardown and corruption. DPDK creates one PD per port so no existing flows are affected. Serialize mana_set_channels() and the async per-port queue reset handler against RDMA vport configuration to prevent RDMA from claiming the vport during the detach/attach window. A channel_changing flag is set under apc->vport_mutex before detach and checked by mana_cfg_vport() when called from the RDMA path, blocking RDMA from grabbing the vport during the entire window. When the port is down and RDMA already holds the vport, the channel change is rejected with -EBUSY. Signed-off-by: Long Li <longli@microsoft.com> Link: https://patch.msgid.link/20260605005717.2059954-2-longli@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09RDMA/efa: Implement the query port speed verbTom Sela
Implement the query port speed callback to report the port effective bandwidth directly in 100 Mb/s granularity. Link: https://patch.msgid.link/r/20260608083927.4116-1-tomsela@amazon.com Reviewed-by: Michael Margolin <mrgolin@amazon.com> Reviewed-by: Yonatan Nachum <ynachum@amazon.com> Signed-off-by: Tom Sela <tomsela@amazon.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-09RDMA/efa: Report 800 and 1600 Gbps link speedTom Sela
Add support for reporting 800 Gbps as 8X NDR and 1600 Gbps as 8X XDR link speeds. Link: https://patch.msgid.link/r/20260608083736.48454-1-tomsela@amazon.com Reviewed-by: Michael Margolin <mrgolin@amazon.com> Reviewed-by: Yonatan Nachum <ynachum@amazon.com> Signed-off-by: Tom Sela <tomsela@amazon.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-09RDMA/mlx5: Use strscpy() to copy strings into arraysDavid Laight
Replacing strcpy() with strscpy() ensures that overflow of the target buffer cannot happen. Link: https://patch.msgid.link/r/20260608095500.2567-2-david.laight.linux@gmail.com Signed-off-by: David Laight <david.laight.linux@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-09RDMA/usnic: User strscpy() to copy device nameDavid Laight
Link: https://patch.msgid.link/r/20260606202633.5018-11-david.laight.linux@gmail.com Signed-off-by: David Laight <david.laight.linux@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-09RDMA/iwcm: User strscpy() to copy device nameDavid Laight
Link: https://patch.msgid.link/r/20260606202633.5018-10-david.laight.linux@gmail.com Signed-off-by: David Laight <david.laight.linux@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-09IB/mlx4: Fill in the access_flags if IB_MR_REREG_ACCESS is not specifiedJason Gunthorpe
Sashiko noticed mlx4 was using whatever random access flags were provided when IB_MR_REREG_ACCESS is not used. Since IB_MR_REREG_TRANS needs access_flags it used the random ones which means it doesn't work sensibly if userspace provides only IB_MR_REREG_TRANS. Keep track of the current access_flag of the MR and use it if the user does not specify one. Also fixup a little confusion around mmr.access, it is the HW access flags so the convert_access() was missing. But nothing reads this by the time rereg_mr can happen. Fixes: 9376932d0c26 ("IB/mlx4_ib: Add support for user MR re-registration") Link: https://patch.msgid.link/r/0-v1-29ca7a402625+ddd6-mlx4_rereg_flags_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-08RDMA/rtrs-srv: Fix integer underflow in process_read and process_writeAurelien DESBRIERES
usr_len is read from a network-supplied message field (le16_to_cpu) and used to compute data_len = off - usr_len without validating that usr_len <= off. A malicious RDMA client can send usr_len > off causing an integer underflow, resulting in data_len wrapping to a huge size_t value which is then passed to the rdma_ev callback as a memory length, leading to out-of-bounds memory access. Fix by reading and validating usr_len <= off before rtrs_srv_get_ops_ids() in both process_read() and process_write(), ensuring the early return path acquires no reference and has no resource leak. Link: https://patch.msgid.link/r/20260608134802.5019-1-aurelien@hackers.camp Reported-by: Aurelien DESBRIERES <aurelien@hackers.camp> Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Aurelien DESBRIERES <aurelien@hackers.camp> Assisted-by: Claude <claude-sonnet-4-6> Acked-by: Md Haris Iqbal <haris.iqbal@ionos.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-08IB/mlx5: Push pdn above pagefault_dmabuf_mr()Jason Gunthorpe
Remove the mlx5_mr_pdn() inside pagefault_dmabuf_mr(), the only user of the pdn is the init path which is inside an ioctl. Link: https://patch.msgid.link/r/10-v1-29ebd2c229b5+fd5-ib_mr_pd_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-08IB/mlx5: Push pdn above pagfault_real_mr()Jason Gunthorpe
Remove the mlx5_mr_pdn() in pagefault_real_mr() by pushing the pdn up, all the callers use 0 since they don't pass MLX5_PF_FLAGS_ENABLE except the ioctl reg_mr path which can use the ioctl pd. Link: https://patch.msgid.link/r/9-v1-29ebd2c229b5+fd5-ib_mr_pd_jgg@nvidia.com Assisted-by: Codex:gpt-5-5 Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-08IB/mlx5: Push pdn above mlx5r_umr_update_xlt()Jason Gunthorpe
Keep pushing the pdn higher to remove more places touching mr->pd: - XLT combinations that don't use PDN can just pass 0 - Use local pd values instead of mr->pd - Implicit MR does not have inplace rereg, so the mr->pd is safe Link: https://patch.msgid.link/r/8-v1-29ebd2c229b5+fd5-ib_mr_pd_jgg@nvidia.com Assisted-by: Codex:gpt-5-5 Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-08IB/mlx5: Don't mangle the mr->pd inside the rereg callbackJason Gunthorpe
The rereg protocol expects the core code to change mr->pd and synchronize that change with the atomics and syncs. The driver should not touch it. mlx5 needed to update it in umr_rereg_pas() because mlx5r_umr_update_mr_pas() required the updated mr->pd to build the UMR. Simply switch mlx5r_umr_update_mr_pas() to use the pdn directly from the new pd and remove the mr->pd update. Fixes: 56e11d628c5d ("IB/mlx5: Added support for re-registration of MRs") Link: https://patch.msgid.link/r/7-v1-29ebd2c229b5+fd5-ib_mr_pd_jgg@nvidia.com Assisted-by: Codex:gpt-5-5 Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>