| Age | Commit message (Collapse) | Author |
|
Pull rdma updates from Jason Gunthorpe:
"Many AI driven bug fixes, and several big driver API cleanups
- Driver bug fixes and minor cleanups in mlx5, hns, rxe, efa, siw,
rtrs, mana, irdma, mlx4. Commonly error path flows, integer
arithmetic overflows on unsafe data, out of bounds access, and use
after free issues under races.
- Second half of the new udata API for drivers focusing on uAPI
response
- bnxt_re supports more options for QP creation that will allow a dv
path in rdma-core
- Untangle the module dependencies so drivers don't link to
ib_uverbs.ko as was originall intended
- Provide a new way to handle umems with a consistent simplified uAPI
and update several drivers to use it. This brings dmabuf support to
more places and more drivers
- Support for mlx5 rate limit and packet pacing for UD and UC
- A batch of fixes for the new shared FRMR pools infrastructure"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (148 commits)
RDMA/irdma: Replace waitqueue and flag with completion
RDMA/hns: Fix memory leak of bonding resources
RDMA/rtrs-srv: Bound RDMA-Write length to chunk size in rdma_write_sg
docs: infiniband: correct name of option to enable the ib_uverbs module
RDMA/bnxt_re: Reject GET_TOGGLE_MEM when toggle page was not allocated
RDMA/bnxt_re: Fail DBR related page allocation UAPIs if the feature is disabled
RDMA/bnxt_re: Avoid repeated requests to allocate WC pages
RDMA/bnxt_re: Proper rollback if the ioremap fails
RDMA/bnxt_re: Add a max slot check for SQ
RDMA/bnxt_re: Avoid displaying the kernel pointer
RDMA/bnxt_re: Free CQ toggle page after firmware teardown
RDMA/bnxt_re: Free SRQ toggle page after firmware teardown
RDMA/bnxt_re: Initialize dpi variable to zero
ABI: sysfs-class-infiniband: minor cleanup
RDMA/mlx5: Release the HW‑provided UAR index rather than the SW one
RDMA/mlx5: Fix undefined shift of user RQ WQE size
RDMA/mlx5: Remove raw RSS QP restrack tracking
RDMA/mlx5: Remove DCT restrack tracking
RDMA/mlx5: Drop FRMR pool handle on UMR revoke failure
RDMA/core: Add ib_frmr_pool_drop for unrecoverable handles
...
|
|
The driver previously used a waitqueue along with an explicit
request_done flag, but without proper barriers around request_done.
An earlier patch by Gui-Dong Han <hanguidong02@gmail.com> attempted
to fix this by adding the missing memory barriers. Rather than
adding the barriers, this patch replaces the waitqueue+flag with
a completion, which is designed for this exact purpose.
Fixes: 44d9e52977a1 ("RDMA/irdma: Implement device initialization definitions")
Fixes: 915cc7ac0f8e ("RDMA/irdma: Add miscellaneous utility definitions")
Link: https://patch.msgid.link/r/20260616155601.1081448-1-jmoroni@google.com
Signed-off-by: Jacob Moroni <jmoroni@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
In a corner case of concurrent driver removal and driver reset,
bonding resource is first released in hns_roce_hw_v2_exit() during
driver removal, and then is allocated again in hns_roce_register_device()
during driver reset. This leads to memory leak because the release
timing has already passed. This may also lead to a kernel panic
as below because of the leaked notifier callback:
Call trace:
0xffffa20fccc04978 (P)
raw_notifier_call_chain+0x20/0x38
call_netdevice_notifiers_info+0x60/0xb8
netdev_lower_state_changed+0x4c/0xb8
As Sashiko suggested, the teardown order of bonding resources should
be inverted to make sure the resources are released when the driver
is removed.
Fixes: b37ad2e290fc ("RDMA/hns: Initialize bonding resources")
Link: https://patch.msgid.link/r/20260613102045.811623-1-huangjunxian6@hisilicon.com
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
When the server answers an RTRS READ, rdma_write_sg() builds the source
scatter/gather entry for the IB_WR_RDMA_WRITE that returns data to the
peer. Its length is taken directly from the wire descriptor:
plist->length = le32_to_cpu(id->rd_msg->desc[0].len);
rd_msg points into the chunk buffer that the remote peer filled via
RDMA-WRITE-WITH-IMM (rtrs_srv_rdma_done() -> process_io_req() ->
process_read()), so desc[0].len is attacker-controlled and, before this
change, was only rejected when zero. The source address is the fixed
chunk start (dma_addr[msg_id]) and the source lkey is the PD-wide
local_dma_lkey, which is not tied to the chunk's MR mapping, so the verbs
layer does not constrain the transfer length to max_chunk_size. msg_id
and off are bounded against queue_depth and max_chunk_size in
rtrs_srv_rdma_done(), but desc[0].len is a separate field that was not
checked against the chunk size.
A peer that advertises desc[0].len larger than max_chunk_size can make
the posted RDMA write read past the chunk's mapped region. The resulting
behaviour depends on the IOMMU configuration: with no IOMMU or in
passthrough mode the read may extend into memory adjacent to the chunk
and be returned to the peer, which can disclose host memory; with a
translating IOMMU the out-of-range access is expected to fault and abort
the connection. In either case the transfer exceeds what the protocol
permits and is driven by a remote peer.
Reject a descriptor length above max_chunk_size, mirroring the existing
off >= max_chunk_size bound in rtrs_srv_rdma_done(). Legitimate clients
do not exceed it: the client sets desc[0].len to its MR length, which is
capped at the negotiated max_io_size (max_chunk_size - MAX_HDR_SIZE).
Fixes: 9cb837480424 ("RDMA/rtrs: server: main functionality")
Link: https://patch.msgid.link/r/20260612-master-v1-1-70cde5c6fdc9@gmail.com
Reported-by: Yuhao Jiang <danisjiang@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Zhenhao Wan <whi4ed0g@gmail.com>
Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
If a user calls BNXT_RE_METHOD_GET_TOGGLE_MEM on a device that does not
support the CQ/SRQ toggle feature, uctx_cq_page or uctx_srq_page will
be NULL.
Add an explicit -EOPNOTSUPP return after capturing the address from
uctx_cq_page / uctx_srq_page if the address is zero.
Fixes: e275919d9669 ("RDMA/bnxt_re: Share a page to expose per CQ info with userspace")
Fixes: 181028a0d84c ("RDMA/bnxt_re: Share a page to expose per SRQ info with userspace")
Link: https://patch.msgid.link/r/20260615224751.232802-16-selvin.xavier@broadcom.com
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
No need to support the DBR related page allocations if the pacing feature
is disabled. Fail the request if pacing is disabled.
Fixes: ea2224857882 ("RDMA/bnxt_re: Update alloc_page uapi for pacing")
Link: https://patch.msgid.link/r/20260615224751.232802-15-selvin.xavier@broadcom.com
Reviewed-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Applications can request multiple WC pages for the same ucontext.
As of now, only 1 WC page per ucontext is supported. Add a lock to
avoid concurrent access and a check to fail repeated requests.
Also, if the mmap entry insert fails for the WC, free the Doorbell
page index mapped for the WC page.
Fixes: eee6268421a2 ("RDMA/bnxt_re: Move the UAPI methods to a dedicated file")
Fixes: 360da60d6c6e ("RDMA/bnxt_re: Enable low latency push")
Link: https://patch.msgid.link/r/20260615224751.232802-12-selvin.xavier@broadcom.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
bnxt_qplib_alloc_dpi returns success even if ioremap fails.
Add the proper rollback when the ioremap fails and return
-ENOMEM status.
Fixes: 0ac20faf5d83 ("RDMA/bnxt_re: Reorg the bar mapping")
Fixes: 360da60d6c6e ("RDMA/bnxt_re: Enable low latency push")
Link: https://patch.msgid.link/r/20260615224751.232802-11-selvin.xavier@broadcom.com
Reviewed-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The variable WQE mode must be validated against
the maximum slots supported by HW. The max supported
value is 64K. Adding a max and min check and fail if user
supplied value is more than the max supported and zero.
Fixes: d8ea645d6984 ("RDMA/bnxt_re: Handle variable WQE support for user applications")
Link: https://patch.msgid.link/r/20260615224751.232802-10-selvin.xavier@broadcom.com
Reviewed-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
While dumping the info on MR using the rdma tool, we
dump the mr_hwq which is a kernel pointer. There is
no need to expose this value for end user. So avoid
it.
Fixes: 7363eb76b7f3 ("RDMA/bnxt_re: Support driver specific data collection using rdma tool")
Link: https://patch.msgid.link/r/20260615224751.232802-9-selvin.xavier@broadcom.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Free the toggle page only after firmware teardown completes so that
an NQ interrupt arriving during bnxt_qplib_destroy_cq() won't write
the toggle value to an already-freed page. Move free_page() after
bnxt_qplib_destroy_cq.
Fixes: e275919d9669 ("RDMA/bnxt_re: Share a page to expose per CQ info with userspace")
Link: https://patch.msgid.link/r/20260615224751.232802-4-selvin.xavier@broadcom.com
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Free the toggle page only after firmware teardown completes so that
an NQ interrupt arriving during bnxt_qplib_destroy_srq() won't write
the toggle values to an already-freed page. Move free_page() after
bnxt_qplib_destroy_srq().
Fixes: 181028a0d84c ("RDMA/bnxt_re: Share a page to expose per SRQ info with userspace")
Link: https://patch.msgid.link/r/20260615224751.232802-3-selvin.xavier@broadcom.com
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
dpi is initialized only for BNXT_RE_ALLOC_WC_PAGE, but copied
for all the cases. So initialize the dpi to 0.
Fixes: eee6268421a2 ("RDMA/bnxt_re: Move the UAPI methods to a dedicated file")
Fixes: 360da60d6c6e ("RDMA/bnxt_re: Enable low latency push")
Link: https://patch.msgid.link/r/20260615224751.232802-2-selvin.xavier@broadcom.com
Reviewed-by: Anantha Prabhu <anantha.prabhu@broadcom.com>
Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Cross-merge networking fixes after downstream PR (net-7.1-rc8).
Conflicts:
drivers/net/ethernet/wangxun/txgbe/txgbe_aml.c
f67aead16e85 ("net: txgbe: rework service event handling")
57d39faed4c9 ("net: txgbe: improve functions of AML 40G devices")
net/rds/info.c
512db8267b73 ("rds: mark snapshot pages dirty in rds_info_getsockopt()")
6e94eeb2a2a6 ("rds: convert to getsockopt_iter")
Adjacent changes:
include/net/sock.h
1ee90b77b727 ("net: guard timestamp cmsgs to real error queue skbs")
f0de88303d5e ("net: make is_skb_wmem() available to modules")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Free the UAR index returned by the hardware.
Fixes: 4ed131d0bb15 ("IB/mlx5: Expose dynamic mmap allocation")
Link: https://patch.msgid.link/r/20260611-fix-uar-release-v1-1-f5464d845dbf@nvidia.com
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
set_rq_size() computes the RQ WQE size as "1 << rq_wqe_shift" based on
the user-provided rq_wqe_shift, which is only checked to be greater than
32, so shifts of 32 are still accepted. A shift of 31 also overflows a
signed integer, leading to undefined behavior.
Use check_shl_overflow() to compute the RQ WQE size and reject any
invalid values.
Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Link: https://patch.msgid.link/r/20260611-maher-sec-fixes-v1-1-cd8eb2542869@nvidia.com
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Raw RSS QP restrack tracking wasn't working to begin with as it was
only tracking the first raw RSS QP which was added, since at creation
the raw RSS QP number is reserved so the QP number for this qp type
was always zero.
The following raw RSS QP additions were always failing silently.
Since the fix isn't trivial and there were no users that required or
complained about this issue we are dropping this for now instead of fixing.
Fixes: 968f0b6f9c01 ("RDMA/mlx5: Consolidate into special function all create QP calls")
Link: https://patch.msgid.link/r/20260607-restrack-uaf-fix-v1-2-d72e45eb76c2@nvidia.com
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
DCT restrack tracking wasn't working to begin with as it was only
tracking the first DCT which was added, since at creation the DCT number
isn't yet initialized because the DCT FW object is only created during
modify. The following DCT additions were failing silently.
Since the fix isn't trivial and there were no users that required or
complained about this issue we are dropping this for now instead of fixing.
Fixes: fd3af5e21866 ("RDMA/mlx5: Track DCT, DCI and REG_UMR QPs as diver_detail resources.")
Link: https://patch.msgid.link/r/20260607-restrack-uaf-fix-v1-1-d72e45eb76c2@nvidia.com
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
When UMR revoke fails during MR cleanup, the handle is left in an
unknown state and cannot be returned to the pool. The driver already
destroys the mkey via the fallback path, but the pool's in_use counter
is never decremented, drifting upward over time.
Call ib_frmr_pool_drop on the revoke-failure path so the pool's
accounting stays consistent with the handles it has handed out.
Fixes: 36680ef7bceb ("RDMA/mlx5: Switch from MR cache to FRMR pools")
Link: https://patch.msgid.link/r/20260610000145.820592-10-michaelgur@nvidia.com
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
A driver that has popped a handle from an FRMR pool can hit failures
that leave the handle in a state where it can't safely be returned
for reuse. The driver destroys the handle itself, but the pool has
no way to learn about it, so the in_use counter drifts upward.
Add ib_frmr_pool_drop to balance the pool's accounting in this case.
Every pop is now balanced by exactly one push or drop.
Fixes: 36680ef7bceb ("RDMA/mlx5: Switch from MR cache to FRMR pools")
Link: https://patch.msgid.link/r/20260610000145.820592-9-michaelgur@nvidia.com
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Failure to push a handle to the pool, caused by ENOMEM on queue page
allocation, will trigger missing in_use counter update, skewing pool
state indefinitely.
Fix that by moving the handling of handle destruction in such case
into the FRMR code, ensuring the handle is either pushed to the pool
or destroyed inside the same function.
Adjust mlx5_ib call site accordingly.
Fixes: ce5df0b891ed ("IB/core: Introduce FRMR pools")
Link: https://patch.msgid.link/r/20260610000145.820592-8-michaelgur@nvidia.com
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
In case a driver calls FRMR pop operation without a successful init,
return after triggering a warning to avoid the NULL dereference.
Fixes: ce5df0b891ed ("IB/core: Introduce FRMR pools")
Link: https://patch.msgid.link/r/20260610000145.820592-7-michaelgur@nvidia.com
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Add destruction of FRMR handles in case the push to the pool fails.
This prevents resources leak in case pool page allocation fails.
Fixes: 020d189d16a6 ("RDMA/core: Add pinned handles to FRMR pools")
Link: https://patch.msgid.link/r/20260610000145.820592-6-michaelgur@nvidia.com
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Tao Cui <cuitao@kylinos.cn>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Aging pools with pinned handles requires moving handles from the
active queue to a non-empty inactive queue that might fail on new page
allocation, we are currently not handling the fault and leaking any mkey
that fails the push.
Fix by Introducing push_queue_to_queue_locked() that fills the
destination's partial tail page from the source and then splices the
remaining source pages onto the destination, performing no allocation.
Replace the per-handle move loop in age_pinned_pool() and the
open-coded splice in pool_aging_work() with calls to the helper.
As the helper cannot fail under memory pressure, removing a class of
GFP_ATOMIC allocations under the pool lock and simplifying the error
flow.
Fixes: 020d189d16a6 ("RDMA/core: Add pinned handles to FRMR pools")
Link: https://patch.msgid.link/r/20260610000145.820592-5-michaelgur@nvidia.com
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
When creating FRMR handles following a netlink command to pin handles,
use the key after driver callback instead of using the key passed directly
from user.
Fixes: 020d189d16a6 ("RDMA/core: Add pinned handles to FRMR pools")
Link: https://patch.msgid.link/r/20260610000145.820592-4-michaelgur@nvidia.com
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Fix reading the PH value from the FRMR pool key by shifting the pool key
to the relevant bits.
Fixes: 36680ef7bceb ("RDMA/mlx5: Switch from MR cache to FRMR pools")
Link: https://patch.msgid.link/r/20260610000145.820592-3-michaelgur@nvidia.com
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Fix the indices of mkeys destroyed in case of an error in batch mkey
creation.
Fixes: 36680ef7bceb ("RDMA/mlx5: Switch from MR cache to FRMR pools")
Link: https://patch.msgid.link/r/20260610000145.820592-2-michaelgur@nvidia.com
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Remove IB_QP_RATE_LIMIT from the qp_state_table and instead
pass it through ib_modify_qp_is_ok() unconditionally. This
delegates rate limit attribute validation to the individual
drivers that support it.
As rate limit support expands to additional QP types and transitions
across different vendors, centralizing this policy in the core becomes
impractical. Each driver is better positioned to enforce its own
supported QP types and transitions over non-standard attributes.
Future support for non-standard attributes will be handled per vendor
driver instead of in generic IB core qp_state_table.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260524-packet-pacing-v1-8-3d79439f8d08@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Rate limit transition validation for RC QPs currently relies on
the IB core qp_state_table. Add a driver-level helper to validate
the rate limit attribute directly during QP modify, ensuring it
is only accepted for RC QPs in INIT->RTR, RTR->RTS and RTS->RTS
transitions.
This makes the driver responsible for rate limit validation
and prepares for a follow-up IB core change that delegates
IB_QP_RATE_LIMIT and all future non-standard modify attributes
handling to individual vendor drivers.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260524-packet-pacing-v1-7-3d79439f8d08@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Rate limit transition validation for RC QPs currently relies on
the IB core qp_state_table. Add a driver-level helper to validate
the rate limit attribute directly during QP modify, ensuring it
is only accepted for RC QPs in INIT->RTR, RTR->RTS and RTS->RTS
transitions.
This makes the driver responsible for rate limit validation
and prepares for a follow-up IB core change that delegates
IB_QP_RATE_LIMIT and all future non-standard modify attributes
handling to individual vendor drivers.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260524-packet-pacing-v1-6-3d79439f8d08@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
When querying device, report packet pacing capabilities for UD and
UC QPs when device supports it.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260524-packet-pacing-v1-5-3d79439f8d08@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Allow passing a rate limit attribute in modify QP flows even when the
QP is in a state that does not support packet pacing programming in
the lower layers.
When the user sets a rate limit during a QP transition that is not to
RTS, store the value in the mlx5 QP struct and program it to FW when
the QP later transitions to RTS, which is the state that allows
configuring the rate limit index in the QP context.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260524-packet-pacing-v1-4-3d79439f8d08@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Rate limiting is currently supported only for raw packet QPs, where
the packet pacing index is programmed into the SQC during SQ modify.
Extend rate limit support to UD and UC QPs by setting the pacing
index in the QPC during RTR2RTS and RTS2RTS transitions.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260524-packet-pacing-v1-3-3d79439f8d08@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Refactor the raw packet QP modify path to extract rate limit
configuration into a qp_rl_parse() helper that parses user attributes,
and a qp_rl_prepare() helper that handles FW rate limit table
adjustments before the SQ modify itself.
Use qp_rl_commit() to commit changes to QP once FW call
succeeds, and qp_rl_rollback() to rollback changes done to
the FW rate limit table in the prepare stage, in case the
modify operation fails.
These helpers will be reused for extending rate limit support to
additional QP types in the following patch.
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/20260524-packet-pacing-v1-2-3d79439f8d08@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
When rdma_resolve_addr() is invoked with a broadcast destination on an
IPoIB interface, is_dst_local() inspects the resolved route and
incorrectly concludes that the address is local. As a result, the
resolution fails with -ENODEV.
The issue stems from using '&' to compare rt_type with RTN_LOCAL. The
RTN_* values form a sequential enum, not a bitmask (RTN_LOCAL=2,
RTN_BROADCAST=3). Thus, "rt_type & RTN_LOCAL" yields a non-zero result
for a broadcast route as well.
Replace '&' with '==' when comparing rt_type against RTN_LOCAL.
Link: https://patch.msgid.link/r/20260609-fix-rdma-resolve-addr-v1-1-449b8b4e6c09@nvidia.com
Cc: stable@vger.kernel.org
Fixes: c31e4038c97f ("RDMA/core: Use route entry flag to decide on loopback traffic")
Signed-off-by: Maher Sanalla <msanalla@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
bnxt_re_debugfs_add_pdev() allocates per-file private data for the CC
configuration debugfs entries. The loop that initializes those entries
uses rdev->cc_config_params immediately, so allocation failure would lead
to NULL pointer dereferences while setting up debugfs.
Debugfs is best-effort. If the CC configuration private data cannot be
allocated just stop.
Link: https://patch.msgid.link/r/20260606040644.13-1-ruoyuw560@gmail.com
Signed-off-by: Ruoyu Wang <ruoyuw560@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Change the return type of ndo_set_rx_mode_async from void to int to
allow drivers to report failures back to the core stack. This is a
prerequisite for adding retry logic in the core when drivers fail to
program RX filters (e.g. bnxt VF when PF is unavailable).
All existing implementations return 0 for now, maintaining current
behavior.
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20260608154014.227538-2-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use the GIC functions to allocate interrupt contexts for RDMA EQs. These
interrupt contexts may be shared with Ethernet EQs when MSI-X vectors
are limited.
The driver now supports allocating dedicated MSI-X for each EQ. Indicate
this capability through driver capability bits. The RDMA EQs pass
use_msi_bitmap=false to share MSI-X vectors with Ethernet, while the
capability flag advertises that the driver supports per-vPort EQ
separation when hardware has sufficient vectors.
Populate eq.irq on all RDMA EQs for consistency with the Ethernet path.
Also relocate the GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE define to its
numeric BIT(6) position among the other capability flags.
Signed-off-by: Long Li <longli@microsoft.com>
Acked-by: Leon Romanovsky <leon@kernel.org>
Link: https://patch.msgid.link/20260605005717.2059954-7-longli@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
To prepare for assigning vPorts to dedicated MSI-X vectors, remove EQ
sharing among the vPorts and create dedicated EQs for each vPort.
Move the EQ definition from struct mana_context to struct mana_port_context
and update related support functions. Export mana_create_eq() and
mana_destroy_eq() for use by the MANA RDMA driver.
RSS QPs now take a vport reference via pd->vport_use_count to ensure
EQs outlive all QP consumers. The vport must already be configured by
a raw QP before an RSS QP can be created. EQs are only destroyed when
the last QP (raw or RSS) on the PD releases its reference.
Restrict each vport to a single RSS QP. The hardware only supports one
steering configuration (indirection table / hash key) per vport, and
mana_disable_vport_rx() on QP destroy disables RX globally for the
vport. Previously, creating a second RSS QP would silently overwrite
the first QP's steering config and destroy would blackhole all traffic.
This is now explicitly rejected with -EBUSY. Existing applications
(DPDK being the primary RDMA consumer) always create one RSS QP per
vport, so no real-world flows are affected.
Reject cross-port PD sharing for both raw and RSS QPs. Since EQs and
vport configuration are per-port, a PD is bound to the port used by
its first raw QP. Subsequent QPs on the same PD must use the same
port or the creation fails with -EINVAL. Previously this was silently
broken: with shared EQs it appeared to work, but with per-vPort EQs
a cross-port PD would cause wrong-port EQ teardown and corruption.
DPDK creates one PD per port so no existing flows are affected.
Serialize mana_set_channels() and the async per-port queue reset
handler against RDMA vport configuration to prevent RDMA from claiming
the vport during the detach/attach window. A channel_changing flag is
set under apc->vport_mutex before detach and checked by
mana_cfg_vport() when called from the RDMA path, blocking RDMA from
grabbing the vport during the entire window. When the port is down
and RDMA already holds the vport, the channel change is rejected with
-EBUSY.
Signed-off-by: Long Li <longli@microsoft.com>
Link: https://patch.msgid.link/20260605005717.2059954-2-longli@microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Implement the query port speed callback to report the port effective
bandwidth directly in 100 Mb/s granularity.
Link: https://patch.msgid.link/r/20260608083927.4116-1-tomsela@amazon.com
Reviewed-by: Michael Margolin <mrgolin@amazon.com>
Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Tom Sela <tomsela@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Add support for reporting 800 Gbps as 8X NDR and 1600 Gbps as 8X XDR
link speeds.
Link: https://patch.msgid.link/r/20260608083736.48454-1-tomsela@amazon.com
Reviewed-by: Michael Margolin <mrgolin@amazon.com>
Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Tom Sela <tomsela@amazon.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Replacing strcpy() with strscpy() ensures that overflow of the target
buffer cannot happen.
Link: https://patch.msgid.link/r/20260608095500.2567-2-david.laight.linux@gmail.com
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Link: https://patch.msgid.link/r/20260606202633.5018-11-david.laight.linux@gmail.com
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Link: https://patch.msgid.link/r/20260606202633.5018-10-david.laight.linux@gmail.com
Signed-off-by: David Laight <david.laight.linux@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Sashiko noticed mlx4 was using whatever random access flags were provided
when IB_MR_REREG_ACCESS is not used. Since IB_MR_REREG_TRANS needs
access_flags it used the random ones which means it doesn't work sensibly
if userspace provides only IB_MR_REREG_TRANS.
Keep track of the current access_flag of the MR and use it if the user
does not specify one.
Also fixup a little confusion around mmr.access, it is the HW access flags
so the convert_access() was missing. But nothing reads this by the time
rereg_mr can happen.
Fixes: 9376932d0c26 ("IB/mlx4_ib: Add support for user MR re-registration")
Link: https://patch.msgid.link/r/0-v1-29ca7a402625+ddd6-mlx4_rereg_flags_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
usr_len is read from a network-supplied message field (le16_to_cpu)
and used to compute data_len = off - usr_len without validating that
usr_len <= off. A malicious RDMA client can send usr_len > off causing
an integer underflow, resulting in data_len wrapping to a huge size_t
value which is then passed to the rdma_ev callback as a memory length,
leading to out-of-bounds memory access.
Fix by reading and validating usr_len <= off before rtrs_srv_get_ops_ids()
in both process_read() and process_write(), ensuring the early return
path acquires no reference and has no resource leak.
Link: https://patch.msgid.link/r/20260608134802.5019-1-aurelien@hackers.camp
Reported-by: Aurelien DESBRIERES <aurelien@hackers.camp>
Reviewed-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Aurelien DESBRIERES <aurelien@hackers.camp>
Assisted-by: Claude <claude-sonnet-4-6>
Acked-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Remove the mlx5_mr_pdn() inside pagefault_dmabuf_mr(), the only user of
the pdn is the init path which is inside an ioctl.
Link: https://patch.msgid.link/r/10-v1-29ebd2c229b5+fd5-ib_mr_pd_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Remove the mlx5_mr_pdn() in pagefault_real_mr() by pushing the pdn up, all
the callers use 0 since they don't pass MLX5_PF_FLAGS_ENABLE except the
ioctl reg_mr path which can use the ioctl pd.
Link: https://patch.msgid.link/r/9-v1-29ebd2c229b5+fd5-ib_mr_pd_jgg@nvidia.com
Assisted-by: Codex:gpt-5-5
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Keep pushing the pdn higher to remove more places touching mr->pd:
- XLT combinations that don't use PDN can just pass 0
- Use local pd values instead of mr->pd
- Implicit MR does not have inplace rereg, so the mr->pd is safe
Link: https://patch.msgid.link/r/8-v1-29ebd2c229b5+fd5-ib_mr_pd_jgg@nvidia.com
Assisted-by: Codex:gpt-5-5
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
The rereg protocol expects the core code to change mr->pd and synchronize
that change with the atomics and syncs. The driver should not touch it.
mlx5 needed to update it in umr_rereg_pas() because
mlx5r_umr_update_mr_pas() required the updated mr->pd to build the
UMR.
Simply switch mlx5r_umr_update_mr_pas() to use the pdn directly from
the new pd and remove the mr->pd update.
Fixes: 56e11d628c5d ("IB/mlx5: Added support for re-registration of MRs")
Link: https://patch.msgid.link/r/7-v1-29ebd2c229b5+fd5-ib_mr_pd_jgg@nvidia.com
Assisted-by: Codex:gpt-5-5
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|