linux.git/net/xdp, branch v6.17

xsk: Fix immature cq descriptor production

2025-09-09T22:09:55+00:00

Eryk reported an issue that I have put under Closes: tag, related to
umem addrs being prematurely produced onto pool's completion queue.
Let us make the skb's destructor responsible for producing all addrs
that given skb used.

Commit from fixes tag introduced the buggy behavior, it was not broken
from day 1, but rather when xsk multi-buffer got introduced.

In order to mitigate performance impact as much as possible, mimic the
linear and frag parts within skb by storing the first address from XSK
descriptor at sk_buff::destructor_arg. For fragments, store them at ::cb
via list. The nodes that will go onto list will be allocated via
kmem_cache. xsk_destruct_skb() will consume address stored at
::destructor_arg and optionally go through list from ::cb, if count of
descriptors associated with this particular skb is bigger than 1.

Previous approach where whole array for storing UMEM addresses from XSK
descriptors was pre-allocated during first fragment processing yielded
too big performance regression for 64b traffic. In current approach
impact is much reduced on my tests and for jumbo frames I observed
traffic being slower by at most 9%.

Magnus suggested to have this way of processing special cased for
XDP_SHARED_UMEM, so we would identify this during bind and set different
hooks for 'backpressure mechanism' on CQ and for skb destructor, but
given that results looked promising on my side I decided to have a
single data path for XSK generic Tx. I suppose other auxiliary stuff
would have to land as well in order to make it work.

Fixes: b7f72a30e9ac ("xsk: introduce wrappers and helpers for supporting multi-buffer in Tx path")
Reported-by: Eryk Kubanski
Closes: https://lore.kernel.org/netdev/20250530103456.53564-1-e.kubanski@partner.samsung.com/
Acked-by: Stanislav Fomichev
Signed-off-by: Maciej Fijalkowski
Tested-by: Jason Xing
Reviewed-by: Jason Xing
Link: https://lore.kernel.org/r/20250904194907.2342177-1-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov

net: xsk: introduce XDP_MAX_TX_SKB_BUDGET setsockopt

2025-07-10T12:48:29+00:00

This patch provides a setsockopt method to let applications leverage to
adjust how many descs to be handled at most in one send syscall. It
mitigates the situation where the default value (32) that is too small
leads to higher frequency of triggering send syscall.

Considering the prosperity/complexity the applications have, there is no
absolutely ideal suggestion fitting all cases. So keep 32 as its default
value like before.

The patch does the following things:
- Add XDP_MAX_TX_SKB_BUDGET socket option.
- Set max_tx_budget to 32 by default in the initialization phase as a
  per-socket granular control.
- Set the range of max_tx_budget as [32, xs->tx->nentries].

The idea behind this comes out of real workloads in production. We use a
user-level stack with xsk support to accelerate sending packets and
minimize triggering syscalls. When the packets are aggregated, it's not
hard to hit the upper bound (namely, 32). The moment user-space stack
fetches the -EAGAIN error number passed from sendto(), it will loop to try
again until all the expected descs from tx ring are sent out to the driver.
Enlarging the XDP_MAX_TX_SKB_BUDGET value contributes to less frequency of
sendto() and higher throughput/PPS.

Here is what I did in production, along with some numbers as follows:
For one application I saw lately, I suggested using 128 as max_tx_budget
because I saw two limitations without changing any default configuration:
1) XDP_MAX_TX_SKB_BUDGET, 2) socket sndbuf which is 212992 decided by
net.core.wmem_default. As to XDP_MAX_TX_SKB_BUDGET, the scenario behind
this was I counted how many descs are transmitted to the driver at one
time of sendto() based on [1] patch and then I calculated the
possibility of hitting the upper bound. Finally I chose 128 as a
suitable value because 1) it covers most of the cases, 2) a higher
number would not bring evident results. After twisting the parameters,
a stable improvement of around 4% for both PPS and throughput and less
resources consumption were found to be observed by strace -c -p xxx:
1) %time was decreased by 7.8%
2) error counter was decreased from 18367 to 572

[1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/

Signed-off-by: Jason Xing 
Acked-by: Maciej Fijalkowski 
Link: https://patch.msgid.link/20250704160138.48677-1-kerneljasonxing@gmail.com
Signed-off-by: Paolo Abeni

net: xsk: update tx queue consumer immediately after transmission

2025-07-09T01:28:07+00:00

For afxdp, the return value of sendto() syscall doesn't reflect how many
descs handled in the kernel. One of use cases is that when user-space
application tries to know the number of transmitted skbs and then decides
if it continues to send, say, is it stopped due to max tx budget?

The following formular can be used after sending to learn how many
skbs/descs the kernel takes care of:

  tx_queue.consumers_before - tx_queue.consumers_after

Prior to the current patch, in non-zc mode, the consumer of tx queue is
not immediately updated at the end of each sendto syscall when error
occurs, which leads to the consumer value out-of-dated from the perspective
of user space. So this patch requires store operation to pass the cached
value to the shared value to handle the problem.

More than those explicit errors appearing in the while() loop in
__xsk_generic_xmit(), there are a few possible error cases that might
be neglected in the following call trace:
__xsk_generic_xmit()
    xskq_cons_peek_desc()
        xskq_cons_read_desc()
	    xskq_cons_is_valid_desc()
It will also cause the premature exit in the while() loop even if not
all the descs are consumed.

Based on the above analysis, using @sent_frame could cover all the possible
cases where it might lead to out-of-dated global state of consumer after
finishing __xsk_generic_xmit().

The patch also adds a common helper __xsk_tx_release() to keep align
with the zc mode usage in xsk_tx_release().

Signed-off-by: Jason Xing 
Acked-by: Maciej Fijalkowski 
Acked-by: Stanislav Fomichev 
Link: https://patch.msgid.link/20250703141712.33190-2-kerneljasonxing@gmail.com
Signed-off-by: Jakub Kicinski

net: remove sock_i_uid()

2025-06-24T00:04:03+00:00

Difference between sock_i_uid() and sk_uid() is that
after sock_orphan(), sock_i_uid() returns GLOBAL_ROOT_UID
while sk_uid() returns the last cached sk->sk_uid value.

None of sock_i_uid() callers care about this.

Use sk_uid() which is much faster and inlined.

Note that diag/dump users are calling sock_i_ino() and
can not see the full benefit yet.

Signed-off-by: Eric Dumazet 
Cc: Lorenzo Colitti 
Reviewed-by: Maciej Żenczykowski 
Link: https://patch.msgid.link/20250620133001.4090592-3-edumazet@google.com
Signed-off-by: Jakub Kicinski

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

2025-05-22T16:42:41+00:00

Cross-merge networking fixes after downstream PR (net-6.15-rc8).

Conflicts:
  80f2ab46c2ee ("irdma: free iwdev->rf after removing MSI-X")
  4bcc063939a5 ("ice, irdma: fix an off by one in error handling code")
  c24a65b6a27c ("iidc/ice/irdma: Update IDC to support multiple consumers")
https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au

No extra adjacent changes.

Signed-off-by: Jakub Kicinski

xsk: Bring back busy polling support in XDP_COPY

2025-05-21T09:28:23+00:00

Commit 5ef44b3cb43b ("xsk: Bring back busy polling support") fixed the
busy polling support in xsk for XDP_ZEROCOPY after it was broken in
commit 86e25f40aa1e ("net: napi: Add napi_config"). The busy polling
support with XDP_COPY remained broken since the napi_id setup in
xsk_rcv_check was removed.

Bring back the setup of napi_id for XDP_COPY so socket level SO_BUSYPOLL
can be used to poll the underlying napi.

Do the setup of napi_id for XDP_COPY in xsk_bind, as it is done
currently for XDP_ZEROCOPY. The setup of napi_id for XDP_COPY in
xsk_bind is safe because xsk_rcv_check checks that the rx queue at which
the packet arrives is equal to the queue_id that was supplied in bind.
This is done for both XDP_COPY and XDP_ZEROCOPY mode.

Tested using AF_XDP support in virtio-net by running the xsk_rr AF_XDP
benchmarking tool shared here:
https://lore.kernel.org/all/20250320163523.3501305-1-skhawaja@google.com/T/

Enabled socket busy polling using following commands in qemu,

```
sudo ethtool -L eth0 combined 1
echo 400 | sudo tee /proc/sys/net/core/busy_read
echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
```

Fixes: 5ef44b3cb43b ("xsk: Bring back busy polling support")
Signed-off-by: Samiullah Khawaja 
Reviewed-by: Willem de Bruijn 
Acked-by: Stanislav Fomichev 
Signed-off-by: David S. Miller

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

2025-05-01T22:11:38+00:00

Cross-merge networking fixes after downstream PR (net-6.15-rc5).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski

xsk: Fix race condition in AF_XDP generic RX path

2025-04-25T00:11:33+00:00

Move rx_lock from xsk_socket to xsk_buff_pool.
Fix synchronization for shared umem mode in
generic RX path where multiple sockets share
single xsk_buff_pool.

RX queue is exclusive to xsk_socket, while FILL
queue can be shared between multiple sockets.
This could result in race condition where two
CPU cores access RX path of two different sockets
sharing the same umem.

Protect both queues by acquiring spinlock in shared
xsk_buff_pool.

Lock contention may be minimized in the future by some
per-thread FQ buffering.

It's safe and necessary to move spin_lock_bh(rx_lock)
after xsk_rcv_check():
* xs->pool and spinlock_init is synchronized by
  xsk_bind() -> xsk_is_bound() memory barriers.
* xsk_rcv_check() may return true at the moment
  of xsk_release() or xsk_unbind_dev(),
  however this will not cause any data races or
  race conditions. xsk_unbind_dev() removes xdp
  socket from all maps and waits for completion
  of all outstanding rx operations. Packets in
  RX path will either complete safely or drop.

Signed-off-by: Eryk Kubanski 
Fixes: bf0bdd1343efb ("xdp: fix race on generic receive path")
Acked-by: Magnus Karlsson 
Link: https://patch.msgid.link/20250416101908.10919-1-e.kubanski@partner.samsung.com
Signed-off-by: Jakub Kicinski

net: designate XSK pool pointers in queues as "ops protected"

2025-04-10T00:01:51+00:00

Read accesses go via xsk_get_pool_from_qid(), the call coming
from the core and gve look safe (other "ops locked" drivers
don't support XSK).

Write accesses go via xsk_reg_pool_at_qid() and xsk_clear_pool_at_qid().
Former is already under the ops lock, latter is not (both coming from
the workqueue via xp_clear_dev() and NETDEV_UNREGISTER via xsk_notifier()).

Acked-by: Stanislav Fomichev 
Signed-off-by: Stanislav Fomichev 
Link: https://patch.msgid.link/20250408195956.412733-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski

xsk: Fix __xsk_generic_xmit() error code when cq is full

2025-04-03T04:55:43+00:00

When the cq reservation is failed, the error code is not set which is
initialized to zero in __xsk_generic_xmit(). That means the packet is not
send successfully but sendto() return ok.

Considering the impact on uapi, return -EAGAIN is a good idea. The cq is
full usually because it is not released in time, try to send msg again is
appropriate.

The bug was at the very early implementation of xsk, so the Fixes tag
targets the commit that introduced the changes in
xsk_cq_reserve_addr_locked where this fix depends on.

Fixes: e6c4047f5122 ("xsk: Use xsk_buff_pool directly for cq functions")
Suggested-by: Magnus Karlsson 
Signed-off-by: Wang Liang 
Signed-off-by: Martin KaFai Lau 
Acked-by: Stanislav Fomichev 
Link: https://patch.msgid.link/20250227081052.4096337-1-wangliang74@huawei.com
Signed-off-by: Alexei Starovoitov