linux.git/net/core/dev.h, branch v7.1-rc2

net: introduce ndo_set_rx_mode_async and netdev_rx_mode_work

2026-04-21T10:50:03+00:00

Add ndo_set_rx_mode_async callback that drivers can implement instead
of the legacy ndo_set_rx_mode. The legacy callback runs under the
netif_addr_lock spinlock with BHs disabled, preventing drivers from
sleeping. The async variant runs from a work queue with rtnl_lock and
netdev_lock_ops held, in fully sleepable context.

When __dev_set_rx_mode() sees ndo_set_rx_mode_async, it schedules
netdev_rx_mode_work instead of calling the driver inline. The work
function takes two snapshots of each address list (uc/mc) under
the addr_lock, then drops the lock and calls the driver with the
work copies. After the driver returns, it reconciles the snapshots
back to the real lists under the lock.

Add netif_rx_mode_sync() to opportunistically execute the pending
workqueue update inline, so that rx mode changes are committed
before returning to userspace:
  - dev_change_flags (SIOCSIFFLAGS / RTM_NEWLINK)
  - dev_set_promiscuity
  - dev_set_allmulti
  - dev_ifsioc SIOCADDMULTI / SIOCDELMULTI
  - do_setlink (RTM_SETLINK)

Note that some deep hierarchies still do skip the lower updates via:
  - dev_uc_sync
  - dev_mc_sync

If we do end up hitting user-visible issues, we can add more calls to
netif_rx_mode_sync in specific places. But hopefully we should not,
the actual user-visible lists are still synced, it's that just HW state
that might be lagging.

Signed-off-by: Stanislav Fomichev 
Link: https://patch.msgid.link/20260416185712.2155425-3-sdf@fomichev.me
Signed-off-by: Paolo Abeni

net: add address list snapshot and reconciliation infrastructure

2026-04-21T10:50:03+00:00

Introduce __hw_addr_list_snapshot() and __hw_addr_list_reconcile()
for use by the upcoming ndo_set_rx_mode_async callback.

The async rx_mode path needs to snapshot the device's unicast and
multicast address lists under the addr_lock, hand those snapshots
to the driver (which may sleep), and then propagate any sync_cnt
changes back to the real lists. Two identical snapshots are taken:
a work copy for the driver to pass to __hw_addr_sync_dev() and a
reference copy to compute deltas against.

__hw_addr_list_reconcile() walks the reference snapshot comparing
each entry against the work snapshot to determine what the driver
synced or unsynced. It then applies those deltas to the real list,
handling concurrent modifications:

  - If the real entry was concurrently removed but the driver synced
    it to hardware (delta > 0), re-insert a stale entry so the next
    work run properly unsyncs it from hardware.
  - If the entry still exists, apply the delta normally. An entry
    whose refcount drops to zero is removed.

  # dev_addr_test_snapshot_benchmark: 1024 addrs x 1000 snapshots: 89872802 ns total, 89872 ns/iter
  # dev_addr_test_snapshot_benchmark.speed: slow

Reviewed-by: Aleksandr Loktionov 
Signed-off-by: Stanislav Fomichev 
Link: https://patch.msgid.link/20260416185712.2155425-2-sdf@fomichev.me
Signed-off-by: Paolo Abeni

net: fix reference tracker mismanagement in netdev_put_lock()

2026-04-12T16:08:43+00:00

dev_put() releases a reference which didn't have a tracker.
References without a tracker are accounted in the tracking
code as "no_tracker". We can't free the tracker and then
call dev_put(). The references themselves will be fine
but the tracking code will think it's a double-release:

  refcount_t: decrement hit 0; leaking memory.

IOW commit under fixes confused dev_put() (release never tracked
reference) with __dev_put() (just release the reference, skipping
the reference tracking infra).

Since __netdev_put_lock() uses dev_put() we can't feed a previously
tracked netdev ref into it. Let's flip things around.
netdev_put(dev, NULL) is the same as dev_put(dev) so make
netdev_put_lock() the real function and have __netdev_put_lock()
feed it a NULL tracker for all the cases that were untracked.

Fixes: d04686d9bc86 ("net: Implement netdev_nl_queue_create_doit")
Acked-by: Daniel Borkmann 
Link: https://patch.msgid.link/20260410153600.1984522-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski

net: remove the netif_get_rx_queue_lease_locked() helpers

2026-04-10T01:26:28+00:00

The netif_get_rx_queue_lease_locked() API hides the locking
and the descend onto the leased queue. Making the code
harder to follow (at least to me). Remove the API and open
code the descend a bit. Most of the code now looks like:

 if (!leased)
     return __helper(x);

 hw_rxq = ..
 netdev_lock(hw_rxq->dev);
 ret = __helper(x);
 netdev_unlock(hw_rxq->dev);

 return ret;

Of course if we have more code paths that need the wrapping
we may need to revisit. For now, IMHO, having to know what
netif_get_rx_queue_lease_locked() does is not worth the 20LoC
it saves.

Link: https://patch.msgid.link/20260408151251.72bd2482@kernel.org
Signed-off-by: Jakub Kicinski

net: Proxy netif_mp_{open,close}_rxq for leased queues

2026-04-10T01:21:46+00:00

When a process in a container wants to setup a memory provider, it will
use the virtual netdev and a leased rxq, and call netif_mp_{open,close}_rxq
to try and restart the queue. At this point, proxy the queue restart on
the real rxq in the physical netdev.

For memory providers (io_uring zero-copy rx and devmem), it causes the
real rxq in the physical netdev to be filled from a memory provider that
has DMA mapped memory from a process within a container.

Signed-off-by: David Wei 
Co-developed-by: Daniel Borkmann 
Signed-off-by: Daniel Borkmann 
Reviewed-by: Nikolay Aleksandrov 
Link: https://patch.msgid.link/20260402231031.447597-7-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski

net: Implement netdev_nl_queue_create_doit

2026-04-10T01:21:45+00:00

Implement netdev_nl_queue_create_doit which creates a new rx queue in a
virtual netdev and then leases it to a rx queue in a physical netdev.

Example with ynl client:

  # ynl --family netdev --output-json --do queue-create \
        --json '{"ifindex": 8, "type": "rx", "lease": {"ifindex": 4, "queue": {"type": "rx", "id": 15}}}'
  {'id': 1}

Note that the netdevice locking order is always from the virtual to
the physical device.

Signed-off-by: Daniel Borkmann 
Co-developed-by: David Wei 
Signed-off-by: David Wei 
Reviewed-by: Nikolay Aleksandrov 
Link: https://patch.msgid.link/20260402231031.447597-3-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski

net: add xmit recursion limit to tunnel xmit functions

2026-03-10T12:30:30+00:00

Tunnel xmit functions (iptunnel_xmit, ip6tunnel_xmit) lack their own
recursion limit. When a bond device in broadcast mode has GRE tap
interfaces as slaves, and those GRE tunnels route back through the
bond, multicast/broadcast traffic triggers infinite recursion between
bond_xmit_broadcast() and ip_tunnel_xmit()/ip6_tnl_xmit(), causing
kernel stack overflow.

The existing XMIT_RECURSION_LIMIT (8) in the no-qdisc path is not
sufficient because tunnel recursion involves route lookups and full IP
output, consuming much more stack per level. Use a lower limit of 4
(IP_TUNNEL_RECURSION_LIMIT) to prevent overflow.

Add recursion detection using dev_xmit_recursion helpers directly in
iptunnel_xmit() and ip6tunnel_xmit() to cover all IPv4/IPv6 tunnel
paths including UDP encapsulated tunnels (VXLAN, Geneve, etc.).

Move dev_xmit_recursion helpers from net/core/dev.h to public header
include/linux/netdevice.h so they can be used by tunnel code.

 BUG: KASAN: stack-out-of-bounds in blake2s.constprop.0+0xe7/0x160
 Write of size 32 at addr ffff88810033fed0 by task kworker/0:1/11
 Workqueue: mld mld_ifc_work
 Call Trace:
  
  __build_flow_key.constprop.0 (net/ipv4/route.c:515)
  ip_rt_update_pmtu (net/ipv4/route.c:1073)
  iptunnel_xmit (net/ipv4/ip_tunnel_core.c:84)
  ip_tunnel_xmit (net/ipv4/ip_tunnel.c:847)
  gre_tap_xmit (net/ipv4/ip_gre.c:779)
  dev_hard_start_xmit (net/core/dev.c:3887)
  sch_direct_xmit (net/sched/sch_generic.c:347)
  __dev_queue_xmit (net/core/dev.c:4802)
  bond_dev_queue_xmit (drivers/net/bonding/bond_main.c:312)
  bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5279)
  bond_start_xmit (drivers/net/bonding/bond_main.c:5530)
  dev_hard_start_xmit (net/core/dev.c:3887)
  __dev_queue_xmit (net/core/dev.c:4841)
  ip_finish_output2 (net/ipv4/ip_output.c:237)
  ip_output (net/ipv4/ip_output.c:438)
  iptunnel_xmit (net/ipv4/ip_tunnel_core.c:86)
  gre_tap_xmit (net/ipv4/ip_gre.c:779)
  dev_hard_start_xmit (net/core/dev.c:3887)
  sch_direct_xmit (net/sched/sch_generic.c:347)
  __dev_queue_xmit (net/core/dev.c:4802)
  bond_dev_queue_xmit (drivers/net/bonding/bond_main.c:312)
  bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5279)
  bond_start_xmit (drivers/net/bonding/bond_main.c:5530)
  dev_hard_start_xmit (net/core/dev.c:3887)
  __dev_queue_xmit (net/core/dev.c:4841)
  ip_finish_output2 (net/ipv4/ip_output.c:237)
  ip_output (net/ipv4/ip_output.c:438)
  iptunnel_xmit (net/ipv4/ip_tunnel_core.c:86)
  ip_tunnel_xmit (net/ipv4/ip_tunnel.c:847)
  gre_tap_xmit (net/ipv4/ip_gre.c:779)
  dev_hard_start_xmit (net/core/dev.c:3887)
  sch_direct_xmit (net/sched/sch_generic.c:347)
  __dev_queue_xmit (net/core/dev.c:4802)
  bond_dev_queue_xmit (drivers/net/bonding/bond_main.c:312)
  bond_xmit_broadcast (drivers/net/bonding/bond_main.c:5279)
  bond_start_xmit (drivers/net/bonding/bond_main.c:5530)
  dev_hard_start_xmit (net/core/dev.c:3887)
  __dev_queue_xmit (net/core/dev.c:4841)
  mld_sendpack
  mld_ifc_work
  process_one_work
  worker_thread
  

Fixes: 745e20f1b626 ("net: add a recursion limit in xmit path")
Reported-by: Xiang Mei 
Signed-off-by: Weiming Shi 
Link: https://patch.msgid.link/20260306160133.3852900-2-bestswngs@gmail.com
Signed-off-by: Paolo Abeni

net: add queue config validation callback

2026-01-23T19:49:02+00:00

I imagine (tm) that as the number of per-queue configuration
options grows some of them may conflict for certain drivers.
While the drivers can obviously do all the validation locally
doing so is fairly inconvenient as the config is fed to drivers
piecemeal via different ops (for different params and NIC-wide
vs per-queue).

Add a centralized callback for validating the queue config
in queue ops. The callback gets invoked before memory provider
is installed, and in the future should also be called when ring
params are modified.

The validation is done after each layer of configuration.
Since we can't fail MP un-binding we must make sure that
the config is valid both before and after MP overrides are
applied. This is moot for now since the set of MP and device
configs are disjoint. It will matter significantly in the future,
so adding it now so that we don't forget..

Link: https://patch.msgid.link/20260122005113.2476634-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski

Revert "Merge branch 'netkit-support-for-io_uring-zero-copy-and-af_xdp'"

2026-01-21T02:06:01+00:00

This reverts commit 77b9c4a438fc66e2ab004c411056b3fb71a54f2c, reversing
changes made to 4515ec4ad58a37e70a9e1256c0b993958c9b7497:

 931420a2fc36 ("selftests/net: Add netkit container tests")
 ab771c938d9a ("selftests/net: Make NetDrvContEnv support queue leasing")
 6be87fbb2776 ("selftests/net: Add env for container based tests")
 61d99ce3dfc2 ("selftests/net: Add bpf skb forwarding program")
 920da3634194 ("netkit: Add xsk support for af_xdp applications")
 eef51113f8af ("netkit: Add netkit notifier to check for unregistering devices")
 b5ef109d22d4 ("netkit: Implement rtnl_link_ops->alloc and ndo_queue_create")
 b5c3fa4a0b16 ("netkit: Add single device mode for netkit")
 0073d2fd679d ("xsk: Proxy pool management for leased queues")
 1ecea95dd3b5 ("xsk: Extend xsk_rcv_check validation")
 804bf334d08a ("net: Proxy netdev_queue_get_dma_dev for leased queues")
 0caa9a8ddec3 ("net: Proxy net_mp_{open,close}_rxq for leased queues")
 ff8889ff9107 ("net, ethtool: Disallow leased real rxqs to be resized")
 9e2103f36110 ("net: Add lease info to queue-get response")
 31127deddef4 ("net: Implement netdev_nl_queue_create_doit")
 a5546e18f77c ("net: Add queue-create operation")

The series will conflict with io_uring work, and the code needs more
polish.

Signed-off-by: Jakub Kicinski

net: Implement netdev_nl_queue_create_doit

2026-01-20T10:58:49+00:00

Implement netdev_nl_queue_create_doit which creates a new rx queue in a
virtual netdev and then leases it to a rx queue in a physical netdev.

Example with ynl client:

  # ./pyynl/cli.py \
      --spec ~/netlink/specs/netdev.yaml \
      --do queue-create \
      --json '{"ifindex": 8, "type": "rx", "lease": {"ifindex": 4, "queue": {"type": "rx", "id": 15}}}'
  {'id': 1}

Note that the netdevice locking order is always from the virtual to
the physical device.

Signed-off-by: Daniel Borkmann 
Co-developed-by: David Wei 
Signed-off-by: David Wei 
Acked-by: Stanislav Fomichev 
Reviewed-by: Nikolay Aleksandrov 
Link: https://patch.msgid.link/20260115082603.219152-3-daniel@iogearbox.net
Signed-off-by: Paolo Abeni