linux.git/net/openvswitch, branch v7.2-rc1

openvswitch: conntrack: annotate ct limit hlist traversal

2026-06-25T15:38:00+00:00

ct_limit_set() is documented as being called with ovs_mutex held. It
walks the ct limit hlist with hlist_for_each_entry_rcu(), but the
iterator does not currently pass the OVS lockdep condition used
elsewhere for RCU-protected OVS objects.

Pass lockdep_ovsl_is_held() to the iterator. This matches the function's
existing caller contract and lets CONFIG_PROVE_RCU_LIST distinguish the
ovs_mutex-protected update path from the RCU read-side ct_limit_get()
path.

This was found by our static analysis tool and then manually reviewed
against the current tree. In the reviewed CONFIG_PROVE_RCU_LIST triage
run, the writer-side ct limit update produced the expected "RCU-list
traversed in non-reader section!!" warning while ovs_mutex was held,
with the stack matching ct_limit_set() and ovs_ct_limit_set_zone_limit().
The change is limited to documenting the existing protection contract.

This is a lockdep annotation cleanup. It does not change the conntrack
limit list update or release behavior.

Signed-off-by: Runyu Xiao 
Reviewed-by: Eelco Chaudron 
Link: https://patch.msgid.link/20260624150149.3510541-1-runyu.xiao@seu.edu.cn
Signed-off-by: Jakub Kicinski

Merge tag 'nf-next-26-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next

2026-06-15T21:09:57+00:00

Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for net-next.
More specifically, this contains conncount rework to address AI related
reports, assorted Netfiter updates and two small incremental updates on
IPVS:

1) Replace old obsolete workqueues (system_wq, system_unbound_wq)
   in IPVS, from Marco Crivellari.

2) Replace WARN_ON{_ONCE} by DEBUG_NET_WARN_ON_ONCE in nf_tables.
   In the recent years, reporters say that the use of WARN_ON{_ONCE}
   in conjunction with panic_on_warn=1 results in DoS. Let's replace
   it by DEBUG_NET_WARN_ON_ONCE so this is only exercised by test
   infrastructure and fuzzers, while also providing context to AI
   agents. From Fernando F. Mancera.

Five patches from Florian Westphal to address AI reports in the conncount
infrastructures:

3) Fix missing rcu read lock section when calling
   __ovs_ct_limit_get_zone_limit().

4) Add a dedicate lock per rbtree tree, this increases memory
   usage but it should improve scalability.

5) Add a helper function to find the rbtree node, no functional
   changes are intented.

6) Add sequence counter to detect concurrent tree modifications
   and retry lookups.

7) Add locks to GC conncount walk and address other nitpicks.

Then, several assorted updates:

8) Defensive Tree-wide addition of NULL checks for ct extensions.

9) Bail out if flowtable bypass cannot be fully set up from the
   flow offload expression, instead of lazy building a likely
   incomplete one.

10) Fix documentation for the new conn_max sysctl toggle in IPVS.

11) Add nf_dev_xmit_recursion*() helpers and use them, to address
    recent AI reports.

* tag 'nf-next-26-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
  netfilter: nf_dup_netdev: add nf_dev_xmit_recursion*() helpers and use them
  ipvs: fix doc syntax for conn_max sysctl
  netfilter: flowtable: bail out if forward path cannot be discovered
  netfilter: conntrack: check NULL when retrieving ct extension
  netfilter: nf_conncount: gc and rcu fixes
  netfilter: nf_conncount: add sequence counter to detect tree modifications
  netfilter: nf_conncount: split count_tree_node rbtree walk into helper
  netfilter: nf_conncount: use per nf_conncount_data spinlocks
  netfilter: nf_conncount: callers must hold rcu read lock
  netfilter: nf_tables: use DEBUG_NET_WARN_ON_ONCE in packet and control paths
  ipvs: Replace use of system_unbound_wq with system_dfl_long_wq
====================

Link: https://patch.msgid.link/20260614114605.474783-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski

netfilter: nf_conncount: callers must hold rcu read lock

2026-06-14T10:51:50+00:00

rcu_derefence_raw() should not have been used here, it concealed this bug.
Its used because struct rb_node lacks __rcu annotated pointers, so plain
rcu_derefence causes sparse warnings.

The major tradeoff is that rcu_derefence_raw() doesn't warn when the caller
isn't in a rcu read section.

Extend the rcu read lock scope accordingly and cause sparse warnings,
those warnings are the lesser evil.

Fixes: 11efd5cb04a1 ("openvswitch: Support conntrack zone limit")
Closes: https://sashiko.dev/#/patchset/20260603230610.7900-1-fw%40strlen.de
Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso

net: openvswitch: fix possible kfree_skb of ERR_PTR

2026-06-09T03:13:02+00:00

After the patch in the "Fixes" tag, the allocation of the "reply" skb
can happen either before or after locking the ovs_mutex.

However, error cleanups still follow the classical reversed order,
assuming "reply" is allocated before locking: it is freed after unlocking.

If "reply" allocation happens after locking the mutex and it fails,
"reply" is left with an ERR_PTR, and execution jumps to the correspondent
cleanup stage which will try to free an invalid pointer.

Fix this by setting the pointer to NULL after having saved its error
value.

Fixes: 893f139b9a6c ("openvswitch: Minimize ovs_flow_cmd_new|set critical sections.")
Signed-off-by: Adrian Moreno 
Reviewed-by: Aaron Conole 
Acked-by: Eelco Chaudron 
Link: https://patch.msgid.link/20260604121946.942164-1-amorenoz@redhat.com
Signed-off-by: Jakub Kicinski

openvswitch: vport: fix race between linking and the device notifier

2026-05-18T23:38:45+00:00

Sashiko reports that it is technically possible that we got the device
reference, but by the time we're linking it to the OVS datapath, it
may be already in the process of being deleted.  In this case if the
notifier wins the race for RTNL, it will see that the device is not
yet in the OVS datapath (ovs_netdev_get_vport() will fail in the
dp_device_event()) and will do nothing.  Then the ovs_netdev_link()
will take the RTNL and link the unregistering device to OVS datapath.

Eventually, netdev_wait_allrefs_any() will re-broadcast the event and
the device will be properly detached, but it will take at least a
second before that happens, so it's not something we should rely on.

Let's avoid linking the non-registered device in the first place.

Note: As per documentation, RTNL doesn't protect the reg_state, but
it actually does for all the state transitions we care about here,
so it should not be necessary to use READ_ONCE or taking the instance
lock.  We can still do that, but we have a few more places even in
this file where the reg_state is accessed without those while under
RTNL, and many more places like this across the kernel code, so it
might make more sense to change all of them in a more centralized
fashion in the future, if necessary.

Fixes: ccb1352e76cf ("net: Add Open vSwitch kernel components.")
Signed-off-by: Ilya Maximets 
Reviewed-by: Aaron Conole 
Acked-by: Eelco Chaudron 
Link: https://patch.msgid.link/20260514184702.2461435-1-i.maximets@ovn.org
Signed-off-by: Jakub Kicinski

openvswitch: vport: fix self-deadlock on release of tunnel ports

2026-05-05T13:19:37+00:00

vports are used concurrently and protected by RCU, so netdev_put()
must happen after the RCU grace period.  So, either in an RCU call or
after the synchronize_net().  The rtnl_delete_link() must happen under
RTNL and so can't be executed in RCU context.  Calling synchronize_net()
while holding RTNL is not a good idea for performance and system
stability under load in general, so calling netdev_put() in RCU call
is the right solution here.

However,
when the device is deleted, rtnl_unlock() will call netdev_run_todo()
and block until all the references are gone.  In the current code this
means that we never reach the call_rcu() and the vport is never freed
and the reference is never released, causing a self-deadlock on device
removal.

Fix that by moving the rcu_call() before the rtnl_unlock(), so the
scheduled RCU callback will be executed when synchronize_net() is
called from the rtnl_unlock()->netdev_run_todo() while the RTNL itself
is already released.

Fixes: 6931d21f87bc ("openvswitch: defer tunnel netdev_put to RCU release")
Cc: stable@vger.kernel.org
Acked-by: Eelco Chaudron 
Signed-off-by: Ilya Maximets 
Acked-by: Aaron Conole 
Link: https://patch.msgid.link/20260430233848.440994-2-i.maximets@ovn.org
Signed-off-by: Paolo Abeni

openvswitch: vport: fix race between tunnel creation and linking

2026-05-05T13:14:33+00:00

When a tunnel vport is created it first creates the tunnel device, e.g.,
with geneve_dev_create_fb(), then it calls ovs_netdev_link() to take a
reference and link it to the device that represents openvswitch datapath.

The creation of the device is happening under RTNL, but then RTNL is
released and re-acquired to find the device by name.  It is technically
possible for the tunnel device to be re-named or deleted within that
window while RTNL is not held, and some other device created in its
place.  This will cause a non-tunnel device to be referenced in the
vport and tunnel-specific functions used on it, e.g. vxlan_get_options()
that directly casts the private netdev data into a struct vxlan_dev
causing an invalid memory access:

 BUG: KASAN: slab-use-after-free in vxlan_get_options+0x323/0x3a0
  vxlan_get_options+0x323/0x3a0
  ovs_vport_cmd_new+0x6e3/0xd30

Fix that by taking a reference to the just created device before
releasing RTNL.  This ensures that the device in the vport is always
the one that was just created.  The search by name is only needed
for a standard vport-netdev that links pre-existing devices, so that
functionality and device type checks are moved to netdev_create().

It is also awkward that ovs_netdev_link() takes ownership of the vport
and destroys it on failure.  It doesn't know the type of the port it is
dealing with, so we need to pass down the indicator that it's a tunnel,
so the link can be properly deleted on failure.

It's possible to refactor the logic to make the ovs_netdev_link() do
only the linking part and let the callers perform a proper destruction,
but it will be much more code for each legacy tunnel port type, so it
is not worth it for the bug fix.

Fixes: 614732eaa12d ("openvswitch: Use regular VXLAN net_device device")
Reported-by: Yuan Tan 
Reported-by: Yifan Wu 
Reported-by: Juefei Pu 
Reported-by: Xin Liu 
Reported-by: Yang Yang 
Signed-off-by: Ilya Maximets 
Acked-by: Eelco Chaudron 
Link: https://patch.msgid.link/20260430213349.407991-1-i.maximets@ovn.org
Signed-off-by: Paolo Abeni

openvswitch: cap upcall PID array size and pre-size vport replies

2026-04-20T18:43:04+00:00

The vport netlink reply helpers allocate a fixed-size skb with
nlmsg_new(NLMSG_DEFAULT_SIZE, ...) but serialize the full upcall PID
array via ovs_vport_get_upcall_portids().  Since
ovs_vport_set_upcall_portids() accepts any non-zero multiple of
sizeof(u32) with no upper bound, a CAP_NET_ADMIN user can install a PID
array large enough to overflow the reply buffer, causing nla_put() to
fail with -EMSGSIZE and hitting BUG_ON(err < 0).  On systems with
unprivileged user namespaces enabled (e.g., Ubuntu default), this is
reachable via unshare -Urn since OVS vport mutation operations use
GENL_UNS_ADMIN_PERM.

 kernel BUG at net/openvswitch/datapath.c:2414!
 Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
 CPU: 1 UID: 0 PID: 65 Comm: poc Not tainted 7.0.0-rc7-00195-geb216e422044 #1
 RIP: 0010:ovs_vport_cmd_set+0x34c/0x400
 Call Trace:
  
  genl_family_rcv_msg_doit (net/netlink/genetlink.c:1116)
  genl_rcv_msg (net/netlink/genetlink.c:1194)
  netlink_rcv_skb (net/netlink/af_netlink.c:2550)
  genl_rcv (net/netlink/genetlink.c:1219)
  netlink_unicast (net/netlink/af_netlink.c:1344)
  netlink_sendmsg (net/netlink/af_netlink.c:1894)
  __sys_sendto (net/socket.c:2206)
  __x64_sys_sendto (net/socket.c:2209)
  do_syscall_64 (arch/x86/entry/syscall_64.c:63)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
  
 Kernel panic - not syncing: Fatal exception

Reject attempts to set more PIDs than nr_cpu_ids in
ovs_vport_set_upcall_portids(), and pre-compute the worst-case reply
size in ovs_vport_cmd_msg_size() based on that bound, similar to the
existing ovs_dp_cmd_msg_size().  nr_cpu_ids matches the cap already
used by the per-CPU dispatch configuration on the datapath side
(ovs_dp_cmd_fill_info() serialises at most nr_cpu_ids PIDs), so the
two sides stay consistent.

Fixes: 5cd667b0a456 ("openvswitch: Allow each vport to have an array of 'port_id's.")
Reported-by: Xiang Mei 
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Weiming Shi 
Reviewed-by: Ilya Maximets 
Link: https://patch.msgid.link/20260416024653.153456-2-bestswngs@gmail.com
Signed-off-by: Jakub Kicinski

net: use get_random_u{16,32,64}() where appropriate

2026-04-10T02:27:43+00:00

Use the typed random integer helpers instead of
get_random_bytes() when filling a single integer variable.
The helpers return the value directly, require no pointer
or size argument, and better express intent.

Skipped sites writing into __be16 (netdevsim) and __le64
(ceph) fields where a direct assignment would trigger
sparse endianness warnings.

Signed-off-by: David Carlier 
Reviewed-by: Matthieu Baerts (NGI0) 
Reviewed-by: Eric Dumazet 
Link: https://patch.msgid.link/20260407150758.5889-1-devnexen@gmail.com
Signed-off-by: Jakub Kicinski

net: convert remaining ipv6_stub users to direct function calls

2026-03-29T18:21:23+00:00

As IPv6 is built-in only, the ipv6_stub infrastructure is no longer
necessary.

Convert remaining ipv6_stub users to make direct function calls. The
fallback functions introduced previously will prevent linkage errors
when CONFIG_IPV6 is disabled.

Signed-off-by: Fernando Fernandez Mancera 
Tested-by: Ricardo B. Marlière 
Link: https://patch.msgid.link/20260325120928.15848-9-fmancera@suse.de
Signed-off-by: Jakub Kicinski