linux-stable.git/net/ipv4, branch master

tcp: Decrement tcp_md5_needed static branch

2026-06-30T01:14:30+00:00

In case of early freeing an unwanted TCP-MD5 key on TCP-AO connect(),
md5sig_info is freed right away (and set to NULL). Later, at
the moment of socket destruction, the static branch counter
is not getting decremented.

Add a missing decrement for TCP-MD5 static branch.

Reported-by: Qihang 
Fixes: 0aadc73995d0 ("net/tcp: Prevent TCP-MD5 with TCP-AO being set")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://patch.msgid.link/20260625-tcp-md5-connect-v3-3-1fd313d6c1e0@gmail.com
Signed-off-by: Jakub Kicinski

tcp: defer md5sig_info kfree past RCU grace period in tcp_connect

2026-06-30T01:14:30+00:00

The md5+ao reconciliation in tcp_connect() (net/ipv4/tcp_output.c)
has two symmetric branches:

	if (needs_md5) {
		tcp_ao_destroy_sock(sk, false);
	} else if (needs_ao) {
		tcp_clear_md5_list(sk);
		kfree(rcu_replace_pointer(tp->md5sig_info, NULL, ...));
	}

Both branches free a per-socket auth-info object while the socket is
in TCP_SYN_SENT and is already on the inet ehash (inserted by
inet_hash_connect() in tcp_v4_connect()). Both branches are reachable
by softirq RX-path readers that load the corresponding info pointer
via implicit RCU before bh_lock_sock_nested() is taken.

The needs_md5 branch is fixed in the prior patch by re-introducing
the call_rcu() free in tcp_ao_destroy_sock(): the equivalent per-key
loop runs inside tcp_ao_info_free_rcu(), the RCU callback, so by the
time it frees each tcp_ao_key all softirq readers that captured the
container have already completed rcu_read_unlock().

The needs_ao branch is not symmetric in the same way. The container
free can be deferred via kfree_rcu(md5sig, rcu) -- struct
tcp_md5sig_info already has the required rcu member
(include/net/tcp.h:1999-2002), and the rest of the tree already does
this in the tcp_md5sig_info_add() rollback paths
(net/ipv4/tcp_ipv4.c:1410, 1436). But the per-key teardown is done
by tcp_clear_md5_list() in process context BEFORE the container's
RCU grace period: it walks &md5sig->head and frees each
tcp_md5sig_key with bare hlist_del + kfree. A concurrent softirq
reader in __tcp_md5_do_lookup() / __tcp_md5_do_lookup_exact()
(tcp_ipv4.c:1253, 1298) walks the same list via
hlist_for_each_entry_rcu() and races with that bare kfree on the
keys themselves -- a per-key slab use-after-free of the same class
as the TCP-AO bug, on the same race window.

Fix this in two halves:

  1. Convert the bare kfree() in tcp_connect() to kfree_rcu() so the
     md5sig_info container joins the rest of the md5sig lifecycle.
     The local-variable lift is mechanical and required because
     kfree_rcu() is a macro that expects an lvalue.

  2. Make tcp_clear_md5_list() RCU-safe by replacing hlist_del +
     kfree(key) with hlist_del_rcu + kfree_rcu(key, rcu). struct
     tcp_md5sig_key already carries the rcu member
     (include/net/tcp.h:1995) and tcp_md5_do_del()
     (net/ipv4/tcp_ipv4.c:1456) already uses kfree_rcu, so this
     restores the lifecycle invariant the rest of the file follows
     rather than introducing a one-off.

The other caller of tcp_clear_md5_list() is tcp_md5_destruct_sock()
(net/ipv4/tcp.c:412), which runs from the sock destructor when the
socket is already unhashed and unreachable; the extra grace period
there is unnecessary but harmless. Making the helper unconditionally
RCU-safe is the cleaner contract.

The needs_ao branch is not reachable by the userns reproducer used
to demonstrate the AO-side splat (the repro installs both keys but
ends up in the needs_md5 branch because the connect peer matches
the MD5 key, not the AO key); however the symmetric race exists
and a maintainer touching this code should not have to think about
which branch escapes RCU and which one does not.

Fixes: 51e547e8c89c ("tcp: Free TCP-AO/TCP-MD5 info/keys without RCU")
Cc: stable@vger.kernel.org # v6.18+
Suggested-by: Eric Dumazet 
Signed-off-by: Michael Bommarito 
Reviewed-by: Dmitry Safonov 
Reviewed-by: Eric Dumazet 
[also credits to Qihang, who found that this races with tcp-diag]
Reported-by: Qihang 
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://patch.msgid.link/20260625-tcp-md5-connect-v3-2-1fd313d6c1e0@gmail.com
Signed-off-by: Jakub Kicinski

tcp: restore RCU grace period in tcp_ao_destroy_sock

2026-06-30T01:14:30+00:00

Commit 51e547e8c89c ("tcp: Free TCP-AO/TCP-MD5 info/keys without RCU")
removed the call_rcu() callback from tcp_ao_destroy_sock(), arguing that
"the destruction of info/keys is delayed until the socket destructor"
and therefore "no one can discover it anymore".

That argument does not hold for the call site in tcp_connect()
(net/ipv4/tcp_output.c:4327-4332). At that point the socket is in
TCP_SYN_SENT, has already been inserted into the inet ehash by
inet_hash_connect() in tcp_v4_connect(), and is therefore very much
discoverable: any softirq running tcp_v4_rcv() on another CPU can take
the socket out of the ehash, walk into tcp_inbound_hash(), and load
tp->ao_info via implicit RCU before bh_lock_sock_nested() is taken on
the destroying CPU.

The reader path then enters __tcp_ao_do_lookup() (net/ipv4/tcp_ao.c:208)
which re-loads tp->ao_info via rcu_dereference_check(); the re-load can
still observe the (about-to-be-freed) pointer because there is no
synchronize_rcu() between rcu_assign_pointer(tp->ao_info, NULL) and
tcp_ao_info_free() in tcp_ao_destroy_sock(). The captured pointer is
then walked at line 223:

	hlist_for_each_entry_rcu(key, &ao->head, node, ...)

The writer's synchronous kfree() is free to complete between the line
218 re-fetch and the line 223 hlist iteration. The slab is reused
(or simply LIST_POISON1-stamped if not yet reused) and the iteration
walks attacker-controlled or poison memory in softirq context.

Reproducer (no debug shim, stock x86_64 v7.1-rc2 SMP+KASAN, QEMU+KVM):
an unprivileged uid=1000 process inside CLONE_NEWUSER|CLONE_NEWNET
installs TCP_MD5SIG + TCP_AO_ADD_KEY on a TCP socket, sprays forged
TCP-AO segments toward its eventual 4-tuple via raw sockets, then
calls connect(). The md5-wins reconciliation in tcp_connect() fires
tcp_ao_destroy_sock(); the softirq backlog reader on the loopback
NAPI path crashes on the freed ao->head.first walk:

  Oops: general protection fault, probably for non-canonical
    address 0xfbd59c000000002f
  KASAN: maybe wild-memory-access in range
    [0xdead000000000178-0xdead00000000017f]
  CPU: 0 UID: 1000 PID: 100 Comm: repro_userns
  RIP: 0010:__tcp_ao_do_lookup+0x107/0x1c0
  Call Trace: 
    __tcp_ao_do_lookup+0x107/0x1c0
    tcp_ao_inbound_lookup.constprop.0+0x12a/0x200
    tcp_inbound_ao_hash+0x5ea/0x1520
    tcp_inbound_hash+0x7ce/0x1240
    tcp_v4_rcv+0x1e7a/0x3e10
    ...

Restore the RCU grace period: re-add struct rcu_head to tcp_ao_info
and replace the synchronous tcp_ao_info_free() with a call_rcu()
callback. Readers that captured tp->ao_info before rcu_assign_pointer
NULLed it now see the object remain valid until rcu_read_unlock().
With the patch applied the reproducer runs cleanly for 2000 iterations
on the same kernel build.

Fixes: 51e547e8c89c ("tcp: Free TCP-AO/TCP-MD5 info/keys without RCU")
Cc: stable@vger.kernel.org # v6.18+
Reviewed-by: Dmitry Safonov 
Signed-off-by: Michael Bommarito 
Reviewed-by: Eric Dumazet 
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Link: https://patch.msgid.link/20260625-tcp-md5-connect-v3-1-1fd313d6c1e0@gmail.com
Signed-off-by: Jakub Kicinski

Merge tag 'net-7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

2026-06-25T19:25:36+00:00

Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter and IPsec.

  Current release - regressions:

   - do not acquire dev->tx_global_lock in netdev_watchdog_up()

   - ethtool: keep rtnl_lock for ops using ethtool_op_get_link()

   - fix deadlock in nested UP notifier events

  Current release - new code bugs:

   - eth:
      - cn20k: fix subbank free list indexing for search order
      - airoha: fix BQL underflow in shared QDMA TX ring

  Previous releases - regressions:

   - netfilter:
     - flowtable: fix offloaded ct timeout never being extended
     - nf_conncount: prevent connlimit drops for early confirmed ct

  Previous releases - always broken:

   - require CAP_NET_ADMIN in the originating netns when modifying
     cross-netns devices

   - report NAPI thread PID in the caller's pid namespace

   - mac802154: fix dirty frag in in-place crypto for IOT radios

   - sctp: hold socket lock when dumping endpoints in sctp_diag, avoid
     an overflow

   - eth: gve: fix header buffer corruption with header-split and HW-GRO

   - af_key: initialize alg_key_len for IPComp states, prevent OOB read"

* tag 'net-7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (213 commits)
  selftests: bonding: add a test for VLAN propagation over a bonded real device
  vlan: defer real device state propagation to netdev_work
  net: add the driver-facing netdev_work scheduling API
  net: turn the rx_mode work into a generic netdev_work facility
  net: ethtool: keep rtnl_lock for ops using ethtool_op_get_link()
  rxrpc: Fix rxrpc_rotate_tx_rotate() to check there's something to rotate
  rxrpc: Fix leak of released call in recvmsg(MSG_PEEK)
  rxrpc: Fix socket notification race
  rxrpc: Fix potential infinite loop in rxrpc_recvmsg()
  rxrpc: Fix oob challenge leak in cleanup after notification failure
  rxrpc: Fix the reception of a reply packet before data transmission
  afs: Fix uncancelled rxrpc OOB message handler
  afs: Fix further netns teardown to cancel the preallocation charger
  rxrpc: Fix double unlock in rxrpc_recvmsg()
  rxrpc: Fix leak of connection from OOB challenge
  rxrpc: Fix ACKALL packet handling
  net: hns3: differentiate autoneg default values between copper and fiber
  net: hns3: fix permanent link down deadlock after reset
  net: hns3: refactor MAC autoneg and speed configuration
  net: hns3: unify copper port ksettings configuration path
  ...

net: udp_tunnel: prevent double queueing in udp_tunnel_nic_device_sync

2026-06-25T15:35:51+00:00

Yue Sun reported a use-after-free and debugobjects warning in
udp_tunnel_nic_device_sync_work() during concurrent device operations.

The workqueue core clears the internal pending bit before invoking the
worker. At that point, a concurrent thread can queue the work again.
When the already running worker eventually clears the work_pending flag
to 0, it mistakenly clears the flag for the newly queued instance.
udp_tunnel_nic_unregister() then observes work_pending as 0 and frees
the structure while the second work item is still active in the queue,
leading to UAF.

Fix this by returning early in udp_tunnel_nic_device_sync() if
work_pending is already set, preventing redundant work queueing.

Fixes: cc4e3835eff4 ("udp_tunnel: add central NIC RX port offload infrastructure")
Reported-by: Yue Sun 
Suggested-by: Jakub Kicinski 
Signed-off-by: Eric Dumazet 
Link: https://patch.msgid.link/20260625065938.654652-2-edumazet@google.com
Signed-off-by: Jakub Kicinski

net/tcp-ao: fix use-after-free of key in del_async path

2026-06-25T02:25:35+00:00

In tcp_ao_delete_key(), the del_async path skips the current_key
and rnext_key validity checks present in the synchronous path,
assuming these pointers are always NULL on LISTEN sockets.  However,
if a key was added with set_current=1/set_rnext=1 while the socket
was in CLOSE state, current_key and rnext_key will be non-NULL
after listen() transitions the socket to LISTEN.

When such a key is deleted with del_async=1, hlist_del_rcu() and
call_rcu() free the key without clearing the dangling pointers.
After the RCU grace period, getsockopt(TCP_AO_INFO) dereferences
current_key->sndid and rnext_key->rcvid from freed slab memory.

Clear current_key and rnext_key in the del_async path when they
reference the key being deleted.

Fixes: d6732b95b6fb ("net/tcp: Allow asynchronous delete for TCP-AO keys (MKTs)")
Signed-off-by: HanQuan 
Reviewed-by: Eric Dumazet 
Link: https://patch.msgid.link/20260623015208.1191687-1-eilaimemedsnaimel@gmail.com
Signed-off-by: Jakub Kicinski

Merge tag 'ipsec-2026-06-22' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

2026-06-23T23:22:24+00:00

Steffen Klassert says:

====================
pull request (net): ipsec 2026-06-22

1) xfrm: use compat translator only for u64 alignment mismatch
   Gate the XFRM_USER_COMPAT translator on COMPAT_FOR_U64_ALIGNMENT
   so 32-bit compat tasks on arches whose 32-bit ABI already matches
   the native 64-bit layout are no longer rejected with -EOPNOTSUPP.
   From Sanman Pradhan.

2) net: af_key: initialize alg_key_len for IPComp states
   Initialize the alg_key_len to 0 in the IPComp branch of
   pfkey_msg2xfrm_state() so an uninitialized value cannot drive
   xfrm_alg_len() into a slab-out-of-bounds kmemdup during
   XFRM_MSG_MIGRATE. From Zijing Yin.

3) xfrm: Fix dev use-after-free in xfrm async resumption
   Stash the original skb->dev and extend the RCU critical section
   across xfrm_rcv_cb() and transport_finish() to prevent a
   tunnel-device UAF and original-device refcount leak when a
   callback replaces skb->dev. From Dong Chenchen.

4) xfrm: Fix xfrm state cache insertion race
   Move the state-validity check inside xfrm_state_lock in the
   input state cache insertion path so a state cannot be killed
   between the check and the insert. From Herbert Xu.

5) xfrm: annotate data-races around xfrm_policy_count[] and xfrm_policy_default[]
   Add READ_ONCE()/WRITE_ONCE() annotations on xfrm_policy_count
   and xfrm_policy_default to silence the KCSAN data race reported
   on net->xfrm.policy_count. From Eric Dumazet.

6) espintcp: use sk_msg_free_partial to fix partial send
   Replace the manual skmsg accounting in espintcp with
   sk_msg_free_partial() so the skmsg stays consistent on every
   iteration and the partial-send accounting bugs go away.
   From Sabrina Dubroca.

7) xfrm: validate selector family and prefixlen during match
   Reject mismatched address families in xfrm_selector_match() and
   bound prefixlen in addr4_match()/addr_match() to prevent the
   shift-out-of-bounds syzbot reported when an AF_UNSPEC selector
   with a large prefixlen is matched against an IPv4 flow.
   From Eric Dumazet.

* tag 'ipsec-2026-06-22' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  xfrm: validate selector family and prefixlen during match
  espintcp: use sk_msg_free_partial to fix partial send
  xfrm: annotate data-races around xfrm_policy_count[] and xfrm_policy_default[]
  xfrm: Fix xfrm state cache insertion race
  xfrm: Fix dev use-after-free in xfrm async resumption
  net: af_key: initialize alg_key_len for IPComp states
  xfrm: use compat translator only for u64 alignment mismatch
====================

Link: https://patch.msgid.link/20260622075726.29685-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski

Merge tag 'nf-26-06-21' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

2026-06-22T17:33:38+00:00

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net. This batches
fixes for real crashes with trivial/correctness fixes. There is too
a rework of the conntrack expectation timeout strategy to deal with
a possible race when removing an expectation.

1) Fix the incorrect flowtable timeout extension for entries in
   hw offload, from Adrian Bente. This is correcting a defect in
   the functionality, no crash.

2) Hold reference to device under the fake dst in br_netfilter,
   from Haoze Xie. This is fixing a possible UaF if the device
   is removed while packet is sitting in nfqueue.

3) Reject template conntrack in xt_cluster, otherwise access to
   uninitialize conntrack fields are possible leading to WARN_ON
   due to unset layer 3 protocol. From Wyatt Feng.

4) Make sure the IPv6 tunnel header is in the linear skb data
   area before pulling. While at it remove incomplete NEXTHDR_DEST
   support. From Lorenzo Bianconi. This possibly leading to crash
   if IPv4 header is not in the linear area.

5) Use test_bit_acquire in ipset hash set to avoid reordering
   of subsequent memory access. This is addressing a LLM related
   report, no crash has been observed. From Jozsef Kadlecsik.

6) Use test_bit_acquire in ipset bitmap set too, for the same
   reason as in the previous patch, from Jozsef Kadlecsik.

7) Call kfree_rcu() after rcu_assign_pointer() to address a
   possible UaF if kfree_rcu() runs inmediately, which to my
   understanding never happens. Never observed in practise,
   reported by LLM. Also from Jozsef Kadlecsik.

8) Use disable_delayed_work_sync() instead cancel_delayed_work_sync()
   to avoid that ipset GC handler re-queues work as reported by LLM.
   From Jozsef Kadlecsik. This is for correctness.

9) Restore the check in nft_payload for exceeding payloda offset
    over 2^16. From Florian Westphal. This fixes a silent truncation,
    not a big deal, but better be assertive and reject it.

10) Validate NFT_META_BRI_IIFHWADDR can only run from bridge
    prerouting. From Florian Westphal. Harmless but it could allow
    to read bytes from skb->cb.

11) Zero out destination hardware address during the flowtable
    path setup, also from Florian. This is a correctness fix, LLM
    points that possible infoleak can happen but topology to achieve
    it is not clear.

12) Skip IPv4 options if present when building the IPV4 reject reply.
    Otherwise bytes in the IPv4 options header can be sent back to
    origin where the ICMP header is being expected. Again from
    Florian Westphal.

13) Replace timer API for expectation by GC worker approach. This
    is implicitly fixing a race between nf_ct_remove_expectations()
    which might fail to remove the expectation due to timer_del()
    returning false because timer has expired and callback is
    being run concurrently. This fix is addressing a crash that has
    been already reported with a reproducer.

14) Check if br_vlan_get_pvid_rcu() fails, otherwise possible stack
    infoleak of 4-bytes. From Florian Westphal.

* tag 'nf-26-06-21' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nft_meta_bridge: fix NFT_META_BRI_IIFPVID stack leak
  netfilter: nf_conntrack_expect: use conntrack GC to reap expectations
  netfilter: nf_reject: skip iphdr options when looking for icmp header
  netfilter: nft_flow_offload: zero device address for non-ether case
  netfilter: nft_meta_bridge: add validate callback for get operations
  netfilter: nft_payload: reject offsets exceeding 65535 bytes
  netfilter: ipset: make sure gc is properly stopped
  netfilter: ipset: fix order of kfree_rcu() and rcu_assign_pointer()
  netfilter: ipset: Don't use test_bit() in lockless RCU readers in bitmap types
  netfilter: ipset: Don't use test_bit() in lockless RCU readers in hash types
  netfilter: flowtable: fix and simplify IP6IP6 tunnel handling
  netfilter: xt_cluster: reject template conntracks in hash match
  netfilter: nf_queue: pin bridge device while NFQUEUE holds fake dst
  netfilter: flowtable: fix offloaded ct timeout never being extended
====================

Link: https://patch.msgid.link/20260620222738.112506-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski

ipv4: account for fraggap on the paged allocation path

2026-06-21T22:24:49+00:00

In __ip_append_data(), when the paged-allocation branch is taken,
alloclen and pagedlen are computed as

	alloclen = fragheaderlen + transhdrlen;
	pagedlen = datalen - transhdrlen;

datalen already includes fraggap, but the fraggap bytes carried over
from the previous skb are copied into the new skb's linear area at
offset transhdrlen by the subsequent skb_copy_and_csum_bits(). The
linear area is therefore undersized by fraggap bytes while pagedlen is
overstated by the same amount.

The non-paged branch sets alloclen to fraglen, which already accounts
for fraggap because datalen does. Bring the paged branch in line by
adding fraggap to alloclen and subtracting it from pagedlen.

After this adjustment, copy no longer collapses to -fraggap on the
paged path, so remove the stale comment describing that old arithmetic.

Fixes: 8eb77cc73977 ("ipv4: avoid partial copy for zc")
Signed-off-by: Jungwoo Lee 
Signed-off-by: Wongi Lee 
Reviewed-by: Ido Schimmel 
Link: https://patch.msgid.link/ajFR1eLAIs42TN3g@DESKTOP-19IMU7U.localdomain
Signed-off-by: Jakub Kicinski

netfilter: nf_reject: skip iphdr options when looking for icmp header

2026-06-20T22:18:27+00:00

Not a big deal but this hould have used the real ip header length and not the
base header size.  As-is, if there are options then
nf_skb_is_icmp_unreach() result will be random.

Fixes: db99b2f2b3e2 ("netfilter: nf_reject: don't reply to icmp error messages")
Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso