linux-stable.git/net/ipv4, branch v7.1.3

net/tcp-ao: fix use-after-free of key in del_async path

2026-07-04T11:45:09+00:00

commit 5ba9950bc9078e19b69cca1e56d1553b125c6857 upstream.

In tcp_ao_delete_key(), the del_async path skips the current_key
and rnext_key validity checks present in the synchronous path,
assuming these pointers are always NULL on LISTEN sockets.  However,
if a key was added with set_current=1/set_rnext=1 while the socket
was in CLOSE state, current_key and rnext_key will be non-NULL
after listen() transitions the socket to LISTEN.

When such a key is deleted with del_async=1, hlist_del_rcu() and
call_rcu() free the key without clearing the dangling pointers.
After the RCU grace period, getsockopt(TCP_AO_INFO) dereferences
current_key->sndid and rnext_key->rcvid from freed slab memory.

Clear current_key and rnext_key in the del_async path when they
reference the key being deleted.

Fixes: d6732b95b6fb ("net/tcp: Allow asynchronous delete for TCP-AO keys (MKTs)")
Signed-off-by: HanQuan 
Reviewed-by: Eric Dumazet 
Link: https://patch.msgid.link/20260623015208.1191687-1-eilaimemedsnaimel@gmail.com
Signed-off-by: Jakub Kicinski 
Signed-off-by: Greg Kroah-Hartman

net: ip_gre: require CAP_NET_ADMIN in the device netns for changelink

2026-07-04T11:45:03+00:00

commit 8165f7ff57d9667d2bb477ef6af83ede7fed4ad7 upstream.

A tunnel changelink() operates on at most two netns, dev_net(dev) and
the tunnel link netns t->net. They differ once the device is created in
or moved to a netns other than the one the request runs in. The rtnl
changelink path checks CAP_NET_ADMIN only against dev_net(dev), so a
caller privileged there but not in t->net can rewrite a tunnel that
lives in t->net.

Add rtnl_dev_link_net_capable() next to rtnl_get_net_ns_capable() in
net/core/rtnetlink.c. It requires CAP_NET_ADMIN in the link netns and is
skipped when the link netns is dev_net(dev), where the rtnl path already
checked it. The other patches in this series use the same helper.

Gate ipgre_changelink() and erspan_changelink() with it, at the top of
the op before any attribute is parsed, because the parsers update live
tunnel fields first. ipgre_netlink_parms() sets t->collect_md before
ip_tunnel_changelink() runs.

Commit 8b484efd5cb4 ("ip6: vti: Use ip6_tnl.net in
vti6_siocdevprivate().") added the same check on the ioctl path. This
adds it on RTM_NEWLINK.

Reported-by: Xiao Liang 
Closes: https://lore.kernel.org/netdev/CABAhCOSzP1vaThGV35_VnsRCb=87_CPjPVsTHbq905k8A+BuUg@mail.gmail.com/
Fixes: b57708add314 ("gre: add x-netns support")
Cc: stable@vger.kernel.org
Signed-off-by: Maoyi Xie 
Reviewed-by: Kuniyuki Iwashima 
Link: https://patch.msgid.link/20260612085941.3158249-2-maoyixie.tju@gmail.com
Signed-off-by: Jakub Kicinski 
Signed-off-by: Greg Kroah-Hartman

ipv4: account for fraggap on the paged allocation path

2026-07-04T11:45:02+00:00

[ Upstream commit eca856950f7cb1a221e02b99d758409f2c5cec42 ]

In __ip_append_data(), when the paged-allocation branch is taken,
alloclen and pagedlen are computed as

	alloclen = fragheaderlen + transhdrlen;
	pagedlen = datalen - transhdrlen;

datalen already includes fraggap, but the fraggap bytes carried over
from the previous skb are copied into the new skb's linear area at
offset transhdrlen by the subsequent skb_copy_and_csum_bits(). The
linear area is therefore undersized by fraggap bytes while pagedlen is
overstated by the same amount.

The non-paged branch sets alloclen to fraglen, which already accounts
for fraggap because datalen does. Bring the paged branch in line by
adding fraggap to alloclen and subtracting it from pagedlen.

After this adjustment, copy no longer collapses to -fraggap on the
paged path, so remove the stale comment describing that old arithmetic.

Fixes: 8eb77cc73977 ("ipv4: avoid partial copy for zc")
Signed-off-by: Jungwoo Lee 
Signed-off-by: Wongi Lee 
Reviewed-by: Ido Schimmel 
Link: https://patch.msgid.link/ajFR1eLAIs42TN3g@DESKTOP-19IMU7U.localdomain
Signed-off-by: Jakub Kicinski 
Signed-off-by: Sasha Levin

Merge tag 'nf-26-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

2026-06-11T10:30:00+00:00

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Revalidate bridge ports, add missing NULL checks to fetch the bridge
   device by the port. From Florian Westphal.

2) Fix netdevice refcount leak in the error path of nft_fwd hardware
   offload function, also from Florian.

3) Unregister helper expectfn callback on conntrack helper module
   removal, otherwise dangling pointer remains in place,
   from Weiming Shi.

4) Fix possible pointer infoleak in getsockopt() IPT_SO_GET_ENTRIES,
   From Kyle Zeng.

5) Validate that device MAC header is present before nf_syslog
   accesses it. From Xiang Mei.

6-8) Three patches to address a possible infoleak of stale stack
     data in three nf_tables expressions, due to mismatch in the
     _init() and _eval() function which is possible since 14fb07130c7d.
     From Davide Ornaghi and Florian Westphal.

netfilter pull request 26-06-10

* tag 'nf-26-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nft_meta_bridge: fix stale stack leak via IIFHWADDR register
  netfilter: nft_fib: fix stale stack leak via the OIFNAME register
  netfilter: nft_exthdr: fix register tracking for F_PRESENT flag
  netfilter: nf_log: validate MAC header was set before dumping it
  netfilter: x_tables: avoid leaking percpu counter pointers
  netfilter: nf_conntrack: destroy stale expectfn expectations on unregister
  netfilter: nf_tables_offload: drop device refcount on error
  netfilter: revalidate bridge ports
====================

Link: https://patch.msgid.link/20260610161629.214092-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni

Merge tag 'ipsec-2026-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec

2026-06-11T10:00:49+00:00

Steffen Klassert says:

====================
pull request (net): ipsec 2026-06-10

1) xfrm: iptfs: preserve shared-frag marker in iptfs_consume_frags()
   Propagate SKBFL_SHARED_FRAG when paged fragments are moved between
   skbs so ESP can decide whether in-place crypto is safe.

2) xfrm: iptfs: fix use-after-free on first_skb in __input_process_payload
   Replace the unlocked read of xtfs->ra_newskb with a local flag so a
   concurrent reassembly can no longer free first_skb between
   spin_unlock and the post-loop check.

3) xfrm: policy: fix use-after-free on inexact bin in xfrm_policy_bysel_ctx()
   Prune the inexact bin under xfrm_policy_lock so a concurrent
   xfrm_hash_rebuild() can no longer free it before xfrm_policy_kill()
   dereferences it.

4) xfrm: iptfs: fix ABBA deadlock in iptfs_destroy_state()
   Move hrtimer_cancel() for the output and drop timers ahead of their
   spinlocks, breaking the softirq/lock cycle that could deadlock
   against the timer callbacks on SMP.

5) xfrm: espintcp: do not reuse an in-progress partial send
   Fail a new send when espintcp_push_msgs() returns with emsg->len
   still set, so a blocking caller can no longer overwrite ctx->partial
   while a previous transfer still owns it.

6) esp: fix page frag reference leak on skb_to_sgvec failure
   Add a flag to esp_ssg_unref() to unconditionally unref the source
   scatterlist, releasing the old page references that are otherwise
   leaked when the second skb_to_sgvec() in esp_output_tail() fails.

Please pull or let me know if there are problems.

ipsec-2026-06-10

* tag 'ipsec-2026-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  esp: fix page frag reference leak on skb_to_sgvec failure
  xfrm: espintcp: do not reuse an in-progress partial send
  xfrm: iptfs: fix ABBA deadlock in iptfs_destroy_state()
  xfrm: policy: fix use-after-free on inexact bin in xfrm_policy_bysel_ctx()
  xfrm: iptfs: fix use-after-free on first_skb in __input_process_payload
  xfrm: iptfs: preserve shared-frag marker in iptfs_consume_frags()
====================

Link: https://patch.msgid.link/20260610140800.2562818-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni

netfilter: nft_fib: fix stale stack leak via the OIFNAME register

2026-06-10T16:00:19+00:00

For NFT_FIB_RESULT_OIFNAME the destination register is declared with
len = IFNAMSIZ (four 32-bit registers), but on the lookup-fail,
RTN_LOCAL and oif-mismatch paths nft_fib{4,6}_eval() only writes one
register via "*dest = 0". The remaining three registers are left as
whatever was on the stack in nft_do_chain()'s struct nft_regs, and a
downstream expression that loads the register span can leak that
uninitialised kernel stack to userspace.

The NFTA_FIB_F_PRESENT existence check has the same shape: it is only
meaningful for NFT_FIB_RESULT_OIF, yet it was accepted for any result type
while the eval stores a single byte via nft_reg_store8(), leaving the rest
of the declared span stale.

Fix both:

 - replace the bare "*dest = 0" in the eval with nft_fib_store_result(),
   which strscpy_pad()s the whole IFNAMSIZ for OIFNAME (and is already
   used on the other early-return path), and

 - restrict NFTA_FIB_F_PRESENT to NFT_FIB_RESULT_OIF and declare its
   destination as a single u8, so the marked span matches the one byte
   the eval writes.

Fixes: f6d0cbcf09c5 ("netfilter: nf_tables: add fib expression")
Suggested-by: Florian Westphal 
Cc: stable@vger.kernel.org
Signed-off-by: Davide Ornaghi 
Signed-off-by: Pablo Neira Ayuso

netfilter: x_tables: avoid leaking percpu counter pointers

2026-06-10T15:59:01+00:00

The native and compat get-entries paths copy the fixed rule entry header
from the kernelized rule blob to userspace before overwriting the entry's
counter fields with a sanitized counter snapshot.

On SMP kernels, entry->counters.pcnt contains the percpu allocation
address used by x_tables rule counters. A caller can provide a userspace
buffer that faults during the initial fixed-header copy after pcnt has
been copied but before the later sanitized counter copy runs. The syscall
then returns -EFAULT while leaving the raw percpu pointer in userspace.

Copy only the fixed entry prefix before counters from the kernelized rule
blob, then copy the sanitized counter snapshot into the counter field.
Apply this ordering to the IPv4, IPv6, and ARP native and compat
get-entries implementations so a fault cannot expose the internal percpu
counter pointer.

Fixes: 71ae0dff02d7 ("netfilter: xtables: use percpu rule counters")
Signed-off-by: Kyle Zeng 
Signed-off-by: Pablo Neira Ayuso

netfilter: nf_conntrack: destroy stale expectfn expectations on unregister

2026-06-10T15:58:39+00:00

NAT helpers such as nf_nat_h323 store a raw pointer to module text in
exp->expectfn (e.g. ip_nat_q931_expect). nf_ct_helper_expectfn_unregister()
only unlinks the callback descriptor and never walks the expectation table,
so an expectation pending at module removal survives with a dangling
exp->expectfn into freed module text.

When the expected connection arrives, init_conntrack() invokes
exp->expectfn(), now a stale pointer into the unloaded module. Reproduced
on a KASAN build by loading the H.323 helpers, creating a Q.931
expectation, unloading nf_nat_h323, then connecting to the expected port:

 Oops: int3: 0000 [#1] SMP KASAN NOPTI
 RIP: 0010:0xffffffffa06102d1
  init_conntrack.isra.0 (net/netfilter/nf_conntrack_core.c:1862)
  nf_conntrack_in (net/netfilter/nf_conntrack_core.c:2049)
  ipv4_conntrack_local (net/netfilter/nf_conntrack_proto.c:223)
  nf_hook_slow (net/netfilter/core.c:619)
  __ip_local_out (net/ipv4/ip_output.c:120)
  __tcp_transmit_skb (net/ipv4/tcp_output.c:1715)
  tcp_connect (net/ipv4/tcp_output.c:4374)
  tcp_v4_connect (net/ipv4/tcp_ipv4.c:345)
  __sys_connect (net/socket.c:2167)
 Modules linked in: nf_conntrack_h323 [last unloaded: nf_nat_h323]

Reaching the dangling state requires CAP_SYS_MODULE in the initial user
namespace to remove a NAT helper that still has live expectations, so this
is a robustness fix; leaving an expectation pointing at freed text is wrong
regardless.

Add nf_ct_helper_expectfn_destroy(), which walks the expectation table and
drops every expectation whose ->expectfn matches the descriptor being torn
down. Call it from each NAT helper's exit path after the existing RCU grace
period, so no expectation outlives the code it points at and no extra
synchronize_rcu() is introduced. With the fix, the same reproducer runs to
completion without the Oops.

Fixes: f587de0e2feb ("[NETFILTER]: nf_conntrack/nf_nat: add H.323 helper port")
Reported-by: Xiang Mei 
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Weiming Shi 
Signed-off-by: Pablo Neira Ayuso

esp: fix page frag reference leak on skb_to_sgvec failure

2026-06-09T13:58:17+00:00

In esp_output_tail(), when esp->inplace is false, the old skb page frags
are replaced with a new page from the xfrm page_frag cache The source
scatterlist (sg) is built from the old frags before the replacement, and
esp_ssg_unref() is responsible for releasing the old page references
after the crypto operation completes

However, if the second skb_to_sgvec() call (which builds the destination
scatterlist from the new page) fails, the code jumps to error_free which
only calls kfree(tmp). The old page frag references captured in the
source scatterlist are never released:

  1 sg[] is built from old frags via skb_to_sgvec() (no extra get_page)
  2 nr_frags is set to 1 and frag[0] is replaced with the new page
  3 Second skb_to_sgvec() fails -> goto error_free

Fix this by adding a bool parameter to esp_ssg_unref() that, when true,
unconditionally unrefs the source scatterlist frags. Since req->src is
not yet initialized by aead_request_set_crypt() at the point of the
error, the source scatterlist is obtained directly via esp_req_sg()
Existing callers pass false to preserve the original behavior

The same issue exists in both esp4 and esp6 as the code is identical

Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible")
Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible")
Signed-off-by: Alessandro Schino <7991aleschino@gmail.com>
Signed-off-by: Steffen Klassert

inet: frags: fix use-after-free caused by the fqdir_pre_exit() flush

2026-06-05T01:05:23+00:00

On netns teardown, fqdir_pre_exit() walks the fqdir rhashtable and
flushes every fragment queue that is not yet complete using
inet_frag_queue_flush(). That helper frees all the skbs queued on the
fragment queue but does not set INET_FRAG_COMPLETE, and leaves
q->fragments_tail and q->last_run_head pointing at the freed skbs.
The queue itself stays in the rhashtable.

fqdir_pre_exit() first lowers high_thresh to 0 to stop new queue lookups,
but it cannot stop a fragment that already obtained the queue through
inet_frag_find() earlier and stalled just before taking the queue lock.
Once that fragment resumes after the flush and takes the queue lock,
it passes the INET_FRAG_COMPLETE check and then dereferences the freed
fragments_tail. inet_frag_queue_insert() reads FRAG_CB() and ->len of
that pointer and, on the append path, writes ->next_frag, causing a
slab use-after-free. IPv6, nf_conntrack_reasm6 and 6lowpan reassembly
share the same flush path and are affected as well.

Reset rb_fragments, fragments_tail and last_run_head in
inet_frag_queue_flush() so a flushed queue no longer points at the
freed skbs. A fragment that resumes after the flush and takes the
queue lock then finds an empty queue and starts a new run instead of
dereferencing the freed fragments_tail. ip_frag_reinit() already
performed this reset after its own flush, so drop the now duplicate
code there.

Cc: stable@vger.kernel.org
Fixes: 006a5035b495 ("inet: frags: flush pending skbs in fqdir_pre_exit()")
Suggested-by: Eric Dumazet 
Signed-off-by: Hyunwoo Kim 
Link: https://patch.msgid.link/ah6ukYq5G98LshdA@v4bel
Signed-off-by: Jakub Kicinski