linux.git - Linux kernel source tree

Age	Commit message (Collapse)	Author
4 hours	Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf	Linus Torvalds
	Pull bpf fixes from Eduard Zingerman: - Fix tcp_bpf_sendmsg() error path mistaking a concurrently-freed sk_psock->cork for the local temporary message and freeing it again (Chengfeng Ye) - Reject passing scalar NULL to nonnull arg of a global subprog. Previously the verifier did not account for the cases directly passing scalars to a global subprog, e.g.: 'global_func(0);' would pass even if 'global_func' argument was marked nonnull (Amery Hung) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, sockmap: Fix cork use-after-free in tcp_bpf_sendmsg() selftests/bpf: Test passing scalar NULL to nonnull global subprog bpf: Reject passing scalar NULL to nonnull arg of a global subprog
8 hours	bpf, sockmap: Fix cork use-after-free in tcp_bpf_sendmsg()	Chengfeng Ye
	tcp_bpf_sendmsg() keeps msg_tx across sk_stream_wait_memory(), which drops and reacquires the socket lock. Its error path tries to decide whether msg_tx names the local temporary message by comparing it with the current value of psock->cork. This comparison is unsafe when two threads send on the same socket: Thread A Thread B msg_tx = psock->cork sk_msg_alloc() fails sk_stream_wait_memory() releases the socket lock acquires the socket lock completes the cork psock->cork = NULL frees the cork reacquires the socket lock msg_tx != psock->cork sk_msg_free(msg_tx) The stale cork is therefore mistaken for the local temporary message and freed again. KASAN reported: BUG: KASAN: slab-use-after-free in sk_msg_free+0x49/0x50 Read of size 4 at addr ffff88810c908800 by task poc/90 Call Trace: sk_msg_free+0x49/0x50 tcp_bpf_sendmsg+0x14f5/0x1cc0 __sys_sendto+0x32c/0x3a0 __x64_sys_sendto+0xdb/0x1b0 Allocated by task 89: __kasan_kmalloc+0x8f/0xa0 tcp_bpf_sendmsg+0x16b3/0x1cc0 Freed by task 91: __kasan_slab_free+0x43/0x70 kfree+0x131/0x3c0 tcp_bpf_sendmsg+0xec3/0x1cc0 msg_tx can only name the stack-local tmp or the shared cork. Check for tmp directly so a changed psock->cork cannot turn a shared message into an apparent local one. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Chengfeng Ye <nicoyip.dev@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/87fr18lmzo.fsf%40cloudflare.com/ Link: https://lore.kernel.org/netdev/20260719161630.2901208-1-nicoyip.dev%40gmail.com/ [v1] Link: https://patch.msgid.link/20260724103856.3399001-1-nicoyip.dev@gmail.com Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
34 hours	Merge tag 'net-7.2-rc5' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Lots of fixes, double the count even for the 'new normal'. Largely due to my time off followed by a networking conference which distracted most maintainers (less so the AI generators). Including fixes from Bluetooth and WiFi. Current release - regressions: - wifi: mt76: fix MAC address for non OF pcie cards Current release - new code bugs: - mptcp: fix BUILD_BUG_ON on legacy ARM config - wifi: cfg80211: guard optional PMSR nominal time Previous releases - regressions: - qrtr: ns: raise node count limit to 512, we arbitrarily picked 256 as a limit, turns out it was too low for real world deployments - vhost-net: fix TX stall when vhost owns virtio-net header - eth: amd-xgbe: fix MAC_AUTO_SW handling in CL37 AN - wifi: ath12k: fix low MLO RX throughput on WCN7850 Previous releases - always broken: - number of random AI fixes for SCTP, RDS and TIPC protocols - more AI-looking fixes for WiFi drivers - number of fixes for missing pointer reloading after skb pull - reject BPF redirect use from qdisc qevent block - tcp: initialize standalone TCP-AO response padding - vsock/virtio: collapse receive queue under memory pressure to avoid client OOMing the host with tiny messages - ipv4: icmp: fill flow parameters in icmp_route_lookup decoy lookup, make sure the ICMP response routing follows the routing policy - gro: fix double aggregation of flush-marked skbs - ovpn: fix various refcount bugs - tls: device: push pending open record on splice EOF - eth: mlx5: - use sender devcom for MPV master-up - fix MCIA register buffer overflow on 32 dword reads" * tag 'net-7.2-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (234 commits) drop_monitor: perform u64_stats updates under IRQ-disabled section drop_monitor: fix size calculations for 64-bit attributes net: drop_monitor: fix info leak in NET_DM_ATTR_PAYLOAD mptcp: fix BUILD_BUG_ON on legacy ARM config selftests: mptcp: userspace_pm: fix undefined variable port mptcp: fix stale skb->sk reference on subflow close mptcp: pm: userspace: fix use-after-free in get_local_id mptcp: decrement subflows counter on failed passive join mac802154: hold an interface reference across the scan worker sctp: don't free the ASCONF's own transport in DEL-IP processing phonet: check register_netdevice_notifier() error in phonet_device_init() phonet: pep: fix use-after-free in pep_get_sb() bnge/bng_re: fix ring ID widths tipc: fix integer overflow in tipc_recvmsg() and tipc_recvstream() net: airoha: fix ETS channel derivation in airoha_tc_setup_qdisc_ets() mctp: check register_netdevice_notifier() error in mctp_device_init() ptp: netc: explicitly clear TMR_OFF during initialization rds: tcp: unregister sysctl before tearing down listen socket ipv6: Change allocation flags to match rcu_read_lock section requirements net: slip: serialize receive against buffer reallocation ...
39 hours	tcp: challenge ACK for non-exact RST in SYN-RECEIVED	Yuxiang Yang
	The SYN-RECEIVED request-socket path in tcp_check_req() accepts an in-window RST without requiring SEG.SEQ to exactly match RCV.NXT. A non-exact RST therefore removes the request instead of eliciting a challenge ACK. RFC 9293 section 3.10.7.4 applies the RFC 5961 reset check in SYN-RECEIVED: an exact RST resets the connection, while a non-exact in-window RST must trigger a challenge ACK and be dropped. Apply that check before the ACK-field validation, following the RFC sequence-number, RST, then ACK processing order. Factor the per-netns challenge ACK quota out of tcp_send_challenge_ack() so request sockets can share it. Use the request socket's send_ack() callback and its own out-of-window ACK timestamp to send and rate-limit the response. Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Fixes: 282f23c6ee34 ("tcp: implement RFC 5961 3.2") Cc: stable@vger.kernel.org Signed-off-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260717081443.809393-2-yangyx22@mails.tsinghua.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
39 hours	raw: annotate lockless match fields in raw_v4_match()	Runyu Xiao
	raw_v4_match() is a lockless match helper under sk_for_each_rcu(). It still reads inet->inet_daddr, inet->inet_rcv_saddr and sk->sk_bound_dev_if with plain loads while bind, connect and bind-to-device paths can update the same match fields concurrently. Annotate only those mutable match fields in raw_v4_match(), and do so at the point of use instead of hoisting the bound-device read before the earlier short-circuit tests. Also annotate the raw bind writer and the shared IPv4 datagram connect writer used by raw sockets, so the address fields updated on bind and connect match explicit WRITE_ONCE() updates. This version intentionally leaves the shared disconnect-side IPv4 writers to follow-up cleanup and limits the writer changes here to the raw bind path and the datagram connect path directly exercised by raw sockets. Fixes: 0daf07e52709 ("raw: convert raw sockets to RCU") Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn> Link: https://patch.msgid.link/20260716142958.3064224-1-runyu.xiao@seu.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
40 hours	ipv4: icmp: fill flow parameters in icmp_route_lookup decoy lookup	Eric Dumazet
	When Linux forwards a packet and needs to generate an ICMP error, icmp_route_lookup() performs a reverse-path relookup. For non-local destinations, it performs a decoy lookup to find the expected egress interface (rt2->dst.dev) before validating the path with ip_route_input(). Currently, the decoy flow structure (fl4_2) only sets .daddr = fl4_dec.saddr, leaving .saddr, .flowi4_dscp, .flowi4_proto, .flowi4_mark, .flowi4_oif, .fl4_sport, .fl4_dport, and .flowi4_uid zeroed out. When policy routing rules (such as ip rule add from $SRC lookup 100, or dscp/fwmark/ipproto/port rules, or VRF bindings) are configured: 1. The decoy lookup fails to match the policy rule because saddr and other key flow selectors are missing in fl4_2. 2. It resolves a route using the default table instead, returning an incorrect egress netdev. 3. Passing the wrong netdev to ip_route_input() causes strict reverse-path filtering (rp_filter=1) to fail, logging false-positive "martian source" warnings and causing the relookup to fail. Fix this by initializing fl4_2 from fl4_dec and: - Swapping source/destination IP addresses. - Swapping L4 ports for transport protocols with ports (TCP, UDP, SCTP, DCCP) so port-based policy routing matches correctly. Non-port protocols (such as ICMP or GRE) leave the flowi_uli union fields intact to prevent corruption. - Setting .flowi4_oif = l3mdev_master_ifindex(route_lookup_dev) to ensure VRF routing tables are respected. - Setting .flowi4_flags \|= FLOWI_FLAG_ANYSRC to allow output route lookups for non-local source IP addresses. - Using __ip_route_output_key() instead of ip_route_output_key() for fl4_2 so that raw FIB routing is used without triggering spurious XFRM policy lookups on the decoy flow (the actual XFRM lookup is performed later using fl4_dec). Fixes: 415b3334a21a ("icmp: Fix regression in nexthop resolution during replies.") Reported-by: Muhammad Ziad <muhzi100@gmail.com> Closes: https://lore.kernel.org/netdev/CAOAwikA60AYKdFr_UDLyja3oU4hqyAE7uFZWqum5uRdaQsgRYg@mail.gmail.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260722104236.2938082-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
43 hours	net: gre: fix lltx regression for GRE tunnels with SEQ/CSUM	Yun Zhou
	Before commit 00d066a4d4ed ("netdev_features: convert NETIF_F_LLTX to dev->lltx"), NETIF_F_LLTX was set unconditionally in both __gre_tunnel_init() and ip6gre_tnl_init_features() alongside GRE_FEATURES: dev->features \|= GRE_FEATURES \| NETIF_F_LLTX; When that commit converted NETIF_F_LLTX to the dev->lltx flag, it placed 'dev->lltx = true' after the SEQ/CSUM early returns instead of before them. This causes GRE/GRETAP/ip6gre tunnels with SEQ or CSUM+encap to lose lockless TX, reintroducing _xmit_lock acquisition around their ndo_start_xmit. Since GRE xmit re-enters the stack via ip_tunnel_xmit(), holding _xmit_lock risks ABBA deadlock with the underlay device. CPU0 CPU1 ---- ---- lock(&qdisc_xmit_lock_key#6); lock(&qdisc_xmit_lock_key#3); lock(&qdisc_xmit_lock_key#6); lock(&qdisc_xmit_lock_key#3); Fix by moving dev->lltx = true before the early returns in both functions, restoring the original unconditional behavior. Fixes: 00d066a4d4ed ("netdev_features: convert NETIF_F_LLTX to dev->lltx") Signed-off-by: Yun Zhou <yun.zhou@windriver.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260713150945.1779628-1-yun.zhou@windriver.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
44 hours	bpf: tcp: fix double sock release on batch realloc	Xiang Mei (Microsoft)
	bpf_iter_tcp_batch() releases the current batch via bpf_iter_tcp_put_batch(), which drops the socket refs and rewrites each slot with the socket cookie, then grows the batch. cur_sk/end_sk are kept for bpf_iter_tcp_resume(), but on realloc failure the function returns ERR_PTR() before resume runs, leaving cur_sk < end_sk over slots that now hold cookies rather than sock pointers. bpf_iter_tcp_seq_stop() then calls bpf_iter_tcp_put_batch() again and dereferences a cookie as a struct sock. Empty the batch on the failure path so stop() does not release it again. The sockets were already freed by the first bpf_iter_tcp_put_batch(), so nothing leaks, and a later read() rescans the bucket from the start instead of skipping it. The sibling GFP_NOWAIT failure path still holds real socket references and is left for stop() to release. BUG: KASAN: null-ptr-deref in __sock_gen_cookie Read of size 8 at addr 0000000000000059 by task exploit ... __sock_gen_cookie (net/core/sock_diag.c:28) bpf_iter_tcp_put_batch (net/ipv4/tcp_ipv4.c:2918) bpf_iter_tcp_seq_stop (net/ipv4/tcp_ipv4.c:3270) bpf_seq_read (kernel/bpf/bpf_iter.c:205) vfs_read (fs/read_write.c:572) ksys_read (fs/read_write.c:716) do_syscall_64 entry_SYSCALL_64_after_hwframe Kernel panic - not syncing: Fatal exception Fixes: cdec67a489d4 ("bpf: tcp: Make sure iter->batch always contains a full bucket snapshot") Reported-by: AutonomousCodeSecurity@microsoft.com Signed-off-by: Xiang Mei (Microsoft) <xmei5@asu.edu> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jordan Rife <jordan@jrife.io> Link: https://patch.msgid.link/20260713233230.3553593-1-xmei5@asu.edu Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 days	tcp: initialize standalone TCP-AO response padding	Yizhou Zhao
	tcp_v4_send_ack() and tcp_v6_send_response() construct standalone TCP responses with TCP-AO options. The option length carries the actual MAC length, but the TCP header length includes the option rounded up to a four-byte boundary. tcp_ao_hash_hdr() writes the MAC only. Thus, when the MAC length is not four-byte aligned, the one to three bytes after the MAC are left uninitialized and may be transmitted. For the normal TCP-AO hashing mode, those bytes also have to be initialized before computing the MAC. Initialize only the alignment padding in the TCP-AO branches, before hashing the header. Use TCPOPT_NOP, as in the normal TCP-AO output path. This avoids adding work to non-AO TCP responses while preserving a valid authenticated header. Fixes: decde2586b34 ("net/tcp: Add TCP-AO sign to twsk") Fixes: da7dfaa6d6f7 ("net/tcp: Consistently align TCP-AO option in the header") Cc: stable@vger.kernel.org Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260713105631.8616-1-zhaoyz24@mails.tsinghua.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
3 days	nexthop: initialize extack in nh_res_bucket_migrate()	Xiang Mei (Microsoft)
	nh_res_bucket_migrate() passes an uninitialized netlink_ext_ack to call_nexthop_res_bucket_notifiers(). When nh_notifier_res_bucket_info_init() fails (e.g. the kzalloc returns -ENOMEM), the error is propagated back before any notifier sets extack._msg, and the error path formats the stale pointer with pr_err_ratelimited("%s\n", extack._msg). With CONFIG_INIT_STACK_NONE this dereferences uninitialized stack memory: Oops: general protection fault, probably for non-canonical address ... KASAN: maybe wild-memory-access in range [...] RIP: 0010:string (lib/vsprintf.c:730) vsnprintf (lib/vsprintf.c:2945) _printk (kernel/printk/printk.c:2504) nh_res_bucket_migrate (net/ipv4/nexthop.c:1816) nh_res_table_upkeep (net/ipv4/nexthop.c:1866) rtm_new_nexthop (net/ipv4/nexthop.c:3323) rtnetlink_rcv_msg (net/core/rtnetlink.c:7076) netlink_sendmsg (net/netlink/af_netlink.c:1900) Kernel panic - not syncing: Fatal exception Zero-initialize extack so _msg is NULL on error paths that never set it. Fixes: 7c37c7e00411 ("nexthop: Implement notifiers for resilient nexthop groups") Reported-by: AutonomousCodeSecurity@microsoft.com Signed-off-by: Xiang Mei (Microsoft) <xmei5@asu.edu> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260713221551.3344650-1-xmei5@asu.edu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 days	Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf	Linus Torvalds
	Pull bpf fixes from Kumar Kartikeya Dwivedi: - Fix a UAF in socket clone early bailout paths (Matt Bobrowski) - Reject unhashed UDP sockets on sockmap update to prevent refcount leaks (Michal Luczaj) - Account for receive queue data in FIONREAD on sockmap sockets without a verdict program (Mattia Meleleo) - Reject negative constant offsets for verifier buffer pointers (Sun Jian) - Fix for tracing of kfuncs with implicit arguments (Ihor Solodrai) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Cover tracing implicit kfunc args bpf: Fix tracing of kfuncs with implicit args selftests/bpf: Cover negative buffer pointer offsets bpf: Reject negative const offsets for buffer pointers selftests/bpf: Test FIONREAD on a sockmap socket without a verdict program bpf, sockmap: Account for receive queue in FIONREAD without a verdict program selftests/bpf: Fail unbound UDP on sockmap update selftests/bpf: Adapt sockmap update error handling bpf, sockmap: Reject unhashed UDP sockets on sockmap update selftests/bpf: Ensure UDP sockets are bound bpf: Fix UAF in sock clone early bailouts
8 days	tcp: fix TIME_WAIT socket reference leak on PSP policy failure	Eric Dumazet
	Release the TIME_WAIT socket reference and jump to discard_it upon PSP policy failure in both IPv4 and IPv6 receive paths. This prevents a memory leak of tcp_tw_bucket structures. Fixes: 659a2899a57d ("tcp: add datapath logic for PSP with inline key exchange") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20260710181317.4060230-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
10 days	bpf, sockmap: Account for receive queue in FIONREAD without a verdict program	Mattia Meleleo
	tcp_bpf_ioctl() answers SIOCINQ from psock->msg_tot_len, which only counts bytes in ingress_msg. Without a stream/skb verdict program nothing is diverted there: data stays in sk_receive_queue, so FIONREAD returns 0 even though read() returns data. Add tcp_inq() to the reported value when the psock has no verdict program. The two queues are disjoint, so bytes redirected into ingress_msg from other sockets stay correctly accounted through msg_tot_len. Remove unused sk_psock_msg_inq(). Fixes: 929e30f93125 ("bpf, sockmap: Fix FIONREAD for sockmap") Signed-off-by: Mattia Meleleo <mattia.meleleo@coralogix.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20260708-fionread-no-verdict-v3-1-b4ee31b3af53@coralogix.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
2026-07-10	ipv4: fib: free fib_alias with kfree_rcu() on insert error path	Weiming Shi
	fib_table_insert() publishes new_fa into the leaf's fa_list with fib_insert_alias() before calling the fib entry notifiers. When a notifier fails, the error path removes new_fa with fib_remove_alias() (hlist_del_rcu) and frees it right away with kmem_cache_free(). fib_table_lookup() walks that list under rcu_read_lock() only, so a concurrent lookup that already reached new_fa keeps reading it after the free: BUG: KASAN: slab-use-after-free in fib_table_lookup (net/ipv4/fib_trie.c:1601) Read of size 1 at addr ffff88810676d4eb by task exploit/297 Call Trace: fib_table_lookup (net/ipv4/fib_trie.c:1601) ip_route_output_key_hash_rcu (net/ipv4/route.c:2814) ip_route_output_key_hash (net/ipv4/route.c:2705) __ip4_datagram_connect (net/ipv4/datagram.c:49) udp_connect (net/ipv4/udp.c:2144) __sys_connect (net/socket.c:2167) __x64_sys_connect (net/socket.c:2173) do_syscall_64 entry_SYSCALL_64_after_hwframe which belongs to the cache ip_fib_alias of size 56 Triggering the error path needs CAP_NET_ADMIN and a registered fib notifier that can reject a route; a netdevsim device whose IPv4 FIB resource is exhausted is enough. Free new_fa with alias_free_mem_rcu(), as fib_table_delete() already does for a fib_alias removed from the trie. Fixes: a6c76c17df02 ("ipv4: Notify route after insertion to the routing table") Reported-by: Xiang Mei <xmei5@asu.edu> Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Weiming Shi <bestswngs@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260704171421.1786806-1-bestswngs@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-07-08	ipv4: igmp: Fix potential memory leaks in igmp_mod_timer() and igmp_stop_timer()	Eric Dumazet
	When a timer is deleted and not re-armed in igmp_mod_timer(), or stopped in igmp_stop_timer(), the code currently decrements the reference counter of the multicast list entry @im using refcount_dec(&im->refcnt). However, both functions can be called from the RCU reader path: - igmp_mod_timer() via igmp_heard_query() -> for_each_pmc_rcu() - igmp_stop_timer() via igmp_rcv() -> igmp_heard_report() If the group im was concurrently removed from the list by ip_mc_dec_group(), its reference count might have already been decremented to 1. In this case, timer_delete() succeeds, and refcount_dec() decrements the refcount from 1 to 0. Since refcount_dec() does not free the object when it hits 0 (unlike ip_ma_put()), the im structure is leaked. Fix this by using ip_ma_put(im) instead of refcount_dec(&im->refcnt), and deferring the put until after the spinlock is released. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260705181756.963063-4-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-07-08	ipv4: igmp: Fix potential UAF in igmp_gq_start_timer()	Eric Dumazet
	A race condition exists between device teardown (inetdev_destroy) and incoming IGMP query processing (igmp_rcv), leading to a Use-After-Free in the IGMP timer callback. During device destruction, inetdev_destroy() drops the primary reference to in_device, which can drop its refcount to 0. The actual freeing of in_device memory is deferred via RCU (using call_rcu()). Concurrently, igmp_rcv() runs under RCU read lock and obtains the in_device pointer. Because the memory is RCU-protected, CPU-0 can safely dereference in_device even if its refcount has hit 0. However, if CPU-0 calls igmp_gq_start_timer() and re-arms the timer, it attempts to acquire a reference using in_dev_hold(). This increments the refcount from 0 to 1, triggering a "refcount_t: addition on 0" warning. Since the in_device memory is still scheduled to be freed after the RCU grace period (as the free callback does not check the refcount again), the device is freed while the timer is still armed. When the timer expires, it accesses the freed memory, causing a kernel panic. Fix this by using refcount_inc_not_zero() (via a new helper in_dev_hold_safe()) to prevent acquiring a reference if the device is already being destroyed. If the refcount is 0, we do not arm the timer. A similar issue in IPv6 MLD is fixed in a subsequent patch. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Zero Day Initiative <zdi-disclosures@trendmicro.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260705181756.963063-2-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-07-07	ipv4: igmp: remove multicast group from hash table on device destruction	Yuyang Huang
	When a device is destroyed under RTNL, ip_mc_destroy_dev() iterates through the multicast list and calls ip_ma_put() on each membership, scheduling them for RCU reclamation. However, they are not unlinked from the device's multicast hash table (mc_hash). Since the device remains published in dev->ip_ptr until after ip_mc_destroy_dev() completes, concurrent RCU readers traversing mc_hash can still locate and access the multicast group after its refcount is decremented. If the RCU callback runs and frees the group while a reader is accessing it, a use-after-free occurs. Fix this by unlinking the multicast group from mc_hash using ip_mc_hash_remove() before scheduling it for reclamation. BUG: KASAN: slab-use-after-free in ip_check_mc_rcu+0x149/0x3f0 Read of size 4 at addr ffff888009bf1408 by task mausezahn/2276 Call Trace: <IRQ> dump_stack_lvl+0x67/0x90 print_report+0x175/0x7c0 kasan_report+0x147/0x180 ip_check_mc_rcu+0x149/0x3f0 udp_v4_early_demux+0x36d/0x12d0 ip_rcv_finish_core+0xb8b/0x1390 ip_rcv_finish+0x54/0x120 NF_HOOK+0x213/0x2b0 __netif_receive_skb+0x126/0x340 process_backlog+0x4f2/0xf00 __napi_poll+0x92/0x2c0 net_rx_action+0x583/0xc60 handle_softirqs+0x236/0x7f0 do_softirq+0x57/0x80 </IRQ> Allocated by task 2239: kasan_save_track+0x3e/0x80 __kasan_kmalloc+0x72/0x90 ____ip_mc_inc_group+0x31a/0xa40 __ip_mc_join_group+0x334/0x3f0 do_ip_setsockopt+0x16fa/0x2010 ip_setsockopt+0x3f/0x90 do_sock_setsockopt+0x1ad/0x300 Freed by task 0: kasan_save_track+0x3e/0x80 kasan_save_free_info+0x40/0x50 __kasan_slab_free+0x3a/0x60 __rcu_free_sheaf_prepare+0xd4/0x220 rcu_free_sheaf+0x36/0x190 rcu_core+0x8d9/0x12f0 handle_softirqs+0x236/0x7f0 Fixes: e9897071350b ("igmp: hash a hash table to speedup ip_check_mc_rcu()") Cc: stable@vger.kernel.org Signed-off-by: Yuyang Huang <yuyanghuang@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260701235014.73505-1-yuyanghuang@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-06-29	tcp: Decrement tcp_md5_needed static branch	Dmitry Safonov
	In case of early freeing an unwanted TCP-MD5 key on TCP-AO connect(), md5sig_info is freed right away (and set to NULL). Later, at the moment of socket destruction, the static branch counter is not getting decremented. Add a missing decrement for TCP-MD5 static branch. Reported-by: Qihang <q.h.hack.winter@gmail.com> Fixes: 0aadc73995d0 ("net/tcp: Prevent TCP-MD5 with TCP-AO being set") Cc: stable@vger.kernel.org Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20260625-tcp-md5-connect-v3-3-1fd313d6c1e0@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-29	tcp: defer md5sig_info kfree past RCU grace period in tcp_connect	Michael Bommarito
	The md5+ao reconciliation in tcp_connect() (net/ipv4/tcp_output.c) has two symmetric branches: if (needs_md5) { tcp_ao_destroy_sock(sk, false); } else if (needs_ao) { tcp_clear_md5_list(sk); kfree(rcu_replace_pointer(tp->md5sig_info, NULL, ...)); } Both branches free a per-socket auth-info object while the socket is in TCP_SYN_SENT and is already on the inet ehash (inserted by inet_hash_connect() in tcp_v4_connect()). Both branches are reachable by softirq RX-path readers that load the corresponding info pointer via implicit RCU before bh_lock_sock_nested() is taken. The needs_md5 branch is fixed in the prior patch by re-introducing the call_rcu() free in tcp_ao_destroy_sock(): the equivalent per-key loop runs inside tcp_ao_info_free_rcu(), the RCU callback, so by the time it frees each tcp_ao_key all softirq readers that captured the container have already completed rcu_read_unlock(). The needs_ao branch is not symmetric in the same way. The container free can be deferred via kfree_rcu(md5sig, rcu) -- struct tcp_md5sig_info already has the required rcu member (include/net/tcp.h:1999-2002), and the rest of the tree already does this in the tcp_md5sig_info_add() rollback paths (net/ipv4/tcp_ipv4.c:1410, 1436). But the per-key teardown is done by tcp_clear_md5_list() in process context BEFORE the container's RCU grace period: it walks &md5sig->head and frees each tcp_md5sig_key with bare hlist_del + kfree. A concurrent softirq reader in __tcp_md5_do_lookup() / __tcp_md5_do_lookup_exact() (tcp_ipv4.c:1253, 1298) walks the same list via hlist_for_each_entry_rcu() and races with that bare kfree on the keys themselves -- a per-key slab use-after-free of the same class as the TCP-AO bug, on the same race window. Fix this in two halves: 1. Convert the bare kfree() in tcp_connect() to kfree_rcu() so the md5sig_info container joins the rest of the md5sig lifecycle. The local-variable lift is mechanical and required because kfree_rcu() is a macro that expects an lvalue. 2. Make tcp_clear_md5_list() RCU-safe by replacing hlist_del + kfree(key) with hlist_del_rcu + kfree_rcu(key, rcu). struct tcp_md5sig_key already carries the rcu member (include/net/tcp.h:1995) and tcp_md5_do_del() (net/ipv4/tcp_ipv4.c:1456) already uses kfree_rcu, so this restores the lifecycle invariant the rest of the file follows rather than introducing a one-off. The other caller of tcp_clear_md5_list() is tcp_md5_destruct_sock() (net/ipv4/tcp.c:412), which runs from the sock destructor when the socket is already unhashed and unreachable; the extra grace period there is unnecessary but harmless. Making the helper unconditionally RCU-safe is the cleaner contract. The needs_ao branch is not reachable by the userns reproducer used to demonstrate the AO-side splat (the repro installs both keys but ends up in the needs_md5 branch because the connect peer matches the MD5 key, not the AO key); however the symmetric race exists and a maintainer touching this code should not have to think about which branch escapes RCU and which one does not. Fixes: 51e547e8c89c ("tcp: Free TCP-AO/TCP-MD5 info/keys without RCU") Cc: stable@vger.kernel.org # v6.18+ Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Reviewed-by: Dmitry Safonov <dima@arista.com> Reviewed-by: Eric Dumazet <edumazet@google.com> [also credits to Qihang, who found that this races with tcp-diag] Reported-by: Qihang <q.h.hack.winter@gmail.com> Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20260625-tcp-md5-connect-v3-2-1fd313d6c1e0@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-29	tcp: restore RCU grace period in tcp_ao_destroy_sock	Michael Bommarito
	Commit 51e547e8c89c ("tcp: Free TCP-AO/TCP-MD5 info/keys without RCU") removed the call_rcu() callback from tcp_ao_destroy_sock(), arguing that "the destruction of info/keys is delayed until the socket destructor" and therefore "no one can discover it anymore". That argument does not hold for the call site in tcp_connect() (net/ipv4/tcp_output.c:4327-4332). At that point the socket is in TCP_SYN_SENT, has already been inserted into the inet ehash by inet_hash_connect() in tcp_v4_connect(), and is therefore very much discoverable: any softirq running tcp_v4_rcv() on another CPU can take the socket out of the ehash, walk into tcp_inbound_hash(), and load tp->ao_info via implicit RCU before bh_lock_sock_nested() is taken on the destroying CPU. The reader path then enters __tcp_ao_do_lookup() (net/ipv4/tcp_ao.c:208) which re-loads tp->ao_info via rcu_dereference_check(); the re-load can still observe the (about-to-be-freed) pointer because there is no synchronize_rcu() between rcu_assign_pointer(tp->ao_info, NULL) and tcp_ao_info_free() in tcp_ao_destroy_sock(). The captured pointer is then walked at line 223: hlist_for_each_entry_rcu(key, &ao->head, node, ...) The writer's synchronous kfree() is free to complete between the line 218 re-fetch and the line 223 hlist iteration. The slab is reused (or simply LIST_POISON1-stamped if not yet reused) and the iteration walks attacker-controlled or poison memory in softirq context. Reproducer (no debug shim, stock x86_64 v7.1-rc2 SMP+KASAN, QEMU+KVM): an unprivileged uid=1000 process inside CLONE_NEWUSER\|CLONE_NEWNET installs TCP_MD5SIG + TCP_AO_ADD_KEY on a TCP socket, sprays forged TCP-AO segments toward its eventual 4-tuple via raw sockets, then calls connect(). The md5-wins reconciliation in tcp_connect() fires tcp_ao_destroy_sock(); the softirq backlog reader on the loopback NAPI path crashes on the freed ao->head.first walk: Oops: general protection fault, probably for non-canonical address 0xfbd59c000000002f KASAN: maybe wild-memory-access in range [0xdead000000000178-0xdead00000000017f] CPU: 0 UID: 1000 PID: 100 Comm: repro_userns RIP: 0010:__tcp_ao_do_lookup+0x107/0x1c0 Call Trace: <IRQ> __tcp_ao_do_lookup+0x107/0x1c0 tcp_ao_inbound_lookup.constprop.0+0x12a/0x200 tcp_inbound_ao_hash+0x5ea/0x1520 tcp_inbound_hash+0x7ce/0x1240 tcp_v4_rcv+0x1e7a/0x3e10 ... Restore the RCU grace period: re-add struct rcu_head to tcp_ao_info and replace the synchronous tcp_ao_info_free() with a call_rcu() callback. Readers that captured tp->ao_info before rcu_assign_pointer NULLed it now see the object remain valid until rcu_read_unlock(). With the patch applied the reproducer runs cleanly for 2000 iterations on the same kernel build. Fixes: 51e547e8c89c ("tcp: Free TCP-AO/TCP-MD5 info/keys without RCU") Cc: stable@vger.kernel.org # v6.18+ Reviewed-by: Dmitry Safonov <dima@arista.com> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20260625-tcp-md5-connect-v3-1-1fd313d6c1e0@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-25	Merge tag 'net-7.2-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter and IPsec. Current release - regressions: - do not acquire dev->tx_global_lock in netdev_watchdog_up() - ethtool: keep rtnl_lock for ops using ethtool_op_get_link() - fix deadlock in nested UP notifier events Current release - new code bugs: - eth: - cn20k: fix subbank free list indexing for search order - airoha: fix BQL underflow in shared QDMA TX ring Previous releases - regressions: - netfilter: - flowtable: fix offloaded ct timeout never being extended - nf_conncount: prevent connlimit drops for early confirmed ct Previous releases - always broken: - require CAP_NET_ADMIN in the originating netns when modifying cross-netns devices - report NAPI thread PID in the caller's pid namespace - mac802154: fix dirty frag in in-place crypto for IOT radios - sctp: hold socket lock when dumping endpoints in sctp_diag, avoid an overflow - eth: gve: fix header buffer corruption with header-split and HW-GRO - af_key: initialize alg_key_len for IPComp states, prevent OOB read" * tag 'net-7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (213 commits) selftests: bonding: add a test for VLAN propagation over a bonded real device vlan: defer real device state propagation to netdev_work net: add the driver-facing netdev_work scheduling API net: turn the rx_mode work into a generic netdev_work facility net: ethtool: keep rtnl_lock for ops using ethtool_op_get_link() rxrpc: Fix rxrpc_rotate_tx_rotate() to check there's something to rotate rxrpc: Fix leak of released call in recvmsg(MSG_PEEK) rxrpc: Fix socket notification race rxrpc: Fix potential infinite loop in rxrpc_recvmsg() rxrpc: Fix oob challenge leak in cleanup after notification failure rxrpc: Fix the reception of a reply packet before data transmission afs: Fix uncancelled rxrpc OOB message handler afs: Fix further netns teardown to cancel the preallocation charger rxrpc: Fix double unlock in rxrpc_recvmsg() rxrpc: Fix leak of connection from OOB challenge rxrpc: Fix ACKALL packet handling net: hns3: differentiate autoneg default values between copper and fiber net: hns3: fix permanent link down deadlock after reset net: hns3: refactor MAC autoneg and speed configuration net: hns3: unify copper port ksettings configuration path ...
2026-06-25	net: udp_tunnel: prevent double queueing in udp_tunnel_nic_device_sync	Eric Dumazet
	Yue Sun reported a use-after-free and debugobjects warning in udp_tunnel_nic_device_sync_work() during concurrent device operations. The workqueue core clears the internal pending bit before invoking the worker. At that point, a concurrent thread can queue the work again. When the already running worker eventually clears the work_pending flag to 0, it mistakenly clears the flag for the newly queued instance. udp_tunnel_nic_unregister() then observes work_pending as 0 and frees the structure while the second work item is still active in the queue, leading to UAF. Fix this by returning early in udp_tunnel_nic_device_sync() if work_pending is already set, preventing redundant work queueing. Fixes: cc4e3835eff4 ("udp_tunnel: add central NIC RX port offload infrastructure") Reported-by: Yue Sun <samsun1006219@gmail.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260625065938.654652-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-24	net/tcp-ao: fix use-after-free of key in del_async path	HanQuan
	In tcp_ao_delete_key(), the del_async path skips the current_key and rnext_key validity checks present in the synchronous path, assuming these pointers are always NULL on LISTEN sockets. However, if a key was added with set_current=1/set_rnext=1 while the socket was in CLOSE state, current_key and rnext_key will be non-NULL after listen() transitions the socket to LISTEN. When such a key is deleted with del_async=1, hlist_del_rcu() and call_rcu() free the key without clearing the dangling pointers. After the RCU grace period, getsockopt(TCP_AO_INFO) dereferences current_key->sndid and rnext_key->rcvid from freed slab memory. Clear current_key and rnext_key in the del_async path when they reference the key being deleted. Fixes: d6732b95b6fb ("net/tcp: Allow asynchronous delete for TCP-AO keys (MKTs)") Signed-off-by: HanQuan <eilaimemedsnaimel@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260623015208.1191687-1-eilaimemedsnaimel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-23	Merge tag 'ipsec-2026-06-22' of ↵	Jakub Kicinski
	git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2026-06-22 1) xfrm: use compat translator only for u64 alignment mismatch Gate the XFRM_USER_COMPAT translator on COMPAT_FOR_U64_ALIGNMENT so 32-bit compat tasks on arches whose 32-bit ABI already matches the native 64-bit layout are no longer rejected with -EOPNOTSUPP. From Sanman Pradhan. 2) net: af_key: initialize alg_key_len for IPComp states Initialize the alg_key_len to 0 in the IPComp branch of pfkey_msg2xfrm_state() so an uninitialized value cannot drive xfrm_alg_len() into a slab-out-of-bounds kmemdup during XFRM_MSG_MIGRATE. From Zijing Yin. 3) xfrm: Fix dev use-after-free in xfrm async resumption Stash the original skb->dev and extend the RCU critical section across xfrm_rcv_cb() and transport_finish() to prevent a tunnel-device UAF and original-device refcount leak when a callback replaces skb->dev. From Dong Chenchen. 4) xfrm: Fix xfrm state cache insertion race Move the state-validity check inside xfrm_state_lock in the input state cache insertion path so a state cannot be killed between the check and the insert. From Herbert Xu. 5) xfrm: annotate data-races around xfrm_policy_count[] and xfrm_policy_default[] Add READ_ONCE()/WRITE_ONCE() annotations on xfrm_policy_count and xfrm_policy_default to silence the KCSAN data race reported on net->xfrm.policy_count. From Eric Dumazet. 6) espintcp: use sk_msg_free_partial to fix partial send Replace the manual skmsg accounting in espintcp with sk_msg_free_partial() so the skmsg stays consistent on every iteration and the partial-send accounting bugs go away. From Sabrina Dubroca. 7) xfrm: validate selector family and prefixlen during match Reject mismatched address families in xfrm_selector_match() and bound prefixlen in addr4_match()/addr_match() to prevent the shift-out-of-bounds syzbot reported when an AF_UNSPEC selector with a large prefixlen is matched against an IPv4 flow. From Eric Dumazet. * tag 'ipsec-2026-06-22' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: validate selector family and prefixlen during match espintcp: use sk_msg_free_partial to fix partial send xfrm: annotate data-races around xfrm_policy_count[] and xfrm_policy_default[] xfrm: Fix xfrm state cache insertion race xfrm: Fix dev use-after-free in xfrm async resumption net: af_key: initialize alg_key_len for IPComp states xfrm: use compat translator only for u64 alignment mismatch ==================== Link: https://patch.msgid.link/20260622075726.29685-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-22	Merge tag 'nf-26-06-21' of ↵	Jakub Kicinski
	git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net. This batches fixes for real crashes with trivial/correctness fixes. There is too a rework of the conntrack expectation timeout strategy to deal with a possible race when removing an expectation. 1) Fix the incorrect flowtable timeout extension for entries in hw offload, from Adrian Bente. This is correcting a defect in the functionality, no crash. 2) Hold reference to device under the fake dst in br_netfilter, from Haoze Xie. This is fixing a possible UaF if the device is removed while packet is sitting in nfqueue. 3) Reject template conntrack in xt_cluster, otherwise access to uninitialize conntrack fields are possible leading to WARN_ON due to unset layer 3 protocol. From Wyatt Feng. 4) Make sure the IPv6 tunnel header is in the linear skb data area before pulling. While at it remove incomplete NEXTHDR_DEST support. From Lorenzo Bianconi. This possibly leading to crash if IPv4 header is not in the linear area. 5) Use test_bit_acquire in ipset hash set to avoid reordering of subsequent memory access. This is addressing a LLM related report, no crash has been observed. From Jozsef Kadlecsik. 6) Use test_bit_acquire in ipset bitmap set too, for the same reason as in the previous patch, from Jozsef Kadlecsik. 7) Call kfree_rcu() after rcu_assign_pointer() to address a possible UaF if kfree_rcu() runs inmediately, which to my understanding never happens. Never observed in practise, reported by LLM. Also from Jozsef Kadlecsik. 8) Use disable_delayed_work_sync() instead cancel_delayed_work_sync() to avoid that ipset GC handler re-queues work as reported by LLM. From Jozsef Kadlecsik. This is for correctness. 9) Restore the check in nft_payload for exceeding payloda offset over 2^16. From Florian Westphal. This fixes a silent truncation, not a big deal, but better be assertive and reject it. 10) Validate NFT_META_BRI_IIFHWADDR can only run from bridge prerouting. From Florian Westphal. Harmless but it could allow to read bytes from skb->cb. 11) Zero out destination hardware address during the flowtable path setup, also from Florian. This is a correctness fix, LLM points that possible infoleak can happen but topology to achieve it is not clear. 12) Skip IPv4 options if present when building the IPV4 reject reply. Otherwise bytes in the IPv4 options header can be sent back to origin where the ICMP header is being expected. Again from Florian Westphal. 13) Replace timer API for expectation by GC worker approach. This is implicitly fixing a race between nf_ct_remove_expectations() which might fail to remove the expectation due to timer_del() returning false because timer has expired and callback is being run concurrently. This fix is addressing a crash that has been already reported with a reproducer. 14) Check if br_vlan_get_pvid_rcu() fails, otherwise possible stack infoleak of 4-bytes. From Florian Westphal. * tag 'nf-26-06-21' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_meta_bridge: fix NFT_META_BRI_IIFPVID stack leak netfilter: nf_conntrack_expect: use conntrack GC to reap expectations netfilter: nf_reject: skip iphdr options when looking for icmp header netfilter: nft_flow_offload: zero device address for non-ether case netfilter: nft_meta_bridge: add validate callback for get operations netfilter: nft_payload: reject offsets exceeding 65535 bytes netfilter: ipset: make sure gc is properly stopped netfilter: ipset: fix order of kfree_rcu() and rcu_assign_pointer() netfilter: ipset: Don't use test_bit() in lockless RCU readers in bitmap types netfilter: ipset: Don't use test_bit() in lockless RCU readers in hash types netfilter: flowtable: fix and simplify IP6IP6 tunnel handling netfilter: xt_cluster: reject template conntracks in hash match netfilter: nf_queue: pin bridge device while NFQUEUE holds fake dst netfilter: flowtable: fix offloaded ct timeout never being extended ==================== Link: https://patch.msgid.link/20260620222738.112506-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-21	ipv4: account for fraggap on the paged allocation path	Wongi Lee
	In __ip_append_data(), when the paged-allocation branch is taken, alloclen and pagedlen are computed as alloclen = fragheaderlen + transhdrlen; pagedlen = datalen - transhdrlen; datalen already includes fraggap, but the fraggap bytes carried over from the previous skb are copied into the new skb's linear area at offset transhdrlen by the subsequent skb_copy_and_csum_bits(). The linear area is therefore undersized by fraggap bytes while pagedlen is overstated by the same amount. The non-paged branch sets alloclen to fraglen, which already accounts for fraggap because datalen does. Bring the paged branch in line by adding fraggap to alloclen and subtracting it from pagedlen. After this adjustment, copy no longer collapses to -fraggap on the paged path, so remove the stale comment describing that old arithmetic. Fixes: 8eb77cc73977 ("ipv4: avoid partial copy for zc") Signed-off-by: Jungwoo Lee <jwlee2217@gmail.com> Signed-off-by: Wongi Lee <qw3rtyp0@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/ajFR1eLAIs42TN3g@DESKTOP-19IMU7U.localdomain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-21	netfilter: nf_reject: skip iphdr options when looking for icmp header	Florian Westphal
	Not a big deal but this hould have used the real ip header length and not the base header size. As-is, if there are options then nf_skb_is_icmp_unreach() result will be random. Fixes: db99b2f2b3e2 ("netfilter: nf_reject: don't reply to icmp error messages") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-06-17	net: ipv4: bound TCP reordering sysctl writes and MTU probe sizes	Wyatt Feng
	Reject invalid `net.ipv4.tcp_reordering` values before they reach TCP socket state. The sysctl is stored as an `int` but copied into the `u32` `tp->reordering` field for new sockets, so negative writes wrap to large values. With `tcp_mtu_probing=2`, the wrapped value can overflow the `tcp_mtu_probe()` size calculation and drive the MTU probing path into an out-of-bounds read. Route `tcp_reordering` writes through `proc_dointvec_minmax()` and require it to be at least 1. Also require `tcp_max_reordering` to be at least 1 so the configured maximum cannot become negative either. When registering the table for a non-init network namespace, relocate `extra2` pointers that refer into `init_net.ipv4` so the `tcp_reordering` upper bound follows that namespace's `tcp_max_reordering`. Harden `tcp_mtu_probe()` itself by computing `size_needed` as `u64`. This keeps the send queue and window checks from being bypassed through signed integer overflow. Fixes: 91cc17c0e5e5 ("[TCP]: MTUprobe: receiver window & data available checks fixed") Cc: stable@vger.kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/1a5b7e1ef4d70fbad8c8ee0b82d8405f3c964a3d.1781395200.git.bronzed_45_vested@icloud.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-17	net: ip_vti: require CAP_NET_ADMIN in the device netns for changelink	Maoyi Xie
	vti_changelink() operates on at most two netns, dev_net(dev) and the tunnel link netns t->net. They differ once the device is created in or moved to a netns other than the one the request runs in. The rtnl changelink path checks CAP_NET_ADMIN only against dev_net(dev), so a caller privileged there but not in t->net can rewrite a tunnel that lives in t->net. Gate vti_changelink() on rtnl_dev_link_net_capable() at its top, before any attribute is parsed. Reported-by: Xiao Liang <shaw.leon@gmail.com> Closes: https://lore.kernel.org/netdev/CABAhCOSzP1vaThGV35_VnsRCb=87_CPjPVsTHbq905k8A+BuUg@mail.gmail.com/ Fixes: 895de9a3488a ("vti4: Enable namespace changing") Cc: stable@vger.kernel.org Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260612085941.3158249-4-maoyixie.tju@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-17	net: ipip: require CAP_NET_ADMIN in the device netns for changelink	Maoyi Xie
	ipip_changelink() operates on at most two netns, dev_net(dev) and the tunnel link netns t->net. They differ once the device is created in or moved to a netns other than the one the request runs in. The rtnl changelink path checks CAP_NET_ADMIN only against dev_net(dev), so a caller privileged there but not in t->net can rewrite a tunnel that lives in t->net. Gate ipip_changelink() on rtnl_dev_link_net_capable() at its top, before any attribute is parsed. Reported-by: Xiao Liang <shaw.leon@gmail.com> Closes: https://lore.kernel.org/netdev/CABAhCOSzP1vaThGV35_VnsRCb=87_CPjPVsTHbq905k8A+BuUg@mail.gmail.com/ Fixes: 6c742e714d8c ("ipip: add x-netns support") Cc: stable@vger.kernel.org Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260612085941.3158249-3-maoyixie.tju@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-17	net: ip_gre: require CAP_NET_ADMIN in the device netns for changelink	Maoyi Xie
	A tunnel changelink() operates on at most two netns, dev_net(dev) and the tunnel link netns t->net. They differ once the device is created in or moved to a netns other than the one the request runs in. The rtnl changelink path checks CAP_NET_ADMIN only against dev_net(dev), so a caller privileged there but not in t->net can rewrite a tunnel that lives in t->net. Add rtnl_dev_link_net_capable() next to rtnl_get_net_ns_capable() in net/core/rtnetlink.c. It requires CAP_NET_ADMIN in the link netns and is skipped when the link netns is dev_net(dev), where the rtnl path already checked it. The other patches in this series use the same helper. Gate ipgre_changelink() and erspan_changelink() with it, at the top of the op before any attribute is parsed, because the parsers update live tunnel fields first. ipgre_netlink_parms() sets t->collect_md before ip_tunnel_changelink() runs. Commit 8b484efd5cb4 ("ip6: vti: Use ip6_tnl.net in vti6_siocdevprivate().") added the same check on the ioctl path. This adds it on RTM_NEWLINK. Reported-by: Xiao Liang <shaw.leon@gmail.com> Closes: https://lore.kernel.org/netdev/CABAhCOSzP1vaThGV35_VnsRCb=87_CPjPVsTHbq905k8A+BuUg@mail.gmail.com/ Fixes: b57708add314 ("gre: add x-netns support") Cc: stable@vger.kernel.org Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260612085941.3158249-2-maoyixie.tju@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-17	Merge tag 'bpf-next-7.2' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: "Major changes: - Recover from BPF arena page faults using a scratch page and add ptep_try_set() for lockless empty-slot installs on x86 and arm64. This allows BPF kfuncs to access arena pointers directly. The 'arena_direct_access' stable branch was created for this work and was pulled into sched-ext and bpf-next trees (Tejun Heo, Kumar Kartikeya Dwivedi) - Lift old restriction and support 6+ arguments in BPF programs and kfuncs on x86 and arm64 (Yonghong Song, Puranjay Mohan) Other features and fixes: - Add 24-bit BTF vlen and reclaim unused bits in the BTF UAPI to ease addition of new BTF kinds (Alan Maguire) - Raise the maximum BPF call chain depth from 8 to 16 frames (Alexei Starovoitov) - Refactor object relationship tracking in the verifier and fix a dynptr use-after-free bug (Amery Hung) - Harden the signed program loader and reject exclusive maps as inner maps (Daniel Borkmann) - Replace the verifier min/max bounds fields with a circular number (cnum) representation and improve 32->64 bit range refinements (Eduard Zingerman) - Introduce the arena library and runtime (libarena) with a buddy allocator, rbtree and SPMC queue data structures, ASAN support and a parallel test harness. Allow subprograms to return arena pointers and switch to a BTF type-tag based __arena annotation (Emil Tsalapatis) - Cache build IDs in the sleepable stackmap path and avoid faultable build ID reads under mm locks (Ihor Solodrai) - Introduce the tracing_multi link to attach a single BPF program to many kernel functions at once. Allow specifying the uprobe_multi target via FD (Jiri Olsa) - Extend the bpf_list family of kfuncs with bpf_list_add/del(), and bpf_list_is_first/is_last/empty() (Kaitao Cheng) - Extend the BPF syscall with common attributes support for prog_load, btf_load and map_create (Leon Hwang) - Wrap rhashtable as BPF map (Mykyta Yatsenko, Herbert Xu) - Add sleepable support for tracepoint programs and fix deadlocks in LRU map due to NMI reentry (Mykyta Yatsenko) - Fix OOB access in bpf_flow_keys, fix nullness analysis of inner arrays, enforce write checks for global subprograms (Nuoqi Gui) - Report the maximum combined stack depth and print a breakdown of instructions processed per subprogram (Paul Chaignon) - Add an XDP load-balancer benchmark and arm64 JIT support for stack arguments (Puranjay Mohan) - Add kfuncs to traverse over wakeup_sources (Samuel Wu) - Allow sleepable BPF programs to use LPM trie maps directly (Vlad Poenaru) - Many more fixes and cleanups across the verifier, BTF, sockmap, devmap, bpffs, security hooks, s390/riscv/loongarch JITs, rqspinlock, libbpf, bpftool, selftests" * tag 'bpf-next-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (336 commits) selftests/bpf: Work around llvm stack overflow in crypto progs selftests/bpf: add test for bpf_msg_pop_data() overflow bpf, sockmap: fix integer overflow in bpf_msg_pop_data() bounds check sockmap: Fix use-after-free in udp_bpf_recvmsg() bpf, sockmap: keep sk_msg copy state in sync bpf, sockmap: Fix wrong rsge offset in bpf_msg_push_data() bpf, sockmap: reject overflowing copy + len in bpf_msg_push_data() selftsets/bpf: Retry map update on helper_fill_hashmap() selftests/bpf: Add test for sleepable lsm_cgroup rejection selftests/bpf: Add test to verify the fix for bpf_setsockopt() helper bpf: Fix bpf_get/setsockopt to tos for ipv4-mapped ipv6 socket selftests/bpf: Avoid static LLVM linking for cross builds selftests/bpf: Use common CFLAGS for urandom_read selftests/bpf: Initialize operation name before use tools/bpf: build: Append extra cflags libbpf: Initialize CFLAGS before including Makefile.include bpftool: Append extra host flags bpftool: Avoid adding EXTRA_CFLAGS to HOST_CFLAGS bpftool: Pass host flags to bootstrap libbpf selftests/bpf: correct CONFIG_PPC64 macro name in comment ...
2026-06-16	ipv4: fib_rule: Move fib4_rules_exit() to ->exit().	Kuniyuki Iwashima
	syzbot reported use-after-free of net->ipv4.rules_ops. [0] It can be reproduced with these commands: while true; do ip netns add ns1 ip -n ns1 link set dev lo up ip -n ns1 address add 192.0.2.1/24 dev lo ip -n ns1 link add name dummy1 up type dummy ip -n ns1 address add 198.51.100.1/24 dev dummy1 ip -n ns1 rule add ipproto tcp sport 12345 table 12345 ip -n ns1 fou add port 5555 ipproto 47 local 192.0.2.1 peer 198.51.100.2 peer_port 54321 ip netns del ns1 done The cited commit moved fib4_rules_exit() earlier to ->exit_rtnl(), but the kernel socket destroyed in ->exit() could eventually reach __fib_lookup(). I left fib4_rules_exit() in ->exit_rtnl() because fib4_rule_delete() calls fib_unmerge(), which requires RTNL. However, when ->delete() is called, ->configure() has already been called, thus fib_unmerge() in ->delete() has no effect. Let's remove fib_unmerge() in fib4_rule_delete() and move fib4_rules_exit() to ->exit(). Many thanks to Ido Schimmel for providing the nice repro very quickly. Note that we can make fib_rules_ops.delete() return void once net-next opens. [0]: BUG: KASAN: slab-use-after-free in fib_rules_lookup+0x15e/0xeb0 net/core/fib_rules.c:321 Read of size 8 at addr ffff88804ec4c680 by task kworker/u8:21/12641 CPU: 0 UID: 0 PID: 12641 Comm: kworker/u8:21 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026 Workqueue: netns cleanup_net Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 print_address_description+0x55/0x1e0 mm/kasan/report.c:378 print_report+0x58/0x70 mm/kasan/report.c:482 kasan_report+0x117/0x150 mm/kasan/report.c:595 fib_rules_lookup+0x15e/0xeb0 net/core/fib_rules.c:321 __fib_lookup+0x106/0x210 net/ipv4/fib_rules.c:96 ip_route_output_key_hash_rcu+0x294/0x2720 net/ipv4/route.c:2811 ip_route_output_key_hash+0x18d/0x2a0 net/ipv4/route.c:2702 __ip_route_output_key include/net/route.h:169 [inline] ip_route_output_flow+0x2a/0x150 net/ipv4/route.c:2929 ip4_datagram_release_cb+0x89d/0xbe0 net/ipv4/datagram.c:118 release_sock+0x206/0x260 net/core/sock.c:3861 inet_shutdown+0x2b1/0x390 net/ipv4/af_inet.c:950 udp_tunnel_sock_release+0x6d/0x80 net/ipv4/udp_tunnel_core.c:197 fou_release net/ipv4/fou_core.c:562 [inline] fou_exit_net+0x17d/0x1f0 net/ipv4/fou_core.c:1230 ops_exit_list net/core/net_namespace.c:199 [inline] ops_undo_list+0x43d/0x8d0 net/core/net_namespace.c:252 cleanup_net+0x572/0x810 net/core/net_namespace.c:702 process_one_work kernel/workqueue.c:3314 [inline] process_scheduled_works+0xa8e/0x14e0 kernel/workqueue.c:3397 worker_thread+0xa47/0xfb0 kernel/workqueue.c:3478 kthread+0x389/0x470 kernel/kthread.c:436 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> Fixes: 759923cf03b0 ("ipv4: fib: Convert fib_net_exit_batch() to ->exit_rtnl().") Reported-by: syzbot+965506b59a2de0b6905c@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6a315824.b0403584.28d0ff.0000.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260616191359.4142661-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-16	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	Jakub Kicinski
	Merge in late fixes in preparation for the net-next PR. Conflicts: net/tls/tls_sw.c 406e8a651a7b ("net: skmsg: preserve sg.copy across SG transforms") 79511603a65b ("tls: remove dead sockmap (psock) handling from the SW path") drivers/net/ethernet/microsoft/mana/mana_en.c f8fd56977eeea ("net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check") d07efe5a6e641 ("net: mana: Use per-queue allocation for tx_qp to reduce allocation size") https://lore.kernel.org/ajAPXu-C_PuTgV-a@sirena.org.uk No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-15	tcp: rehash onto different local ECMP path on retransmit timeout	Neil Spring
	Currently sk_rethink_txhash() re-rolls the socket's txhash on RTO, PLB, and spurious-retransmission events, but the cached route is reused and the new hash is not propagated into the ECMP path selection logic. Two changes are needed to make rehash select a different local ECMP path: 1. Add __sk_dst_reset() alongside sk_rethink_txhash() in tcp_write_timeout(), tcp_rcv_spurious_retrans(), and tcp_plb_check_rehash() so the cached dst is invalidated and the next transmit triggers a fresh route lookup. 2. Set fl6->mp_hash from sk_txhash (or tcp_rsk(req)->txhash for SYN/ACK retransmits and syncookies) in tcp_v6_connect(), inet6_sk_rebuild_header(), inet6_csk_route_req(), inet6_csk_route_socket(), tcp_v6_send_response(), and cookie_v6_check() so fib6_select_path() picks a path based on the new hash. The mp_hash override only applies to fib_multipath_hash_policy 0 (the default L3 policy). Its hash includes the flow label, but that is 0 by default -- np->flow_label is unset, and auto_flowlabels only computes the on-wire label later, per packet -- so flows to the same peer share one local path. Keying the hash on sk_txhash makes the local path per-connection and lets a rehash re-select it. Policies 1-3 are left unchanged. The mp_hash assignment is factored into a small helper, ip6_ecmp_set_mp_hash(), shared by inet6_csk_route_req(), inet6_csk_route_socket(), tcp_v6_connect(), inet6_sk_rebuild_header(), tcp_v6_send_response(), and cookie_v6_check(). It applies (txhash >> 1) ?: 1 for policy 0 (the >> 1 keeps mp_hash in the 31-bit range; ?: 1 keeps it non-zero, since 0 would fall back to rt6_multipath_hash()). inet6_csk_route_socket() calls it only for sk_protocol == IPPROTO_TCP so that non-TCP callers (e.g., L2TP via inet6_csk_xmit) fall through to rt6_multipath_hash() and retain their existing flow-key-based ECMP behavior. tcp_v6_send_response() also sets mp_hash from the response txhash so that a control packet (a RST from the full socket, or an ACK from a time-wait socket) selects the same local ECMP nexthop as the connection's txhash rather than falling back to the flow hash. The time-wait socket's tw_txhash is copied from sk_txhash when the connection enters TIME_WAIT, so it reflects any rehash that occurred. Setting mp_hash explicitly is necessary because the default ECMP hash derives from fl6->flowlabel via np->flow_label, which is not updated from sk_txhash (REPFLOW is off by default). ip6_make_flowlabel() cannot help either, as it runs after the route lookup. As a consequence, for policy 0 the local ECMP path of an IPv6 TCP flow follows sk_txhash even when fl6->flowlabel is non-zero, e.g. a reflected (REPFLOW) or explicitly set (IPV6_FLOWLABEL_MGR) flow label. This is intentional: only local path selection changes, so rehash can recover from a failed path; the on-wire flow label is unchanged. sk_set_txhash() is moved before ip6_dst_lookup_flow() in tcp_v6_connect() so the initial ECMP path is selected by the same txhash that subsequent route rebuilds will use. This avoids unintended path changes when the cached dst is naturally invalidated (e.g., by PMTU discovery or route changes). The rehash sites (tcp_write_timeout(), tcp_plb_check_rehash(), and tcp_rcv_spurious_retrans()) call __sk_rethink_txhash_reset_dst(), which re-rolls the txhash and, when it changed, drops the cached dst so the next transmit re-runs route selection. The dst reset is guarded by sk->sk_family == AF_INET6 since IPv4 ECMP does not currently use sk_txhash for path selection. For IPv4-mapped IPv6 sockets this produces a redundant dst reset on a cold path (RTO/PLB); the subsequent IPv4 route lookup returns the same result. The helper is deliberately separate from sk_rethink_txhash() itself: dst_negative_advice() calls sk_rethink_txhash() before its own dst op, so resetting the dst inside sk_rethink_txhash() would skip that op (e.g. rt6_remove_exception_rt()). For syncookies, cookie_init_sequence() computes the cookie value before route_req() and sets txhash so the SYN-ACK selects the same ECMP path that cookie_v6_check() will use when the full socket is created. cookie_tcp_reqsk_init() derives txhash from the cookie so the full socket's ECMP path matches the SYN-ACK. Both the SYN-ACK assignment in tcp_conn_request() and the full-socket assignment in cookie_tcp_reqsk_init() set txhash from the cookie for IPv4 and IPv6 alike. On IPv6 this drives ECMP path selection; on IPv4, which does not use sk_txhash for ECMP, it only affects TX-queue selection. That selection scales the hash by its high bits (reciprocal_scale()), which are uniform in the keyed secure_tcp_syn_cookie() output -- the MSS index only perturbs the low bits -- so the queue distribution matches net_tx_rndhash(). cookie_init_sequence() is split from the former version that also called tcp_synq_overflow() and incremented SYNCOOKIESSENT; those side effects are now in cookie_record_sent(), called after route_req() succeeds so they are not bumped when route_req() fails. cookie_record_sent() is guarded by CONFIG_SYN_COOKIES to match the guard on tcp_synq_overflow(). route_req() receives 0 as tw_isn for the syncookie path so that tcp_v6_init_req() still saves ireq->pktopts for REPFLOW flowlabel reflection and IPv6 cmsg options. The ecn_ok clear for syncookies without timestamps stays after tcp_ecn_create_request() so it takes precedence. Signed-off-by: Neil Spring <ntspring@meta.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260615042158.1600746-2-ntspring@meta.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-15	Merge tag 'nf-next-26-06-14' of ↵	Jakub Kicinski
	git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for net-next. More specifically, this contains conncount rework to address AI related reports, assorted Netfiter updates and two small incremental updates on IPVS: 1) Replace old obsolete workqueues (system_wq, system_unbound_wq) in IPVS, from Marco Crivellari. 2) Replace WARN_ON{_ONCE} by DEBUG_NET_WARN_ON_ONCE in nf_tables. In the recent years, reporters say that the use of WARN_ON{_ONCE} in conjunction with panic_on_warn=1 results in DoS. Let's replace it by DEBUG_NET_WARN_ON_ONCE so this is only exercised by test infrastructure and fuzzers, while also providing context to AI agents. From Fernando F. Mancera. Five patches from Florian Westphal to address AI reports in the conncount infrastructures: 3) Fix missing rcu read lock section when calling __ovs_ct_limit_get_zone_limit(). 4) Add a dedicate lock per rbtree tree, this increases memory usage but it should improve scalability. 5) Add a helper function to find the rbtree node, no functional changes are intented. 6) Add sequence counter to detect concurrent tree modifications and retry lookups. 7) Add locks to GC conncount walk and address other nitpicks. Then, several assorted updates: 8) Defensive Tree-wide addition of NULL checks for ct extensions. 9) Bail out if flowtable bypass cannot be fully set up from the flow offload expression, instead of lazy building a likely incomplete one. 10) Fix documentation for the new conn_max sysctl toggle in IPVS. 11) Add nf_dev_xmit_recursion() helpers and use them, to address recent AI reports. tag 'nf-next-26-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nf_dup_netdev: add nf_dev_xmit_recursion*() helpers and use them ipvs: fix doc syntax for conn_max sysctl netfilter: flowtable: bail out if forward path cannot be discovered netfilter: conntrack: check NULL when retrieving ct extension netfilter: nf_conncount: gc and rcu fixes netfilter: nf_conncount: add sequence counter to detect tree modifications netfilter: nf_conncount: split count_tree_node rbtree walk into helper netfilter: nf_conncount: use per nf_conncount_data spinlocks netfilter: nf_conncount: callers must hold rcu read lock netfilter: nf_tables: use DEBUG_NET_WARN_ON_ONCE in packet and control paths ipvs: Replace use of system_unbound_wq with system_dfl_long_wq ==================== Link: https://patch.msgid.link/20260614114605.474783-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-15	ipv4: fib: Convert fib_net_exit_batch() to ->exit_rtnl().	Kuniyuki Iwashima
	Currently, IPv4 routes are flushed in ->exit_batch() after all devices are unregistered. Unlike IPv6, IPv4 routes are not added from the fast path, so we can flush routes before default_device_exit_batch(). Let's call ip_fib_net_exit() from ->exit_rtnl() to save one RTNL locking dance. ip_fib_net_exit() must use list_del_rcu() for fib_table for the fast path on dying dev. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260612063225.455191-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-15	ipv4: fib: Avoid calling fib_trie_table() in fib_new_table() for dying net.	Kuniyuki Iwashima
	We will call ip_fib_net_exit() from ->exit_rtnl(). All fib_table will be destroyed before devices are unregistered. During device unregistration, inetdev_destroy() could call fib_del_ifaddr(), which calls fib_magic(RTM_DELROUTE). fib_magic() calls fib_new_table(), but we do not want to create a new table after ip_fib_net_exit() destroys all tables. As a prep, let's add check_net() before fib_trie_table() in fib_new_table(). fib_trie_table() is also called from fib_trie_unmerge(), but fib_get_table() fails first in fib_unmerge(), so the same problem does not occur there. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260612063225.455191-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-15	ipv4: fib: Free net->ipv4.{fib_table_hash,notifier_ops} without RTNL.	Kuniyuki Iwashima
	We will call ip_fib_net_exit() from ->exit_rtnl(). However, some paths will still access net->ipv4.fib_table_hash after ->exit_rtnl(). For example, fib_flush() is called from fib_disable_ip() for NETDEV_UNREGISTER. Let's move kfree(net->ipv4.fib_table_hash) and fib4_notifier_exit() from ip_fib_net_exit() to its caller. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260612063225.455191-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-15	ipv4: fib: Call fib_proc_exit() and nl_fib_lookup_exit() at ->pre_exit().	Kuniyuki Iwashima
	We will call ip_fib_net_exit() from ->exit_rtnl(). Since the exit callbacks are called in the following order, 1. ->pre_exit() ~~~ synchronize_rcu() ~~~ 2. ->exit_rtnl() : ip_fib_net_exit() 3. ->exit() : fib_proc_exit() / nl_fib_lookup_exit() 4. ->exit_batch() : fib4_semantics_exit() the reverse order of fib_net_init() would get messed up. Let's move fib_proc_exit() and nl_fib_lookup_exit() to ->pre_exit(). This is fine because procfs/netlink access from userspace cannot occur at this point and synchronize_rcu() is not needed. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260612063225.455191-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-15	ipv4: fib: Flush all fib_info in fib_table_flush() during netns dismantle.	Kuniyuki Iwashima
	Even when fib_table_flush() is called with flush_all true, it does not flush all fib_info due to this condition: !(fi->fib_flags & RTNH_F_DEAD) && !fib_props[fa->fa_type].error) This creates an implicit ordering between default_device_exit_batch() and fib_net_exit_batch(). fib_table_flush(flush_all=true) must be called after all devices are NETDEV_UNREGISTERed, which is after nexthop_flush_dev() marks RTNH_F_DEAD. This would cause memory leak if the order were reversed. fib_table_flush() does not skip non-dead error routes when flush_all is true: !flush_all && !(fi->fib_flags & RTNH_F_DEAD) && fib_props[fa->fa_type].error Let's merge the two conditions not to skip all non-dead fib_info during netns dismantle. Note that we could further apply !flush_all to the basic table id check and the rtmsg_fib() call in the loop. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260612063225.455191-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-14	sockmap: Fix use-after-free in udp_bpf_recvmsg()	Kuniyuki Iwashima
	syzbot reported use-after-free of struct sk_msg in sk_msg_recvmsg(). [0] sk_msg_recvmsg() peeks sk_msg from psock->ingress_msg under a lock, but its processing is lockless. Thus, sk_msg_recvmsg() must be serialised by callers, otherwise multiple threads could touch the same sk_msg. For example, TCP uses lock_sock(), and AF_UNIX uses unix_sk(sk)->iolock. Initially, udp_bpf_recvmsg() had used lock_sock(), but the cited commit removed it. Let's serialise sk_msg_recvmsg() with lock_sock() in udp_bpf_recvmsg(). Note that holding spin_lock_bh(&sk->sk_receive_queue.lock) is not an option due to copy_page_to_iter() in sk_msg_recvmsg(). [0]: BUG: KASAN: slab-use-after-free in sk_msg_recvmsg+0xb54/0xc30 net/core/skmsg.c:428 Read of size 4 at addr ffff88814cdcf000 by task syz.0.24/6020 CPU: 1 UID: 0 PID: 6020 Comm: syz.0.24 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 01/13/2026 Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xba/0x230 mm/kasan/report.c:482 kasan_report+0x117/0x150 mm/kasan/report.c:595 sk_msg_recvmsg+0xb54/0xc30 net/core/skmsg.c:428 udp_bpf_recvmsg+0x4bd/0xe00 net/ipv4/udp_bpf.c:84 inet_recvmsg+0x260/0x270 net/ipv4/af_inet.c:891 sock_recvmsg_nosec net/socket.c:1078 [inline] sock_recvmsg+0x1a8/0x270 net/socket.c:1100 ____sys_recvmsg+0x1e6/0x4a0 net/socket.c:2812 ___sys_recvmsg+0x215/0x590 net/socket.c:2854 do_recvmmsg+0x334/0x800 net/socket.c:2949 __sys_recvmmsg net/socket.c:3023 [inline] __do_sys_recvmmsg net/socket.c:3046 [inline] __se_sys_recvmmsg net/socket.c:3039 [inline] __x64_sys_recvmmsg+0x198/0x250 net/socket.c:3039 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fb319f9aeb9 Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fb31ad97028 EFLAGS: 00000246 ORIG_RAX: 000000000000012b RAX: ffffffffffffffda RBX: 00007fb31a216090 RCX: 00007fb319f9aeb9 RDX: 0000000000000001 RSI: 0000200000000400 RDI: 0000000000000004 RBP: 00007fb31a008c1f R08: 0000000000000000 R09: 0000000000000000 R10: 0000000040000021 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fb31a216128 R14: 00007fb31a216090 R15: 00007ffe21dd0a98 </TASK> Allocated by task 6019: kasan_save_stack mm/kasan/common.c:57 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 poison_kmalloc_redzone mm/kasan/common.c:398 [inline] __kasan_kmalloc+0x93/0xb0 mm/kasan/common.c:415 kasan_kmalloc include/linux/kasan.h:263 [inline] __kmalloc_cache_noprof+0x3d1/0x6e0 mm/slub.c:5780 kmalloc_noprof include/linux/slab.h:957 [inline] kzalloc_noprof include/linux/slab.h:1094 [inline] alloc_sk_msg net/core/skmsg.c:510 [inline] sk_psock_skb_ingress_self+0x60/0x350 net/core/skmsg.c:612 sk_psock_verdict_apply net/core/skmsg.c:1038 [inline] sk_psock_verdict_recv+0x7d9/0x8d0 net/core/skmsg.c:1236 udp_read_skb+0x73e/0x7e0 net/ipv4/udp.c:2045 sk_psock_verdict_data_ready+0x12d/0x550 net/core/skmsg.c:1257 __udp_enqueue_schedule_skb+0xc54/0x10b0 net/ipv4/udp.c:1789 __udp_queue_rcv_skb net/ipv4/udp.c:2346 [inline] udp_queue_rcv_one_skb+0xac5/0x19c0 net/ipv4/udp.c:2475 __udp4_lib_mcast_deliver+0xc06/0xcf0 net/ipv4/udp.c:2585 __udp4_lib_rcv+0x10f6/0x2620 net/ipv4/udp.c:2724 ip_protocol_deliver_rcu+0x282/0x440 net/ipv4/ip_input.c:207 ip_local_deliver_finish+0x3bb/0x6f0 net/ipv4/ip_input.c:241 NF_HOOK+0x336/0x3c0 include/linux/netfilter.h:318 dst_input include/net/dst.h:474 [inline] ip_sublist_rcv_finish+0x221/0x2a0 net/ipv4/ip_input.c:584 ip_list_rcv_finish net/ipv4/ip_input.c:628 [inline] ip_sublist_rcv+0x5c6/0xa70 net/ipv4/ip_input.c:644 ip_list_rcv+0x3f1/0x450 net/ipv4/ip_input.c:678 __netif_receive_skb_list_ptype net/core/dev.c:6195 [inline] __netif_receive_skb_list_core+0x7e5/0x810 net/core/dev.c:6242 __netif_receive_skb_list net/core/dev.c:6294 [inline] netif_receive_skb_list_internal+0x995/0xcf0 net/core/dev.c:6385 netif_receive_skb_list+0x54/0x410 net/core/dev.c:6437 xdp_recv_frames net/bpf/test_run.c:269 [inline] xdp_test_run_batch net/bpf/test_run.c:350 [inline] bpf_test_run_xdp_live+0x1946/0x1cf0 net/bpf/test_run.c:379 bpf_prog_test_run_xdp+0x81c/0x1160 net/bpf/test_run.c:1396 bpf_prog_test_run+0x2c7/0x340 kernel/bpf/syscall.c:4703 __sys_bpf+0x5cb/0x920 kernel/bpf/syscall.c:6182 __do_sys_bpf kernel/bpf/syscall.c:6274 [inline] __se_sys_bpf kernel/bpf/syscall.c:6272 [inline] __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:6272 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 6021: kasan_save_stack mm/kasan/common.c:57 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584 poison_slab_object mm/kasan/common.c:253 [inline] __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285 kasan_slab_free include/linux/kasan.h:235 [inline] slab_free_hook mm/slub.c:2540 [inline] slab_free mm/slub.c:6674 [inline] kfree+0x1be/0x650 mm/slub.c:6882 kfree_sk_msg include/linux/skmsg.h:385 [inline] sk_msg_recvmsg+0xaa8/0xc30 net/core/skmsg.c:483 udp_bpf_recvmsg+0x4bd/0xe00 net/ipv4/udp_bpf.c:84 inet_recvmsg+0x260/0x270 net/ipv4/af_inet.c:891 sock_recvmsg_nosec net/socket.c:1078 [inline] sock_recvmsg+0x1a8/0x270 net/socket.c:1100 ____sys_recvmsg+0x1e6/0x4a0 net/socket.c:2812 ___sys_recvmsg+0x215/0x590 net/socket.c:2854 do_recvmmsg+0x334/0x800 net/socket.c:2949 __sys_recvmmsg net/socket.c:3023 [inline] __do_sys_recvmmsg net/socket.c:3046 [inline] __se_sys_recvmmsg net/socket.c:3039 [inline] __x64_sys_recvmmsg+0x198/0x250 net/socket.c:3039 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: 9f2470fbc4cb ("skmsg: Improve udp_bpf_recvmsg() accuracy") Reported-by: syzbot+9307c991a6d07ce6e6d8@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69922ac9.a70a0220.2c38d7.00e0.GAE@google.com/ Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260615021959.140010-5-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-14	netfilter: conntrack: check NULL when retrieving ct extension	Pablo Neira Ayuso
	nf_ct_ext_find() might return NULL if ct extension is not found. Add also the null checks to: - nfct_help() - nfct_help_data() - nfct_seqadj() - nfct_nat() This is defensive, for safety reasons. nf_ct_ext_find() used to return NULL if the extension is stale for unconfirmed conntracks if the genid validation fails. Skip NULL check in nf_nat_inet_fn() given this is valid to be NULL for non-initialized ct nat extensions. While at it, fetch ct helper area in nf_ct_expect_related_report() only once and pass it on to other ancilliary functions. Replace WARN_ON() by WARN_ON_ONCE() in nf_ct_unlink_expect_report(). Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-06-13	ipv4: handle devconf post-set actions on netlink updates	Fernando Fernandez Mancera
	When IPv4 device configuration parameters are updated via netlink, the kernel currently only updates the value. This bypasses several post-modification actions that occur when these same parameters are updated via sysctl, such as flushing the routing cache or emitting RTM_NEWNETCONF notifications. This patch addresses the inconsistency by calling the devinet_conf_post_set() helper inside inet_set_link_af(). If a flush is required, we defer it until the netlink attribute parsing loop completes. This ensures consistent behavior and side-effects for devconf changes, regardless of whether they are initiated via sysctl or netlink. Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Link: https://patch.msgid.link/20260609204520.4670-2-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-13	ipv4: centralize devconf sysctl handling	Fernando Fernandez Mancera
	The logic for handling IPv4 devconf sysctls is scattered. Notification and cache flushes are managed in devinet_conf_proc(), while a separate ipv4_doint_and_flush() function and DEVINET_SYSCTL_FLUSHING_ENTRY macro is used for properties that solely require a cache flush. This patch refactors the sysctl handling by introducing a centralized helper, devinet_conf_post_set(). This new function evaluates the changed attribute and handles all necessary operations like triggering netlink notifications. It returns a boolean indicating whether a routing cache flush is required. Note that the boolean is necessary as this function will be re-used for netlink IPv4 devconf handling where the cache flushing must wait until all the attributes have been processed. Finally, this is introducing a small change in behavior for IPV4_DEVCONF_ROUTE_LOCALNET. As commit d0daebc3d622 ("ipv4: Add interface option to enable routing of 127.0.0.0/8") intended, the cache flush should only be performed when ROUTE_LOCALNET changes from 1 to 0. Unfortunately, this was not true because while implementing it the DEVINET_SYSCTL_FLUSHING_ENTRY was used for the attribute, making the code related to it on devinet_conf_proc() dead. IPV4_DEVCONF_FORWARDING is still being handled separately as it requires more operations. Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Link: https://patch.msgid.link/20260609204520.4670-1-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-13	tcp: refine tcp_sequence() for the FIN exception	Eric Dumazet
	Commit 0e24d17bd966 ("tcp: implement RFC 7323 window retraction receiver requirements") removed the special FIN case that was added in commit 1e3bb184e941 ("tcp: re-enable acceptance of FIN packets when RWIN is 0"). If a peer sends a segment containing data and a FIN flag before it learns about our window retraction and has a buggy TCP stack, it might place the FIN one byte beyond what it thinks is the right edge of the window (i.e., max_window_edge + 1). The data portion (end_seq - th->fin) will end exactly at max_window_edge. In this case, we will drop the packet if our receive queue is not empty, even though the data was sent within the window we previously allowed. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Simon Baatz <gmbnomis@gmail.com> Link: https://patch.msgid.link/20260608151452.706822-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-13	Merge tag 'ipsec-next-2026-06-12' of ↵	Jakub Kicinski
	git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2026-06-12 1) Replace the open-coded manual cleanup in xfrm_add_policy() error path with xfrm_policy_destroy() for consistency with xfrm_policy_construct(). From Deepanshu Kartikey. 2) Limit XFRMA_TFCPAD to a sensible maximum (max IP length, 64k) since u32 is excessive for traffic flow confidentiality padding. From David Ahern. 3) Add a new netlink message XFRM_MSG_MIGRATE_STATE that allows migrating individual IPsec SAs independently of their policies. The existing XFRM_MSG_MIGRATE is tightly coupled to policy+SA migration, lacks SPI for unique SA identification, and cannot express reqid changes or migrate Transport mode selectors. The new interface identifies the SA via SPI and mark, supports reqid changes, address family changes, encap removal, and uses an atomic create+install flow under x->lock to prevent SN/IV reuse during AEAD SA migration. From Antony Antony. * tag 'ipsec-next-2026-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: add documentation for XFRM_MSG_MIGRATE_STATE xfrm: restrict netlink attributes for XFRM_MSG_MIGRATE_STATE xfrm: add XFRM_MSG_MIGRATE_STATE for single SA migration xfrm: make xfrm_dev_state_add xuo parameter const xfrm: extract address family and selector validation helpers xfrm: refactor XFRMA_MTIMER_THRESH validation into a helper xfrm: move encap and xuo into struct xfrm_migrate xfrm: add error messages to state migration xfrm: add state synchronization after migration xfrm: check family before comparing addresses in migrate xfrm: split xfrm_state_migrate into create and install functions xfrm: rename reqid in xfrm_migrate xfrm: fix NAT-related field inheritance in SA migration xfrm: allow migration from UDP encapsulated to non-encapsulated ESP xfrm: add extack to xfrm_init_state xfrm: remove redundant assignments xfrm: Reject excessive values for XFRMA_TFCPAD xfrm: cleanup error path in xfrm_add_policy() ==================== Link: https://patch.msgid.link/20260612074725.1760473-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-12	net: remove some unused EXPORT_SYMBOL()s	Sabrina Dubroca
	chtls was using a lot of symbols that no other module requires. Remove those EXPORT_SYMBOL()s. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/d124db74f6f0838b652f0ee4b4530964f3cf8d49.1781165969.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-12	ip_tunnel: annotate data-races around t->err_count and t->err_time	Eric Dumazet
	ip_tunnel_xmit() runs locklessly (dev->lltx == true). ipgre_err() and ipip_err() also run locklessly. We need to add READ_ONCE() and WRITE_ONCE() annotations around t->err_count and t->err_time. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260611165247.2710257-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-12	tcp: clear sock_ops cb flags before force-closing a child socket	Sechang Lim
	A child socket inherits the listener's bpf_sock_ops_cb_flags via sk_clone_lock(). If its setup fails in tcp_v4_syn_recv_sock() / tcp_v6_syn_recv_sock(), the child is freed through put_and_exit, where inet_csk_prepare_forced_close() drops the socket lock and tcp_done() runs without it. If BPF_SOCK_OPS_STATE_CB_FLAG was inherited, tcp_done() -> tcp_set_state() calls tcp_call_bpf(), which expects the lock and trips sock_owned_by_me(): WARNING: include/net/sock.h:1799 at tcp_set_state+0x433/0x550 RIP: 0010:tcp_set_state+0x433/0x550 include/net/sock.h:1799 Call Trace: <IRQ> tcp_done+0xba/0x250 net/ipv4/tcp.c:5095 tcp_v4_syn_recv_sock+0x850/0xa50 net/ipv4/tcp_ipv4.c:1787 tcp_check_req+0xf30/0x1360 net/ipv4/tcp_minisocks.c:926 tcp_v4_rcv+0x1047/0x1b50 net/ipv4/tcp_ipv4.c:2164 </IRQ> The child is freed before it is ever established, so it should run no sock_ops callback. Clear its cb flags in inet_csk_prepare_for_destroy_sock(), the common point for the IPv4, IPv6 and chtls forced-close paths and for the MPTCP ->syn_recv_sock() failure path (dispose_child), which reaches tcp_done() on a child that was never established too. Suggested-by: Jiayuan Chen <jiayuan.chen@linux.dev> Fixes: d44874910a26 ("bpf: Add BPF_SOCK_OPS_STATE_CB") Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260611092923.1895982-1-rhkrqnwk98@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>