linux.git - Linux kernel source tree

Age	Commit message (Collapse)	Author
9 days	netfilter: nft_payload: fix mask build for partial field offload	Xiang Mei (Microsoft)
	nft_payload_offload_mask() builds the offload match mask for a payload expression that covers only part of a header field. For a partial IPv6 address match (field_len = 16, priv_len = 1) that shift is 1 << 120, which is undefined on the 32-bit int operand. It also trims only one word, so the remaining words stay 0xffffffff (and when priv_len is a multiple of 4 the trim is skipped entirely), leaving the mask covering more bytes than the rule matches. UBSAN: shift-out-of-bounds in net/netfilter/nft_payload.c:278:20 shift exponent 120 is too large for 32-bit type 'int' ... The match is byte-granular and struct nft_data is zero-initialised, so the correct mask is simply the first priv_len bytes set to 0xff. Set those bytes directly and drop the word/shift trimming; this removes the undefined shift and no longer over-masks the trailing bytes. Fixes: a5d45bc0dc50 ("netfilter: nftables_offload: build mask based from the matching bytes") Reported-by: AutonomousCodeSecurity@microsoft.com Signed-off-by: Xiang Mei (Microsoft) <xmei5@asu.edu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
9 days	ipvs: clear the nfct flag under lock	Julian Anastasov
	Sashiko warns that cp->flags should be changed under cp->lock Fixes: 35dfb013149f ("ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1") Fixes: f0a5e4d7a594 ("ipvs: allow connection reuse for unconfirmed conntrack") Link: https://sashiko.dev/#/patchset/CALMqdkR704S2BG_QD_bgHTFp2%2B1QCi7n0T4zoZyTo8mDZevYSA%40mail.gmail.com Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
9 days	ipvs: do not mangle ICMP replies for non-first fragments	Julian Anastasov
	Sashiko warns that ip_vs_nat_icmp() unconditionally mangles the payload for embedded non-first IPv4 fragments. The problem is in the very old inverted pp->dont_defrag check which should not continue when embedded is a non-first TCP/UDP/SCTP fragment. Check for embedded non-first fragment is also missing from ip_vs_out_icmp_v6(), it is needed before any connection lookups that expect ports after the network headers. Drop the blocking code from ip_vs_in_icmp_v6() which prevents ICMPv6 from local clients to use non-MASQ forwarding. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Link: https://sashiko.dev/#/patchset/20260720201122.79882-1-ja%40ssi.bg Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
9 days	ipvs: fix places with wrong packet offsets	Julian Anastasov
	The offsets we use to packet headers and payloads should be based on skb->data. We even already respect non-zero network offset in ip_vs_fill_iph_skb() but some places do it wrongly and support only zero offset which is expected for the IP layer where IPVS has hooks. Change all places that instead of skb->data use offsets based on the network header (skb_network_header, ip_hdr, etc) because this doubles the network offset as noted by Sashiko. For ip_vs_nat_icmp_v6() we can even rely on the IPv6 header parsing done by the caller. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Link: https://sashiko.dev/#/patchset/20260710143733.29741-2-fw%40strlen.de Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
9 days	ipvs: fix the checksum validations	Julian Anastasov
	ip_vs_in_icmp_v6() is missing checksum validation for ICMPv6 packets from clients. In fact, as for TCP/UDP we should validate the checksum for ICMP packets only when we mangle the packets on MASQ or on reply for tunnel. Also, Sashiko points out that handle_response_icmp() being common for IPv4 and IPv6 is missing the pseudo-header calculation while validating ICMPv6 messages from real servers which is a problem if checksum is not validated by the hardware. Fix the problems by creating ip_vs_checksum_common_check() helper and use it for TCP/UDP/ICMP both for IPv4 and IPv6. Rely on the nf_checksum() for validating the ICMP messages but use it also for TCP and UDP. Use correct IP offset for IP_VS_DBG_RL_PKT for TCP/UDP/SCTP. IPVS packets (TCP/UDP/SCTP/ICMP) do not need checksum validation on LOCAL_OUT (local clients or local real servers) and on FORWARD (traffic from servers on LAN). Do it only on LOCAL_IN, in case nf_checksum() is not called on PRE_ROUTING. Also, ip_vs_checksum_complete() can be marked static. Fixes: 2a3b791e6e11 ("IPVS: Add/adjust Netfilter hook functions and helpers for v6") Link: https://sashiko.dev/#/patchset/20260708180315.77413-1-ja%40ssi.bg Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
9 days	netfilter: xt_hashlimit: validate hashtable supports XT_HASHLIMIT_RATE_MATCH	Pablo Neira Ayuso
	The XT_HASHLIMIT_RATE_MATCH flag mode changes the semantics of the dsthash_ent structure which represents an entry in the hashtable. There is a union area which uses a different layout to express the rate match mode. Update .checkentry path to validate the XT_HASHLIMIT_RATE_MATCH mode flag is requested by two or more different rules that refer to the same hashtable. Otherwise, uninitialized access to the burst field in the union is possible. Reject the use of the XT_HASHLIMIT_RATE_MATCH mode flag if set on by revision less than 3 too. Fixes: bea74641e378 ("netfilter: xt_hashlimit: add rate match mode") Reported-and-tested-by: Talha Berk Arslan <talha.anything.info@gmail.com> Link: https://patch.msgid.link/20260721074629.668-1-talha.anything.info@gmail.com/ Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
9 days	netfilter: nf_tables: make nft_object rhltable per table	Pablo Neira Ayuso
	The nft_object rhltable is global, this allows for accessing objects that are being dismangled from lookup path by other existing netns. Given the nft_obj_destroy() releases the object inmediately, this might lead to use-after-free of these objects that are being released. Make the existing rhltable per table to address this issue to deal with with the nft_rcv_nl_event() path too. Update nft_obj_lookup() to take the table as non-const, otherwise, compiler complains when passing the objname_ht to rhltable_lookup(). Fixes: 4d44175aa5bb ("netfilter: nf_tables: handle nft_object lookups via rhltable") Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
9 days	ipvs: adjust double hashing when fwd method changes	Julian Anastasov
	Synced conns can be created with one forwarding method and later updated with different one after the dest server is configured. This needs adjusting the hashing for node hn1 because only MASQ supports double hashing. Modify conn_tab_lock() to support seeking for hash node hn0 together with adding for hn1. By this way we can safely modify the forwarding method and hn1.hash_key under bucket lock for the first node hn0. The forwarding method is also protected by cp->lock as it is part of cp->flags. Fix the usage of stale idx/idx2 values in conn_tab_lock after jumping to the retry label. Instead, use idx/idx2 values just to order the locking for the old/new tables. Reported-by: Zhiling Zou <roxy520tt@gmail.com> Link: https://patch.msgid.link/1b914f41d725bc064c9ba9830dc8169329737270.1782540466.git.roxy520tt@gmail.com/ Link: https://sashiko.dev/#/patchset/CALMqdkR704S2BG_QD_bgHTFp2%2B1QCi7n0T4zoZyTo8mDZevYSA%40mail.gmail.com Fixes: f20c73b0460d ("ipvs: use more keys for connection hashing") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
9 days	ipvs: do not propagate one-packet flag to synced conns	Zhiling Zou
	Synced connections can be created before their destination exists. When the destination is later added, ip_vs_bind_dest() copies connection flags from the destination into cp->flags. IP_VS_CONN_F_ONE_PACKET connections are not synced. If a synced connection inherits IP_VS_CONN_F_ONE_PACKET while it is already hashed, expiry can treat it as a one-packet connection and skip unlinking the existing conn_tab node, leaving stale hash nodes pointing at a freed struct ip_vs_conn. Drop IP_VS_CONN_F_ONE_PACKET from destination flags when binding synced connections. Fixes: 26ec037f9841 ("IPVS: one-packet scheduling") Cc: stable@vger.kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Suggested-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Zhiling Zou <roxy520tt@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
9 days	netfilter: ipset: do not update comments from kernel-side hash adds	David Lee
	mtype_resize() copies comment pointers with memcpy(), not the comment objects themselves. During the window after an entry has been copied but before the table swap and backlog replay, the old table is still published for packet-side updates while the replacement-table entry already holds the same ip_set_comment_rcu pointer. If xt_SET --add-set ... --exist hits that old entry in this window, mtype_add() calls ip_set_init_comment() even though packet-side adds carry no comment payload. That call frees the shared comment through the old entry, so the replacement-table entry now holds a stale pointer. When the queued add is replayed on the new table, mtype_add() calls ip_set_init_comment() again and strlen() dereferences the stale pointer. Fix this in mtype_add() by skipping ip_set_init_comment() when ext->target marks a packet-side add. Userspace adds still update comments, while packet-side adds can no longer free comment storage shared with a resize copy. Fixes: f66ee0410b1c ("netfilter: ipset: Fix "INFO: rcu detected stall in hash_xxx" reports") Cc: stable@vger.kernel.org Signed-off-by: David Lee <david.lee@trailofbits.com> Assisted-by: Codex:gpt-5.5 Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
10 days	netfilter: nf_conntrack_expect: add and use nf_ct_expect_related_pair()	Pablo Neira Ayuso
	Add a new function to insert a pair of expectations, this is required by the SIP and H323 NAT helpers. The spinlock is held to check if there is a slot for both expectations, in such case, insert them. This removes the need for nf_ct_unexpect_related() inside the loop to find a pair of consecutive ports, otherwise inserting expectations whose dead flag is already set on can happen. Bump master_help->expecting for the expectation class after checking if the expectation fits in the master expectation list, which is needed for this new _pair() function variant to run the eviction routine including the preallocated slot for the first expectation in the pair. Fixes: b8b09dc2bf35 ("netfilter: nf_conntrack_expect: use conntrack GC to reap expectations") Reported-by: Jaeyeong Lee <iostreampy@proton.me> Link: https://patch.msgid.link/178377968720.33756.12204817361601593230@proton.me/ Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
10 days	netfilter: nf_conntrack_sip: widen NAT rewrite delta to s32 in sip_help_tcp()	Xiang Mei
	sip_help_tcp() stores the size change of each NAT-rewritten SIP message in s16 diff and accumulates it in s16 tdiff, but a single message can grow by more than S16_MAX while the packet stays under the 65535 enlarge_skb() limit: nf_nat_sip() rewrites every matching URI, and a long Contact list expands the message by tens of kilobytes. diff then wraps, and "datalen = datalen + diff - msglen" yields a huge unsigned datalen, so the next iteration's ct_sip_get_header() reads past the linearized skb tail. Widen diff, tdiff and the seq_adjust hook to s32. Both are bounded by the 65535 byte packet limit, and the seqadj core is already s32 (nf_ct_seqadj_set() takes s32), so no previously accepted input is rejected. BUG: KASAN: use-after-free in ct_sip_get_header (net/netfilter/nf_conntrack_sip.c:464) Read of size 1 at addr ffff888010800000 by task ksoftirqd/1/25 ct_sip_get_header (net/netfilter/nf_conntrack_sip.c:464) sip_help_tcp (net/netfilter/nf_conntrack_sip.c:1694) nf_confirm (net/netfilter/nf_conntrack_proto.c:183) nf_hook_slow (net/netfilter/core.c:619) ip6_output (net/ipv6/ip6_output.c:246) ip6_forward (net/ipv6/ip6_output.c:690) ipv6_rcv (net/ipv6/ip6_input.c:351) __netif_receive_skb_one_core (net/core/dev.c:6212) process_backlog (net/core/dev.c:6676) __napi_poll (net/core/dev.c:7735) net_rx_action (net/core/dev.c:7955) handle_softirqs (kernel/softirq.c:622) run_ksoftirqd (kernel/softirq.c:1076) ... Fixes: f5b321bd37fb ("netfilter: nf_conntrack_sip: add TCP support") Reported-by: Weiming Shi <bestswngs@gmail.com> Link: https://patch.msgid.link/netfilter-devel/20260712234201.3213635-1-xmei5@asu.edu Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-07-10	netfilter: xt_physdev: masks are not c-strings	Florian Westphal
	... and must not be subjected to the 'nul terminated' constraint. If the interface name is 15 characters long, the mask is 16-bytes '0xff' (to cover for \0) and the valid device name is rejected. Fixes: 8df772afc9d0 ("netfilter: x_physdev: reject empty or not-nul terminated device names") Cc: stable@vger.kernel.org Closes: https://bugs.launchpad.net/neutron/+bug/2159935 Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-10	ipvs: fix more places with wrong ipv6 transport offsets	Julian Anastasov
	Sashiko reports for more incorrect IPv6 transport offsets. The app code for TCP was assuming IPv4 network header even after the ipvsh argument was provided. This can cause problems with apps over IPv6. As for the only official app in the kernel tree (FTP) this problem is harmless because we use Netfilter to mangle the FTP ports and we do not adjust the TCP seq numbers. Also, provide correct offset of the ICMPV6 header in ip_vs_out_icmp_v6() for correct checksum checks when the IPv6 packet has extension headers. Fixes: d12e12299a69 ("ipvs: add ipv6 support to ftp") Fixes: 2a3b791e6e11 ("IPVS: Add/adjust Netfilter hook functions and helpers for v6") Cc: stable@vger.kernel.org Link: https://sashiko.dev/#/patchset/20260706101624.69471-1-zhaoyz24%40mails.tsinghua.edu.cn Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-10	ipvs: reload ip header after head reallocation	Florian Westphal
	__ip_vs_get_out_rt() calls skb_ensure_writable() which may reallocate skb->head. Fixes: 8d8e20e2d7bb ("ipvs: Decrement ttl") Cc: stable@vger.kernel.org Assisted-by: Claude:claude-sonnet-4-6 Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-10	netfilter: flowtable: use correct direction to set up tunnel route	Pablo Neira Ayuso
	The layer 2 encapsulation and layer 3 tunnel information in the xmit path is taken from the other tuple, because the tunnel information that is included in the tuple for hashtable lookups is also used to perform the egress encapsulation in the transmit path. This patch uses the correct direction when setting up the tunnel, the original proposed patch to address this fix uses the reversed direction. While at it, remove the redundant check to call dst_release() to drop the reference on the dst that was obtained from the forward path, which is not useful in the direct xmit path unless tunneling is performed. Fixes: fa7395c02d95 ("netfilter: flowtable: support IPIP tunnel with direct xmit") Cc: stable@vger.kernel.org Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-10	netfilter: nf_conncount: fix zone comparison in tuple dedup	Yizhou Zhao
	The "already exists" dedup logic in __nf_conncount_add() decides whether a connection has already been counted and can be skipped instead of incrementing the connlimit count. It compares the conntrack zone of a list entry with the zone of the connection being added using nf_ct_zone_id() and nf_ct_zone_equal(), passing conn->zone.dir or zone->dir as the direction argument. Those helpers take enum ip_conntrack_dir values: IP_CT_DIR_ORIGINAL is 0 and IP_CT_DIR_REPLY is 1. However, zone->dir is a u8 bitmask: NF_CT_ZONE_DIR_ORIG is 1, NF_CT_ZONE_DIR_REPL is 2 and NF_CT_DEFAULT_ZONE_DIR is 3. Passing that bitmask as the enum direction shifts the meaning of every non-zero value. An ORIG-only zone passes 1 and is tested as REPLY, while REPL-only and default zones pass 2 or 3 and test bits beyond the valid direction range. In those cases nf_ct_zone_id() can fall back to NF_CT_DEFAULT_ZONE_ID instead of using the real zone id, so different zones can be treated as equal and dedup collapses to tuple equality alone. nf_conncount stores and compares the original-direction tuple for a connection. If an skb already has an attached conntrack entry, get_ct_or_tuple_from_skb() explicitly copies ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple, regardless of the packet's ctinfo. Therefore the zone comparison in the tuple dedup path must use IP_CT_DIR_ORIGINAL as well; the zone direction bitmask describes where a zone id applies, not which direction this conncount tuple represents. Fix the two dedup comparisons by passing IP_CT_DIR_ORIGINAL directly. Do not special-case NF_CT_DEFAULT_ZONE_DIR and do not compare raw zone ids: using the existing helpers with IP_CT_DIR_ORIGINAL preserves the direction-aware NF_CT_DEFAULT_ZONE_ID fallback. A default bidirectional zone contains the ORIG bit, so it naturally returns the real zone id; reply-only zones continue to fall back for original-direction tuple comparisons. Fixes: 21ba8847f857 ("netfilter: nf_conncount: Fix garbage collection with zones") Fixes: b36e4523d4d5 ("netfilter: nf_conncount: fix garbage collection confirm race") Cc: stable@vger.kernel.org Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Assisted-by: Claude-Code:GLM-5.2 Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-10	netfilter: ecache: fix inverted time_after() check	Yizhou Zhao
	ecache_work_evict_list() redelivers DESTROY events for conntracks that were moved to the per-netns dying_list after event delivery failed. It sets a 10ms deadline: stop = jiffies + ECACHE_MAX_JIFFIES but then tests: time_after(stop, jiffies) This condition is true while the deadline is still in the future, so the worker returns STATE_RESTART after the first successful redelivery in the usual case. ecache_work() maps STATE_RESTART to delay 0, which turns the redelivery path into one dying conntrack per workqueue dispatch and makes the sent > 16 batching/cond_resched() path effectively unreachable. A conntrack netlink listener whose receive queue is congested can make DESTROY event delivery fail with -ENOBUFS. With sustained conntrack churn, entries then accumulate on the dying_list and are only drained at the degraded one-entry-per-dispatch rate once delivery succeeds again, wasting CPU on back-to-back workqueue reschedules and prolonging conntrack memory/resource pressure. In a KASAN QEMU test with CONFIG_NF_CONNTRACK_EVENTS=y and nf_conntrack.enable_hooks=1, a congested DESTROY listener caused 8192 nf_ct_delete() calls to return false and move entries to the dying_list. After closing the listener, the unfixed kernel needed 7670 ecache_work() entries to destroy 7669 conntracks. With this change, the same 8192 entries were destroyed by 2 ecache_work() entries. Swap the comparison so the worker restarts only after the deadline has expired. Fixes: 2ed3bf188b33 ("netfilter: ecache: use dedicated list for event redelivery") Cc: stable@vger.kernel.org Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Assisted-by: Claude-Code:GLM-5.2 Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-10	netfilter: xt_nat: reject unsupported target families	Wyatt Feng
	xt_nat SNAT and DNAT target handlers assume IP-family conntrack state is present and can dereference a NULL pointer when instantiated from an unsupported family through nft_compat. A bridge-family compat rule can therefore trigger a NULL-dereference in nf_nat_setup_info(). Reject non-IP families in xt_nat_checkentry() so unsupported targets cannot be installed. Keep NFPROTO_INET allowed for valid inet NAT compat users and leave the runtime fast path unchanged. [ The crash was fixed via 9dbba7e694ec ("netfilter: nft_compat: ebtables emulation must reject non-bridge targets"), so this patch is no longer critical. Nevertheless, NAT is only relevant for ipv4/ipv6, so this extra family check is a good idea in any case. ] Fixes: c7232c9979cb ("netfilter: add protocol independent NAT core") Cc: stable@vger.kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Assisted-by: Codex:GPT-5.4 Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-08	ipvs: ensure inner headers in ICMP errors are in headroom	Julian Anastasov
	Sashiko points out that after stripping the outer headers with pskb_pull() we should ensure the inner IP headers in ICMP errors from tunnels are present in the skb headroom for functions like ipv4_update_pmtu(), icmp_send() and IP_VS_DBG(). Also, add more checks for the length of the inner headers. Fixes: f2edb9f7706d ("ipvs: implement passive PMTUD for IPIP packets") Link: https://sashiko.dev/#/patchset/20260702073430.67680-1-zhaoyz24%40mails.tsinghua.edu.cn Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-08	ipvs: use parsed transport offset in SCTP state lookup	Yizhou Zhao
	set_sctp_state() reads the SCTP chunk header again in order to drive the IPVS SCTP state table. For IPv6 it computes the offset with sizeof(struct ipv6hdr), while the surrounding IPVS code uses iph.len from ip_vs_fill_iph_skb(), where ipv6_find_hdr() has already skipped extension headers and found the real transport header. This makes the state machine read from the wrong offset for IPv6 SCTP packets that carry extension headers. For example, an INIT packet with an 8-byte destination options header can be scheduled correctly by sctp_conn_schedule(), but set_sctp_state() reads the first byte of the SCTP verification tag as a DATA chunk type. The connection then moves from NONE to ESTABLISHED instead of INIT1, gets the longer established timeout, and updates the active/inactive destination counters incorrectly. This happens even though the SCTP handshake has not completed. Use the parsed transport offset passed down from ip_vs_set_state() for the SCTP chunk-header lookup. For IPv4 and IPv6 packets without extension headers this preserves the existing offset. Fixes: 2906f66a5682 ("ipvs: SCTP Trasport Loadbalancing Support") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/netdev/20260705123040.35755-1-zhaoyz24@mails.tsinghua.edu.cn/ Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Assisted-by: Claude Code:GLM-5.2 Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-08	ipvs: use parsed transport offset in TCP state lookup	Yizhou Zhao
	TCP state handling reparses the skb to find the TCP header. For IPv6 it uses sizeof(struct ipv6hdr), while the surrounding IPVS code already parsed the packet with ip_vs_fill_iph_skb() and has the real transport-header offset in iph.len. This makes TCP state handling look at the wrong bytes when an IPv6 packet carries extension headers. Use the parsed transport offset passed down from ip_vs_set_state() when reading the TCP header. For IPv4 and for IPv6 packets without extension headers, the passed offset matches the previous value. Fixes: 0bbdd42b7efa6 ("IPVS: Extend protocol DNAT/SNAT and state handlers") Link: https://lore.kernel.org/netdev/20260705125659.37744-1-zhaoyz24@mails.tsinghua.edu.cn/ Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Assisted-by: Claude Code:GLM-5.2 Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-08	ipvs: pass parsed transport offset to state handlers	Yizhou Zhao
	IPVS callers already parse the packet into struct ip_vs_iphdr before updating connection state. For IPv6 this records the real transport-header offset after extension headers in iph.len. Pass this parsed transport offset through ip_vs_set_state() and the protocol state_transition() callback so protocol handlers can use the same packet context as scheduling and NAT handling. This patch only changes the common callback plumbing and adapts the protocol callback signatures; TCP and SCTP start using the value in follow-up patches. Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-08	netfilter: handle unreadable frags	Florian Westphal
	sashiko reports: When an skb with unreadable fragments (such as from devmem TCP, where skb_frags_readable(skb) returns false) is processed by the u32 module, skb_copy_bits() will safely return a negative error code [..] xt_u32: bail out with hotdrop in this case. gather_frags: return -1, just as if we had no fragment header. nfnetlink_queue: restrict to the linear part. nfnetlink_log: restrict to the linear part. v2: - skb_zerocopy helpers don't copy readable flag, i.e. nfnetlink_queue is broken too xt_u32 shouldn't return true if hotdrop was set. Fixes: 65249feb6b3d ("net: add support for skbs with unreadable frags") Cc: stable@vger.kernel.org Acked-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-08	netfilter: flowtable: support IPIP tunnel with direct xmit	Pablo Neira Ayuso
	The combination of IPIP tunnel with direct xmit, eg. bridge device, breaks because no dst_entry is provided to check the skb headroom and to set the iph->frag_off field. This leads to invalid dst usage and can trigger a crash in the tunnel transmit path. Fix this by moving dst_cache and dst_cookie out of the runtime union so that they can be shared by neighbour, xfrm, and direct tunnel flows. For FLOW_OFFLOAD_XMIT_DIRECT tuples carrying tunnel metadata, preserve route state in these shared fields and release it through the common dst release path. Since dst_entry is now available to the three supported xmit modes and dst_release() already deals with NULL dst, remove the xmit type check in nft_flow_dst_release(). Moreover, skip the check if the dst entry is NULL in nf_flow_dst_check() which is now the case for the direct xmit case. Based on patch from Rein Wei <n05ec@lzu.edu.cn>. Fixes: d30301ba4b07 ("netfilter: flowtable: Add IPIP tx sw acceleration") Cc: stable@vger.kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Reported-by: Zhengyang Chen <chzhengyang2023@lzu.edu.cn> Reported-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-08	netfilter: flowtable: IPIP tunnel hardware offload is not yet support	Pablo Neira Ayuso
	No driver supports for IPIP tunnels yet, give up early on setting up the hardware offload for this scenario. This patch adds a stub that can be enhanced to add more configuration that are currently not supported. As of now, the offload work is enqueued to the worker, then ignored if the hardware offload configuration is not supported. Check the NF_FLOW_HW flag to know if this entry was already tried once to be offloaded so this is not retried on refresh when unsupported. Move NF_FLOW_HW flag check to nf_flow_offload_add(). If this NF_FLOW_HW flag is unset the _del and _stats variants are never called. This can be updated later on to skip hardware offload work to be queued in case hardware offload does not support it. Fixes: d98103575dcd ("netfilter: flowtable: Add IP6IP6 rx sw acceleration") Fixes: ab427db17885 ("netfilter: flowtable: Add IPIP rx sw acceleration") Cc: stable@vger.kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Reported-by: Zhengyang Chen <chzhengyang2023@lzu.edu.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-08	netfilter: flowtable: use dst in this direction when pushing IPIP header	Pablo Neira Ayuso
	When pushing the IPIP header, the route of the other direction is used to calculate the headroom, use the route in this direction. Accessing the other tuple to set the IP source and destination is fine because this tuple does not provide such information to avoid storing redundant information. However, this tuple already provides the dst for this direction, this went unnoticed because this bug affects headroom and iph->frag_off only at this stage. Fixes: d30301ba4b07 ("netfilter: flowtable: Add IPIP tx sw acceleration") Fixes: 93cf357fa797 ("netfilter: flowtable: Add IP6IP6 tx sw acceleration") Cc: stable@vger.kernel.org Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-08	netfilter: ipset: allocate the proper memory for the generic hash structure	Jozsef Kadlecsik
	Because a single create function is emitted for every hash type, from the IPv4 and IPv6 generic hash structure definitions the last one, i.e. the IPv6 was in effect for IPv4 too. Use the proper size when allocating the structure. Comment properly that because create() refers to elements of the generic hash structure, all referred ones must come before the IPv4/IPv6 dependent 'next' member. Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-08	netfilter: ipset: cleanup the add/del backlog when resize failed	Jozsef Kadlecsik
	Sashiko pointed out that the add/del backlog was not cleaned up when resize failed. Fix it in the corresponding error path. Also, make sure that the add/del backlog is htable-specific so when resize creates a new htable, old/new backlog can't be mixed up. Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-08	netfilter: ipset: exclude gc when resize is in progress	Jozsef Kadlecsik
	Zhengchuan Liang and Eulgyu Kim reported that because resize does not copy the comment extension into the resized set but uses it's pointer, ongoing gc can free the extension in the original set which then results stale pointer in the resized one. The proposed patch was to recreate the extensions for every element in the resized set. It is both expensive and wastes memory, so better exclude gc when resizing in progress detected: resizing will destroy the original set anyway, so doing gc on it is unnecessary. Introduce a new spinlock to exclude parallel gc and resize. Because we just set and check a bool value, there's no need for the parameter to be atomic_t and rename it for better readability. Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Reported by: Zhengchuan Liang <zcliangcn@gmail.com> Reported by: Eulgyu Kim <eulgyukim@snu.ac.kr> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-08	netfilter: ipset: mark the rcu locked areas properly	Jozsef Kadlecsik
	When we bump the uref counter, there's no need to keep the rcu lock because the referred hash table can't disappear. Also, from the same reason in mtype_gc we need the rcu lock and not a spinlock. Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-08	netfilter: nft_lookup: fix catchall element handling with inverted lookups	Tamaki Yanagawa
	nft_lookup_eval() decides whether a lookup matched (`found`) from the direct set lookup and priv->invert before falling back to the catchall element used by interval sets (e.g. nft_set_rbtree) for the open-ended default range. Since `found` is never recomputed after `ext` is replaced by the catchall lookup, inverted lookups (NFT_LOOKUP_F_INV, "!= @set") can wrongly match or wrongly skip the catchall element, producing the wrong verdict. Fold the catchall lookup into `ext` before computing `found`, matching the order already used by nft_objref_map_eval(). Fixes: aaa31047a6d2 ("netfilter: nftables: add catch-all set element support") Signed-off-by: Tamaki Yanagawa <ty@000ty.net> Assisted-by: Claude:claude-sonnet-5 Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-03	netfilter: xt_connmark: reject invalid shift parameters	Wyatt Feng
	Revision 2 of the CONNMARK target accepts user-controlled shift parameters and applies them to 32-bit mark values in connmark_tg_shift(). A shift_bits value of 32 or more triggers an undefined-shift bug when the rule is evaluated. Invalid shift_dir values are also accepted and silently fall back to the left-shift path. Reject invalid revision-2 shift parameters in connmark_tg_check() so malformed rules fail at installation time, before they can reach the packet path. Fixes: 472a73e00757 ("netfilter: xt_conntrack: Support bit-shifting for CONNMARK & MARK targets.") Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <dstsmallbird@foxmail.com> Assisted-by: Codex:GPT-5.4 Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com> Reviewed-by: Ren Wei <enjou1224z@gmail.com> Reviewed-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-03	ipvs: reset full ip_vs_seq structs in ip_vs_conn_new	Yizhou Zhao
	Commit 9a05475cebdd ("ipvs: avoid kmem_cache_zalloc in ip_vs_conn_new") changed ip_vs_conn_new() to allocate an ip_vs_conn object with kmem_cache_alloc(). The function then initializes many fields explicitly, but only resets in_seq.delta and out_seq.delta in the two struct ip_vs_seq members. That leaves init_seq and previous_delta uninitialized. This is normally harmless while the corresponding IP_VS_CONN_F_IN_SEQ or IP_VS_CONN_F_OUT_SEQ flag is clear. For connections learned from a sync message, however, ip_vs_proc_conn() preserves those flags from IP_VS_CONN_F_BACKUP_MASK and passes opt=NULL when the message omits IPVS_OPT_SEQ_DATA. In that case the new connection can be hashed with SEQ flags set but with the rest of in_seq/out_seq still containing stale slab data. When a packet for such a connection is later handled by an IPVS application helper, vs_fix_seq() and vs_fix_ack_seq() use previous_delta and init_seq to rewrite TCP sequence numbers. A malformed sync message can therefore make forwarded packets carry stale slab bytes in their TCP seq/ack numbers, and can also corrupt the forwarded TCP flow. Reset both struct ip_vs_seq members completely before publishing the connection. This matches the existing "reset struct ip_vs_seq" comment and keeps the sequence-adjustment gates inactive unless valid sequence data is installed later. Fixes: 9a05475cebdd ("ipvs: avoid kmem_cache_zalloc in ip_vs_conn_new") Cc: stable@vger.kernel.org Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Assisted-by: Claude-Code:GLM-5.2 Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-03	ipvs: fix PMTU for GUE/GRE tunnel ICMP errors	Yizhou Zhao
	When an ICMP Fragmentation Needed error is received for a tunneled IPVS connection, ip_vs_in_icmp() recomputes the MTU that the original packet can use by subtracting the tunnel overhead from the reported next-hop MTU. The current code always subtracts sizeof(struct iphdr), which is only the IPIP overhead. For GUE and GRE tunnels, ipvs_udp_decap() and ipvs_gre_decap() already compute the additional tunnel header length, but that value is scoped to the decapsulation block and is lost before the ICMP_FRAG_NEEDED handling. As a result, the ICMP error sent back to the client advertises an MTU that is too large, so PMTUD can fail to converge for GUE/GRE-tunneled real servers. With a reported next-hop MTU of 1400, a GUE tunnel currently returns 1380 to the client. The correct value is 1368: 1400 - sizeof(struct iphdr) - sizeof(struct udphdr) - sizeof(struct guehdr) Hoist the tunnel header length into the main ip_vs_in_icmp() scope and subtract sizeof(struct iphdr) + ulen in the Fragmentation Needed path. The IPIP path keeps ulen as 0, so its existing 1400 - 20 = 1380 result is unchanged. Fixes: 508f744c0de3 ("ipvs: strip udp tunnel headers from icmp errors") Cc: stable@vger.kernel.org Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Assisted-by: Claude-Code:GLM-5.2 Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-03	netfilter: nft_set_rbtree: get command skips end element with open interval	Pablo Neira Ayuso
	The get command on intervals provide partial matches such as subranges for usability reasons. However, an open interval has no closing end element. If the closing element matches within the range of the open internal, ie. its closest match is the start element of the open range, then, return 0 but offer no matching element to userspace through netlink as a special case. Userspace provides at least a matching start element in this case and the closing end element matching the open interal is ignored. Another possibility is to report the matching start element of the open interval for this end interval. However, this results in duplicated matching being listed in userspace because userspace does not expect a start element as response to a end element. Fixes: 2aa34191f06f ("netfilter: nft_set_rbtree: use binary search array in get command") Reported-by: Melbin K Mathew <mlbnkm1@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-03	netfilter: nfnetlink_cthelper: cap to maximum number of expectation per ↵	Pablo Neira Ayuso
	master on updates Really cap it to NF_CT_EXPECT_MAX_CNT (255) on updates. The commit ("netfilter: nfnetlink_cthelper: cap to maximum number of expectation per master") only covers creation of helpers, not updates. Fixes: 397c8300972f ("netfilter: nf_conntrack_helper: cap maximum number of expectation at helper registration") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-03	netfilter: xt_rateest: fix u64 truncation in xt_rateest_mt()	Feng Wu
	On links faster than ~34 Gbps, where byte rate may exceed 2^32-1 (~ 4.3 GBps), the comparison result becomes incorrect because the truncated value no longer reflects the actual estimator rate. Fix by changing the local variables to u64. Fixes: 1c0d32fde5bd ("net_sched: gen_estimator: complete rewrite of rate estimators") Signed-off-by: Feng Wu <wufengwufengwufeng@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-03	netfilter: xt_u32: reject invalid shift counts	Wyatt Feng
	u32_match_it() executes rule-supplied shift operands on a 32-bit value. A malformed u32 rule can provide a shift count of 32 or more, triggering an undefined shift out-of-bounds during packet evaluation. Validate XT_U32_LEFTSH and XT_U32_RIGHTSH operands in u32_mt_checkentry() and reject malformed rules before they reach the packet path. Fixes: 1b50b8a371e9 ("[NETFILTER]: Add u32 match") Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Assisted-by: Codex:GPT-5.4 Signed-off-by: Wyatt Feng <bronzed_45_vested@icloud.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-03	netfilter: nf_nat_sip: reload possible stale data pointer	Florian Westphal
	quoting sashiko: ------------------------------------------------------------------------ [..] noticed a potential memory bug and header corruption involving the SIP NAT helper. In net/netfilter/nf_nat_sip.c:nf_nat_sip(): if (skb_ensure_writable(skb, skb->len)) { nf_ct_helper_log(skb, ct, "cannot mangle packet"); return NF_DROP; } uh = (void )skb->data + protoff; uh->dest = ct_sip_info->forced_dport; if (!nf_nat_mangle_udp_packet(skb, ct, ctinfo, protoff, 0, 0, NULL, 0)) { If a cloned or fragmented SKB is reallocated by skb_ensure_writable(), the old data buffer is freed. However, nf_nat_sip() fails to update dptr to point to the new buffer. It also appears to use nf_nat_mangle_udp_packet() on what could be a TCP packet, which would overwrite the sequence number with a checksum update. ------------------------------------------------------------------------ nf_conntrack_sip linerizes skbs, hence no fragmented skb can be seen. But clones are possible, so rebuild dptr. Disable nf_nat_mangle_udp_packet() branch for TCP streams. It doesn't look like this can ever happen, else we should have received bug reports about this, so just check the conntrack is UDP and drop otherwise. The calling conntrack_sip set ->forced_dport for SIP_HDR_VIA_UDP messages, so I don't think this is ever expected to be true for a TCP stream. Fixes: 7266507d8999 ("netfilter: nf_ct_sip: support Cisco 7941/7945 IP phones") Cc: stable@vger.kernel.org Assisted-by: Claude:claude-sonnet-4-6 Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-02	Merge tag 'net-7.2-rc2' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter and batman-adv. Current release - new code bugs: - netfilter: cthelper: cap to maximum number of expectation per master Previous releases - regressions: - netpoll: fix a use-after-free on shutdown path - tcp: restore RCU grace period in tcp_ao_destroy_sock - ipv6: fix NULL deref in fib6_walk_continiue() on multi-batch dump - batman-adv: dat: ensure accessible eth_hdr proto field - eth: - virtio_net: disable cb when NAPI is busy-polled - lan743x: Initialize eth_syslock spinlock before use Previous releases - always broken: - netfilter: - nft_set_pipapo: don't leak bad clone into future transaction - sched: - sch_teql: Introduce slaves_lock to avoid race condition and UAF - replace direct dequeue call with peek and qdisc_dequeue_peeked - sctp: add INIT verification after cookie unpacking - tipc: fix out-of-bounds read in broadcast Gap ACK blocks - seg6: validate SRH length before reading fixed fields - eth: - mlx5e: fix use-after-free of metadata_dst on RX SC delete - enetc: check the number of BDs needed for xdp_frame - fbnic: don't cache shinfo across skb realloc" * tag 'net-7.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits) net/mlx5: HWS, fix matcher leak on resize target setup failure net/sched: hhf: clear heavy-hitter state on reset net/sched: dualpi2: clear stale classification on filter miss net/sched: act_bpf: use rcu_dereference_bh() to read the filter selftests: drv-net: tso: don't touch dangerous feature bits cxgb4: Fix decode strings dump for T6 adapters virtio_net: disable cb when NAPI is busy-polled sctp: fix addr_wq_timer race in sctp_free_addr_wq() selftests: net: bump default cmd() timeout to 20 seconds bridge: stp: Fix a potential use-after-free when deleting a bridge net/sched: sch_teql: Introduce slaves_lock to avoid race condition and UAF net: gianfar: dispose irq mappings on probe failure and device removal net: lan743x: Initialize eth_syslock spinlock before use net: libwx: fix VMDQ mask for 1-queue mode net: airoha: fix max receive size configuration fsl/fman: Free init resources on KeyGen failure in fman_init() netfilter: nftables: restrict checkum update offset netfilter: nftables: restrict linklayer and network header writes netfilter: nfnetlink_queue: restrict writes to network header netfilter: nft_fib: reject fib expression on the netdev egress hook ...
2026-06-30	netfilter: nftables: restrict checkum update offset	Florian Westphal
	After previous patch, writes to network header are restricted. However, there is another way to manipulate the l3 header: The checksum update function. Restrict this for network header writes, only the ipv4 header is allowed. This needs run-time checks because BRIDGE, INET, NETDEV families can carry l3 headers other than IP. checksum updates to the udp/tcp (l4) headers are not restricted. Signed-off-by: Florian Westphal <fw@strlen.de>
2026-06-30	netfilter: nftables: restrict linklayer and network header writes	Florian Westphal
	Don't permit arbitrary writes to linklayer and network header data. Several spots in network stack trust header validation performed in ipv4/ipv6 before PRE_ROUTING hook. For linklayer, allow writes for netdev ingress. For other hooks, only allow link layer writes that do not spill into network header. For network header, check the offset/length combinations: - changing dscp requires store at offset 0 for checsum fixups, so make sure ip version + length field isn't altered. - ip6 dscp starts directly after the version field, so make sure it remains 6. Several of these checks could already be done at rule insertion time. Risk is that this might cause ruleset load failures for existing rulesets. With this change such writes are silently skipped and packet passes unchanged. Transport and inner header bases are not checked / restricted. Signed-off-by: Florian Westphal <fw@strlen.de>
2026-06-30	netfilter: nfnetlink_queue: restrict writes to network header	Florian Westphal
	nfnetlink_queue doesn't allow selective replacements of some part of the payload, only complete replacement. If the new data is shorter, skb is trimmed, otherwise expanded. Add minimal validation of the new ip/ipv6 header. Check total len matches skb length. Disallow ip option modifications. IPv6 extension headers are also disabled. IP options and exthdrs could be allowed later after validation pass or ip option recompile. Transport header is not checked. Bridge modifications are rejected. Given userspace doesn't even receive L2 headers, use is limited and I don't think there are any users of bridge nfnetlink_queue, let alone users that modifiy payload. Arp isn't supported at all. Signed-off-by: Florian Westphal <fw@strlen.de>
2026-06-30	netfilter: nft_fib: reject fib expression on the netdev egress hook	Theodor Arsenij Larionov-Trichkine
	A fib expression in a netdev egress base chain dereferences nft_in(pkt), NULL on the transmit path, causing a NULL pointer dereference at eval. nft_fib_validate() masks the hook with NF_INET_* values, but netdev hook numbers are a separate enum that aliases them (NF_NETDEV_EGRESS == NF_INET_LOCAL_IN), so an egress chain passes validation and then faults. Add nft_fib_netdev_validate() that limits each result/flag to the netdev hook where the device it reads exists: the input-device cases (OIF, OIFNAME, ADDRTYPE with F_IIF) to ingress, the output-device case (ADDRTYPE with F_OIF) to egress, ADDRTYPE with no device flag to both. Also restrict nft_fib_validate() to NFPROTO_IPV4/IPV6/INET so its NF_INET_* masks are not applied to another family's hooks. Fixes: 42df6e1d221d ("netfilter: Introduce egress hook") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/netfilter-devel/ajxsjcDOnwllMfoR@strlen.de/ Signed-off-by: Theodor Arsenij Larionov-Trichkine <theodorlarionov@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-06-30	netfilter: nfnetlink_cthelper: cap to maximum number of expectation per master	Pablo Neira Ayuso
	If userspace helper policy updates sets maximum number of expectation to zero, cap it to NF_CT_EXPECT_MAX_CNT (255) on updates too. Fixes: 397c8300972f ("netfilter: nf_conntrack_helper: cap maximum number of expectation at helper registration") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-06-30	netfilter: nf_conntrack_sip: validate skb_dst() before accessing it	Pablo Neira Ayuso
	tc ingress and openvswitch do not guarantee routing information to be available. These subsystems use the conntrack helper infrastructure, and the SIP helper relies on the skb_dst() to be present if sip_external_media is set to 1 (which is disabled by default as a module parameter). This effectively disables the sip_external_media toggle for these subsystems without resulting in a crash. Fixes: cae3a2627520 ("openvswitch: Allow attaching helpers to ct action") Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct") Cc: stable@vger.kernel.org Reported-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-06-30	netfilter: ipset: fix race between dump and ip_set_list resize	Xiang Mei
	The release path of ip_set_dump_do() and ip_set_dump_done() read inst->ip_set_list via ip_set_ref_netlink(), a plain rcu_dereference_raw() of the array pointer. These run from netlink_recvmsg() without the nfnl mutex and without an RCU read-side critical section. A concurrent ip_set_create() can grow the array: it publishes the new array, calls synchronize_net() and then kvfree()s the old one. Since the dump paths read the array outside any RCU reader, synchronize_net() does not wait for them and the old array can be freed while they still index into it, causing a use-after-free. The dumped set itself stays pinned via set->ref_netlink, so only the array load needs protecting. Take rcu_read_lock() around it, matching ip_set_get_byname() and __ip_set_put_byindex(). BUG: KASAN: slab-use-after-free in ip_set_dump_do (net/netfilter/ipset/ip_set_core.c:1697) Read of size 8 at addr ffff88800b5c4018 by task exploit/150 Call Trace: ... kasan_report (mm/kasan/report.c:595) ip_set_dump_do (net/netfilter/ipset/ip_set_core.c:1697) netlink_dump (net/netlink/af_netlink.c:2325) netlink_recvmsg (net/netlink/af_netlink.c:1976) sock_recvmsg (net/socket.c:1159) __sys_recvfrom (net/socket.c:2315) ... Oops: general protection fault, probably for non-canonical address ... KASAN NOPTI KASAN: maybe wild-memory-access in range [0x02d6...d0-0x02d6...d7] RIP: 0010:ip_set_dump_do (net/netfilter/ipset/ip_set_core.c:1698) Kernel panic - not syncing: Fatal exception Fixes: 8a02bdd50b2e ("netfilter: ipset: Fix calling ip_set() macro at dumping") Cc: stable@vger.kernel.org Reported-by: Weiming Shi <bestswngs@gmail.com> Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Xiang Mei <xmei5@asu.edu> Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-06-30	netfilter: nft_set_pipapo: don't leak bad clone into future transaction	Florian Westphal
	On memory allocation failure the cloned nft_pipapo_match can enter a bad state: - some fields can have their lookup tables resized while others did not - bits might have been toggled - scratch map can be undersized which also means m->bsize_max can be lower than what is required This means that the next insertion in the same batch can trigger out-of-bounds writes. Furthermore, a failure in the first can result in the bad clone to leak into the next transaction because the abort callback is never executed in this case (the upper layer saw an error and no attempt to allocate a transactional request was made). Record a state for the nft_pipapo_match structure: - NEW (pristine clone) - MOD (modified clone with good state) - ERR (potentially bogus content) Then make it so that deletes and insertions fail when the clone entered ERR state. In case the very first insert attempt results in an error, free the clone right away. Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Cc: stable@vger.kernel.org Reported-and-tested-by: Seesee <cjc000013@gmail.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-06-30	netfilter: nf_conntrack_expect: zero at allocation time	Florian Westphal
	There are occasional LLM hints wrt. leaking uninitialized data to userspace via ctnetlink. Just zero at allocation time, expectations are not frequently used these days. Intentionally keeps _init as-is because we could theoretically support re-init, so add the missing exp->dir there. Signed-off-by: Florian Westphal <fw@strlen.de>