linux-stable.git/net/netfilter, branch v6.18.2

netfilter: nft_connlimit: update the count if add was skipped

2025-12-18T13:03:26+00:00

[ Upstream commit 69894e5b4c5e28cda5f32af33d4a92b7a4b93b0e ]

Connlimit expression can be used for all kind of packets and not only
for packets with connection state new. See this ruleset as example:

table ip filter {
        chain input {
                type filter hook input priority filter; policy accept;
                tcp dport 22 ct count over 4 counter
        }
}

Currently, if the connection count goes over the limit the counter will
count the packets. When a connection is closed, the connection count
won't decrement as it should because it is only updated for new
connections due to an optimization on __nf_conncount_add() that prevents
updating the list if the connection is duplicated.

To solve this problem, check whether the connection was skipped and if
so, update the list. Adjust count_tree() too so the same fix is applied
for xt_connlimit.

Fixes: 976afca1ceba ("netfilter: nf_conncount: Early exit in nf_conncount_lookup() and cleanup")
Closes: https://lore.kernel.org/netfilter/trinity-85c72a88-d762-46c3-be97-36f10e5d9796-1761173693813@3c-app-mailcom-bs12/
Signed-off-by: Fernando Fernandez Mancera 
Signed-off-by: Pablo Neira Ayuso 
Signed-off-by: Sasha Levin

netfilter: nf_conncount: rework API to use sk_buff directly

2025-12-18T13:03:26+00:00

[ Upstream commit be102eb6a0e7c03db00e50540622f4e43b2d2844 ]

When using nf_conncount infrastructure for non-confirmed connections a
duplicated track is possible due to an optimization introduced since
commit d265929930e2 ("netfilter: nf_conncount: reduce unnecessary GC").

In order to fix this introduce a new conncount API that receives
directly an sk_buff struct.  It fetches the tuple and zone and the
corresponding ct from it. It comes with both existing conncount variants
nf_conncount_count_skb() and nf_conncount_add_skb(). In addition remove
the old API and adjust all the users to use the new one.

This way, for each sk_buff struct it is possible to check if there is a
ct present and already confirmed. If so, skip the add operation.

Fixes: d265929930e2 ("netfilter: nf_conncount: reduce unnecessary GC")
Signed-off-by: Fernando Fernandez Mancera 
Signed-off-by: Pablo Neira Ayuso 
Signed-off-by: Sasha Levin

netfilter: flowtable: check for maximum number of encapsulations in bridge vlan

2025-12-18T13:03:26+00:00

[ Upstream commit 634f3853cc98d73bdec8918010ee29b06981583e ]

Add a sanity check to skip path discovery if the maximum number of
encapsulation is reached. While at it, check for underflow too.

Fixes: 26267bf9bb57 ("netfilter: flowtable: bridge vlan hardware offload and switchdev")
Signed-off-by: Pablo Neira Ayuso 
Signed-off-by: Sasha Levin

netfilter: nft_ct: add seqadj extension for natted connections

2025-10-29T13:47:59+00:00

Sequence adjustment may be required for FTP traffic with PASV/EPSV modes.
due to need to re-write packet payload (IP, port) on the ftp control
connection. This can require changes to the TCP length and expected
seq / ack_seq.

The easiest way to reproduce this issue is with PASV mode.
Example ruleset:
table inet ftp_nat {
        ct helper ftp_helper {
                type "ftp" protocol tcp
                l3proto inet
        }

        chain prerouting {
                type filter hook prerouting priority 0; policy accept;
                tcp dport 21 ct state new ct helper set "ftp_helper"
        }
}
table ip nat {
        chain prerouting {
                type nat hook prerouting priority -100; policy accept;
                tcp dport 21 dnat ip prefix to ip daddr map {
			192.168.100.1 : 192.168.13.2/32 }
        }

        chain postrouting {
                type nat hook postrouting priority 100 ; policy accept;
                tcp sport 21 snat ip prefix to ip saddr map {
			192.168.13.2 : 192.168.100.1/32 }
        }
}

Note that the ftp helper gets assigned *after* the dnat setup.

The inverse (nat after helper assign) is handled by an existing
check in nf_nat_setup_info() and will not show the problem.

Topoloy:

 +-------------------+     +----------------------------------+
 | FTP: 192.168.13.2 | <-> | NAT: 192.168.13.3, 192.168.100.1 |
 +-------------------+     +----------------------------------+
                                      |
                         +-----------------------+
                         | Client: 192.168.100.2 |
                         +-----------------------+

ftp nat changes do not work as expected in this case:
Connected to 192.168.100.1.
[..]
ftp> epsv
EPSV/EPRT on IPv4 off.
ftp> ls
227 Entering passive mode (192,168,100,1,209,129).
421 Service not available, remote server has closed connection.

Kernel logs:
Missing nfct_seqadj_ext_add() setup call
WARNING: CPU: 1 PID: 0 at net/netfilter/nf_conntrack_seqadj.c:41
[..]
 __nf_nat_mangle_tcp_packet+0x100/0x160 [nf_nat]
 nf_nat_ftp+0x142/0x280 [nf_nat_ftp]
 help+0x4d1/0x880 [nf_conntrack_ftp]
 nf_confirm+0x122/0x2e0 [nf_conntrack]
 nf_hook_slow+0x3c/0xb0
 ..

Fix this by adding the required extension when a conntrack helper is assigned
to a connection that has a nat binding.

Fixes: 1a64edf54f55 ("netfilter: nft_ct: add helper set support")
Signed-off-by: Andrii Melnychenko 
Signed-off-by: Florian Westphal

netfilter: nft_connlimit: fix possible data race on connection count

2025-10-29T13:47:59+00:00

nft_connlimit_eval() reads priv->list->count to check if the connection
limit has been exceeded. This value is being read without a lock and can
be modified by a different process. Use READ_ONCE() for correctness.

Fixes: df4a90250976 ("netfilter: nf_conncount: merge lookup and add functions")
Signed-off-by: Fernando Fernandez Mancera 
Signed-off-by: Florian Westphal

netfilter: nft_ct: enable labels for get case too

2025-10-29T13:47:59+00:00

conntrack labels can only be set when the conntrack has been created
with the "ctlabel" extension.

For older iptables (connlabel match), adding an "-m connlabel" rule
turns on the ctlabel extension allocation for all future conntrack
entries.

For nftables, its only enabled for 'ct label set foo', but not for
'ct label foo' (i.e. check).
But users could have a ruleset that only checks for presence, and rely
on userspace to set a label bit via ctnetlink infrastructure.

This doesn't work without adding a dummy 'ct label set' rule.
We could also enable extension infra for the first (failing) ctnetlink
request, but unlike ruleset we would not be able to disable the
extension again.

Therefore turn on ctlabel extension allocation if an nftables ruleset
checks for a connlabel too.

Fixes: 1ad8f48df6f6 ("netfilter: nftables: add connlabel set support")
Reported-by: Antonio Ojea 
Closes: https://lore.kernel.org/netfilter-devel/aPi_VdZpVjWujZ29@strlen.de/
Signed-off-by: Florian Westphal

netfilter: nft_objref: validate objref and objrefmap expressions

2025-10-08T11:17:25+00:00

Referencing a synproxy stateful object from OUTPUT hook causes kernel
crash due to infinite recursive calls:

BUG: TASK stack guard page was hit at 000000008bda5b8c (stack is 000000003ab1c4a5..00000000494d8b12)
[...]
Call Trace:
 __find_rr_leaf+0x99/0x230
 fib6_table_lookup+0x13b/0x2d0
 ip6_pol_route+0xa4/0x400
 fib6_rule_lookup+0x156/0x240
 ip6_route_output_flags+0xc6/0x150
 __nf_ip6_route+0x23/0x50
 synproxy_send_tcp_ipv6+0x106/0x200
 synproxy_send_client_synack_ipv6+0x1aa/0x1f0
 nft_synproxy_do_eval+0x263/0x310
 nft_do_chain+0x5a8/0x5f0 [nf_tables
 nft_do_chain_inet+0x98/0x110
 nf_hook_slow+0x43/0xc0
 __ip6_local_out+0xf0/0x170
 ip6_local_out+0x17/0x70
 synproxy_send_tcp_ipv6+0x1a2/0x200
 synproxy_send_client_synack_ipv6+0x1aa/0x1f0
[...]

Implement objref and objrefmap expression validate functions.

Currently, only NFT_OBJECT_SYNPROXY object type requires validation.
This will also handle a jump to a chain using a synproxy object from the
OUTPUT hook.

Now when trying to reference a synproxy object in the OUTPUT hook, nft
will produce the following error:

synproxy_crash.nft: Error: Could not process rule: Operation not supported
  synproxy name mysynproxy
  ^^^^^^^^^^^^^^^^^^^^^^^^

Fixes: ee394f96ad75 ("netfilter: nft_synproxy: add synproxy stateful object support")
Reported-by: Georg Pfuetzenreuter 
Closes: https://bugzilla.suse.com/1250237
Signed-off-by: Fernando Fernandez Mancera 
Reviewed-by: Pablo Neira Ayuso 
Signed-off-by: Florian Westphal

netfilter: nf_conntrack: do not skip entries in /proc/net/nf_conntrack

2025-09-24T09:50:28+00:00

ct_seq_show() has an opportunistic garbage collector :

if (nf_ct_should_gc(ct)) {
    nf_ct_kill(ct);
    goto release;
}

So if one nf_conn is killed there, next time ct_get_next() runs,
we skip the following item in the bucket, even if it should have
been displayed if gc did not take place.

We can decrement st->skip_elems to tell ct_get_next() one of the items
was removed from the chain.

Fixes: 58e207e4983d ("netfilter: evict stale entries when user reads /proc/net/nf_conntrack")
Signed-off-by: Eric Dumazet 
Signed-off-by: Florian Westphal

netfilter: nft_set_pipapo_avx2: fix skip of expired entries

2025-09-24T09:50:28+00:00

KASAN reports following splat:
BUG: KASAN: slab-out-of-bounds in pipapo_get_avx2+0x941/0x25d0
Read of size 1 at addr ffff88814c561be0 by task nft/3944
Call Trace:
 pipapo_get_avx2+0x941/0x25d0
 nft_pipapo_insert+0x440/0x11b0
 nf_tables_newsetelem+0x220a/0x3a00
 ..

This bisects to commit 84c1da7b38d9 ("netfilter: nft_set_pipapo: use AVX2
algorithm for insertions too").

However, that change merely uncovers this bug.

When we find a match but that match has expired or timed out, the AVX2
implementation restarts the full match loop.

At that point, the pointer to the key data has already been changed and
points to the keys last field.
This will then result in out-of-bounds read once its incremented again
for the next field.

The restart logic in AVX2 is different compared to the plain C
implementation, but both should follow the same logic.

The C implementation just calls pipapo_refill() again do check the next
entry.  Do the same in the AVX2 implementation.

Note that with this change, due to implementation differences of
pipapo_refill vs. nft_pipapo_avx2_refill, the refill call will return
the same element again. Then, on the next call, it will move to the next
entry as expected.  This is because avx2_refill doesn't clear the bitmap
in the 'last' conditional.  This is harmless. Expired/timed out elements
are also not expected to be frequent.

selftest is added in a followup commit.

Fixes: 7400b063969b ("nft_set_pipapo: Introduce AVX2-based lookup implementation")
Reviewed-by: Stefano Brivio 
Signed-off-by: Florian Westphal

netfilter: nft_set_pipapo: use 0 genmask for packetpath lookups

2025-09-24T09:50:28+00:00

In commit c4eaca2e1052 ("netfilter: nft_set_pipapo: don't check genbit from
packetpath lookups") I replaced genmask_cur() with NFT_GENMASK_ANY, but
this change has no effect in the pipapo set type.

New entries are unreachable from the active copy, so NFT_GENMASK_ANY has
same result as genmask_cur():

current-gen elements are disabled and the new-generation
elements cannot be found.

Tests did not catch this incomplete fix because the change also dropped
the genmask test from the AVX2 version of the algorithm, so test only
fails if host cpu lacks AVX2 support.

Use genmask test only from the control plane (inserts, deletions, ..).

Packet path has to skip the check, use of 0 is enough for this because
ext->genmask has a the relevant bit set when the element is INACTIVE
in that generation: using a 0 genmask thus makes nft_set_elem_active()
always return true.

Fix the comment and replace NFT_GENMASK_ANY with 0.

Fixes: c4eaca2e1052 ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups")
Signed-off-by: Florian Westphal