linux.git - Linux kernel source tree

Age	Commit message (Collapse)	Author
9 days	ipvs: clear the nfct flag under lock	Julian Anastasov
	Sashiko warns that cp->flags should be changed under cp->lock Fixes: 35dfb013149f ("ipvs: queue delayed work to expire no destination connections if expire_nodest_conn=1") Fixes: f0a5e4d7a594 ("ipvs: allow connection reuse for unconfirmed conntrack") Link: https://sashiko.dev/#/patchset/CALMqdkR704S2BG_QD_bgHTFp2%2B1QCi7n0T4zoZyTo8mDZevYSA%40mail.gmail.com Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
9 days	ipvs: do not mangle ICMP replies for non-first fragments	Julian Anastasov
	Sashiko warns that ip_vs_nat_icmp() unconditionally mangles the payload for embedded non-first IPv4 fragments. The problem is in the very old inverted pp->dont_defrag check which should not continue when embedded is a non-first TCP/UDP/SCTP fragment. Check for embedded non-first fragment is also missing from ip_vs_out_icmp_v6(), it is needed before any connection lookups that expect ports after the network headers. Drop the blocking code from ip_vs_in_icmp_v6() which prevents ICMPv6 from local clients to use non-MASQ forwarding. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Link: https://sashiko.dev/#/patchset/20260720201122.79882-1-ja%40ssi.bg Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
9 days	ipvs: fix places with wrong packet offsets	Julian Anastasov
	The offsets we use to packet headers and payloads should be based on skb->data. We even already respect non-zero network offset in ip_vs_fill_iph_skb() but some places do it wrongly and support only zero offset which is expected for the IP layer where IPVS has hooks. Change all places that instead of skb->data use offsets based on the network header (skb_network_header, ip_hdr, etc) because this doubles the network offset as noted by Sashiko. For ip_vs_nat_icmp_v6() we can even rely on the IPv6 header parsing done by the caller. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Link: https://sashiko.dev/#/patchset/20260710143733.29741-2-fw%40strlen.de Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
9 days	ipvs: fix the checksum validations	Julian Anastasov
	ip_vs_in_icmp_v6() is missing checksum validation for ICMPv6 packets from clients. In fact, as for TCP/UDP we should validate the checksum for ICMP packets only when we mangle the packets on MASQ or on reply for tunnel. Also, Sashiko points out that handle_response_icmp() being common for IPv4 and IPv6 is missing the pseudo-header calculation while validating ICMPv6 messages from real servers which is a problem if checksum is not validated by the hardware. Fix the problems by creating ip_vs_checksum_common_check() helper and use it for TCP/UDP/ICMP both for IPv4 and IPv6. Rely on the nf_checksum() for validating the ICMP messages but use it also for TCP and UDP. Use correct IP offset for IP_VS_DBG_RL_PKT for TCP/UDP/SCTP. IPVS packets (TCP/UDP/SCTP/ICMP) do not need checksum validation on LOCAL_OUT (local clients or local real servers) and on FORWARD (traffic from servers on LAN). Do it only on LOCAL_IN, in case nf_checksum() is not called on PRE_ROUTING. Also, ip_vs_checksum_complete() can be marked static. Fixes: 2a3b791e6e11 ("IPVS: Add/adjust Netfilter hook functions and helpers for v6") Link: https://sashiko.dev/#/patchset/20260708180315.77413-1-ja%40ssi.bg Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
9 days	ipvs: adjust double hashing when fwd method changes	Julian Anastasov
	Synced conns can be created with one forwarding method and later updated with different one after the dest server is configured. This needs adjusting the hashing for node hn1 because only MASQ supports double hashing. Modify conn_tab_lock() to support seeking for hash node hn0 together with adding for hn1. By this way we can safely modify the forwarding method and hn1.hash_key under bucket lock for the first node hn0. The forwarding method is also protected by cp->lock as it is part of cp->flags. Fix the usage of stale idx/idx2 values in conn_tab_lock after jumping to the retry label. Instead, use idx/idx2 values just to order the locking for the old/new tables. Reported-by: Zhiling Zou <roxy520tt@gmail.com> Link: https://patch.msgid.link/1b914f41d725bc064c9ba9830dc8169329737270.1782540466.git.roxy520tt@gmail.com/ Link: https://sashiko.dev/#/patchset/CALMqdkR704S2BG_QD_bgHTFp2%2B1QCi7n0T4zoZyTo8mDZevYSA%40mail.gmail.com Fixes: f20c73b0460d ("ipvs: use more keys for connection hashing") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
9 days	ipvs: do not propagate one-packet flag to synced conns	Zhiling Zou
	Synced connections can be created before their destination exists. When the destination is later added, ip_vs_bind_dest() copies connection flags from the destination into cp->flags. IP_VS_CONN_F_ONE_PACKET connections are not synced. If a synced connection inherits IP_VS_CONN_F_ONE_PACKET while it is already hashed, expiry can treat it as a one-packet connection and skip unlinking the existing conn_tab node, leaving stale hash nodes pointing at a freed struct ip_vs_conn. Drop IP_VS_CONN_F_ONE_PACKET from destination flags when binding synced connections. Fixes: 26ec037f9841 ("IPVS: one-packet scheduling") Cc: stable@vger.kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Suggested-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Zhiling Zou <roxy520tt@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-07-10	ipvs: fix more places with wrong ipv6 transport offsets	Julian Anastasov
	Sashiko reports for more incorrect IPv6 transport offsets. The app code for TCP was assuming IPv4 network header even after the ipvsh argument was provided. This can cause problems with apps over IPv6. As for the only official app in the kernel tree (FTP) this problem is harmless because we use Netfilter to mangle the FTP ports and we do not adjust the TCP seq numbers. Also, provide correct offset of the ICMPV6 header in ip_vs_out_icmp_v6() for correct checksum checks when the IPv6 packet has extension headers. Fixes: d12e12299a69 ("ipvs: add ipv6 support to ftp") Fixes: 2a3b791e6e11 ("IPVS: Add/adjust Netfilter hook functions and helpers for v6") Cc: stable@vger.kernel.org Link: https://sashiko.dev/#/patchset/20260706101624.69471-1-zhaoyz24%40mails.tsinghua.edu.cn Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-10	ipvs: reload ip header after head reallocation	Florian Westphal
	__ip_vs_get_out_rt() calls skb_ensure_writable() which may reallocate skb->head. Fixes: 8d8e20e2d7bb ("ipvs: Decrement ttl") Cc: stable@vger.kernel.org Assisted-by: Claude:claude-sonnet-4-6 Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-08	ipvs: ensure inner headers in ICMP errors are in headroom	Julian Anastasov
	Sashiko points out that after stripping the outer headers with pskb_pull() we should ensure the inner IP headers in ICMP errors from tunnels are present in the skb headroom for functions like ipv4_update_pmtu(), icmp_send() and IP_VS_DBG(). Also, add more checks for the length of the inner headers. Fixes: f2edb9f7706d ("ipvs: implement passive PMTUD for IPIP packets") Link: https://sashiko.dev/#/patchset/20260702073430.67680-1-zhaoyz24%40mails.tsinghua.edu.cn Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-08	ipvs: use parsed transport offset in SCTP state lookup	Yizhou Zhao
	set_sctp_state() reads the SCTP chunk header again in order to drive the IPVS SCTP state table. For IPv6 it computes the offset with sizeof(struct ipv6hdr), while the surrounding IPVS code uses iph.len from ip_vs_fill_iph_skb(), where ipv6_find_hdr() has already skipped extension headers and found the real transport header. This makes the state machine read from the wrong offset for IPv6 SCTP packets that carry extension headers. For example, an INIT packet with an 8-byte destination options header can be scheduled correctly by sctp_conn_schedule(), but set_sctp_state() reads the first byte of the SCTP verification tag as a DATA chunk type. The connection then moves from NONE to ESTABLISHED instead of INIT1, gets the longer established timeout, and updates the active/inactive destination counters incorrectly. This happens even though the SCTP handshake has not completed. Use the parsed transport offset passed down from ip_vs_set_state() for the SCTP chunk-header lookup. For IPv4 and IPv6 packets without extension headers this preserves the existing offset. Fixes: 2906f66a5682 ("ipvs: SCTP Trasport Loadbalancing Support") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/netdev/20260705123040.35755-1-zhaoyz24@mails.tsinghua.edu.cn/ Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Assisted-by: Claude Code:GLM-5.2 Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-08	ipvs: use parsed transport offset in TCP state lookup	Yizhou Zhao
	TCP state handling reparses the skb to find the TCP header. For IPv6 it uses sizeof(struct ipv6hdr), while the surrounding IPVS code already parsed the packet with ip_vs_fill_iph_skb() and has the real transport-header offset in iph.len. This makes TCP state handling look at the wrong bytes when an IPv6 packet carries extension headers. Use the parsed transport offset passed down from ip_vs_set_state() when reading the TCP header. For IPv4 and for IPv6 packets without extension headers, the passed offset matches the previous value. Fixes: 0bbdd42b7efa6 ("IPVS: Extend protocol DNAT/SNAT and state handlers") Link: https://lore.kernel.org/netdev/20260705125659.37744-1-zhaoyz24@mails.tsinghua.edu.cn/ Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Assisted-by: Claude Code:GLM-5.2 Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-08	ipvs: pass parsed transport offset to state handlers	Yizhou Zhao
	IPVS callers already parse the packet into struct ip_vs_iphdr before updating connection state. For IPv6 this records the real transport-header offset after extension headers in iph.len. Pass this parsed transport offset through ip_vs_set_state() and the protocol state_transition() callback so protocol handlers can use the same packet context as scheduling and NAT handling. This patch only changes the common callback plumbing and adapts the protocol callback signatures; TCP and SCTP start using the value in follow-up patches. Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-03	ipvs: reset full ip_vs_seq structs in ip_vs_conn_new	Yizhou Zhao
	Commit 9a05475cebdd ("ipvs: avoid kmem_cache_zalloc in ip_vs_conn_new") changed ip_vs_conn_new() to allocate an ip_vs_conn object with kmem_cache_alloc(). The function then initializes many fields explicitly, but only resets in_seq.delta and out_seq.delta in the two struct ip_vs_seq members. That leaves init_seq and previous_delta uninitialized. This is normally harmless while the corresponding IP_VS_CONN_F_IN_SEQ or IP_VS_CONN_F_OUT_SEQ flag is clear. For connections learned from a sync message, however, ip_vs_proc_conn() preserves those flags from IP_VS_CONN_F_BACKUP_MASK and passes opt=NULL when the message omits IPVS_OPT_SEQ_DATA. In that case the new connection can be hashed with SEQ flags set but with the rest of in_seq/out_seq still containing stale slab data. When a packet for such a connection is later handled by an IPVS application helper, vs_fix_seq() and vs_fix_ack_seq() use previous_delta and init_seq to rewrite TCP sequence numbers. A malformed sync message can therefore make forwarded packets carry stale slab bytes in their TCP seq/ack numbers, and can also corrupt the forwarded TCP flow. Reset both struct ip_vs_seq members completely before publishing the connection. This matches the existing "reset struct ip_vs_seq" comment and keeps the sequence-adjustment gates inactive unless valid sequence data is installed later. Fixes: 9a05475cebdd ("ipvs: avoid kmem_cache_zalloc in ip_vs_conn_new") Cc: stable@vger.kernel.org Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Assisted-by: Claude-Code:GLM-5.2 Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-07-03	ipvs: fix PMTU for GUE/GRE tunnel ICMP errors	Yizhou Zhao
	When an ICMP Fragmentation Needed error is received for a tunneled IPVS connection, ip_vs_in_icmp() recomputes the MTU that the original packet can use by subtracting the tunnel overhead from the reported next-hop MTU. The current code always subtracts sizeof(struct iphdr), which is only the IPIP overhead. For GUE and GRE tunnels, ipvs_udp_decap() and ipvs_gre_decap() already compute the additional tunnel header length, but that value is scoped to the decapsulation block and is lost before the ICMP_FRAG_NEEDED handling. As a result, the ICMP error sent back to the client advertises an MTU that is too large, so PMTUD can fail to converge for GUE/GRE-tunneled real servers. With a reported next-hop MTU of 1400, a GUE tunnel currently returns 1380 to the client. The correct value is 1368: 1400 - sizeof(struct iphdr) - sizeof(struct udphdr) - sizeof(struct guehdr) Hoist the tunnel header length into the main ip_vs_in_icmp() scope and subtract sizeof(struct iphdr) + ulen in the Fragmentation Needed path. The IPIP path keeps ulen as 0, so its existing 1400 - 20 = 1380 result is unchanged. Fixes: 508f744c0de3 ("ipvs: strip udp tunnel headers from icmp errors") Cc: stable@vger.kernel.org Reported-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Reported-by: Yuxiang Yang <yangyx22@mails.tsinghua.edu.cn> Reported-by: Ao Wang <wangao@seu.edu.cn> Reported-by: Xuewei Feng <fengxw06@126.com> Reported-by: Qi Li <qli01@tsinghua.edu.cn> Reported-by: Ke Xu <xuke@tsinghua.edu.cn> Assisted-by: Claude-Code:GLM-5.2 Signed-off-by: Yizhou Zhao <zhaoyz24@mails.tsinghua.edu.cn> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-06-14	ipvs: Replace use of system_unbound_wq with system_dfl_long_wq	Marco Crivellari
	This patch continues the effort to refactor workqueue APIs, which has begun with the changes introducing new workqueues and a new alloc_workqueue flag: commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq") commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag") The point of the refactoring is to eventually alter the default behavior of workqueues to become unbound by default so that their workload placement is optimized by the scheduler. Before that to happen, workqueue users must be converted to the better named new workqueues with no intended behaviour changes: system_wq -> system_percpu_wq system_unbound_wq -> system_dfl_wq This way the old obsolete workqueues (system_wq, system_unbound_wq) can be removed in the future. This specific work is considered long, so enqueue it using system_dfl_long_wq instead of system_dfl_wq. Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-06-05	ipvs: add conn_max sysctl to limit connections	Julian Anastasov
	Currently, we are using atomic_t to track the number of connections. On 64-bit setups with large memory there is a risk this counter to overflow. Also, setups with many containers may need to tune the limit for connections. Add sysctl control to limit the number of connections to 1,073,741,824 (64-bit) and 16,777,216 (32-bit). Depending on the admin's privilege, the value is used to change a soft or hard limit allowing unprivileged admins to change the soft limit in range determined by privileged admins. Link: https://sashiko.dev/#/patchset/20260523172715.94795-1-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260430074420.26697-7-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260522105546.13732-1-ja%40ssi.bg Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-06-04	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	Jakub Kicinski
	Cross-merge networking fixes after downstream PR (net-7.1-rc7). Silent conflicts: net/wireless/nl80211.c cb9959ab5f99 ("wifi: cfg80211: enforce HE/EHT cap/oper consistency") a384ae969902 ("wifi: cfg80211: move AP HT/VHT/... operation to beacon info") https://lore.kernel.org/aiGJDaHV4UlCexIQ@sirena.org.uk Conflicts: drivers/net/wireless/intel/iwlwifi/mld/ap.c a342c99cb70d ("wifi: iwlwifi: mld: honor BSS_CHANGED_BEACON_ENABLED") 9bf1b409afc7 ("wifi: iwlwifi: mld: send tx power constraints before link activation") https://lore.kernel.org/ah2bfedhV45ZxMO8@sirena.org.uk drivers/net/wireless/intel/iwlwifi/pcie/drv.c 093305d801fa ("wifi: iwlwifi: pcie: simplify the resume flow if fast resume is not used") e2323929a68a ("wifi: iwlwifi: pcie: add debug print for resume flow if powered off") https://lore.kernel.org/ah2bfedhV45ZxMO8@sirena.org.uk Adjacent changes: drivers/net/ethernet/airoha/airoha_eth.c b38cae85d1c4 ("net: airoha: Fix use-after-free in metadata dst teardown") ec6c391bcca7 ("net: airoha: Introduce airoha_gdm_dev struct") drivers/net/ethernet/microchip/lan743x_main.c 8173d22b211f ("net: lan743x: permit VLAN-tagged packets up to configured MTU") e3c6508a46f5 ("net: lan743x: avoid netdev-based logging before netdev registration") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-01	ipvs: clear the svc scheduler ptr early on edit	Julian Anastasov
	ip_vs_edit_service() while unbinding the old scheduler clears the svc->scheduler ptr after the scheduler module initiates RCU callbacks. This can cause packets to use the old scheduler at the time when svc->sched_data is already freed after RCU grace period. Fix it by clearing the ptr early in ip_vs_unbind_scheduler(), before the done_service method schedules any RCU callbacks. Also, if the new scheduler fails to initialize when replacing the old scheduler, try to restore the old scheduler while still returning the error code. Link: https://sashiko.dev/#/patchset/20260519015506.634185-1-rosenp%40gmail.com Fixes: 05f00505a89a ("ipvs: fix crash if scheduler is changed") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-05-24	netfilter: add option for GCOV profiling	Florian Westphal
	Similar to a few other subsystems: add a new config toggle to enable netfilter gcov profiling in netfilter, including ebtables, arptables and so on. ipset and ipvs gain their own, dedicated toggles. Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-05-16	ipvs: avoid possible loop in ip_vs_dst_event on resizing	Julian Anastasov
	Sashiko points out that unprivileged user can frequently call ip_vs_flush() or ip_vs_del_service() to trigger svc_table_changes updates that can lead to infinite loop in ip_vs_dst_event(). This can also happen if the user triggers frequent table resizing without deleting all services. We should also consider the possible effects if the user triggers many NETDEV_DOWN events. One way to solve it is to hold svc_resize_sem in ip_vs_dst_event() but this can block the dev notifier during the whole resizing process. Instead, use new rw_semaphore svc_replace_sem to protect just the svc_table replacement which is a short code section. Then hold svc_replace_sem in ip_vs_dst_event() to serialize with replacing the svc_table. As result, loop is avoided as there is no need to repeat the table walking from the start. By this way changes in svc_table_changes can happen only when all services are removed and all dev references dropped which allows us to abort the table walking. As IP_VS_WORK_SVC_NORESIZE is the flag used to stop the svc_resize_work under service_mutex, we should check only this flag often but not while under service_mutex. To remove the mutex_trylock() for service_mutex in the second phase where the resizer installs the new table after rehashing, we will avoid holding the service_mutex there. As result, the code in configuration context which is under service_mutex should access ipvs->svc_table under RCU because it can be replaced at anytime and released after a RCU grace period. As for ip_vs_zero_all(), it needs different solution as a table walker which can escape single RCU read-side critical section: to hold the svc_replace_sem to prevent table to be replaced. In ip_vs_status_show() prefer to hold svc_replace_sem to avoid many loops, just detect if the svc_table is removed. Prefer the newly attached table for the u_thresh/l_thresh checks to know when to grow/shrink while adding or deleting services because the new table size is based on the latest parameters. Link: https://sashiko.dev/#/patchset/20260505001648.360569-1-pablo%40netfilter.org Fixes: 840aac3d900d ("ipvs: use resizable hash table for services") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-05-05	ipvs: Guard access of HK_TYPE_KTHREAD cpumask with RCU	Waiman Long
	The ip_vs_ctl.c file and the associated ip_vs.h file are the only places in the kernel where HK_TYPE_KTHREAD cpumask is being retrieved and used. Now that HK_TYPE_KTHREAD/HK_TYPE_DOMAIN cpumask can be changed at run time. We need to use RCU to guard access to this cpumask to avoid a potential UAF problem as the returned cpumask may be freed before it is being used. We can replace HK_TYPE_KTHREAD by HK_TYPE_DOMAIN as they are aliases of each other, but keeping the HK_TYPE_KTHREAD name can highlight the fact that it is the kthread initiated by ipvs that is being controlled. Fixes: 03ff73510169 ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-05-05	ipvs: fix shift-out-of-bounds in ip_vs_rht_desired_size	Julian Anastasov
	Calling roundup_pow_of_two() with 0 has undefined result: UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13 shift exponent 64 is too large for 64-bit type 'unsigned long' CPU: 1 UID: 0 PID: 77 Comm: kworker/u8:4 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/18/2026 Workqueue: events_unbound conn_resize_work_handler Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 ubsan_epilogue+0xa/0x30 lib/ubsan.c:233 __ubsan_handle_shift_out_of_bounds+0x385/0x410 lib/ubsan.c:494 __roundup_pow_of_two include/linux/log2.h:57 [inline] ip_vs_rht_desired_size+0x2cf/0x410 net/netfilter/ipvs/ip_vs_core.c:240 ip_vs_conn_desired_size net/netfilter/ipvs/ip_vs_conn.c:765 [inline] conn_resize_work_handler+0x1b6/0x14c0 net/netfilter/ipvs/ip_vs_conn.c:822 process_one_work kernel/workqueue.c:3302 [inline] process_scheduled_works+0xb5d/0x1860 kernel/workqueue.c:3385 worker_thread+0xa53/0xfc0 kernel/workqueue.c:3466 kthread+0x388/0x470 kernel/kthread.c:436 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> Reported-by: syzbot+217f1db9c791e27fe54a@syzkaller.appspotmail.com Fixes: b655388111cf ("ipvs: add resizable hash tables") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-05-05	ipvs: fix races around est_mutex and est_cpulist	Julian Anastasov
	Sashiko reports for races and possible crash around the usage of est_cpulist_valid and sysctl_est_cpulist. The problem is that we do not lock est_mutex in some places which can lead to wrong write ordering and as result problems when calling cpumask_weight() and cpumask_empty(). Fix them by moving the est_max_threads read/write under locked est_mutex. Do the same for one ip_vs_est_reload_start() call to protect the cpumask_empty() usage of sysctl_est_cpulist. To remove the chance of deadlock while stopping the estimation kthreads, keep the data structure for kthread 0 even after last estimator is removed and do not hold mutexes while stopping this task. Now we will use a new flag 'needed' to know when kthread 0 should run. The kthreads above 0 do not use mutexes, so stop them under est_mutex because their kthread data still can be destroyed if they do not serve estimators. Now all kthreads will be started by the est_reload_work to properly serialize the stop/start for kthread 0. Reduce the use of service_mutex in ip_vs_est_calc_phase() because under est_mutex we can safely walk est_kt_arr to stop the kthreads above slot 0. As ip_vs_stop_estimator() for tot_stats should be called under service_mutex, do it early in the netns exit path in ip_vs_flush() to avoid locking the mutex again later. It still should be called in ip_vs_control_net_cleanup_sysctl() when we are called during netns init error. Use -2 for ktid as indicator if estimator was already stopped. Finally, fix use-after-free for kd->est_row in ip_vs_est_calc_phase(). est->ktrow should simply switch to a delay value while estimator is linked to est_temp_list. Link: https://sashiko.dev/#/patchset/20260331165015.2777765-1-longman%40redhat.com Link: https://sashiko.dev/#/patchset/20260420171308.87192-1-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260422125123.40658-1-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260424175858.54752-1-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260425103918.7447-1-ja%40ssi.bg Fixes: f0be83d54217 ("ipvs: add est_cpulist and est_nice sysctl vars") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-05-05	ipvs: do not leak dest after get from dest trash	Julian Anastasov
	Sashiko warns about leaked dest if ip_vs_start_estimator() fails in ip_vs_add_dest(). Add ip_vs_trash_put_dest() to put back the dest into dest trash. Link: https://sashiko.dev/#/patchset/20260428175725.72050-1-ja%40ssi.bg Fixes: 705dd3444081 ("ipvs: use kthreads for stats estimation") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-05-05	ipvs: fix the spin_lock usage for RT build	Julian Anastasov
	syzbot reports for sleeping function called from invalid context [1]. The recently added code for resizable hash tables uses hlist_bl bit locks in combination with spin_lock for the connection fields (cp->lock). Fix the following problems: * avoid using spin_lock(&cp->lock) under locked bit lock because it sleeps on PREEMPT_RT * as the recent changes call ip_vs_conn_hash() only for newly allocated connection, the spin_lock can be removed there because the connection is still not linked to table and does not need cp->lock protection. * the lock can be removed also from ip_vs_conn_unlink() where we are the last connection user. * the last place that is fixed is ip_vs_conn_fill_cport() where now the cp->lock is locked before the other locks to ensure other packets do not modify the cp->flags in non-atomic way. Here we make sure cport and flags are changed only once if two or more packets race to fill the cport. Also, we fill cport early, so that if we race with resizing there will be valid cport key for the hashing. Add a warning if too many hash table changes occur for our RCU read-side critical section which is error condition but minor because the connection still can expire gracefully. Still, restore the cport to 0 to allow retransmitted packet to properly fill the cport. Problems reported by Sashiko. [1]: BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 16, name: ktimers/0 preempt_count: 2, expected: 0 RCU nest depth: 3, expected: 3 8 locks held by ktimers/0/16: #0: ffffffff8de5f260 (local_bh){.+.+}-{1:3}, at: __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163 #1: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163 #2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at: spin_lock include/linux/spinlock_rt.h:45 [inline] #2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at: timer_base_lock_expiry kernel/time/timer.c:1502 [inline] #2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at: __run_timer_base+0x120/0x9f0 kernel/time/timer.c:2384 #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:300 [inline] #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: __rt_spin_lock kernel/locking/spinlock_rt.c:50 [inline] #3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rt_spin_lock+0x1e0/0x400 kernel/locking/spinlock_rt.c:57 #4: ffffc90000157a80 ((&cp->timer)){+...}-{0:0}, at: call_timer_fn+0xd4/0x5e0 kernel/time/timer.c:1745 #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:300 [inline] #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline] #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: ip_vs_conn_unlink net/netfilter/ipvs/ip_vs_conn.c:315 [inline] #5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: ip_vs_conn_expire+0x257/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260 #6: ffffffff8de5f260 (local_bh){.+.+}-{1:3}, at: __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163 #7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at: spin_lock include/linux/spinlock_rt.h:45 [inline] #7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at: ip_vs_conn_unlink net/netfilter/ipvs/ip_vs_conn.c:324 [inline] #7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at: ip_vs_conn_expire+0xd4a/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260 Preemption disabled at: [<ffffffff898a6358>] bit_spin_lock include/linux/bit_spinlock.h:38 [inline] [<ffffffff898a6358>] hlist_bl_lock+0x18/0x110 include/linux/list_bl.h:149 CPU: 0 UID: 0 PID: 16 Comm: ktimers/0 Tainted: G W L syzkaller #0 PREEMPT_{RT,(full)} Tainted: [W]=WARN, [L]=SOFTLOCKUP Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/18/2026 Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 __might_resched+0x329/0x480 kernel/sched/core.c:9162 __rt_spin_lock kernel/locking/spinlock_rt.c:48 [inline] rt_spin_lock+0xc2/0x400 kernel/locking/spinlock_rt.c:57 spin_lock include/linux/spinlock_rt.h:45 [inline] ip_vs_conn_unlink net/netfilter/ipvs/ip_vs_conn.c:324 [inline] ip_vs_conn_expire+0xd4a/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260 call_timer_fn+0x192/0x5e0 kernel/time/timer.c:1748 expire_timers kernel/time/timer.c:1799 [inline] __run_timers kernel/time/timer.c:2374 [inline] __run_timer_base+0x6a3/0x9f0 kernel/time/timer.c:2386 run_timer_base kernel/time/timer.c:2395 [inline] run_timer_softirq+0xb7/0x170 kernel/time/timer.c:2405 handle_softirqs+0x1de/0x6d0 kernel/softirq.c:622 __do_softirq kernel/softirq.c:656 [inline] run_ktimerd+0x69/0x100 kernel/softirq.c:1151 smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160 kthread+0x388/0x470 kernel/kthread.c:436 ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 </TASK> Reported-by: syzbot+504e778ddaecd36fdd17@syzkaller.appspotmail.com Link: https://sashiko.dev/#/patchset/20260415200216.79699-1-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260420165539.85174-4-ja%40ssi.bg Link: https://sashiko.dev/#/patchset/20260422135823.50489-4-ja%40ssi.bg Fixes: 2fa7cc9c7025 ("ipvs: switch to per-net connection table") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-05-05	ipvs: fix races around the conn_lfactor and svc_lfactor sysctl vars	Julian Anastasov
	Sashiko warns that the new sysctls vars can be changed after the hash tables are destroyed and their respective resizing works canceled, leading to mod_delayed_work() being called for canceled works. Solve this in different ways. conn_tab can be present even without services and is destroyed only on netns exit, so use disable_delayed_work_sync() to disable the work instead of adding more synchronization mechanisms. As for the svc_table, it is destroyed when the services are deleted, so we must be sure that netns exit is not called yet (the check for 'enable') and the work is not canceled by checking all under same mutex lock. Also, use WRITE_ONCE when updating the sysctl vars as we already read them with READ_ONCE. Link: https://sashiko.dev/#/patchset/20260410112352.23599-1-fw%40strlen.de Fixes: 8d7de5477e47 ("ipvs: add conn_lfactor and svc_lfactor sysctl vars") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-05-05	ipvs: fixes for the new ip_vs_status info	Julian Anastasov
	Sashiko reports some problems for the recently added /proc/net/ip_vs_status: * ip_vs_status_show() as a table reader may run long after the conn_tab and svc_table table are released. While ip_vs_conn_flush() properly changes the conn_tab_changes counter when conn_tab is removed, ip_vs_del_service() and ip_vs_flush() were missing such change for the svc_table_changes counter. As result, readers like ip_vs_dst_event() and ip_vs_status_show() may continue to use a freed table after a cond_resched_rcu() call. * While counting the buckets in ip_vs_status_show() make sure we traverse only the needed number of entries in the chain. This also prevents possible overflow of the 'count' variable. * Add check for 'loops' to prevent infinite loops while restarting the traversal on table change. * While IP_VS_CONN_TAB_MAX_BITS is 20 on 32-bit platforms and there is no risk to overflow when multiplying the number of conn_tab buckets to 100, prefer the div_u64() helper to make the following dividing safer. * Use 0440 permissions for ip_vs_status to restrict the info only to root due to the exported information for hash distribution. Link: https://sashiko.dev/#/patchset/20260410112352.23599-1-fw%40strlen.de Fixes: 9a9ccef907a7 ("ipvs: add ip_vs_status info") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-04-20	ipvs: fix MTU check for GSO packets in tunnel mode	Yingnan Zhang
	Currently, IPVS skips MTU checks for GSO packets by excluding them with the !skb_is_gso(skb) condition. This creates problems when IPVS tunnel mode encapsulates GSO packets with IPIP headers. The issue manifests in two ways: 1. MTU violation after encapsulation: When a GSO packet passes through IPVS tunnel mode, the original MTU check is bypassed. After adding the IPIP tunnel header, the packet size may exceed the outgoing interface MTU, leading to unexpected fragmentation at the IP layer. 2. Fragmentation with problematic IP IDs: When net.ipv4.vs.pmtu_disc=1 and a GSO packet with multiple segments is fragmented after encapsulation, each segment gets a sequentially incremented IP ID (0, 1, 2, ...). This happens because: a) The GSO packet bypasses MTU check and gets encapsulated b) At __ip_finish_output, the oversized GSO packet is split into separate SKBs (one per segment), with IP IDs incrementing c) Each SKB is then fragmented again based on the actual MTU This sequential IP ID allocation differs from the expected behavior and can cause issues with fragment reassembly and packet tracking. Fix this by properly validating GSO packets using skb_gso_validate_network_len(). This function correctly validates whether the GSO segments will fit within the MTU after segmentation. If validation fails, send an ICMP Fragmentation Needed message to enable proper PMTU discovery. Fixes: 4cdd34084d53 ("netfilter: nf_conntrack_ipv6: improve fragmentation handling") Signed-off-by: Yingnan Zhang <342144303@qq.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-04-10	ipvs: add conn_lfactor and svc_lfactor sysctl vars	Julian Anastasov
	Allow the default load factor for the connection and service tables to be configured. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-04-10	ipvs: add ip_vs_status info	Julian Anastasov
	Add /proc/net/ip_vs_status to show current state of IPVS. The motivation for this new /proc interface is to provide the output for the users to help them decide when to tune the load factor for hash tables, which is possible with the new sysctl knobs coming in followup patch. The output also includes information for the kthreads used for stats. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-04-10	ipvs: show the current conn_tab size to users	Julian Anastasov
	As conn_tab is per-net, better to show the current hash table size to users instead of the ip_vs_conn_tab_size (max). Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-04-09	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	Jakub Kicinski
	Cross-merge networking fixes after downstream PR (net-7.0-rc8). Conflicts: net/ipv6/seg6_iptunnel.c c3812651b522f ("seg6: separate dst_cache for input and output paths in seg6 lwtunnel") 78723a62b969a ("seg6: add per-route tunnel source address") https://lore.kernel.org/adZhwtOYfo-0ImSa@sirena.org.uk net/ipv4/icmp.c fde29fd934932 ("ipv4: icmp: fix null-ptr-deref in icmp_build_probe()") d98adfbdd5c01 ("ipv4: drop ipv6_stub usage and use direct function calls") https://lore.kernel.org/adO3dccqnr6j-BL9@sirena.org.uk Adjacent changes: drivers/net/ethernet/stmicro/stmmac/chain_mode.c 51f4e090b9f8 ("net: stmmac: fix integer underflow in chain mode") 6b4286e05508 ("net: stmmac: rename STMMAC_GET_ENTRY() -> STMMAC_NEXT_ENTRY()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-08	ipvs: fix NULL deref in ip_vs_add_service error path	Weiming Shi
	When ip_vs_bind_scheduler() succeeds in ip_vs_add_service(), the local variable sched is set to NULL. If ip_vs_start_estimator() subsequently fails, the out_err cleanup calls ip_vs_unbind_scheduler(svc, sched) with sched == NULL. ip_vs_unbind_scheduler() passes the cur_sched NULL check (because svc->scheduler was set by the successful bind) but then dereferences the NULL sched parameter at sched->done_service, causing a kernel panic at offset 0x30 from NULL. Oops: general protection fault, [..] [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037] RIP: 0010:ip_vs_unbind_scheduler (net/netfilter/ipvs/ip_vs_sched.c:69) Call Trace: <TASK> ip_vs_add_service.isra.0 (net/netfilter/ipvs/ip_vs_ctl.c:1500) do_ip_vs_set_ctl (net/netfilter/ipvs/ip_vs_ctl.c:2809) nf_setsockopt (net/netfilter/nf_sockopt.c:102) [..] Fix by simply not clearing the local sched variable after a successful bind. ip_vs_unbind_scheduler() already detects whether a scheduler is installed via svc->scheduler, and keeping sched non-NULL ensures the error path passes the correct pointer to both ip_vs_unbind_scheduler() and ip_vs_scheduler_put(). While the bug is older, the problem popups in more recent kernels (6.2), when the new error path is taken after the ip_vs_start_estimator() call. Fixes: 705dd3444081 ("ipvs: use kthreads for stats estimation") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Acked-by: Simon Horman <horms@kernel.org> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-04	ipvs: use more keys for connection hashing	Julian Anastasov
	Simon Kirby reported long time ago that IPVS connection hashing based only on the client address/port (caddr, cport) as hash keys is not suitable for setups that accept traffic on multiple virtual IPs and ports. It can happen for multiple VIP:VPORT services, for single or many fwmark service(s) that match multiple virtual IPs and ports or even for passive FTP with peristence in DR/TUN mode where we expect traffic on multiple ports for the virtual IP. Fix it by adding virtual addresses and ports to the hash function. This causes the traffic from NAT real servers to clients to use second hashing for the in->out direction. As result: - the IN direction from client will use hash node hn0 where the source/dest addresses and ports used by client will be used as hash keys - the OUT direction from NAT real servers will use hash node hn1 for the traffic from real server to client - the persistence templates are hashed only with parameters based on the IN direction, so they now will also use the virtual address, port and fwmark from the service. OLD: - all methods: c_list node: proto, caddr:cport - persistence templates: c_list node: proto, caddr_net:0 - persistence engine templates: c_list node: per-PE, PE-SIP uses jhash NEW: - all methods: hn0 node (dir 0): proto, caddr:cport -> vaddr:vport - MASQ method: hn1 node (dir 1): proto, daddr:dport -> caddr:cport - persistence templates: hn0 node (dir 0): proto, caddr_net:0 -> vaddr:vport_or_0 proto, caddr_net:0 -> fwmark:0 - persistence engine templates: hn0 node (dir 0): as before Also reorder the ip_vs_conn fields, so that hash nodes are on same read-mostly cache line while write-mostly fields are on separate cache line. Reported-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-04	ipvs: switch to per-net connection table	Julian Anastasov
	Use per-net resizable hash table for connections. The global table is slow to walk when using many namespaces. The table can be resized in the range of [256 - ip_vs_conn_tab_size]. Table is attached only while services are present. Resizing is done by delayed work based on load (the number of connections). Add a hash_key field into the connection to store the table ID in the highest bit and the entry's hash value in the lowest bits. The lowest part of the hash value is used as bucket ID, the remaining part is used to filter the entries in the bucket before matching the keys and as result, helps the lookup operation to access only one cache line. By knowing the table ID and bucket ID for entry, we can unlink it without calculating the hash value and doing lookup by keys. We need only to validate the saved hash_key under lock. For better security switch from jhash to siphash for the default connection hashing but the persistence engines may use their own function. Keeping the hash table loaded with entries below the size (12%) allows to avoid collision for 96+% of the conns. ip_vs_conn_fill_cport() now will rehash the connection with proper locking because unhash+hash is not safe for RCU readers. To invalidate the templates setting just dport to 0xffff is enough, no need to rehash them. As result, ip_vs_conn_unhash() is now unused and removed. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-04	ipvs: use resizable hash table for services	Julian Anastasov
	Make the hash table for services resizable in the bit range of 4-20. Table is attached only while services are present. Resizing is done by delayed work based on load (the number of hashed services). Table grows when load increases 2+ times (above 12.5% with lfactor=-3) and shrinks 8+ times when load decreases 16+ times (below 0.78%). Switch to jhash hashing to reduce the collisions for multiple services. Add a hash_key field into the service to store the table ID in the highest bit and the entry's hash value in the lowest bits. The lowest part of the hash value is used as bucket ID, the remaining part is used to filter the entries in the bucket before matching the keys and as result, helps the lookup operation to access only one cache line. By knowing the table ID and bucket ID for entry, we can unlink it without calculating the hash value and doing lookup by keys. We need only to validate the saved hash_key under lock. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-03-04	ipvs: add resizable hash tables	Julian Anastasov
	Add infrastructure for resizable hash tables based on hlist_bl which we will use in followup patches. The tables allow RCU lookups during resizing, bucket modifications are protected with per-bucket bit lock and additional custom locking, the tables are resized when load reaches thresholds determined based on load factor parameter. Compared to other implementations we rely on: * fast entry removal by using node unlinking without pre-lookup * entry rehashing when hash key changes * entries can contain multiple hash nodes * custom locking depending on different contexts * adjustable load factor to customize the grow/shrink process Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-26	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	Jakub Kicinski
	Cross-merge networking fixes after downstream PR (net-7.0-rc2). Conflicts: tools/testing/selftests/drivers/net/hw/rss_ctx.py 19c3a2a81d2b ("selftests: drv-net: rss: Generate unique ports for RSS context tests") ce5a0f4612db ("selftests: drv-net: rss_ctx: test RSS contexts persist after ifdown/up") include/net/inet_connection_sock.h 858d2a4f67ff6 ("tcp: fix potential race in tcp_v6_syn_recv_sock()") fcd3d039fab69 ("tcp: make tcp_v{4,6}_send_check() static") https://lore.kernel.org/aZ8PSFLzBrEU3I89@sirena.org.uk drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.c 69050f8d6d075 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types") bf4afc53b77ae ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument") 8a96b9144f18a ("net/mlx5e: Alloc xsk channel param out of mlx5e_open_xsk()") Adjacent changes: net/netfilter/ipvs/ip_vs_ctl.c c59bd9e62e06 ("ipvs: use more counters to avoid service lookups") bf4afc53b77a ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25	ipvs: no_cport and dropentry counters can be per-net	Julian Anastasov
	Change the no_cport counters to be per-net and address family. This should reduce the extra conn lookups done during present NO_CPORT connections. By changing from global to per-net dropentry counters, one net will not affect the drop rate of another net. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-7-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25	ipvs: use more counters to avoid service lookups	Julian Anastasov
	When new connection is created we can lookup for services multiple times to support fallback options. We already have some counters to skip specific lookups because it costs CPU cycles for hash calculation, etc. Add more counters for fwmark/non-fwmark services (fwm_services and nonfwm_services) and make all counters per address family. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-6-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25	ipvs: do not keep dest_dst after dest is removed	Julian Anastasov
	Before now dest->dest_dst is not released when server is moved into dest_trash list after removal. As result, we can keep dst/dev references for long time without actively using them. It is better to avoid walking the dest_trash list when ip_vs_dst_event() receives dev events. So, make sure we do not hold dev references in dest_trash list. As packets can be flying while server is being removed, check the IP_VS_DEST_F_AVAILABLE flag in slow path to ensure we do not save new dev references to removed servers. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-5-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25	ipvs: use single svc table	Julian Anastasov
	fwmark based services and non-fwmark based services can be hashed in same service table. This reduces the burden of working with two tables. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-4-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25	ipvs: some service readers can use RCU	Julian Anastasov
	Some places walk the services under mutex but they can just use RCU: * ip_vs_dst_event() uses ip_vs_forget_dev() which uses its own lock to modify dest * ip_vs_genl_dump_services(): ip_vs_genl_fill_service() just fills skb * ip_vs_genl_parse_service(): move RCU lock to callers ip_vs_genl_set_cmd(), ip_vs_genl_dump_dests() and ip_vs_genl_get_cmd() * ip_vs_genl_dump_dests(): just fill skb Signed-off-by: Julian Anastasov <ja@ssi.bg> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-3-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-25	ipvs: make ip_vs_svc_table and ip_vs_svc_fwm_table per netns	Jiejian Wu
	Current ipvs uses one global mutex "__ip_vs_mutex" to keep the global "ip_vs_svc_table" and "ip_vs_svc_fwm_table" safe. But when there are tens of thousands of services from different netns in the table, it takes a long time to look up the table, for example, using "ipvsadm -ln" from different netns simultaneously. We make "ip_vs_svc_table" and "ip_vs_svc_fwm_table" per netns, and we add "service_mutex" per netns to keep these two tables safe instead of the global "__ip_vs_mutex" in current version. To this end, looking up services from different netns simultaneously will not get stuck, shortening the time consumption in large-scale deployment. It can be reproduced using the simple scripts below. init.sh: #!/bin/bash for((i=1;i<=4;i++));do ip netns add ns$i ip netns exec ns$i ip link set dev lo up ip netns exec ns$i sh add-services.sh done add-services.sh: #!/bin/bash for((i=0;i<30000;i++)); do ipvsadm -A -t 10.10.10.10:$((80+$i)) -s rr done runtest.sh: #!/bin/bash for((i=1;i<4;i++));do ip netns exec ns$i ipvsadm -ln > /dev/null & done ip netns exec ns4 ipvsadm -ln > /dev/null Run "sh init.sh" to initiate the network environment. Then run "time ./runtest.sh" to evaluate the time consumption. Our testbed is a 4-core Intel Xeon ECS. The result of the original version is around 8 seconds, while the result of the modified version is only 0.8 seconds. Signed-off-by: Jiejian Wu <jiejian@linux.alibaba.com> Co-developed-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20260224205048.4718-2-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-22	Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL uses	Kees Cook
	Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21	Convert more 'alloc_obj' cases to default GFP_KERNEL arguments	Linus Torvalds
	This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument	Linus Torvalds
	This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/$alloc_objs(.*$, GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21	treewide: Replace kmalloc with kmalloc_obj for non-scalar types	Kees Cook
	This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-17	ipvs: do not keep dest_dst if dev is going down	Julian Anastasov
	There is race between the netdev notifier ip_vs_dst_event() and the code that caches dst with dev that is going down. As the FIB can be notified for the closed device after our handler finishes, it is possible valid route to be returned and cached resuling in a leaked dev reference until the dest is not removed. To prevent new dest_dst to be attached to dest just after the handler dropped the old one, add a netif_running() check to make sure the notifier handler is not currently running for device that is closing. Fixes: 7a4f0761fce3 ("IPVS: init and cleanup restructuring") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-02-17	ipvs: skip ipv6 extension headers for csum checks	Julian Anastasov
	Protocol checksum validation fails for IPv6 if there are extension headers before the protocol header. iph->len already contains its offset, so use it to fix the problem. Fixes: 2906f66a5682 ("ipvs: SCTP Trasport Loadbalancing Support") Fixes: 0bbdd42b7efa ("IPVS: Extend protocol DNAT/SNAT and state handlers") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>