summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2025-11-26socket: Split out a getsockname helper for io_uringGabriel Krisman Bertazi
Similar to getsockopt, split out a helper to check security and issue the operation from the main handler that can be used by io_uring. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-26socket: Unify getsockname and getpeername implementationGabriel Krisman Bertazi
They are already implemented by the same get_name hook in the protocol level. Bring the unification one level up to reduce code duplication in preparation to supporting these as io_uring operations. Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-26net/sched: em_canid: fix uninit-value in em_canid_matchShaurya Rane
Use pskb_may_pull() to ensure a complete CAN frame is present in the linear data buffer before reading the CAN ID. A simple skb->len check is insufficient because it only verifies the total data length but does not guarantee the data is present in skb->data (it could be in fragments). pskb_may_pull() both validates the length and pulls fragmented data into the linear buffer if necessary, making it safe to directly access skb->data. Reported-by: syzbot+5d8269a1e099279152bc@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=5d8269a1e099279152bc Fixes: f057bbb6f9ed ("net: em_canid: Ematch rule to match CAN frames according to their identifiers") Signed-off-by: Shaurya Rane <ssrane_b23@ee.vjti.ac.in> Link: https://patch.msgid.link/20251126085718.50808-1-ssranevjti@gmail.com Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: raw: instantly reject unsupported CAN framesOliver Hartkopp
For real CAN interfaces the CAN_CTRLMODE_FD and CAN_CTRLMODE_XL control modes indicate whether an interface can handle those CAN FD/XL frames. In the case a CAN XL interface is configured in CANXL-only mode with disabled error-signalling neither CAN CC nor CAN FD frames can be sent. The checks are performed on CAN_RAW sockets to give an instant feedback to the user when writing unsupported CAN frames to the interface. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-16-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26wifi: mac80211: allow sharing identical chanctx for S1G interfacesLachlan Hodges
Introduce support for sharing identical channel contexts for S1G interfaces. Additionally, do not downgrade channel requests for S1G interfaces. Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com> Link: https://patch.msgid.link/20251126015758.149034-1-lachlan.hodges@morsemicro.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-11-25xsk: avoid data corruption on cq descriptor numberFernando Fernandez Mancera
Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor production"), the descriptor number is stored in skb control block and xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto pool's completion queue. skb control block shouldn't be used for this purpose as after transmit xsk doesn't have control over it and other subsystems could use it. This leads to the following kernel panic due to a NULL pointer dereference. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 2 UID: 1 PID: 927 Comm: p4xsk.bin Not tainted 6.16.12+deb14-cloud-amd64 #1 PREEMPT(lazy) Debian 6.16.12-1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 RIP: 0010:xsk_destruct_skb+0xd0/0x180 [...] Call Trace: <IRQ> ? napi_complete_done+0x7a/0x1a0 ip_rcv_core+0x1bb/0x340 ip_rcv+0x30/0x1f0 __netif_receive_skb_one_core+0x85/0xa0 process_backlog+0x87/0x130 __napi_poll+0x28/0x180 net_rx_action+0x339/0x420 handle_softirqs+0xdc/0x320 ? handle_edge_irq+0x90/0x1e0 do_softirq.part.0+0x3b/0x60 </IRQ> <TASK> __local_bh_enable_ip+0x60/0x70 __dev_direct_xmit+0x14e/0x1f0 __xsk_generic_xmit+0x482/0xb70 ? __remove_hrtimer+0x41/0xa0 ? __xsk_generic_xmit+0x51/0xb70 ? _raw_spin_unlock_irqrestore+0xe/0x40 xsk_sendmsg+0xda/0x1c0 __sys_sendto+0x1ee/0x200 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x84/0x2f0 ? __pfx_pollwake+0x10/0x10 ? __rseq_handle_notify_resume+0xad/0x4c0 ? restore_fpregs_from_fpstate+0x3c/0x90 ? switch_fpu_return+0x5b/0xe0 ? do_syscall_64+0x204/0x2f0 ? do_syscall_64+0x204/0x2f0 ? do_syscall_64+0x204/0x2f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e </TASK> [...] Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Instead use the skb destructor_arg pointer along with pointer tagging. As pointers are always aligned to 8B, use the bottom bit to indicate whether this a single address or an allocated struct containing several addresses. Fixes: 30f241fcf52a ("xsk: Fix immature cq descriptor production") Closes: https://lore.kernel.org/netdev/0435b904-f44f-48f8-afb0-68868474bf1c@nop.hu/ Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20251124171409.3845-1-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25tcp: remove icsk->icsk_retransmit_timerEric Dumazet
Now sk->sk_timer is no longer used by TCP keepalive, we can use its storage for TCP and MPTCP retransmit timers for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25tcp: introduce icsk->icsk_keepalive_timerEric Dumazet
sk->sk_timer has been used for TCP keepalives. Keepalive timers are not in fast path, we want to use sk->sk_timer storage for retransmit timers, for better cache locality. Create icsk->icsk_keepalive_timer and change keepalive code to no longer use sk->sk_timer. Added space is reclaimed in the following patch. This includes changes to MPTCP, which was also using sk_timer. Alias icsk->mptcp_tout_timer and icsk->icsk_keepalive_timer for inet_sk_diag_fill() sake. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25net: move sk_dst_pending_confirm and sk_pacing_status to sock_read_tx groupEric Dumazet
These two fields are mostly read in TCP tx path, move them in an more appropriate group for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25tcp: rename icsk_timeout() to tcp_timeout_expires()Eric Dumazet
In preparation of sk->tcp_timeout_timer introduction, rename icsk_timeout() helper and change its argument to plain 'const struct sock *sk'. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25tools: ynl-gen: add regeneration commentAsbjørn Sloth Tønnesen
Add a comment on regeneration to the generated files. The comment is placed after the YNL-GEN line[1], as to not interfere with ynl-regen.sh's detection logic. [1] and after the optional YNL-ARG line. Link: https://lore.kernel.org/r/aR5m174O7pklKrMR@zx2c4.com/ Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251120174429.390574-3-ast@fiberby.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25net: dsa: append ethtool counters of all hidden ports to conduitVladimir Oltean
Currently there is no way to see packet counters on cascade ports, and no clarity on how the API for that would look like. Because it's something that is currently needed, just extend the hack where ethtool -S on the conduit interface dumps CPU port counters, and also use it to dump counters of cascade ports. Note that the "pXX_" naming convention changes to "sXX_pYY", to distinguish between ports having the same index but belonging to different switches. This has a slight chance of causing regressions to existing tooling: - grepping for "p04_counter_name" still works, but might return more than one string now - grepping for " p04_counter_name" no longer works Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20251122112311.138784-4-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25net: dsa: use kernel data types for ethtool ops on conduitVladimir Oltean
Suppress some checkpatch 'CHECK' messages about u8 being preferable over uint8_t, etc. No functional change. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20251122112311.138784-3-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25net: dsa: cpu_dp->orig_ethtool_ops might be NULLVladimir Oltean
In theory this would have been seen by now, but it seems that all drivers used as DSA conduit interfaces thus far have had ethtool_ops set, and it's hard to even find modern Ethernet drivers (and not VF ones) which don't use ethtool. Here is the unfiltered list of drivers which register any sort of net_device but don't set its ethtool_ops pointer. I don't think any of them 'risks' being used as a DSA conduit, maybe except for moxart, rnpbge and icssm, I'm not sure. - drivers/net/can/dev/dev.c - drivers/net/wwan/qcom_bam_dmux.c - drivers/net/wwan/t7xx/t7xx_netdev.c - drivers/net/arcnet/arcnet.c - drivers/net/hamradio/ - drivers/net/slip/slip.c - drivers/net/ethernet/ezchip/nps_enet.c - drivers/net/ethernet/moxa/moxart_ether.c - drivers/net/ethernet/wangxun/txgbevf/txgbevf_main.c - drivers/net/ethernet/wangxun/ngbevf/ngbevf_main.c - drivers/net/ethernet/huawei/hinic3/hinic3_main.c - drivers/net/ethernet/i825xx/ - drivers/net/ethernet/ti/icssm/icssm_prueth.c - drivers/net/ethernet/seeq/ - drivers/net/ethernet/litex/litex_liteeth.c - drivers/net/ethernet/sunplus/spl2sw_driver.c - drivers/net/ethernet/mucse/rnpgbe/rnpgbe_main.c - drivers/net/ipa/ - drivers/net/wireless/microchip/wilc1000/ - drivers/net/wireless/mediatek/mt76/dma.c - drivers/net/wireless/ath/ath12k/ - drivers/net/wireless/ath/ath11k/ - drivers/net/wireless/ath/ath6kl/ - drivers/net/wireless/ath/ath10k/ - drivers/net/wireless/intel/iwlwifi/pcie/gen1_2/trans.c - drivers/net/wireless/virtual/mac80211_hwsim.c - drivers/net/wireless/quantenna/qtnfmac/pcie/pcie.c - drivers/net/wireless/realtek/rtw89/core.c - drivers/net/wireless/realtek/rtw88/pci.c - drivers/net/caif/ - drivers/net/plip/ - drivers/net/wan/ - drivers/net/mctp/ - drivers/net/ppp/ - drivers/net/thunderbolt/ Nonetheless, it's good for the framework not to make such assumptions, and not panic when coming across such kind of host device in the future. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20251122112311.138784-2-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25net_sched: use qdisc_dequeue_drop() in cake, codel, fq_codelEric Dumazet
cake, codel and fq_codel can drop many packets from dequeue(). Use qdisc_dequeue_drop() so that the freeing can happen outside of the qdisc spinlock scope. Add TCQ_F_DEQUEUE_DROPS to sch->flags. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-15-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net_sched: add qdisc_dequeue_drop() helperEric Dumazet
Some qdisc like cake, codel, fq_codel might drop packets in their dequeue() method. This is currently problematic because dequeue() runs with the qdisc spinlock held. Freeing skbs can be extremely expensive. Add qdisc_dequeue_drop() method and a new TCQ_F_DEQUEUE_DROPS so that these qdiscs can opt-in to defer the skb frees after the socket spinlock is released. TCQ_F_DEQUEUE_DROPS is an attempt to not penalize other qdiscs with an extra cache line miss. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-14-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net_sched: add tcf_kfree_skb_list() helperEric Dumazet
Using kfree_skb_list_reason() to free list of skbs from qdisc operations seems wrong as each skb might have a different drop reason. Cleanup __dev_xmit_skb() to call tcf_kfree_skb_list() once in preparation of the following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-13-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net: annotate a data-race in __dev_xmit_skb()Eric Dumazet
q->limit is read locklessly, add a READ_ONCE(). Fixes: 100dfa74cad9 ("net: dev_queue_xmit() llist adoption") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-12-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net: prefech skb->priority in __dev_xmit_skb()Eric Dumazet
Most qdiscs need to read skb->priority at enqueue time(). In commit 100dfa74cad9 ("net: dev_queue_xmit() llist adoption") I added a prefetch(next), lets add another one for the second half of skb. Note that skb->priority and skb->hash share a common cache line, so this patch helps qdiscs needing both fields. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-11-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net_sched: sch_fq: prefetch one skb ahead in dequeue()Eric Dumazet
prefetch the skb that we are likely to dequeue at the next dequeue(). Also call fq_dequeue_skb() a bit sooner in fq_dequeue(). This reduces the window between read of q.qlen and changes of fields in the cache line that could be dirtied by another cpu trying to queue a packet. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-10-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net_sched: sch_fq: move qdisc_bstats_update() to fq_dequeue_skb()Eric Dumazet
Group together changes to qdisc fields to reduce chances of false sharing if another cpu attempts to acquire the qdisc spinlock. qdisc_qstats_backlog_dec(sch, skb); sch->q.qlen--; qdisc_bstats_update(sch, skb); Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-9-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net_sched: cake: use qdisc_pkt_segs()Eric Dumazet
Use new qdisc_pkt_segs() to avoid a cache line miss in cake_enqueue() for non GSO packets. cake_overhead() does not have to recompute it. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-7-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net_sched: use qdisc_skb_cb(skb)->pkt_segs in bstats_update()Eric Dumazet
Avoid up to two cache line misses in qdisc dequeue() to fetch skb_shinfo(skb)->gso_segs/gso_size while qdisc spinlock is held. This gives a 5 % improvement in a TX intensive workload. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-6-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net: use qdisc_pkt_len_segs_init() in sch_handle_ingress()Eric Dumazet
sch_handle_ingress() sets qdisc_skb_cb(skb)->pkt_len. We also need to initialize qdisc_skb_cb(skb)->pkt_segs. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-5-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net_sched: initialize qdisc_skb_cb(skb)->pkt_segs in qdisc_pkt_len_init()Eric Dumazet
qdisc_pkt_len_init() is currently initalizing qdisc_skb_cb(skb)->pkt_len. Add qdisc_skb_cb(skb)->pkt_segs initialization and rename this function to qdisc_pkt_len_segs_init(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-4-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net: init shinfo->gso_segs from qdisc_pkt_len_init()Eric Dumazet
Qdisc use shinfo->gso_segs for their pkts stats in bstats_update(), but this field needs to be initialized for SKB_GSO_DODGY users. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-3-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net_sched: make room for (struct qdisc_skb_cb)->pkt_segsEric Dumazet
Add a new u16 field, next to pkt_len : pkt_segs This will cache shinfo->gso_segs to speed up qdisc deqeue(). Move slave_dev_queue_mapping at the end of qdisc_skb_cb, and move three bits from tc_skb_cb : - post_ct - post_ct_snat - post_ct_dnat Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-2-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25wifi: cfg80211: include s1g_primary_2mhz when sending chandefLachlan Hodges
The chandef now includes a flag denoting the use of a 2MHz primary channel for S1G interfaces, include this when sending the chandef. Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com> Link: https://patch.msgid.link/20251125025927.245280-2-lachlan.hodges@morsemicro.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-11-24mptcp: leverage the backlog for RX packet processingPaolo Abeni
When the msk socket is owned or the msk receive buffer is full, move the incoming skbs in a msk level backlog list. This avoid traversing the joined subflows and acquiring the subflow level socket lock at reception time, improving the RX performances. When processing the backlog, use the fwd alloc memory borrowed from the incoming subflow. skbs exceeding the msk receive space are not dropped; instead they are kept into the backlog until the receive buffer is freed. Dropping packets already acked at the TCP level is explicitly discouraged by the RFC and would corrupt the data stream for fallback sockets. Special care is needed to avoid adding skbs to the backlog of a closed msk and to avoid leaving dangling references into the backlog at subflow closing time. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-14-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: introduce mptcp-level backlogPaolo Abeni
We are soon using it for incoming data processing. MPTCP can't leverage the sk_backlog, as the latter is processed before the release callback, and such callback for MPTCP releases and re-acquire the socket spinlock, breaking the sk_backlog processing assumption. Add a skb backlog list inside the mptcp sock struct, and implement basic helper to transfer packet to and purge such list. Packets in the backlog are memory accounted and still use the incoming subflow receive memory, to allow back-pressure. The backlog size is implicitly bounded to the sum of subflows rcvbuf. When a subflow is closed, references from the backlog to such sock are removed. No packet is currently added to the backlog, so no functional changes intended here. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-13-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: borrow forward memory from subflowPaolo Abeni
In the MPTCP receive path, we release the subflow allocated fwd memory just to allocate it again shortly after for the msk. That could increases the failures chances, especially when we will add backlog processing, with other actions could consume the just released memory before the msk socket has a chance to do the rcv allocation. Replace the skb_orphan() call with an open-coded variant that explicitly borrows, the fwd memory from the subflow socket instead of releasing it. The borrowed memory does not have PAGE_SIZE granularity; rounding to the page size will make the fwd allocated memory higher than what is strictly required and could make the incoming subflow fwd mem consistently negative. Instead, keep track of the accumulated frag and borrow the full page at subflow close time. This allow removing the last drop in the TCP to MPTCP transition and the associated, now unused, MIB. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-12-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: handle first subflow closing consistentlyPaolo Abeni
Currently, as soon as the PM closes a subflow, the msk stops accepting data from it, even if the TCP socket could be still formally open in the incoming direction, with the notable exception of the first subflow. The root cause of such behavior is that code currently piggy back two separate semantic on the subflow->disposable bit: the subflow context must be released and that the subflow must stop accepting incoming data. The first subflow is never disposed, so it also never stop accepting incoming data. Use a separate bit to mark the latter status and set such bit in __mptcp_close_ssk() for all subflows. Beyond making per subflow behaviour more consistent this will also simplify the next patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-11-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: drop the __mptcp_data_ready() helperPaolo Abeni
It adds little clarity and there is a single user of such helper, just inline it in the caller. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-10-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: make mptcp_destroy_common() staticPaolo Abeni
Such function is only used inside protocol.c, there is no need to expose it to the whole stack. Note that the function definition most be moved earlier to avoid forward declaration. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-9-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: do not miss early first subflow close event notificationPaolo Abeni
The MPTCP protocol is not currently emitting the NL event when the first subflow is closed before msk accept() time. By replacing the in use close helper is such scenario, implicitly introduce the missing notification. Note that in such scenario we want to be sure that mptcp_close_ssk() will not trigger any PM work, move the msk state change update earlier, so that the previous patch will offer such guarantee. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-8-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: ensure the kernel PM does not take action too latePaolo Abeni
The PM hooks can currently take place when the msk is already shutting down. Subflow creation will fail, thanks to the existing check at join time, but we can entirely avoid starting the to be failed operations. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-7-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: cleanup fallback dummy mapping generationPaolo Abeni
MPTCP currently access ack_seq outside the msk socket log scope to generate the dummy mapping for fallback socket. Soon we are going to introduce backlog usage and even for fallback socket the ack_seq value will be significantly off outside of the msk socket lock scope. Avoid relying on ack_seq for dummy mapping generation, using instead the subflow sequence number. Note that in case of disconnect() and (re)connect() we must ensure that any previous state is re-set. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-6-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: cleanup fallback data fin receptionPaolo Abeni
MPTCP currently generate a dummy data_fin for fallback socket when the fallback subflow has completed data reception using the current ack_seq. We are going to introduce backlog usage for the msk soon, even for fallback sockets: the ack_seq value will not match the most recent sequence number seen by the fallback subflow socket, as it will ignore data_seq sitting in the backlog. Instead use the last map sequence number to set the data_fin, as fallback (dummy) map sequences are always in sequence. Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-5-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: fix memcg accounting for passive socketsPaolo Abeni
The passive sockets never got proper memcg accounting: the msk socket is associated with the memcg at accept time, but the passive subflows never got it right. At accept time, traverse the subflows list and associate each of them with the msk memcg, and try to do the same at join completion time, if the msk has been already accepted. Fixes: cf7da0d66cc1 ("mptcp: Create SUBFLOW socket for incoming connections") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/298 Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/597 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-4-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: grafting MPJ subflow earlierPaolo Abeni
Later patches need to ensure that all MPJ subflows are grafted to the msk socket before accept() completion. Currently the grafting happens under the msk socket lock: potentially at msk release_cb time which make satisfying the above condition a bit tricky. Move the MPJ subflow grafting earlier, under the msk data lock, so that we can use such lock as a synchronization point. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-3-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: factor-out cgroup data inherit helperPaolo Abeni
MPTCP will soon need the same functionality for passive sockets, factor them out in a common helper. No functional change intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-2-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24net: factor-out _sk_charge() helperPaolo Abeni
Move out of __inet_accept() the code dealing charging newly accepted socket to memcg. MPTCP will soon use it to on a per subflow basis, in different contexts. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Geliang Tang <geliang@kernel.org> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-1-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24net: optimize eth_type_trans() vs CONFIG_STACKPROTECTOR_STRONG=yEric Dumazet
Some platforms exhibit very high costs with CONFIG_STACKPROTECTOR_STRONG=y when a function needs to pass the address of a local variable to external functions. eth_type_trans() (and its callers) is showing this anomaly on AMD EPYC 7B12 platforms (and maybe others). We could : 1) inline eth_type_trans() This would help if its callers also has the same issue, and the canary cost would be paid by the callers already. This is a bit cumbersome because netdev_uses_dsa() is pulling whole <net/dsa.h> definitions. 2) Compile net/ethernet/eth.c with -fno-stack-protector This would weaken security. 3) Hack eth_type_trans() to temporarily use skb->dev as a place holder if skb_header_pointer() needs to pull 2 bytes not present in skb->head. This patch implements 3), and brings a 5% improvement on TX/RX intensive workload (tcp_rr 10,000 flows) on AMD EPYC 7B12. Removing CONFIG_STACKPROTECTOR_STRONG on this platform can improve performance by 25 %. This means eth_type_trans() issue is not an isolated artifact. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20251121061725.206675-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24net: sched: fix TCF_LAYER_TRANSPORT handling in tcf_get_base_ptr()Eric Dumazet
syzbot reported that tcf_get_base_ptr() can be called while transport header is not set [1]. Instead of returning a dangling pointer, return NULL. Fix tcf_get_base_ptr() callers to handle this NULL value. [1] WARNING: CPU: 1 PID: 6019 at ./include/linux/skbuff.h:3071 skb_transport_header include/linux/skbuff.h:3071 [inline] WARNING: CPU: 1 PID: 6019 at ./include/linux/skbuff.h:3071 tcf_get_base_ptr include/net/pkt_cls.h:539 [inline] WARNING: CPU: 1 PID: 6019 at ./include/linux/skbuff.h:3071 em_nbyte_match+0x2d8/0x3f0 net/sched/em_nbyte.c:43 Modules linked in: CPU: 1 UID: 0 PID: 6019 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) Call Trace: <TASK> tcf_em_match net/sched/ematch.c:494 [inline] __tcf_em_tree_match+0x1ac/0x770 net/sched/ematch.c:520 tcf_em_tree_match include/net/pkt_cls.h:512 [inline] basic_classify+0x115/0x2d0 net/sched/cls_basic.c:50 tc_classify include/net/tc_wrapper.h:197 [inline] __tcf_classify net/sched/cls_api.c:1764 [inline] tcf_classify+0x4cf/0x1140 net/sched/cls_api.c:1860 multiq_classify net/sched/sch_multiq.c:39 [inline] multiq_enqueue+0xfd/0x4c0 net/sched/sch_multiq.c:66 dev_qdisc_enqueue+0x4e/0x260 net/core/dev.c:4118 __dev_xmit_skb net/core/dev.c:4214 [inline] __dev_queue_xmit+0xe83/0x3b50 net/core/dev.c:4729 packet_snd net/packet/af_packet.c:3076 [inline] packet_sendmsg+0x3e33/0x5080 net/packet/af_packet.c:3108 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg+0x21c/0x270 net/socket.c:742 ____sys_sendmsg+0x505/0x830 net/socket.c:2630 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+f3a497f02c389d86ef16@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6920855a.a70a0220.2ea503.0058.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20251121154100.1616228-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24wifi: cfg80211: stop radar detection in cfg80211_leave()Johannes Berg
If an interface is set down or, per the previous patch, changes type, radar detection for it should be cancelled. This is done for AP mode in mac80211 (somewhat needlessly, since cfg80211 can do it, but didn't until now), but wasn't handled for mesh, so if radar detection was started and then the interface set down or its type switched (the latter sometimes happning in the hwsim test 'mesh_peer_connected_dfs'), radar detection would be around with the interface unknown to the driver, later leading to some warnings around chanctx usage. Link: https://patch.msgid.link/20251121174021.290120e419e3.I2a5650c9062e29c988992dd8ce0d8eb570d23267@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-11-24wifi: cfg80211: use cfg80211_leave() in iftype changeJohannes Berg
When changing the interface type, all activity on the interface has to be stopped first. This was done independent of existing code in cfg80211_leave(), so didn't handle e.g. background radar detection. Use cfg80211_leave() to handle it the same way. Note that cfg80211_leave() behaves slightly differently for IBSS in wireless extensions, it won't send an event in that case. We could handle that, but since nl80211 was used to change the type, IBSS is rare, and wext is already a corner case, it doesn't seem worth it. Link: https://patch.msgid.link/20251121174021.922ef48ce007.I970c8514252ef8a864a7fbdab9591b71031dee03@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-11-24wifi: mac80211: fix channel switching codeJohannes Berg
My prior commit here introduced a bug due to copy/paste, it was iterating the links assigned to 'ctx->replace_ctx' and I replaced it by iterating links assigned to 'ctx' by accident, then modified it for the iteration later. Fix it to iterate the users of the correct chanctx, i.e. 'ctx->replace_ctx'. Ultimately, this issue led to a crash in a hwsim test (multi_ap_wps_shared_apdev_csa) because it would actually do the switch (rather than refuse here) and then later have a double-free of the original chanctx, because it was still in use by another interface yet freed as part of the switching. Fixes: a1dc648aa76d ("wifi: mac80211: remove chanctx to link back-references") Link: https://patch.msgid.link/20251121113733.7710a58d45eb.Ie9ec010b52b1baed93dbe44f968c2119b6b5d98d@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-11-23SUNRPC: new helper function for stopping backchannel serverOlga Kornievskaia
Create a new backchannel function to stop the backchannel server and clear the bc_serv in transport protected under the bc_pa_lock. Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-23SUNRPC: cleanup common code in backchannel requestOlga Kornievskaia
Create a helper function for common code between rdma and tcp backchannel handling of the backchannel request. Make sure that access is protected by the bc_pa_lock lock. Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-11-21Merge tag 'for-net-2025-11-21' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - hci_sock: Prevent race in socket write iter and sock bind - hci_core: Fix triggering cmd_timer for HCI_OP_NOP - hci_core: lookup hci_conn on RX path on protocol side - SMP: Fix not generating mackey and ltk when repairing - btusb: mediatek: Fix kernel crash when releasing mtk iso interface - btusb: mediatek: Avoid btusb_mtk_claim_iso_intf() NULL deref * tag 'for-net-2025-11-21' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: SMP: Fix not generating mackey and ltk when repairing Bluetooth: btusb: mediatek: Avoid btusb_mtk_claim_iso_intf() NULL deref Bluetooth: hci_core: lookup hci_conn on RX path on protocol side Bluetooth: hci_sock: Prevent race in socket write iter and sock bind Bluetooth: hci_core: Fix triggering cmd_timer for HCI_OP_NOP Bluetooth: btusb: mediatek: Fix kernel crash when releasing mtk iso interface ==================== Link: https://patch.msgid.link/20251121145332.177015-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>