| Age | Commit message (Collapse) | Author |
|
Use msg->msg_namelen as a place holder instead of a
temporary variable, notably in inet[6]_recvmsg().
This removes stack canaries and allows tail-calls.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux
add/remove: 0/0 grow/shrink: 2/19 up/down: 26/-532 (-506)
Function old new delta
rawv6_recvmsg 744 767 +23
vsock_dgram_recvmsg 55 58 +3
vsock_connectible_recvmsg 50 47 -3
unix_stream_recvmsg 161 158 -3
unix_seqpacket_recvmsg 62 59 -3
unix_dgram_recvmsg 42 39 -3
tcp_recvmsg 546 543 -3
mptcp_recvmsg 1568 1565 -3
ping_recvmsg 806 800 -6
tcp_bpf_recvmsg_parser 983 974 -9
ip_recv_error 588 576 -12
ipv6_recv_rxpmtu 442 428 -14
udp_recvmsg 1243 1224 -19
ipv6_recv_error 1046 1024 -22
udpv6_recvmsg 1487 1461 -26
raw_recvmsg 465 437 -28
udp_bpf_recvmsg 1027 984 -43
sock_common_recvmsg 103 27 -76
inet_recvmsg 257 175 -82
inet6_recvmsg 257 175 -82
tcp_bpf_recvmsg 663 568 -95
Total: Before=25143834, After=25143328, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260227151120.1346573-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Currently key installation is only supported for netdev. For NAN,
support most key operations (except setting default data key) on
wdevs instead of netdevs, and adjust all the APIs and tracing to
match.
Since nothing currently sets NL80211_EXT_FEATURE_SECURE_NAN, this
doesn't change anything (P2P Device already isn't allowed.)
Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260107150057.69a0cfad95fa.I00efdf3b2c11efab82ef6ece9f393382bcf33ba8@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
cfg80211_nan_conf::cluster_id is currently a pointer, but there is no real
reason to not have it an array. It makes things easier as there is no
need to check the pointer validity each time.
If a cluster ID wasn't provided by user space it will be randomized.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260302091108.2b12e4ccf5bb.Ib16bf5cca55463d4c89e18099cf1dfe4de95d405@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Per IEEE Std 802.11be-2024, 12.5.2.3.3, if the MPDU is an
individually addressed Data frame between an AP MLD and a
non-AP MLD associated with the AP MLD, then A1/A2/A3
will be MLD MAC addresses. Otherwise, Al/A2/A3 will be
over-the-air link MAC addresses.
Currently, during AAD and Nonce computation for software based
encryption/decryption cases, mac80211 directly uses the addresses it
receives in the skb frame header. However, after the first
authentication, management frame addresses for non-AP MLD stations
are translated to MLD addresses from over the air link addresses in
software. This means that the skb header could contain translated MLD
addresses, which when used as is, can lead to incorrect AAD/Nonce
computation.
In the following manner, ensure that the right set of addresses are used:
In the receive path, stash the pre-translated link addresses in
ieee80211_rx_data and use them for the AAD/Nonce computations
when required.
In the transmit path, offload the encryption for a CCMP/GCMP key
to the hwsim driver that can then ensure that encryption and hence
the AAD/Nonce computations are performed on the frame containing the
right set of addresses, i.e, MLD addresses if unicast data frame and
link addresses otherwise.
To do so, register the set key handler in hwsim driver so mac80211 is
aware that it is the driver that would take care of encrypting the
frame. Offload encryption for a CCMP/GCMP key, while keeping the
encryption for WEP/TKIP and MMIE generation for a AES_CMAC or a
AES_GMAC key still at the SW crypto in mac layer
Co-developed-by: Rohan Dutta <quic_drohan@quicinc.com>
Signed-off-by: Rohan Dutta <quic_drohan@quicinc.com>
Signed-off-by: Sai Pratyusha Magam <sai.magam@oss.qualcomm.com>
Link: https://patch.msgid.link/20260226042959.3766157-1-sai.magam@oss.qualcomm.com
[only store and apply link_addrs for unicast non-data
rather storing always and applying for !unicast_data]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Currently, the unsolicited probe response template is always fetched from
the default link of a virtual interface in both Multi-Link Operation (MLO)
and non-MLO cases. However, in the MLO case there is a need to fetch the
unsolicited probe response template from a specific link instead of the
default link.
Hence, add support for fetching the unsolicited probe response template
based on the link ID from the corresponding link data.
Signed-off-by: Sriram R <quic_srirrama@quicinc.com>
Co-developed-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com>
Signed-off-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com>
Link: https://patch.msgid.link/20260220-fils-prob-by-link-v1-2-a2746a853f75@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Currently, the FILS discovery template is always fetched from the default
link of a virtual interface in both Multi-Link Operation (MLO) and
non-MLO cases. However, in the MLO case there is a need to fetch the FILS
discovery template from a specific link instead of the default link.
Hence, add support for fetching the FILS discovery template based on the
link ID from the corresponding link data.
Signed-off-by: Sriram R <quic_srirrama@quicinc.com>
Co-developed-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com>
Signed-off-by: Raj Kumar Bhagat <raj.bhagat@oss.qualcomm.com>
Link: https://patch.msgid.link/20260220-fils-prob-by-link-v1-1-a2746a853f75@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Currently, a station can only be added to a netdev interface,
mainly because there was no need for a station of a non-netdev
interface.
But for NAN, we will have stations that belong to the NL80211_IFTYPE_NAN
interface.
Prepare for adding/changing/deleting a station that belongs to a non-netdev
interface. This doesn't actually allow such stations - this will be done
in a different patch.
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://patch.msgid.link/20260219114327.65c9cc96f814.Ic02066b88bb8ad6b21e15cbea8d720280008c83b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
mac80211/driver
When any incumbent signal is detected by an AP/mesh interface operating
in 6 GHz band, FCC mandates the AP/mesh to vacate the channels affected
by it [1].
Add a new API cfg80211_incumbent_signal_notify() that can be used
by mac80211 or drivers to notify the higher layers about the signal
interference event with the interference bitmap in which each bit
denotes the affected 20 MHz in the operating channel.
Add support for the new nl80211 event and nl80211 attribute as well to
notify userspace on the details about the interference event. Userspace is
expected to process it and take further action - vacate the channel, or
reduce the bandwidth.
[1] - https://apps.fcc.gov/kdb/GetAttachment.html?id=nXQiRC%2B4mfiA54Zha%2BrW4Q%3D%3D&desc=987594%20D02%20U-NII%206%20GHz%20EMC%20Measurement%20v03&tracking_number=277034
Signed-off-by: Hari Chandrakanthan <quic_haric@quicinc.com>
Signed-off-by: Amith A <amith.a@oss.qualcomm.com>
Link: https://patch.msgid.link/20260216032027.2310956-2-amith.a@oss.qualcomm.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Allow to track and check CAC state from user mode by
simple check phy channels eg. using iw phy1 channels
command.
This is done for regular CAC and background CAC.
It is important for background CAC while we can start
it from any app (eg. iw or hostapd).
Signed-off-by: Janusz Dziedzic <janusz.dziedzic@gmail.com>
Link: https://patch.msgid.link/20260206171830.553879-3-janusz.dziedzic@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
DualPI2 drops packets during dequeue but was using kfree_skb_reason()
directly, bypassing trace_qdisc_drop. Convert to qdisc_dequeue_drop()
and add QDISC_DROP_L4S_STEP_NON_ECN to the qdisc drop reason enum.
- Set TCQ_F_DEQUEUE_DROPS flag in dualpi2_init()
- Use enum qdisc_drop_reason in drop_and_retry()
- Replace kfree_skb_reason() with qdisc_dequeue_drop()
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211351978.3011628.11267023360997620069.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Rename QDISC_DROP_CAKE_FLOOD to QDISC_DROP_FLOOD_PROTECTION to use a
generic name without embedding the qdisc name. This follows the
principle that drop reasons should describe the drop mechanism rather
than being tied to a specific qdisc implementation.
The flood protection drop reason is used by qdiscs implementing
probabilistic drop algorithms (like BLUE) that detect unresponsive
flows indicating potential DoS or flood attacks. CAKE uses this via
its Cobalt AQM component.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211347537.3011628.13759059534638729639.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Rename FQ-specific drop reasons to generic names:
- QDISC_DROP_FQ_BAND_LIMIT -> QDISC_DROP_BAND_LIMIT
- QDISC_DROP_FQ_HORIZON_LIMIT -> QDISC_DROP_HORIZON_LIMIT
This follows the principle that drop reasons should describe the drop
mechanism rather than being tied to a specific qdisc implementation.
These concepts (priority band limits, timestamp horizon) could apply
to other qdiscs as well.
Remove the local macro define FQDR() and instead use the
full QDISC_DROP_* name to make it easier to navigate code.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211346902.3011628.12523261489552097455.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Convert SFQ to use the new qdisc-specific drop reason infrastructure.
This patch demonstrates how to convert a flow-based qdisc to use the
new enum qdisc_drop_reason. As part of this conversion:
- Add QDISC_DROP_MAXFLOWS for flow table exhaustion
- Rename FQ_FLOW_LIMIT to generic FLOW_LIMIT, now shared by FQ and SFQ
- Use QDISC_DROP_OVERLIMIT for sfq_drop() when overall limit exceeded
- Use QDISC_DROP_FLOW_LIMIT for per-flow depth limit exceeded
The FLOW_LIMIT reason is now a common drop reason for per-flow limits,
applicable to both FQ and SFQ qdiscs.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211345946.3011628.12770616071857185664.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Create new enum qdisc_drop_reason and trace_qdisc_drop tracepoint
for qdisc layer drop diagnostics with direct qdisc context visibility.
The new tracepoint includes qdisc handle, parent, kind (name), and
device information. Existing SKB_DROP_REASON_QDISC_DROP is retained
for backwards compatibility via kfree_skb_reason().
Convert qdiscs with drop reasons to use the new infrastructure.
Change CAKE's cobalt_should_drop() return type from enum skb_drop_reason
to enum qdisc_drop_reason to fix implicit enum conversion warnings.
Use QDISC_DROP_UNSPEC as the 'not dropped' sentinel instead of
SKB_NOT_DROPPED_YET. Both have the same compiled value (0), so the
comparison logic remains semantically equivalent.
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/177211345275.3011628.1974310302645218067.stgit@firesoul
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After commit b692bf9a7543 ("xsk: Get rid of xdp_buff_xsk::xskb_list_node"),
the list_node field is reused for both the xskb pool list and the buffer
free list, this causes a buffer leak as described below.
xp_free() checks if a buffer is already on the free list using
list_empty(&xskb->list_node). When list_del() is used to remove a node
from the xskb pool list, it doesn't reinitialize the node pointers.
This means list_empty() will return false even after the node has been
removed, causing xp_free() to incorrectly skip adding the buffer to the
free list.
Fix this by using list_del_init() instead of list_del() in all fragment
handling paths, this ensures the list node is reinitialized after removal,
allowing the list_empty() to work correctly.
Fixes: b692bf9a7543 ("xsk: Get rid of xdp_buff_xsk::xskb_list_node")
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Nikhil P. Rao <nikhil.rao@amd.com>
Link: https://patch.msgid.link/20260225000456.107806-2-nikhil.rao@amd.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
As Paolo said earlier [1]:
"Since the blamed commit below, classify can return TC_ACT_CONSUMED while
the current skb being held by the defragmentation engine. As reported by
GangMin Kim, if such packet is that may cause a UaF when the defrag engine
later on tries to tuch again such packet."
act_ct was never meant to be used in the egress path, however some users
are attaching it to egress today [2]. Attempting to reach a middle
ground, we noticed that, while most qdiscs are not handling
TC_ACT_CONSUMED, clsact/ingress qdiscs are. With that in mind, we
address the issue by only allowing act_ct to bind to clsact/ingress
qdiscs and shared blocks. That way it's still possible to attach act_ct to
egress (albeit only with clsact).
[1] https://lore.kernel.org/netdev/674b8cbfc385c6f37fb29a1de08d8fe5c2b0fbee.1771321118.git.pabeni@redhat.com/
[2] https://lore.kernel.org/netdev/cc6bfb4a-4a2b-42d8-b9ce-7ef6644fb22b@ovn.org/
Reported-by: GangMin Kim <km.kim1503@gmail.com>
Fixes: 3f14b377d01d ("net/sched: act_ct: fix skb leak and crash on ooo frags")
CC: stable@vger.kernel.org
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20260225134349.1287037-1-victor@mojatatu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a fast path in lock_sock_nested(), to avoid acquiring
the socket spinlock only to set @owned to one:
spin_lock_bh(&sk->sk_lock.slock);
if (unlikely(sock_owned_by_user_nocheck(sk)))
__lock_sock(sk);
sk->sk_lock.owned = 1;
spin_unlock_bh(&sk->sk_lock.slock);
On x86_64 compiler generates something quite efficient:
00000000000077c0 <lock_sock_nested>:
77c0: f3 0f 1e fa endbr64
77c4: e8 00 00 00 00 call __fentry__
77c9: b9 01 00 00 00 mov $0x1,%ecx
77ce: 31 c0 xor %eax,%eax
77d0: f0 48 0f b1 8f 48 01 00 00 lock cmpxchg %rcx,0x148(%rdi)
77d9: 75 06 jne slow_path
77db: 2e e9 00 00 00 00 cs jmp __x86_return_thunk-0x4
slow_path: ...
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20260226021215.1764237-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
UDP/TCP lookups are using RCU, thus isk->inet_num accesses
should use READ_ONCE() and WRITE_ONCE() where needed.
Fixes: 3ab5aee7fe84 ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260225203545.1512417-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The gate action can be replaced while the hrtimer callback or dump path is
walking the schedule list.
Convert the parameters to an RCU-protected snapshot and swap updates under
tcf_lock, freeing the previous snapshot via call_rcu(). When REPLACE omits
the entry list, preserve the existing schedule so the effective state is
unchanged.
Fixes: a51c328df310 ("net: qos: introduce a gate control flow action")
Cc: stable@vger.kernel.org
Signed-off-by: Paul Moses <p@1g4.org>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Reviewed-by: Victor Nogueira <victor@mojatatu.com>
Link: https://patch.msgid.link/20260223150512.2251594-2-p@1g4.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Now that the pp fields in net_iov have no users, remove them from
net_iov and clean up.
Signed-off-by: Byungchul Park <byungchul@sk.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20260224061424.11219-1-byungchul@sk.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR (net-7.0-rc2).
Conflicts:
tools/testing/selftests/drivers/net/hw/rss_ctx.py
19c3a2a81d2b ("selftests: drv-net: rss: Generate unique ports for RSS context tests")
ce5a0f4612db ("selftests: drv-net: rss_ctx: test RSS contexts persist after ifdown/up")
include/net/inet_connection_sock.h
858d2a4f67ff6 ("tcp: fix potential race in tcp_v6_syn_recv_sock()")
fcd3d039fab69 ("tcp: make tcp_v{4,6}_send_check() static")
https://lore.kernel.org/aZ8PSFLzBrEU3I89@sirena.org.uk
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.c
69050f8d6d075 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types")
bf4afc53b77ae ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument")
8a96b9144f18a ("net/mlx5e: Alloc xsk channel param out of mlx5e_open_xsk()")
Adjacent changes:
net/netfilter/ipvs/ip_vs_ctl.c
c59bd9e62e06 ("ipvs: use more counters to avoid service lookups")
bf4afc53b77a ("Convert 'alloc_obj' family to use the new default GFP_KERNEL argument")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from IPsec, Bluetooth and netfilter
Current release - regressions:
- wifi: fix dev_alloc_name() return value check
- rds: fix recursive lock in rds_tcp_conn_slots_available
Current release - new code bugs:
- vsock: lock down child_ns_mode as write-once
Previous releases - regressions:
- core:
- do not pass flow_id to set_rps_cpu()
- consume xmit errors of GSO frames
- netconsole: avoid OOB reads, msg is not nul-terminated
- netfilter: h323: fix OOB read in decode_choice()
- tcp: re-enable acceptance of FIN packets when RWIN is 0
- udplite: fix null-ptr-deref in __udp_enqueue_schedule_skb().
- wifi: brcmfmac: fix potential kernel oops when probe fails
- phy: register phy led_triggers during probe to avoid AB-BA deadlock
- eth:
- bnxt_en: fix deleting of Ntuple filters
- wan: farsync: fix use-after-free bugs caused by unfinished tasklets
- xscale: check for PTP support properly
Previous releases - always broken:
- tcp: fix potential race in tcp_v6_syn_recv_sock()
- kcm: fix zero-frag skb in frag_list on partial sendmsg error
- xfrm:
- fix race condition in espintcp_close()
- always flush state and policy upon NETDEV_UNREGISTER event
- bluetooth:
- purge error queues in socket destructors
- fix response to L2CAP_ECRED_CONN_REQ
- eth:
- mlx5:
- fix circular locking dependency in dump
- fix "scheduling while atomic" in IPsec MAC address query
- gve: fix incorrect buffer cleanup for QPL
- team: avoid NETDEV_CHANGEMTU event when unregistering slave
- usb: validate USB endpoints"
* tag 'net-7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (72 commits)
netfilter: nf_conntrack_h323: fix OOB read in decode_choice()
dpaa2-switch: validate num_ifs to prevent out-of-bounds write
net: consume xmit errors of GSO frames
vsock: document write-once behavior of the child_ns_mode sysctl
vsock: lock down child_ns_mode as write-once
selftests/vsock: change tests to respect write-once child ns mode
net/mlx5e: Fix "scheduling while atomic" in IPsec MAC address query
net/mlx5: Fix missing devlink lock in SRIOV enable error path
net/mlx5: E-switch, Clear legacy flag when moving to switchdev
net/mlx5: LAG, disable MPESW in lag_disable_change()
net/mlx5: DR, Fix circular locking dependency in dump
selftests: team: Add a reference count leak test
team: avoid NETDEV_CHANGEMTU event when unregistering slave
net: mana: Fix double destroy_workqueue on service rescan PCI path
MAINTAINERS: Update maintainer entry for QUALCOMM ETHQOS ETHERNET DRIVER
dpll: zl3073x: Remove redundant cleanup in devm_dpll_init()
selftests/net: packetdrill: Verify acceptance of FIN packets when RWIN is 0
tcp: re-enable acceptance of FIN packets when RWIN is 0
vsock: Use container_of() to get net namespace in sysctl handlers
net: usb: kaweth: validate USB endpoints
...
|
|
Two administrator processes may race when setting child_ns_mode as one
process sets child_ns_mode to "local" and then creates a namespace, but
another process changes child_ns_mode to "global" between the write and
the namespace creation. The first process ends up with a namespace in
"global" mode instead of "local". While this can be detected after the
fact by reading ns_mode and retrying, it is fragile and error-prone.
Make child_ns_mode write-once so that a namespace manager can set it
once and be sure it won't change. Writing a different value after the
first write returns -EBUSY. This applies to all namespaces, including
init_net, where an init process can write "local" to lock all future
namespaces into local mode.
Fixes: eafb64f40ca4 ("vsock: add netns to vsock core")
Suggested-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Co-developed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Link: https://patch.msgid.link/20260223-vsock-ns-write-once-v3-2-c0cde6959923@meta.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
This facility was disabled in commit
9e539c5b6d9c ("netfilter: nf_tables: disable expression reduction infra"),
because not all nft_exprs guarantee they will update the destination
register: some may set NFT_BREAK instead to cancel evaluation of the
rule.
This has been dead code ever since.
There are no plans to salvage this at this time, so remove this.
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-10-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Change the no_cport counters to be per-net and address family.
This should reduce the extra conn lookups done during present
NO_CPORT connections.
By changing from global to per-net dropentry counters, one net
will not affect the drop rate of another net.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-7-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When new connection is created we can lookup for services multiple
times to support fallback options. We already have some counters
to skip specific lookups because it costs CPU cycles for hash
calculation, etc.
Add more counters for fwmark/non-fwmark services (fwm_services and
nonfwm_services) and make all counters per address family.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-6-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
fwmark based services and non-fwmark based services can be hashed
in same service table. This reduces the burden of working with two
tables.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-4-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Current ipvs uses one global mutex "__ip_vs_mutex" to keep the global
"ip_vs_svc_table" and "ip_vs_svc_fwm_table" safe. But when there are
tens of thousands of services from different netns in the table, it
takes a long time to look up the table, for example, using "ipvsadm
-ln" from different netns simultaneously.
We make "ip_vs_svc_table" and "ip_vs_svc_fwm_table" per netns, and we
add "service_mutex" per netns to keep these two tables safe instead of
the global "__ip_vs_mutex" in current version. To this end, looking up
services from different netns simultaneously will not get stuck,
shortening the time consumption in large-scale deployment. It can be
reproduced using the simple scripts below.
init.sh: #!/bin/bash
for((i=1;i<=4;i++));do
ip netns add ns$i
ip netns exec ns$i ip link set dev lo up
ip netns exec ns$i sh add-services.sh
done
add-services.sh: #!/bin/bash
for((i=0;i<30000;i++)); do
ipvsadm -A -t 10.10.10.10:$((80+$i)) -s rr
done
runtest.sh: #!/bin/bash
for((i=1;i<4;i++));do
ip netns exec ns$i ipvsadm -ln > /dev/null &
done
ip netns exec ns4 ipvsadm -ln > /dev/null
Run "sh init.sh" to initiate the network environment. Then run "time
./runtest.sh" to evaluate the time consumption. Our testbed is a 4-core
Intel Xeon ECS. The result of the original version is around 8 seconds,
while the result of the modified version is only 0.8 seconds.
Signed-off-by: Jiejian Wu <jiejian@linux.alibaba.com>
Co-developed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20260224205048.4718-2-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
bond_start_xmit() spends some cycles in is_netpoll_tx_blocked():
if (unlikely(is_netpoll_tx_blocked(dev)))
return NETDEV_TX_BUSY;
because of the "pushf;pop reg" sequence (aka irqs_disabled()).
Let's swap the conditions in is_netpoll_tx_blocked() and
convert netpoll_block_tx to a static key.
Before:
1.23 │ mov %gs:0x28,%rax
1.24 │ mov %rax,0x18(%rsp)
29.45 │ pushfq
0.50 │ pop %rax
0.47 │ test $0x200,%eax
│ ↓ je 1b4
0.49 │ 32: lea 0x980(%rsi),%rbx
After:
0.72 │ mov %gs:0x28,%rax
0.81 │ mov %rax,0x18(%rsp)
0.82 │ nop
2.77 │ 2a: lea 0x980(%rsi),%rbx
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260223230749.2376145-1-kuniyu@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This function is only called from tcp_v6_mtu_reduced() and can be
(auto)inlined by the compiler.
Note that inet6_csk_route_socket() is no longer (auto)inlined,
which is a good thing as it is slow path.
$ scripts/bloat-o-meter -t vmlinux.0 vmlinux.1
add/remove: 0/2 grow/shrink: 2/0 up/down: 93/-129 (-36)
Function old new delta
tcp_v6_mtu_reduced 139 228 +89
inet6_csk_route_socket 486 490 +4
__pfx_inet6_csk_update_pmtu 16 - -16
inet6_csk_update_pmtu 113 - -113
Total: Before=25076512, After=25076476, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223153047.886683-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
tcp_v{4,6}_send_check() are only called from tcp_output.c
and should be made static so that the compiler does not need
to put an out of line copy of them.
Remove (struct inet_connection_sock_af_ops) send_check field
and use instead @net_header_len.
Move @net_header_len close to @queue_xmit for data locality
as both are used in TCP tx fast path.
$ scripts/bloat-o-meter -t vmlinux.2 vmlinux.3
add/remove: 0/2 grow/shrink: 0/3 up/down: 0/-172 (-172)
Function old new delta
__tcp_transmit_skb 3426 3423 -3
tcp_v4_send_check 136 132 -4
mptcp_subflow_init 777 763 -14
__pfx_tcp_v6_send_check 16 - -16
tcp_v6_send_check 135 - -135
Total: Before=25143196, After=25143024, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223100729.3761597-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Move tcp_v6_send_check() so that __tcp_transmit_skb() can inline it.
$ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2
add/remove: 0/0 grow/shrink: 1/0 up/down: 105/0 (105)
Function old new delta
__tcp_transmit_skb 3321 3426 +105
Total: Before=25143091, After=25143196, chg +0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223100729.3761597-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Inline __tcp_v4_send_check(), like __tcp_v6_send_check().
Move tcp_v4_send_check() to tcp_output.c close to
its fast path caller (__tcp_transmit_skb()).
Note __tcp_v4_send_check() is still out-of-line for tcp4_gso_segment()
because it is called in an unlikely() section.
$ scripts/bloat-o-meter -t vmlinux.0 vmlinux.1
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-9 (-9)
Function old new delta
__tcp_v4_send_check 130 121 -9
Total: Before=25143100, After=25143091, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223100729.3761597-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This function has a single caller in net/ipv6/udp.c.
Move it there so that the compiler can decide to (auto)inline
it if he prefers to. IBT glue is removed anyway.
With clang, we can see it was able to inline it and also
inlined one other helper at the same time.
UDPLITE removal will also help.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 1/0 up/down: 840/-785 (55)
Function old new delta
__udp6_lib_rcv 1247 2087 +840
__pfx_udp6_csum_init 16 - -16
udp6_csum_init 769 - -769
Total: Before=25074399, After=25074454, chg +0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223093445.3696368-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After commit 6511882cdd82 ("mptcp: allocate fwd memory separately
on the rx and tx path") __lock_sock() can be static again.
Make sure __lock_sock() is not inlined, so that lock_sock_nested()
no longer needs a stack canary.
Add a noinline attribute on lock_sock_nested() so that calls
to lock_sock() from net/core/sock.c are not inlined,
none of them are fast path to deserve that:
- sockopt_lock_sock()
- sock_set_reuseport()
- sock_set_reuseaddr()
- sock_set_mark()
- sock_set_keepalive()
- sock_no_linger()
- sock_bindtoindex()
- sk_wait_data()
- sock_set_rcvbuf()
$ scripts/bloat-o-meter -t vmlinux.old vmlinux
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-312 (-312)
Function old new delta
__lock_sock 192 188 -4
__lock_sock_fast 239 86 -153
lock_sock_nested 227 72 -155
Total: Before=24888707, After=24888395, chg -0.00%
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260223092716.3673939-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- purge error queues in socket destructors
- hci_sync: Fix CIS host feature condition
- L2CAP: Fix invalid response to L2CAP_ECRED_RECONF_REQ
- L2CAP: Fix result of L2CAP_ECRED_CONN_RSP when MTU is too short
- L2CAP: Fix response to L2CAP_ECRED_CONN_REQ
- L2CAP: Fix not checking output MTU is acceptable on L2CAP_ECRED_CONN_REQ
- L2CAP: Fix missing key size check for L2CAP_LE_CONN_REQ
- hci_qca: Cleanup on all setup failures
* tag 'for-net-2026-02-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: L2CAP: Fix missing key size check for L2CAP_LE_CONN_REQ
Bluetooth: L2CAP: Fix not checking output MTU is acceptable on L2CAP_ECRED_CONN_REQ
Bluetooth: Fix CIS host feature condition
Bluetooth: L2CAP: Fix response to L2CAP_ECRED_CONN_REQ
Bluetooth: hci_qca: Cleanup on all setup failures
Bluetooth: purge error queues in socket destructors
Bluetooth: L2CAP: Fix result of L2CAP_ECRED_CONN_RSP when MTU is too short
Bluetooth: L2CAP: Fix invalid response to L2CAP_ECRED_RECONF_REQ
====================
Link: https://patch.msgid.link/20260223211634.3800315-1-luiz.dentz@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
skb_may_tx_timestamp() may acquire sock::sk_callback_lock. The lock must
not be taken in IRQ context, only softirq is okay. A few drivers receive
the timestamp via a dedicated interrupt and complete the TX timestamp
from that handler. This will lead to a deadlock if the lock is already
write-locked on the same CPU.
Taking the lock can be avoided. The socket (pointed by the skb) will
remain valid until the skb is released. The ->sk_socket and ->file
member will be set to NULL once the user closes the socket which may
happen before the timestamp arrives.
If we happen to observe the pointer while the socket is closing but
before the pointer is set to NULL then we may use it because both
pointer (and the file's cred member) are RCU freed.
Drop the lock. Use READ_ONCE() to obtain the individual pointer. Add a
matching WRITE_ONCE() where the pointer are cleared.
Link: https://lore.kernel.org/all/20260205145104.iWinkXHv@linutronix.de
Fixes: b245be1f4db1a ("net-timestamp: no-payload only sysctl")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260220183858.N4ERjFW6@linutronix.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Test L2CAP/ECFC/BV-26-C expect the response to L2CAP_ECRED_CONN_REQ with
and MTU value < L2CAP_ECRED_MIN_MTU (64) to be L2CAP_CR_LE_INVALID_PARAMS
rather than L2CAP_CR_LE_UNACCEPT_PARAMS.
Also fix not including the correct number of CIDs in the response since
the spec requires all CIDs being rejected to be included in the
response.
Link: https://github.com/bluez/bluez/issues/1868
Fixes: 15f02b910562 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
This fixes responding with an invalid result caused by checking the
wrong size of CID which should have been (cmd_len - sizeof(*req)) and
on top of it the wrong result was use L2CAP_CR_LE_INVALID_PARAMS which
is invalid/reserved for reconf when running test like L2CAP/ECFC/BI-03-C:
> ACL Data RX: Handle 64 flags 0x02 dlen 14
LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6
MTU: 64
MPS: 64
Source CID: 64
< ACL Data TX: Handle 64 flags 0x00 dlen 10
LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2
! Result: Reserved (0x000c)
Result: Reconfiguration failed - one or more Destination CIDs invalid (0x0003)
Fiix L2CAP/ECFC/BI-04-C which expects L2CAP_RECONF_INVALID_MPS (0x0002)
when more than one channel gets its MPS reduced:
> ACL Data RX: Handle 64 flags 0x02 dlen 16
LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 8
MTU: 264
MPS: 99
Source CID: 64
! Source CID: 65
< ACL Data TX: Handle 64 flags 0x00 dlen 10
LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2
! Result: Reconfiguration successful (0x0000)
Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002)
Fix L2CAP/ECFC/BI-05-C when SCID is invalid (85 unconnected):
> ACL Data RX: Handle 64 flags 0x02 dlen 14
LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6
MTU: 65
MPS: 64
! Source CID: 85
< ACL Data TX: Handle 64 flags 0x00 dlen 10
LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2
! Result: Reconfiguration successful (0x0000)
Result: Reconfiguration failed - one or more Destination CIDs invalid (0x0003)
Fix L2CAP/ECFC/BI-06-C when MPS < L2CAP_ECRED_MIN_MPS (64):
> ACL Data RX: Handle 64 flags 0x02 dlen 14
LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 2 len 6
MTU: 672
! MPS: 63
Source CID: 64
< ACL Data TX: Handle 64 flags 0x00 dlen 10
LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2
! Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002)
Result: Reconfiguration failed - other unacceptable parameters (0x0004)
Fix L2CAP/ECFC/BI-07-C when MPS reduced for more than one channel:
> ACL Data RX: Handle 64 flags 0x02 dlen 16
LE L2CAP: Enhanced Credit Reconfigure Request (0x19) ident 3 len 8
MTU: 84
! MPS: 71
Source CID: 64
! Source CID: 65
< ACL Data TX: Handle 64 flags 0x00 dlen 10
LE L2CAP: Enhanced Credit Reconfigure Respond (0x1a) ident 2 len 2
! Result: Reconfiguration successful (0x0000)
Result: Reconfiguration failed - reduction in size of MPS not allowed for more than one channel at a time (0x0002)
Link: https://github.com/bluez/bluez/issues/1865
Fixes: 15f02b910562 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
This converts some of the visually simpler cases that have been split
over multiple lines. I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.
Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script. I probably had made it a bit _too_ trivial.
So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.
The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
Code in tcp_v6_syn_recv_sock() after the call to tcp_v4_syn_recv_sock()
is done too late.
After tcp_v4_syn_recv_sock(), the child socket is already visible
from TCP ehash table and other cpus might use it.
Since newinet->pinet6 is still pointing to the listener ipv6_pinfo
bad things can happen as syzbot found.
Move the problematic code in tcp_v6_mapped_child_init()
and call this new helper from tcp_v4_syn_recv_sock() before
the ehash insertion.
This allows the removal of one tcp_sync_mss(), since
tcp_v4_syn_recv_sock() will call it with the correct
context.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+937b5bbb6a815b3e5d0b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/69949275.050a0220.2eeac1.0145.GAE@google.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260217161205.2079883-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from Netfilter.
Current release - new code bugs:
- net: fix backlog_unlock_irq_restore() vs CONFIG_PREEMPT_RT
- eth: mlx5e: XSK, Fix unintended ICOSQ change
- phy_port: correctly recompute the port's linkmodes
- vsock: prevent child netns mode switch from local to global
- couple of kconfig fixes for new symbols
Previous releases - regressions:
- nfc: nci: fix false-positive parameter validation for packet data
- net: do not delay zero-copy skbs in skb_attempt_defer_free()
Previous releases - always broken:
- mctp: ensure our nlmsg responses to user space are zero-initialised
- ipv6: ioam: fix heap buffer overflow in __ioam6_fill_trace_data()
- fixes for ICMP rate limiting
Misc:
- intel: fix PCI device ID conflict between i40e and ipw2200"
* tag 'net-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (85 commits)
net: nfc: nci: Fix parameter validation for packet data
net/mlx5e: Use unsigned for mlx5e_get_max_num_channels
net/mlx5e: Fix deadlocks between devlink and netdev instance locks
net/mlx5e: MACsec, add ASO poll loop in macsec_aso_set_arm_event
net/mlx5: Fix misidentification of write combining CQE during poll loop
net/mlx5e: Fix misidentification of ASO CQE during poll loop
net/mlx5: Fix multiport device check over light SFs
bonding: alb: fix UAF in rlb_arp_recv during bond up/down
bnge: fix reserving resources from FW
eth: fbnic: Advertise supported XDP features.
rds: tcp: fix uninit-value in __inet_bind
net/rds: Fix NULL pointer dereference in rds_tcp_accept_one
octeontx2-af: Fix default entries mcam entry action
net/mlx5e: XSK, Fix unintended ICOSQ change
ipv6: icmp: icmpv6_xrlim_allow() optimization if net.ipv6.icmp.ratelimit is zero
ipv4: icmp: icmpv4_xrlim_allow() optimization if net.ipv4.icmp_ratelimit is zero
ipv6: icmp: remove obsolete code in icmpv6_xrlim_allow()
inet: move icmp_global_{credit,stamp} to a separate cache line
icmp: prevent possible overflow in icmp_global_allow()
selftests/net: packetdrill: add ipv4-mapped-ipv6 tests
...
|
|
icmp_global_credit was meant to be changed ~1000 times per second,
but if an admin sets net.ipv4.icmp_msgs_per_sec to a very high value,
icmp_global_credit changes can inflict false sharing to surrounding
fields that are read mostly.
Move icmp_global_credit and icmp_global_stamp to a separate
cacheline aligned group.
Fixes: b056b4cd9178 ("icmp: move icmp_global.credit and icmp_global.stamp to per netns storage")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Link: https://patch.msgid.link/20260216142832.3834174-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl updates from Joel Granados:
- Remove macros from proc handler converters
Replace the proc converter macros with "regular" functions. Though it
is more verbose than the macro version, it helps when debugging and
better aligns with coding-style.rst.
- General cleanup
Remove superfluous ctl_table forward declarations. Const qualify the
memory_allocation_profiling_sysctl and loadpin_sysctl_table arrays.
Add missing kernel doc to proc_dointvec_conv.
- Testing
This series was run through sysctl selftests/kunit test suite in
x86_64. And went into linux-next after rc4, giving it a good 3 weeks
of testing
* tag 'sysctl-7.00-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: replace SYSCTL_INT_CONV_CUSTOM macro with functions
sysctl: Replace unidirectional INT converter macros with functions
sysctl: Add kernel doc to proc_douintvec_conv
sysctl: Replace UINT converter macros with functions
sysctl: Add CONFIG_PROC_SYSCTL guards for converter macros
sysctl: clarify proc_douintvec_minmax doc
sysctl: Return -ENOSYS from proc_douintvec_conv when CONFIG_PROC_SYSCTL=n
sysctl: Remove unused ctl_table forward declarations
loadpin: Implement custom proc_handler for enforce
alloc_tag: move memory_allocation_profiling_sysctls into .rodata
sysctl: Add missing kernel-doc for proc_dointvec_conv
|
|
This is a recommendation from RFC 8981 and it was intended to be changed
by commit 969c54646af0 ("ipv6: Implement draft-ietf-6man-rfc4941bis")
but it only changed the sysctl documentation.
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260214172543.5783-1-fmancera@suse.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
It is unlikely that this function will be ever called
with isk->inet_num being not zero.
Perform the check on isk->inet_num inside the locked section
for complete safety.
Fixes: 9b115749acb24 ("ipv6: add ip6_sock_set_v6only")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Link: https://patch.msgid.link/20260216102202.3343588-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
On the receive path, __ioam6_fill_trace_data() uses trace->nodelen
to decide how much data to write for each node. It trusts this field
as-is from the incoming packet, with no consistency check against
trace->type (the 24-bit field that tells which data items are
present). A crafted packet can set nodelen=0 while setting type bits
0-21, causing the function to write ~100 bytes past the allocated
region (into skb_shared_info), which corrupts adjacent heap memory
and leads to a kernel panic.
Add a shared helper ioam6_trace_compute_nodelen() in ioam6.c to
derive the expected nodelen from the type field, and use it:
- in ioam6_iptunnel.c (send path, existing validation) to replace
the open-coded computation;
- in exthdrs.c (receive path, ipv6_hop_ioam) to drop packets whose
nodelen is inconsistent with the type field, before any data is
written.
Per RFC 9197, bits 12-21 are each short (4-octet) fields, so they
are included in IOAM6_MASK_SHORT_FIELDS (changed from 0xff100000 to
0xff1ffc00).
Fixes: 9ee11f0fff20 ("ipv6: ioam: Data plane support for Pre-allocated Trace")
Cc: stable@vger.kernel.org
Signed-off-by: Junxi Qian <qjx1298677004@gmail.com>
Reviewed-by: Justin Iurman <justin.iurman@gmail.com>
Link: https://patch.msgid.link/20260211040412.86195-1-qjx1298677004@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Pull rdma updates from Jason Gunthorpe:
"Usual smallish cycle. The NFS biovec work to push it down into RDMA
instead of indirecting through a scatterlist is pretty nice to see,
been talked about for a long time now.
- Various code improvements in irdma, rtrs, qedr, ocrdma, irdma, rxe
- Small driver improvements and minor bug fixes to hns, mlx5, rxe,
mana, mlx5, irdma
- Robusness improvements in completion processing for EFA
- New query_port_speed() verb to move past limited IBA defined speed
steps
- Support for SG_GAPS in rts and many other small improvements
- Rare list corruption fix in iwcm
- Better support different page sizes in rxe
- Device memory support for mana
- Direct bio vec to kernel MR for use by NFS-RDMA
- QP rate limiting for bnxt_re
- Remote triggerable NULL pointer crash in siw
- DMA-buf exporter support for RDMA mmaps like doorbells"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (66 commits)
RDMA/mlx5: Implement DMABUF export ops
RDMA/uverbs: Add DMABUF object type and operations
RDMA/uverbs: Support external FD uobjects
RDMA/siw: Fix potential NULL pointer dereference in header processing
RDMA/umad: Reject negative data_len in ib_umad_write
IB/core: Extend rate limit support for RC QPs
RDMA/mlx5: Support rate limit only for Raw Packet QP
RDMA/bnxt_re: Report QP rate limit in debugfs
RDMA/bnxt_re: Report packet pacing capabilities when querying device
RDMA/bnxt_re: Add support for QP rate limiting
MAINTAINERS: Drop RDMA files from Hyper-V section
RDMA/uverbs: Add __GFP_NOWARN to ib_uverbs_unmarshall_recv() kmalloc
svcrdma: use bvec-based RDMA read/write API
RDMA/core: add rdma_rw_max_sge() helper for SQ sizing
RDMA/core: add MR support for bvec-based RDMA operations
RDMA/core: use IOVA-based DMA mapping for bvec RDMA operations
RDMA/core: add bio_vec based RDMA read/write API
RDMA/irdma: Use kvzalloc for paged memory DMA address array
RDMA/rxe: Fix race condition in QP timer handlers
RDMA/mana_ib: Add device‑memory support
...
|