summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2026-04-12net: change sk_filter_trim_cap() to return a drop_reason by valueEric Dumazet
Current return value can be replaced with the drop_reason, reducing kernel bloat: $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 1/11 up/down: 32/-603 (-571) Function old new delta tcp_v6_rcv 3135 3167 +32 unix_dgram_sendmsg 1731 1726 -5 netlink_unicast 957 945 -12 netlink_dump 1372 1359 -13 sk_filter_trim_cap 882 858 -24 tcp_v4_rcv 3143 3111 -32 __pfx_tcp_filter 32 - -32 netlink_broadcast_filtered 1633 1595 -38 sock_queue_rcv_skb_reason 126 76 -50 tun_net_xmit 1127 1074 -53 __sk_receive_skb 690 632 -58 udpv6_queue_rcv_one_skb 935 869 -66 udp_queue_rcv_one_skb 919 853 -66 tcp_filter 154 - -154 Total: Before=29722783, After=29722212, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409145625.2306224-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12tcp: change tcp_filter() to return the reason by valueEric Dumazet
sk_filter_trim_cap() will soon return the reason by value, do the same for tcp_filter(). Note: tcp_filter() is no longer inlined. Following patch will inline it again. $ scripts/bloat-o-meter -t vmlinux.4 vmlinux.5 add/remove: 2/0 grow/shrink: 0/2 up/down: 186/-43 (143) Function old new delta tcp_filter - 154 +154 __pfx_tcp_filter - 32 +32 tcp_v4_rcv 3152 3143 -9 tcp_v6_rcv 3169 3135 -34 Total: Before=29722640, After=29722783, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409145625.2306224-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net: change sk_filter_reason() to return the reason by valueEric Dumazet
sk_filter_trim_cap will soon return the reason by value, do the same for sk_filter_reason(). $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-21 (-21) Function old new delta sock_queue_rcv_skb_reason 128 126 -2 tun_net_xmit 1146 1127 -19 Total: Before=29722661, After=29722640, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409145625.2306224-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net: always set reason in sk_filter_trim_cap()Eric Dumazet
sk_filter_trim_cap() will soon return the drop reason by value. Make sure *reason is cleared when no error is returned, to ease this conversion. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-7 (-7) Function old new delta sk_filter_trim_cap 889 882 -7 Total: Before=29722668, After=29722661, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409145625.2306224-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net: change sock_queue_rcv_skb_reason() to return a drop_reasonEric Dumazet
Change sock_queue_rcv_skb_reason() to return the drop_reason directly instead of using a reference. This is part of an effort to remove stack canaries and reduce bloat. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 3/7 up/down: 79/-301 (-222) Function old new delta vsock_queue_rcv_skb 50 79 +29 ipmr_cache_report 1290 1315 +25 ip6mr_cache_report 1322 1347 +25 packet_rcv_spkt 329 327 -2 sock_queue_rcv_skb_reason 166 128 -38 raw_rcv_skb 122 80 -42 ping_queue_rcv_skb 109 61 -48 ping_rcv 215 162 -53 rawv6_rcv_skb 278 224 -54 raw_rcv 591 527 -64 Total: Before=29722890, After=29722668, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260409145625.2306224-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12Merge branch 'add-support-for-pic64-hpsc-hx-mdio-controller'Jakub Kicinski
Charles Perry says: ==================== Add support for PIC64-HPSC/HX MDIO controller This series adds a driver for the two MDIO controllers of PIC64-HPSC/HX. The hardware supports C22 and C45 but only C22 is implemented for now. This MDIO hardware is based on a Microsemi design supported in Linux by mdio-mscc-miim.c. However, The register interface is completely different with pic64hpsc, hence the need for a separate driver. The documentation recommends an input clock of 156.25MHz and a prescaler of 39, which yields an MDIO clock of 1.95MHz. This was tested on Microchip HB1301 evalkit which has a VSC8574 and a VSC8541. I've tested with bus frequencies of 0.6, 1.95 and 2.5 MHz. This series also adds a PHY write barrier when disabling PHY interrupts as discussed in: https://lore.kernel.org/acvUqDgepCIScs8M@shell.armlinux.org.uk ==================== Link: https://patch.msgid.link/20260408131821.1145334-1-charles.perry@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net: phy: add a PHY write barrier when disabling interruptsCharles Perry
MDIO bus controllers are not required to wait for write transactions to complete before returning as synchronization is often achieved by polling status bits. This can cause issues when disabling interrupts since an interrupt could fire before the interrupt handler is unregistered and there's no status bit to poll. Add a phy_write_barrier() function and use it in phy_disable_interrupts() to fix this issue. The write barrier just reads an MII register and discards the value, which is enough to guarantee that previous writes have completed. Signed-off-by: Charles Perry <charles.perry@microchip.com> Link: https://patch.msgid.link/20260408131821.1145334-4-charles.perry@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net: mdio: add a driver for PIC64-HPSC/HX MDIO controllerCharles Perry
This adds an MDIO driver for PIC64-HPSC/HX. The hardware supports C22 and C45 but only C22 is implemented in this commit. This MDIO hardware is based on a Microsemi design supported in Linux by mdio-mscc-miim.c. However, The register interface is completely different with pic64hpsc, hence the need for a separate driver. The documentation recommends an input clock of 156.25MHz and a prescaler of 39, which yields an MDIO clock of 1.95MHz. The hardware supports an interrupt pin or a "TRIGGER" bit that can be polled to signal transaction completion. This commit uses polling. This was tested on Microchip HB1301 evalkit with a VSC8574 and a VSC8541. Signed-off-by: Charles Perry <charles.perry@microchip.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260408131821.1145334-3-charles.perry@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12dt-bindings: net: document Microchip PIC64-HPSC/HX MDIO controllerCharles Perry
This MDIO hardware is based on a Microsemi design supported in Linux by mdio-mscc-miim.c. However, The register interface is completely different with pic64hpsc, hence the need for separate documentation. The hardware supports C22 and C45. The documentation recommends an input clock of 156.25MHz and a prescaler of 39, which yields an MDIO clock of 1.95MHz. The hardware supports an interrupt pin to signal transaction completion which is not strictly needed as the software can also poll a "TRIGGER" bit for this. Signed-off-by: Charles Perry <charles.perry@microchip.com> Acked-by: Conor Dooley <conor.dooley@microchip.com> Link: https://patch.msgid.link/20260408131821.1145334-2-charles.perry@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net: phy: fix a return path in get_phy_c45_ids()Charles Perry
The return value of phy_c45_probe_present() is stored in "ret", not "phy_reg", fix this. "phy_reg" always has a positive value if we reach this return path (since it would have returned earlier otherwise), which means that the original goal of the patch of not considering -ENODEV fatal wasn't achieved. Fixes: 17b447539408 ("net: phy: c45 scanning: Don't consider -ENODEV fatal") Signed-off-by: Charles Perry <charles.perry@microchip.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://patch.msgid.link/20260409133654.3203336-1-charles.perry@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12rtc: ntxec: fix OF node reference imbalanceJohan Hovold
The driver reuses the OF node of the parent multi-function device but fails to take another reference to balance the one dropped by the platform bus code when unbinding the MFD and deregistering the child devices. Fix this by using the intended helper for reusing OF nodes. Fixes: 435af89786c6 ("rtc: New driver for RTC in Netronix embedded controller") Cc: stable@vger.kernel.org # 5.13 Cc: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Signed-off-by: Johan Hovold <johan@kernel.org> Link: https://patch.msgid.link/20260407122717.2676774-1-johan@kernel.org Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2026-04-12rtc: pic32: allow driver to be compiled with COMPILE_TESTBrian Masney
This driver currently only supports builds against a PIC32 target. Now that commit ed65ae9f6c6b ("rtc: pic32: update include to use pic32.h from platform_data") is merged, it's possible to compile this driver on other architectures. To avoid future breakage of this driver in the future, let's update the Kconfig so that it can be built with COMPILE_TEST enabled on all architectures. Signed-off-by: Brian Masney <bmasney@redhat.com> Link: https://patch.msgid.link/20260222-rtc-pic32-v1-1-3f8eb654a34d@redhat.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2026-04-12rtc: ti-k3: Add support to resume from IO DDR low power modeAkashdeep Kaur
Restore the RTC HW context which may be lost when system enters certain low power mode (IO+DDR mode). Check if the RTC registers are locked which would indicate loss of context (reset) and restore the context as needed. Signed-off-by: Akashdeep Kaur <a-kaur@ti.com> Reviewed-by: Vignesh Raghavendra <vigneshr@ti.com> Link: https://patch.msgid.link/20260313111740.1492519-1-a-kaur@ti.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2026-04-12dt-bindings: net: dsa: nxp,sja1105: make spi-cpol optional for sja1110Josua Mayer
Currently, the binding requires 'spi-cpha' for SJA1105 and 'spi-cpol' for SJA1110. However, the SJA1110 supports both SPI modes 0 and 2. Mode 2 (cpha=0, cpol=1) is used by the NXP LX2160 Bluebox 3. On the SolidRun i.MX8DXL HummingBoard Telematics, mode 0 is stable, while forcing mode 2 introduces CRC errors especially during bursts. Drop the requirement on spi-cpol for SJA1110. Fixes: af2eab1a8243 ("dt-bindings: net: nxp,sja1105: document spi-cpol/cpha") Signed-off-by: Josua Mayer <josua@solid-run.com> Acked-by: Conor Dooley <conor.dooley@microchip.com> Link: https://patch.msgid.link/20260409-imx8dxl-sr-som-v2-1-83ff20629ba0@solid-run.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12octeon_ep: Remove unnecessary semicolons in octep_oq_drop_rx()Nobuhiro Iwamatsu
Remove unnecessary semicolons in octep_oq_drop_rx(). Signed-off-by: Nobuhiro Iwamatsu <nobuhiro.iwamatsu.x90@mail.toshiba> Link: https://patch.msgid.link/1775711291-13938-1-git-send-email-nobuhiro.iwamatsu.x90@mail.toshiba Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12Merge branch 'more-fixes-for-the-ipa-driver'Jakub Kicinski
Luca Weiss says: ==================== More fixes for the IPA driver Two more fixes for the Qualcomm IPA driver. ==================== Link: https://patch.msgid.link/20260409-ipa-fixes-v1-0-a817c30678ac@fairphone.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net: ipa: Fix decoding EV_PER_EE for IPA v5.0+Luca Weiss
Initially 'reg' and 'val' are assigned from HW_PARAM_2. But since IPA v5.0+ takes EV_PER_EE from HW_PARAM_4 (instead of NUM_EV_PER_EE from HW_PARAM_2), we not only need to re-assign 'reg' but also read the register value of that register into 'val' so that reg_decode() works on the correct value. Fixes: f651334e1ef5 ("net: ipa: add HW_PARAM_4 GSI register") Link: https://sashiko.dev/#/patchset/20260403-milos-ipa-v1-0-01e9e4e03d3e%40fairphone.com?part=2 Signed-off-by: Luca Weiss <luca.weiss@fairphone.com> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Link: https://patch.msgid.link/20260409-ipa-fixes-v1-2-a817c30678ac@fairphone.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net: ipa: Fix programming of QTIME_TIMESTAMP_CFGLuca Weiss
The 'val' variable gets overwritten multiple times, discarding previous values. Looking at the git log shows these should be combined with |= instead. Fixes: 9265a4f0f0b4 ("net: ipa: define even more IPA register fields") Link: https://sashiko.dev/#/patchset/20260403-milos-ipa-v1-0-01e9e4e03d3e%40fairphone.com?part=4 Signed-off-by: Luca Weiss <luca.weiss@fairphone.com> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Link: https://patch.msgid.link/20260409-ipa-fixes-v1-1-a817c30678ac@fairphone.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12Linux 7.0v7.0Linus Torvalds
2026-04-12ppp: require CAP_NET_ADMIN in target netns for unattached ioctlsTaegu Ha
/dev/ppp open is currently authorized against file->f_cred->user_ns, while unattached administrative ioctls operate on current->nsproxy->net_ns. As a result, a local unprivileged user can create a new user namespace with CLONE_NEWUSER, gain CAP_NET_ADMIN only in that new user namespace, and still issue PPPIOCNEWUNIT, PPPIOCATTACH, or PPPIOCATTCHAN against an inherited network namespace. Require CAP_NET_ADMIN in the user namespace that owns the target network namespace before handling unattached PPP administrative ioctls. This preserves normal pppd operation in the network namespace it is actually privileged in, while rejecting the userns-only inherited-netns case. Fixes: 273ec51dd7ce ("net: ppp_generic - introduce net-namespace functionality v2") Signed-off-by: Taegu Ha <hataegu0826@gmail.com> Link: https://patch.msgid.link/20260409071117.4354-1-hataegu0826@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12Merge patch series "bpf: Fix OOB in pcpu_init_value and add a test"Alexei Starovoitov
xulang <xulang@uniontech.com> says: ==================== Fix OOB read when copying element from a BPF_MAP_TYPE_CGROUP_STORAGE map to another pcpu map with the same value_size that is not rounded up to 8 bytes, and add a test case to reproduce the issue. The root cause is that pcpu_init_value() uses copy_map_value_long() which rounds up the copy size to 8 bytes, but CGROUP_STORAGE map values are not 8-byte aligned (e.g., 4-byte). This causes a 4-byte OOB read when the copy is performed. ==================== Link: https://lore.kernel.org/r/7653EEEC2BAB17DF+20260402073948.2185396-1-xulang@uniontech.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12selftests/bpf: Add test for cgroup storage OOB readLang Xu
Add a test case to reproduce the out-of-bounds read issue when copying from a cgroup storage map to a pcpu map with a value_size not rounded up to 8 bytes. The test creates: 1. A CGROUP_STORAGE map with 4-byte value (not 8-byte aligned) 2. A LRU_PERCPU_HASH map with 4-byte value (same size) When a socket is created in the cgroup, the BPF program triggers bpf_map_update_elem() which calls copy_map_value_long(). This function rounds up the copy size to 8 bytes, but the cgroup storage buffer is only 4 bytes, causing an OOB read (before the fix). Signed-off-by: Lang Xu <xulang@uniontech.com> Link: https://lore.kernel.org/r/D63BF0DBFF1EA122+20260402074236.2187154-2-xulang@uniontech.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12bpf: Fix OOB in pcpu_init_valueLang Xu
An out-of-bounds read occurs when copying element from a BPF_MAP_TYPE_CGROUP_STORAGE map to another pcpu map with the same value_size that is not rounded up to 8 bytes. The issue happens when: 1. A CGROUP_STORAGE map is created with value_size not aligned to 8 bytes (e.g., 4 bytes) 2. A pcpu map is created with the same value_size (e.g., 4 bytes) 3. Update element in 2 with data in 1 pcpu_init_value assumes that all sources are rounded up to 8 bytes, and invokes copy_map_value_long to make a data copy, However, the assumption doesn't stand since there are some cases where the source may not be rounded up to 8 bytes, e.g., CGROUP_STORAGE, skb->data. the verifier verifies exactly the size that the source claims, not the size rounded up to 8 bytes by kernel, an OOB happens when the source has only 4 bytes while the copy size(4) is rounded up to 8. Fixes: d3bec0138bfb ("bpf: Zero-fill re-used per-cpu map element") Reported-by: Kaiyan Mei <kaiyanm@hust.edu.cn> Closes: https://lore.kernel.org/all/14e6c70c.6c121.19c0399d948.Coremail.kaiyanm@hust.edu.cn/ Link: https://lore.kernel.org/r/420FEEDDC768A4BE+20260402074236.2187154-1-xulang@uniontech.com Signed-off-by: Lang Xu <xulang@uniontech.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12Merge branch 'net-rds-fix-use-after-free-in-rds-ib-for-non-init-namespaces'Jakub Kicinski
Allison Henderson says: ==================== net/rds: Fix use-after-free in RDS/IB for non-init namespaces This series fixes syzbot bug da8e060735ae02c8f3d1 https://syzkaller.appspot.com/bug?extid=da8e060735ae02c8f3d1 The report finds a use-after-free bug where ib connections access an invalid network namespace after it has been freed. The stack is: rds_rdma_cm_event_handler_cmn rds_conn_path_drop rds_destroy_pending check_net() <-- use-after-free This is initially introduced in: d5a8ac28a7ff ("RDS-TCP: Make RDS-TCP work correctly when it is set up in a netns other than init_net"). Here, we made RDS aware of the namespace by storing a net pointer in each connection. But it is not explicitly restricted to init_net in the case of ib. The RDS/TCP transport has its own pernet exit handler (rds_tcp_exit_net) that destroys connections when a namespace is torn down. But RDS/IB does not support more than the initial namespace and has no such handler. The initial namespace is statically allocated, and never torn down, so it always has at least one reference. Allowing non init namespaces that do not have a persistent reference means that when their refcounts drop to zero, they are released through cleanup_net(). Which would call any registered pernet clean up handlers if it had any, but since they don't in this case, the extra rds_connections remain with stale c_net pointers. Which are then accessed later causing the use-after-free bug. So, the simple fix is to disallow more than the initial namespace to be created in the case of ib connections. Fixes are ported from UEK patches found here: https://github.com/oracle/linux-uek/commit/8ed9a82376b7 Patch 1 is a prerequisite optimization to rds_ib_laddr_check() that avoids excessive rdma_bind_addr() calls during transport probing by first checking rds_ib_get_device(). This is needed because patch 2 adds a namespace check at the top of the same function. UEK: 8ed9a82376b7 ("rds: ib: Optimize rds_ib_laddr_check") https://github.com/oracle/linux-uek/commit/bd9489a08004 Patch 2 restricts RDS/IB to the initial network namespace. It adds checks in both rds_ib_laddr_check() and rds_set_transport() to reject IB use from non-init namespaces with -EPROTOTYPE. This prevents the use-after-free by ensuring IB connections cannot exist in namespaces that may be torn down. UEK: bd9489a08004 ("net/rds: Restrict use of RDS/IB to the initial network namespace") Questions, comments and feedback appreciated! ==================== Link: https://patch.msgid.link/20260408080420.540032-1-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net/rds: Restrict use of RDS/IB to the initial network namespaceGreg Jumper
Prevent using RDS/IB in network namespaces other than the initial one. The existing RDS/IB code will not work properly in non-initial network namespaces. Fixes: d5a8ac28a7ff ("RDS-TCP: Make RDS-TCP work correctly when it is set up in a netns other than init_net") Reported-by: syzbot+da8e060735ae02c8f3d1@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=da8e060735ae02c8f3d1 Signed-off-by: Greg Jumper <greg.jumper@oracle.com> Signed-off-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/20260408080420.540032-3-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net/rds: Optimize rds_ib_laddr_checkHåkon Bugge
rds_ib_laddr_check() creates a CM_ID and attempts to bind the address in question to it. This in order to qualify the allegedly local address as a usable IB/RoCE address. In the field, ExaWatcher runs rds-ping to all ports in the fabric from all local ports. This using all active ToS'es. In a full rack system, we have 14 cell servers and eight db servers. Typically, 6 ToS'es are used. This implies 528 rds-ping invocations per ExaWatcher's "RDSinfo" interval. Adding to this, each rds-ping invocation creates eight sockets and binds the local address to them: socket(AF_RDS, SOCK_SEQPACKET, 0) = 3 bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 4 bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 5 bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 6 bind(6, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 7 bind(7, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 8 bind(8, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 9 bind(9, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 10 bind(10, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 So, at every interval ExaWatcher executes rds-ping's, 4224 CM_IDs are allocated, considering this full-rack system. After the a CM_ID has been allocated, rdma_bind_addr() is called, with the port number being zero. This implies that the CMA will attempt to search for an un-used ephemeral port. Simplified, the algorithm is to start at a random position in the available port space, and then if needed, iterate until an un-used port is found. The book-keeping of used ports uses the idr system, which again uses slab to allocate new struct idr_layer's. The size is 2092 bytes and slab tries to reduce the wasted space. Hence, it chooses an order:3 allocation, for which 15 idr_layer structs will fit and only 1388 bytes are wasted per the 32KiB order:3 chunk. Although this order:3 allocation seems like a good space/speed trade-off, it does not resonate well with how it used by the CMA. The combination of the randomized starting point in the port space (which has close to zero spatial locality) and the close proximity in time of the 4224 invocations of the rds-ping's, creates a memory hog for order:3 allocations. These costly allocations may need reclaims and/or compaction. At worst, they may fail and produce a stack trace such as (from uek4): [<ffffffff811a72d5>] __inc_zone_page_state+0x35/0x40 [<ffffffff811c2e97>] page_add_file_rmap+0x57/0x60 [<ffffffffa37ca1df>] remove_migration_pte+0x3f/0x3c0 [ksplice_6cn872bt_vmlinux_new] [<ffffffff811c3de8>] rmap_walk+0xd8/0x340 [<ffffffff811e8860>] remove_migration_ptes+0x40/0x50 [<ffffffff811ea83c>] migrate_pages+0x3ec/0x890 [<ffffffff811afa0d>] compact_zone+0x32d/0x9a0 [<ffffffff811b00ed>] compact_zone_order+0x6d/0x90 [<ffffffff811b03b2>] try_to_compact_pages+0x102/0x270 [<ffffffff81190e56>] __alloc_pages_direct_compact+0x46/0x100 [<ffffffff8119165b>] __alloc_pages_nodemask+0x74b/0xaa0 [<ffffffff811d8411>] alloc_pages_current+0x91/0x110 [<ffffffff811e3b0b>] new_slab+0x38b/0x480 [<ffffffffa41323c7>] __slab_alloc+0x3b7/0x4a0 [ksplice_s0dk66a8_vmlinux_new] [<ffffffff811e42ab>] kmem_cache_alloc+0x1fb/0x250 [<ffffffff8131fdd6>] idr_layer_alloc+0x36/0x90 [<ffffffff8132029c>] idr_get_empty_slot+0x28c/0x3d0 [<ffffffff813204ad>] idr_alloc+0x4d/0xf0 [<ffffffffa051727d>] cma_alloc_port+0x4d/0xa0 [rdma_cm] [<ffffffffa0517cbe>] rdma_bind_addr+0x2ae/0x5b0 [rdma_cm] [<ffffffffa09d8083>] rds_ib_laddr_check+0x83/0x2c0 [ksplice_6l2xst5i_rds_rdma_new] [<ffffffffa05f892b>] rds_trans_get_preferred+0x5b/0xa0 [rds] [<ffffffffa05f09f2>] rds_bind+0x212/0x280 [rds] [<ffffffff815b4016>] SYSC_bind+0xe6/0x120 [<ffffffff815b4d3e>] SyS_bind+0xe/0x10 [<ffffffff816b031a>] system_call_fastpath+0x18/0xd4 To avoid these excessive calls to rdma_bind_addr(), we optimize rds_ib_laddr_check() by simply checking if the address in question has been used before. The rds_rdma module keeps track of addresses associated with IB devices, and the function rds_ib_get_device() is used to determine if the address already has been qualified as a valid local address. If not found, we call the legacy rds_ib_laddr_check(), now renamed to rds_ib_laddr_check_cm(). Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Allison Henderson <achender@kernel.org> Link: https://patch.msgid.link/20260408080420.540032-2-achender@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12Merge branch 'net-hamradio-fix-missing-input-validation-in-bpqether-and-scc'Jakub Kicinski
Mashiro Chen says: ==================== net: hamradio: fix missing input validation in bpqether and scc This series fixes two missing input validation bugs in the hamradio drivers. Both patches were reviewed by Joerg Reuter (hamradio maintainer). ==================== Link: https://patch.msgid.link/20260409024927.24397-1-mashiro.chen@mailbox.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net: hamradio: scc: validate bufsize in SIOCSCCSMEM ioctlMashiro Chen
The SIOCSCCSMEM ioctl copies a scc_mem_config from user space and assigns its bufsize field directly to scc->stat.bufsize without any range validation: scc->stat.bufsize = memcfg.bufsize; If a privileged user (CAP_SYS_RAWIO) sets bufsize to 0, the receive interrupt handler later calls dev_alloc_skb(0) and immediately writes a KISS type byte via skb_put_u8() into a zero-capacity socket buffer, corrupting the adjacent skb_shared_info region. Reject bufsize values smaller than 16; this is large enough to hold at least one KISS header byte plus useful data. Signed-off-by: Mashiro Chen <mashiro.chen@mailbox.org> Acked-by: Joerg Reuter <jreuter@yaina.de> Link: https://patch.msgid.link/20260409024927.24397-3-mashiro.chen@mailbox.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net: hamradio: bpqether: validate frame length in bpq_rcv()Mashiro Chen
The BPQ length field is decoded as: len = skb->data[0] + skb->data[1] * 256 - 5; If the sender sets bytes [0..1] to values whose combined value is less than 5, len becomes negative. Passing a negative int to skb_trim() silently converts to a huge unsigned value, causing the function to be a no-op. The frame is then passed up to AX.25 with its original (untrimmed) payload, delivering garbage beyond the declared frame boundary. Additionally, a negative len corrupts the 64-bit rx_bytes counter through implicit sign-extension. Add a bounds check before pulling the length bytes: reject frames where len is negative or exceeds the remaining skb data. Acked-by: Joerg Reuter <jreuter@yaina.de> Signed-off-by: Mashiro Chen <mashiro.chen@mailbox.org> Link: https://patch.msgid.link/20260409024927.24397-2-mashiro.chen@mailbox.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12selftests/bpf: Fix reg_bounds to match new tnum-based refinementPaul Chaignon
Commit efc11a667878 ("bpf: Improve bounds when tnum has a single possible value") improved the bounds refinement to detect when the tnum and u64 range overlap in a single value (and the bounds can thus be set to that value). Eduard then noticed that it broke the slow-mode reg_bounds selftests because they don't have an equivalent logic and are therefore unable to refine the bounds as much as the verifier. The following test case illustrates this. ACTUAL TRUE1: scalar(u64=0xffffffff00000000,u32=0,s64=0xffffffff00000000,s32=0) EXPECTED TRUE1: scalar(u64=[0xfffffffe00000001; 0xffffffff00000000],u32=0,s64=[0xfffffffe00000001; 0xffffffff00000000],s32=0) [...] #323/1007 reg_bounds_gen_consts_s64_s32/(s64)[0xfffffffe00000001; 0xffffffff00000000] (s32)<op> S64_MIN:FAIL with the verifier logs: [...] 19: w0 = w6 ; R0=scalar(smin=0,smax=umax=0xffffffff, var_off=(0x0; 0xffffffff)) R6=scalar(smin=0xfffffffe00000001,smax=0xffffffff00000000, umin=0xfffffffe00000001,umax=0xffffffff00000000, var_off=(0xfffffffe00000000; 0x1ffffffff)) 20: w0 = w7 ; R0=0 R7=0x8000000000000000 21: if w6 == w7 goto pc+3 [...] from 21 to 25: [...] 25: w0 = w6 ; R0=0 R6=0xffffffff00000000 ; ^ ; unexpected refined value 26: w0 = w7 ; R0=0 R7=0x8000000000000000 27: exit When w6 == w7 is true, the verifier can deduce that the R6's tnum is equal to (0xfffffffe00000000; 0x100000000) and then use that information to refine the bounds: the tnum only overlap with the u64 range in 0xffffffff00000000. The reg_bounds selftest doesn't know about tnums and therefore fails to perform the same refinement. This issue happens when the tnum carries information that cannot be represented in the ranges, as otherwise the selftest could reach the same refined value using just the ranges. The tnum thus needs to represent non-contiguous values (ex., R6's tnum above, after the condition). The only way this can happen in the reg_bounds selftest is at the boundary between the 32 and 64bit ranges. We therefore only need to handle that case. This patch fixes the selftest refinement logic by checking if the u32 and u64 ranges overlap in a single value. If so, the ranges can be set to that value. We need to handle two cases: either they overlap in umin64... u64 values matching u32 range: xxx xxx xxx xxx |--------------------------------------| u64 range: 0 xxxxx UMAX64 or in umax64: u64 values matching u32 range: xxx xxx xxx xxx |--------------------------------------| u64 range: 0 xxxxx UMAX64 To detect the first case, we decrease umax64 to the maximum value that matches the u32 range. If that happens to be umin64, then umin64 is the only overlap. We proceed similarly for the second case, increasing umin64 to the minimum value that matches the u32 range. Note this is similar to how the verifier handles the general case using tnum, but we don't need to care about a single-value overlap in the middle of the range. That case is not possible when comparing two ranges. This patch also adds two test cases reproducing this bug as part of the normal test runs (without SLOW_TESTS=1). Fixes: efc11a667878 ("bpf: Improve bounds when tnum has a single possible value") Reported-by: Eduard Zingerman <eddyz87@gmail.com> Closes: https://lore.kernel.org/bpf/4e6dd64a162b3cab3635706ae6abfdd0be4db5db.camel@gmail.com/ Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/ada9UuSQi2SE2IfB@mail.gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12net: rose: reject truncated CLEAR_REQUEST frames in state machinesMashiro Chen
All five ROSE state machines (states 1-5) handle ROSE_CLEAR_REQUEST by reading the cause and diagnostic bytes directly from skb->data[3] and skb->data[4] without verifying that the frame is long enough: rose_disconnect(sk, ..., skb->data[3], skb->data[4]); The entry-point check in rose_route_frame() only enforces ROSE_MIN_LEN (3 bytes), so a remote peer on a ROSE network can send a syntactically valid but truncated CLEAR_REQUEST (3 or 4 bytes) while a connection is open in any state. Processing such a frame causes a one- or two-byte out-of-bounds read past the skb data, leaking uninitialized heap content as the cause/diagnostic values returned to user space via getsockopt(ROSE_GETCAUSE). Add a single length check at the rose_process_rx_frame() dispatch point, before any state machine is entered, to drop frames that carry the CLEAR_REQUEST type code but are too short to contain the required cause and diagnostic fields. Signed-off-by: Mashiro Chen <mashiro.chen@mailbox.org> Link: https://patch.msgid.link/20260408172551.281486-1-mashiro.chen@mailbox.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12i3c: mipi-i3c-hci: fix IBI payload length calculation for final statusBilly Tsai
In DMA mode, the IBI status descriptor encodes the payload using CHUNKS (number of chunks) and DATA_LENGTH (valid bytes in the last chunk). All preceding chunks are implicitly full-sized. The current code accumulates full chunk sizes for non-final status descriptors, but for the final status descriptor it only adds DATA_LENGTH. This ignores the contribution of the preceding full chunks described by the same final status entry. As a result, the computed IBI payload length is truncated whenever the final status spans multiple chunks. For example, with a chunk size of 4 bytes, CHUNKS=2 and DATA_LENGTH=1 should result in a total payload size of 5 bytes, but the current code reports only 1 byte. Fix the calculation by adding the size of (CHUNKS - 1) full chunks plus DATA_LENGTH for the last chunk. Fixes: 9ad9a52cce28 ("i3c/master: introduce the mipi-i3c-hci driver") Signed-off-by: Billy Tsai <billy_tsai@aspeedtech.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20260407-i3c-hci-dma-v2-1-a583187b9d22@aspeedtech.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2026-04-12Merge branch 'net-enetc-improve-statistics-for-v1-and-add-statistics-for-v4'Jakub Kicinski
Wei Fang says: ==================== net: enetc: improve statistics for v1 and add statistics for v4 For ENETC v1, some standardized statistics were redundantly included in the unstructured statistics, so remove these duplicated entries. Previously, the unstructured statistics only contained eMAC data and did not include pMAC data; add pMAC statistics to ensure completeness. For ENETC v4, the driver previously reported MAC statistics only for the internal ENETC (Pseudo MAC). Extend the implementation to provide additional statistics for both the internal ENETC and the standalone ENETC. ==================== Link: https://patch.msgid.link/20260408055849.1314033-1-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net: enetc: add unstructured counters for ENETC v4Wei Fang
Like ENETC v1, ENETC v4 also has many non-standard counters, so these counters are added to improve statistical coverage. Signed-off-by: Wei Fang <wei.fang@nxp.com> Link: https://patch.msgid.link/20260408055849.1314033-6-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net: enetc: add unstructured pMAC counters for ENETC v1Wei Fang
The ENETC v1 has two MACs (eMAC and pMAC) to support preemption. The existing unstructured counters include the eMAC counters, but not the pMAC counters. So add pMAC counters to improve statistical coverage. Signed-off-by: Wei Fang <wei.fang@nxp.com> Link: https://patch.msgid.link/20260408055849.1314033-5-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net: enetc: remove standardized counters from enetc_pm_countersWei Fang
The standardized counters are already exposed via the get_pause_stats(), get_rmon_stats(), get_eth_ctrl_stats() and get_eth_mac_stats() interfaces. Keeping the same counters in enetc_pm_counters results in redundant output. Remove these standardized counters from enetc_pm_counters and rely on the existing statistics interfaces to report them. Signed-off-by: Wei Fang <wei.fang@nxp.com> Link: https://patch.msgid.link/20260408055849.1314033-4-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net: enetc: show RX drop counters only for assigned RX ringsWei Fang
For ENETC v1, each SI provides 16 RBDCR registers for RX ring drop counters, but this does not imply that an SI actually owns 16 RX rings. The ENETC hardware supports a total of 16 RX rings, which are assigned to 3 SIs (1 PSI and 2 VSIs), so each SI is assigned fewer than 16 RX rings. The current implementation always reports 16 RX drop counters per SI, leading to redundant output for SIs with fewer RX rings. Update the logic to display drop counters only for the RX rings that are actually assigned to the SI. Signed-off-by: Wei Fang <wei.fang@nxp.com> Link: https://patch.msgid.link/20260408055849.1314033-3-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12net: enetc: add support for the standardized countersWei Fang
ENETC v4 provides 64-bit counters for IEEE 802.3 basic and mandatory managed objects, the IETF Management Information Database (MIB) package (RFC2665), and Remote Network Monitoring (RMON) statistics. In addition, some ENETCs support preemption, so these ENETCs have two MACs: MAC 0 is the express MAC (eMAC), MAC 1 is the preemptible MAC (pMAC). Both MACs support these statistics. Signed-off-by: Wei Fang <wei.fang@nxp.com> Link: https://patch.msgid.link/20260408055849.1314033-2-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-04-12selftests/bpf: Add tests for non-arena/arena operationsEmil Tsalapatis
Add a selftest that ensures instructions with arena source and non-arena destination registers are accepted by the verifier. Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20260412174546.18684-3-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12bpf: Allow instructions with arena source and non-arena dest registersEmil Tsalapatis
The compiler sometimes stores the result of a PTR_TO_ARENA and SCALAR operation into the scalar register rather than the pointer register. Relax the verifier to allow operations between a source arena register and a destination non-arena register, marking the destination's value as a PTR_TO_ARENA. Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Song Liu <song@kernel.org> Fixes: 6082b6c328b5 ("bpf: Recognize addr_space_cast instruction in the verifier.") Link: https://lore.kernel.org/r/20260412174546.18684-2-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12Merge branch 'bpf-add-the-missing-fsession'Alexei Starovoitov
Menglong Dong says: ==================== bpf: add the missing fsession Add the missing fsession attach type to the BPF docs, verifier log and bpftool. Changes since v2: - replace "FENTRY/FEXIT/FSESSION" with "Tracing" in the 1st patch - v2: https://lore.kernel.org/all/20260408062109.386083-1-dongml2@chinatelecom.cn/ Changes since v1: - add a missing FSESSION in bpf_check_attach_target() in the 1st patch - v1: https://lore.kernel.org/all/20260408031416.266229-1-dongml2@chinatelecom.cn/ ==================== Link: https://patch.msgid.link/20260412060346.142007-1-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12bpftool: add missing fsession to the usage and docs of bpftoolMenglong Dong
Add the fsession attach type to the usage of bpftool in do_help(). Meanwhile, add it to the bash-completion and bpftool-prog.rst too. Acked-by: Leon Hwang <leon.hwang@linux.dev> Acked-by: Quentin Monnet <qmo@kernel.org> Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20260412060346.142007-4-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12docs/bpf: add missing fsession attach type to docsMenglong Dong
Add the fsession attach type to program_types.rst and drgn.rst. Acked-by: Leon Hwang <leon.hwang@linux.dev> Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20260412060346.142007-3-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12bpf: add missing fsession to the verifier logMenglong Dong
The fsession attach type is missed in the verifier log in check_get_func_ip(), bpf_check_attach_target() and check_attach_btf_id(). Update them to make the verifier log proper. Meanwhile, update the corresponding selftests. Acked-by: Leon Hwang <leon.hwang@linux.dev> Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20260412060346.142007-2-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12Merge branch 'bpf-split-verifier-c'Alexei Starovoitov
Alexei Starovoitov says: ==================== v3->v4: Restore few minor comments and undo few function moves v2->v3: Actually restore comments lost in patch 3 (instead of adding them to patch 4) v1->v2: Restore comments lost in patch 3 verifier.c is huge. Split it into logically independent pieces. No functional changes. The diff is impossible to review over email. 'git show' shows minimal actual changes. Only plenty of moved lines. Such split may cause backport headaches. We should have split it long ago. Even after split verifier.c is still 20k lines, but further split is harder. ==================== Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://patch.msgid.link/20260412152936.54262-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12bpf: Move BTF checking logic into check_btf.cAlexei Starovoitov
BTF validation logic is independent from the main verifier. Move it into check_btf.c Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-7-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12bpf: Move backtracking logic to backtrack.cAlexei Starovoitov
Move precision propagation and backtracking logic to backtrack.c to reduce verifier.c size. No functional changes. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-6-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12bpf: Move state equivalence logic to states.cAlexei Starovoitov
verifier.c is huge. Move is_state_visited() to states.c, so that all state equivalence logic is in one file. Mechanical move. No functional changes. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-5-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12bpf: Move check_cfg() into cfg.cAlexei Starovoitov
verifier.c is huge. Move check_cfg(), compute_postorder(), compute_scc() into cfg.c Mechanical move. No functional changes. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-4-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-12bpf: Move compute_insn_live_regs() into liveness.cAlexei Starovoitov
verifier.c is huge. Move compute_insn_live_regs() into liveness.c. Mechanical move. No functional changes. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-3-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>