summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2026-06-11Merge tag 'nf-26-06-10' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Revalidate bridge ports, add missing NULL checks to fetch the bridge device by the port. From Florian Westphal. 2) Fix netdevice refcount leak in the error path of nft_fwd hardware offload function, also from Florian. 3) Unregister helper expectfn callback on conntrack helper module removal, otherwise dangling pointer remains in place, from Weiming Shi. 4) Fix possible pointer infoleak in getsockopt() IPT_SO_GET_ENTRIES, From Kyle Zeng. 5) Validate that device MAC header is present before nf_syslog accesses it. From Xiang Mei. 6-8) Three patches to address a possible infoleak of stale stack data in three nf_tables expressions, due to mismatch in the _init() and _eval() function which is possible since 14fb07130c7d. From Davide Ornaghi and Florian Westphal. netfilter pull request 26-06-10 * tag 'nf-26-06-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_meta_bridge: fix stale stack leak via IIFHWADDR register netfilter: nft_fib: fix stale stack leak via the OIFNAME register netfilter: nft_exthdr: fix register tracking for F_PRESENT flag netfilter: nf_log: validate MAC header was set before dumping it netfilter: x_tables: avoid leaking percpu counter pointers netfilter: nf_conntrack: destroy stale expectfn expectations on unregister netfilter: nf_tables_offload: drop device refcount on error netfilter: revalidate bridge ports ==================== Link: https://patch.msgid.link/20260610161629.214092-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-06-11gpio: nomadik: remove dead DB8540 code from <gpio/gpio-nomadik.h>Ethan Nelson-Moore
DB8540 support was removed in commit b6d09f780761 ("pinctrl: nomadik: Drop U8540/9540 support"), but a couple small pieces of related code remained in <gpio/gpio-nomadik.h>. Remove them. Discovered while searching for CONFIG_* symbols referenced in code but not defined in any Kconfig file. Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com> Reviewed-by: Linus Walleij <linusw@kernel.org> Link: https://patch.msgid.link/20260610205007.44881-1-enelsonmoore@gmail.com Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
2026-06-11ALSA: hda: Use the new helper for PCM instance refcountTakashi Iwai
HD-audio core driver has some open-code for managing the refcount for PCM instances, and it can be replaced gracefully with the new helpers. Only a code cleanup, no functional changes. Signed-off-by: Takashi Iwai <tiwai@suse.de> Link: https://patch.msgid.link/20260610154538.51076-4-tiwai@suse.de
2026-06-11ALSA: core: Use the new helper for the power refcountTakashi Iwai
Replace the open code for managing the power refcount in the snd_card object with the new helper functions. Only a code cleanup, no functional changes. Signed-off-by: Takashi Iwai <tiwai@suse.de> Link: https://patch.msgid.link/20260610154538.51076-3-tiwai@suse.de
2026-06-11ALSA: Add simple refcount helper functionsTakashi Iwai
There are many open-code to manage the same pattern for refcount + wakeup sync at closing. Let's provide the common helper functions to replace the open-code. - The recount is kept in struct snd_refcount, where it's initialized by snd_refcount_init(). - The user can simply reference or unreference via snd_refcount_get() and snd_refcount_put() functions - The user can wait for the all usages gone by snd_refcount_sync() Note that here we use atomic_t instead of refcount_t since the current users allow reusing the refcount after sync again. The design of refcount_t prevents exactly this behavior, so it doesn't fit. Signed-off-by: Takashi Iwai <tiwai@suse.de> Link: https://patch.msgid.link/20260610154538.51076-2-tiwai@suse.de
2026-06-11phy: lynx-10g: new driverVladimir Oltean
Introduce a driver for the networking lanes of the 10G Lynx SerDes block, present on the majority of Layerscape and QorIQ (Freescale/NXP) SoCs. As with the 28G Lynx, the SerDes lanes come pre-initialized out of reset and the consumers use them that way outside the Generic PHY framework (for networking, the static configuration remains for the entire SoC lifetime, whereas for SATA and PCIe, the hardware reconfigures itself automatically for other link speeds). The need for the Generic PHY framework comes specifically for networking use cases where a static lane configuration is not sufficient. For example a network MAC is connected to an SFP cage, where various SFP or SFP+ modules can be connected. Each of them may require a different SerDes protocol (SGMII, 1000Base-X, 10GBase-R), which phylink + sfp-bus are responsible of figuring out. The phylink drivers are: - enetc - felix - dpaa_eth (fman_memac) - dpaa2-eth - dpaa2-switch and they all need to reconfigure the SerDes for the requested link mode, using phy_set_mode_ext() (and phy_validate() to see if it is supported in the first place). Note that SerDes 2 on LS1088A is exclusively non-networking, so there is currently no need for this driver. Therefore we skip matching on its compatible string and do not probe on that device. Co-developed-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20260610151952.2141019-16-vladimir.oltean@nxp.com Signed-off-by: Vinod Koul <vkoul@kernel.org>
2026-06-11phy: lynx-28g: move lane mode helpers to new core moduleVladimir Oltean
Do some preparation work for the introduction of the lynx-10g driver, which will share a common backbone with the 28G Lynx SerDes. This is just trivial stuff which can be moved without any surgery, and is easy to follow but otherwise pollutes more serious changes. The lane modes themselves are exported to a public header, because on the 10G Lynx, the hardware requires implementing a procedure called "RCW override". This requires coordination with drivers/soc/fsl/guts.c to tell it that a SerDes lane needs to be switched to a different protocol (enum lynx_lane_mode). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20260610151952.2141019-4-vladimir.oltean@nxp.com Signed-off-by: Vinod Koul <vkoul@kernel.org>
2026-06-11crypto: xilinx-trng - Replace crypto_drbg_ctr_df() with HMAC-SHA512Eric Biggers
This code is just trying to condition 48 bytes of random data. This can be done easily using HKDF-SHA512-Extract, saving 300 lines of code. This commit also fixes forward security (in this particular case) by clearing the entropy from memory after it's used. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2026-06-10netfilter: nf_conntrack: destroy stale expectfn expectations on unregisterWeiming Shi
NAT helpers such as nf_nat_h323 store a raw pointer to module text in exp->expectfn (e.g. ip_nat_q931_expect). nf_ct_helper_expectfn_unregister() only unlinks the callback descriptor and never walks the expectation table, so an expectation pending at module removal survives with a dangling exp->expectfn into freed module text. When the expected connection arrives, init_conntrack() invokes exp->expectfn(), now a stale pointer into the unloaded module. Reproduced on a KASAN build by loading the H.323 helpers, creating a Q.931 expectation, unloading nf_nat_h323, then connecting to the expected port: Oops: int3: 0000 [#1] SMP KASAN NOPTI RIP: 0010:0xffffffffa06102d1 init_conntrack.isra.0 (net/netfilter/nf_conntrack_core.c:1862) nf_conntrack_in (net/netfilter/nf_conntrack_core.c:2049) ipv4_conntrack_local (net/netfilter/nf_conntrack_proto.c:223) nf_hook_slow (net/netfilter/core.c:619) __ip_local_out (net/ipv4/ip_output.c:120) __tcp_transmit_skb (net/ipv4/tcp_output.c:1715) tcp_connect (net/ipv4/tcp_output.c:4374) tcp_v4_connect (net/ipv4/tcp_ipv4.c:345) __sys_connect (net/socket.c:2167) Modules linked in: nf_conntrack_h323 [last unloaded: nf_nat_h323] Reaching the dangling state requires CAP_SYS_MODULE in the initial user namespace to remove a NAT helper that still has live expectations, so this is a robustness fix; leaving an expectation pointing at freed text is wrong regardless. Add nf_ct_helper_expectfn_destroy(), which walks the expectation table and drops every expectation whose ->expectfn matches the descriptor being torn down. Call it from each NAT helper's exit path after the existing RCU grace period, so no expectation outlives the code it points at and no extra synchronize_rcu() is introduced. With the fix, the same reproducer runs to completion without the Oops. Fixes: f587de0e2feb ("[NETFILTER]: nf_conntrack/nf_nat: add H.323 helper port") Reported-by: Xiang Mei <xmei5@asu.edu> Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-06-10Merge tag 'wireless-next-2026-06-10' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Johannes Berg says: ==================== Quite a few last updates, notably: - b43: new support for an 11n device - mt76: - mt792x broken usb transport detection - mt7921 regd improvements - mt7927 support - iwlwifi: - more kunit tests - FW version updates - ath12k: WDS support - rtw89: - RTL8922AU support - USB 3 mode switch for performance - better monitor radiotap support - RTL8922DE preparations - cfg80211/mac80211: - update UHR to D1.4, UHR DBE support - finally remove 5/10 MHz support - S1G rate reporting - multicast encapsulation offload * tag 'wireless-next-2026-06-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (285 commits) b43: add RF power offset for N-PHY r8 + radio 2057 r8 b43: add channel info table for N-PHY r8 + radio 2057 r8 b43: add IPA TX gain table for N-PHY r8 + radio 2057 r8 b43: support radio 2057 rev 8 b43: route d11 corerev 22 to 24-bit indirect radio access b43: add d11 core revision 0x16 to id table b43: add firmware mappings for rev22 rfkill: Replace strcpy() with memcpy() wifi: brcmfmac: flowring: simplify flow allocation wifi: brcm80211: change current_bss to value wifi: ath12k: enable IEEE80211_VHT_EXT_NSS_BW_CAPABLE when NSS ratio is reported wifi: ath12k: fix EAPOL TX failure caused by stale tcl_metadata bits wifi: ath: Update copyright in testmode_i.h wifi: ath10k: Update Qualcomm copyrights wifi: ath11k: Update Qualcomm copyrights wifi: ath12k: Update Qualcomm copyrights wifi: mt76: Drop unneeded mt76_register_debugfs_fops() return checks wifi: mt76: mt7921: assert sniffer on chanctx change wifi: mt76: mt7996: fix potential tx_retries underflow wifi: mt76: mt7925: fix potential tx_retries underflow ... ==================== Link: https://patch.msgid.link/20260610103637.179340-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-10bonding: 3ad: add lacp_strict configuration knobLouis Scalbert
When an 802.3ad (LACP) bonding interface has no slaves in the collecting/distributing state, the bonding master still reports carrier as up as long as at least 'min_links' slaves have carrier. In this situation, only one slave is effectively used for TX/RX, while traffic received on other slaves is dropped. Upper-layer daemons therefore consider the interface operational, even though traffic may be blackholed if the lack of LACP negotiation means the partner is not ready to deal with traffic. Introduce a configuration knob to control this behavior. It allows the bonding master to assert carrier only when at least 'min_links' slaves are in Collecting_Distributing state. The default mode preserves the existing behavior. This patch only introduces the knob; its behavior is implemented in the subsequent commit. Fixes: 655f8919d549 ("bonding: add min links parameter to 802.3ad") Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com> Acked-by: Jay Vosburgh <jv@jvosburgh.net> Link: https://patch.msgid.link/20260603150331.1919611-4-louis.scalbert@6wind.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-10jbd2: remove special jbd2 slabsMatthew Wilcox (Oracle)
When jbd2 was originally written, kmalloc() would not guarantee memory alignment for the requested objects. Since commit 59bb47985c1d in 2019, kmalloc has guaranteed natural alignment for power-of-two allocations. We can now remove the jbd2 special slabs and just use kmalloc() directly. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reviewed-by: Tal Zussman <tz2294@columbia.edu> Link: https://patch.msgid.link/20260528171413.1088143-1-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-06-10ASoC: remove .debugfs_prefix from ComponentMark Brown
Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> says: Basically, we are assuming to use snd_soc_register_component() (X) to register Component. It requests Component driver (A). And, current Component has .debugfs_prefix (B). Now we can set component->debugfs_prefix (B) via component_driver->debugfs_prefix (A) today. But some drivers are still trying to set it via (B). Thus, they need to use snd_soc_component_initialize() (1) / snd_soc_component_add() (2) instead of (X), because they need to access component->debugfs_prefix (B). These functions (= 1, 2) should be capsuled into soc-xxx.c, but can't because of above drivers. This patch-set removes component->debugfs_prefix (B). The functions (= 1, 2) are still not yet be capsuled. This is step1 for it, step2 will be posted after this. Link: https://patch.msgid.link/87ldcxk5wz.wl-kuninori.morimoto.gx@renesas.com
2026-06-10ASoC: soc-component: remove .debugfs_prefix from ComponentKuninori Morimoto
All drivers are now setting .debugfs_prefix via Component driver. Remove it from Component. Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Link: https://patch.msgid.link/87cxy9k5vj.wl-kuninori.morimoto.gx@renesas.com Signed-off-by: Mark Brown <broonie@kernel.org>
2026-06-10ASoC: soc-component: remove CONFIG_DEBUG_FS for debugfs_prefixKuninori Morimoto
Both (A) and (B) have debugfs_prefix, but (B) is using CONFIG_DEBUG_FS (C) (A) struct snd_soc_component { ... const char *debugfs_prefix; }; (B) struct snd_soc_component_driver { ... (C) ifdef CONFIG_DEBUG_FS const char *debugfs_prefix; endif }; Remove (C) which makes code cleanup difficult. Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Link: https://patch.msgid.link/87jyshk5wc.wl-kuninori.morimoto.gx@renesas.com Signed-off-by: Mark Brown <broonie@kernel.org>
2026-06-10sh: roll back Ecovec24/7724se Sound supportKuninori Morimoto
Due to a communication miss, the Ecovec24/7724se Sound support were removed. We need to keep them for a while, until they will support "DT-style". Roll back Ecovec24/7724se "platform data style", and its necessary header. Fixes: deadb855b694d ("sh: 7724se: remove FSI/AK4642/Simple-Audio-Card support") Fixes: 9cc93ebc85e71 ("sh: ecovec24: remove FSI/DA7210/Simple-Audio-Card support") Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Link: https://patch.msgid.link/87v7br43vk.wl-kuninori.morimoto.gx@renesas.com Signed-off-by: Mark Brown <broonie@kernel.org>
2026-06-10drm/atomic: track individual colorop updatesMelissa Wen
As we do for CRTC color mgmt properties, use color_mgmt_changed flag to track any value changes in the color pipeline of a given plane, so that drivers can update color blocks as soon as plane color pipeline or individual colorop values change. Since we're here, only announce and track changes to plane COLOR_PIPELINE prop if its value is actually changing. Fixes: 8c5ea1745f4c ("drm/colorop: Add BYPASS property") Fixes: 7fa3ee8c0a79 ("drm/colorop: Define LUT_1D interpolation") Fixes: 41651f9d42eb ("drm/colorop: Add 1D Curve subtype") Fixes: 3410108037d5 ("drm/colorop: Add multiplier type") Fixes: db971856bbe0 ("drm/colorop: Add 3D LUT support to color pipeline") Fixes: e5719e7f1900 ("drm/colorop: Add 3x4 CTM type") Fixes: 99a4e4f08abe ("drm/colorop: Add 1D Curve Custom LUT type") Fixes: 2afc3184f3b3 ("drm/plane: Add COLOR PIPELINE property") Reviewed-by: Harry Wentland <harry.wentland@amd.com> #v1 Reviewed-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Reviewed-by: Alex Hung <alex.hung@amd.com> Fixes: 9ba25915efba ("drm/amd/display: Add support for sRGB EOTF in DEGAM block") Signed-off-by: Melissa Wen <mwen@igalia.com> Signed-off-by: Melissa Wen <melissa.srw@gmail.com> Link: https://patch.msgid.link/20260609110420.1298352-4-mwen@igalia.com
2026-06-10drm/colorop: make lut(1/3)d_interpolation props correctly behave as mutableMelissa Wen
As interpolation props are actually mutable props, any changes should be handled by drm_colorop_state. Move their enum and make it correctly behaves as mutable. Fixes: 7fa3ee8c0a79 ("drm/colorop: Define LUT_1D interpolation") Fixes: db971856bbe0 ("drm/colorop: Add 3D LUT support to color pipeline") Reviewed-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Reviewed-by: Alex Hung <alex.hung@amd.com> Fixes: 9ba25915efba ("drm/amd/display: Add support for sRGB EOTF in DEGAM block") Signed-off-by: Melissa Wen <mwen@igalia.com> Signed-off-by: Melissa Wen <melissa.srw@gmail.com> Link: https://patch.msgid.link/20260609110420.1298352-3-mwen@igalia.com
2026-06-10drm/colorop: Remove read-only comments from interpolation fieldsAlex Hung
The lut1d_interpolation and lut3d_interpolation fields and their associated properties were marked as read-only, but userspace can set them via drm_atomic_colorop_set_property(). Fixes: 7fa3ee8c0a79 ("drm/colorop: Define LUT_1D interpolation") Fixes: db971856bbe0 ("drm/colorop: Add 3D LUT support to color pipeline") Reviewed-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com> Signed-off-by: Alex Hung <alex.hung@amd.com> Fixes: 9ba25915efba ("drm/amd/display: Add support for sRGB EOTF in DEGAM block") Signed-off-by: Melissa Wen <mwen@igalia.com> Signed-off-by: Melissa Wen <melissa.srw@gmail.com> Link: https://patch.msgid.link/20260609110420.1298352-2-mwen@igalia.com
2026-06-10fanotify: allow reporting pidfds for reaped tasksAnonymeMeow
Fanotify used to refuse to report pidfds for reaped tasks by applying a pid_has_task() check before calling pidfd_prepare(). This prevented userspace from obtaining information about the task. Register the event pid with pidfs when creating the fanotify event if pidfd reporting was requested, so pidfd_prepare() can later create a pidfd for the reaped task. Suggested-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/linux-fsdevel/20260528-schmuckvoll-heilen-garen-be77b4208671@brauner/ Signed-off-by: AnonymeMeow <anonymemeow@gmail.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> Link: https://patch.msgid.link/20260607003343.425939-3-anonymemeow@gmail.com Signed-off-by: Jan Kara <jack@suse.cz>
2026-06-10ALSA: timer: Manage timer object with krefTakashi Iwai
So far we've tried to address UAFs in ALSA timer code by applying the locks at various places, but the fundamental problem is that the timer object may be released while the belonging timer instance objects are still present and accessing to it. This patch is a more proper fix to address that issue, namely, by refcounting and keeping the timer object. The basic implementation is to use kref for the refcount of the timer object, and take/release the reference at assigning/releasing the instance, as well as at referring from ioctls or ALSA sequencer code. The reference from ioctl or ALSA sequencer is abstracted with snd_timeri_timer auto-cleanup. Note that this change assumes that the code already took the fix commit da3039e91d1f ("ALSA: timer: Forcibly close timer instances at closing"); otherwise the refcount may be unbalanced when the timer is freed while slave instances are still present. Signed-off-by: Takashi Iwai <tiwai@suse.de> Link: https://patch.msgid.link/20260609115100.806869-2-tiwai@suse.de
2026-06-10can: virtio: Fix comment in UAPI headerNathan Chancellor
When compile testing the UAPI headers with clang, there is an warning turned error for using a C++ style ('//') comment, which is explicitly forbidden for UAPI headers. In file included from <built-in>:1: ./usr/include/linux/virtio_can.h:29:35: error: // comments are not allowed in this language [-Werror,-Wcomment] 29 | #define VIRTIO_CAN_MAX_DLEN 64 // this is like CANFD_MAX_DLEN | ^ 1 error generated. Switch to a standard C style comment. Fixes: 2b6b4bb7d96f ("can: virtio: Add virtio CAN driver") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Message-ID: <20260604-virtio_can-fix-uapi-comment-v1-1-199fa96ec5f0@kernel.org>
2026-06-10can: virtio: Add virtio CAN driverMatias Ezequiel Vara Larsen
Add virtio CAN driver based on Virtio 1.4 specification (see https://github.com/oasis-tcs/virtio-spec/tree/virtio-1.4). The driver implements a complete CAN bus interface over Virtio transport, supporting both CAN Classic and CAN-FD Ids. In term of frames, it supports classic and CAN FD. RTR frames are only supported with classic CAN. Usage: - "ip link set up can0" - start controller - "ip link set down can0" - stop controller - "candump can0" - receive frames - "cansend can0 123#DEADBEEF" - send frames Signed-off-by: Harald Mommer <harald.mommer@oss.qualcomm.com> Co-developed-by: Harald Mommer <harald.mommer@oss.qualcomm.com> Signed-off-by: Mikhail Golubev-Ciuchea <mikhail.golubev-ciuchea@oss.qualcomm.com> Co-developed-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Damir Shaikhutdinov <Damir.Shaikhutdinov@opensynergy.com> Reviewed-by: Francesco Valla <francesco@valla.it> Tested-by: Francesco Valla <francesco@valla.it> Signed-off-by: Matias Ezequiel Vara Larsen <mvaralar@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Message-ID: <ahXNb+KzuHYbS24+@fedora>
2026-06-10virtio_console: Fix spelling mistake "colums" -> "columns"Ethan Carter Edwards
There is a spelling mistake in a struct description. Fix it. Signed-off-by: Ethan Carter Edwards <ethan@ethancedwards.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Message-ID: <20260418-virtio-typo-v1-1-0df6f943a79d@ethancedwards.com>
2026-06-10virtio: add missing kernel-doc for map and vmap membersChristian Fontanez
Commit bee8c7c24b73 ("virtio: introduce map ops in virtio core") and commit b16060c5c7d5 ("virtio: introduce virtio_map container union") added 'map' and 'vmap' members to struct virtio_device but did not update the kernel-doc comment block. This caused 'make htmldocs' to emit warnings: ./include/linux/virtio.h:188 struct member 'map' not described in 'virtio_device' ./include/linux/virtio.h:188 struct member 'vmap' not described in 'virtio_device' Add the missing entries in struct-declaration order to match the existing convention in the file. After this patch, 'make htmldocs' no longer emits these warnings. Fixes: bee8c7c24b73 ("virtio: introduce map ops in virtio core") Fixes: b16060c5c7d5 ("virtio: introduce virtio_map container union") Reported-by: Luis Felipe Hernandez <luis.hernandez093@gmail.com> Signed-off-by: Christian Fontanez <christfontanez@gmail.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Message-ID: <20260519013321.32511-1-christfontanez@gmail.com>
2026-06-09bpf: Cancel special fields on map value recycleJustin Suess
Map update and delete paths currently call bpf_obj_free_fields() when a value is being replaced or recycled. That makes field destruction depend on the context of the update/delete operation. For tracing programs this can include NMI context, where referenced kptr destructors, uptr unpinning, and graph root destruction are not generally safe. Introduce bpf_obj_cancel_fields() for the reusable-value path. It only performs NMI-safe cleanup for timer, workqueue, and task_work fields. Fields that need full destruction are left attached to the recycled value and are destroyed by the final cleanup path instead. Switch array and hashtab update/delete/recycle paths to this cancel helper. Keep bpf_obj_free_fields() for final map destruction and for bpf_mem_alloc destructors. Preallocated hashtabs do not have allocator destructors, so teardown continues to walk the normal and extra elements and fully destroy their fields. This deliberately relaxes the eager-free semantics of map update/delete for special fields. Programs that relied on a recycled map slot becoming empty immediately after update/delete were relying on behavior that cannot be implemented safely from every BPF execution context without offloading arbitrary destructors. There is a chance this change breaks programs making assumptions regarding the eager freeing of fields. If so, we can relax semantics to cancellation only when irqs_disabled() is true in the future. However, theoretically, map values that get reused eagerly already have weaker guarantees as parallel users can recreate freed fields before the new element becomes visible again. Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr") Signed-off-by: Justin Suess <utilityemal77@gmail.com> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260609202548.3571690-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-09bpf: Reject bpf_obj_drop() from tracing progsJustin Suess
bpf_obj_drop() runs bpf_obj_free_fields() synchronously for program-allocated objects. When such an object contains NMI unsafe fields, tracing programs that can run from arbitrary instrumented context can reach that destruction from unsafe contexts, including NMI. NMI is likely one instance of this problem, and other instances would include possible unsafe reentrancy. Deferring bpf_obj_drop() is not appealing either: it would add delayed-free machinery to a release operation that otherwise has straightforward synchronous ownership semantics. Reject bpf_obj_drop() and bpf_percpu_obj_drop() from tracing programs that may run from unsafe contexts unless every field in the object's BTF record is explicitly NMI safe. Do not reject sleepable BPF_PROG_TYPE_TRACING programs, since they are not the arbitrary/NMI contexts that motivate the restriction. Note that while bpf_rb_root and bpf_list_head would be NMI safe on their own to free, the objects recursively held by them may not be; be conservative and just mark them as not NMI safe for now. Use a whitelist for the NMI-safe field set instead of listing only known NMI unsafe fields. Locks, async fields, unreferenced kptrs, and refcounts are known to be NMI safe because their destruction is either a no-op, simple state reset, or async cancellation. Referenced kptrs, percpu referenced kptrs, uptrs, graph roots, graph nodes, and any future field type are rejected until audited for arbitrary tracing and NMI contexts. This is less susceptible to future changes in fields that were previously safe by exclusion, and to new fields being added without updating this check. Convert the existing recursive local-object drop success case to a syscall program in the same commit, since this verifier change makes the old tracing program form invalid. The test still exercises bpf_obj_drop() releasing a referenced task kptr from a safe program type. Fixes: ac9f06050a35 ("bpf: Introduce bpf_obj_drop") Signed-off-by: Justin Suess <utilityemal77@gmail.com> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260609202548.3571690-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-09net: guard timestamp cmsgs to real error queue skbsKyle Zeng
skb_is_err_queue() treats PACKET_OUTGOING as the sole marker for an skb from sk_error_queue. That assumption is not true for AF_PACKET sockets: outgoing packet taps are also delivered to packet sockets with skb->pkt_type == PACKET_OUTGOING, but their skb->cb is owned by AF_PACKET instead of struct sock_exterr_skb. If such an skb is received with timestamping enabled, the generic timestamp cmsg path can read AF_PACKET control-buffer state as sock_exterr_skb::opt_stats. With SO_RXQ_OVFL enabled, the packet drop counter overlaps opt_stats. An odd drop count makes the path emit SCM_TIMESTAMPING_OPT_STATS with skb->len and skb->data. For non-linear skbs this copies past the linear head and can trigger hardened usercopy or disclose adjacent heap contents. Keep skb_is_err_queue() local to net/socket.c, but make it verify that the PACKET_OUTGOING marker is paired with the sock_rmem_free destructor installed by sock_queue_err_skb(). AF_PACKET receive skbs use normal receive ownership and no longer pass as error-queue skbs, while legitimate sk_error_queue entries keep the PACKET_OUTGOING marker and sock_rmem_free ownership. Fixes: 8605330aac5a ("tcp: fix SCM_TIMESTAMPING_OPT_STATS for normal skbs") Signed-off-by: Kyle Zeng <kylebot@openai.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260607021819.49698-1-kylebot@openai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: add retry mechanism to ndo_set_rx_mode_asyncStanislav Fomichev
When ndo_set_rx_mode_async returns an error, schedule a retry with exponential backoff (1s, 2s, 4s, 8s -- 15s total). Give up after the 4th retry and log an error via netdev_err(). This moves retry logic from individual drivers into the core stack. Timer callback does not hold a ref on dev. Safe because the timer can only be armed when dev is IFF_UP, and __dev_close_many runs timer_delete_sync before clearing IFF_UP. Unregister always closes IFF_UP devices first, so by the time dev can be freed the timer is dead and cannot be re-armed. Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260608154014.227538-3-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: change ndo_set_rx_mode_async return type to intStanislav Fomichev
Change the return type of ndo_set_rx_mode_async from void to int to allow drivers to report failures back to the core stack. This is a prerequisite for adding retry logic in the core when drivers fail to program RX filters (e.g. bnxt VF when PF is unavailable). All existing implementations return 0 for now, maintaining current behavior. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260608154014.227538-2-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: mana: Cache MANA_QUERY_LINK_CONFIG result to avoid repeated HWC queriesErni Sri Satya Vennela
mana_query_link_cfg() sends an HWC command to firmware on every call, but the link speed and QoS values it returns only change when the driver explicitly calls mana_set_bw_clamp(). This function is called not only by userspace via ethtool get_link_ksettings, but also periodically by hv_netvsc through netvsc_get_link_ksettings and by the sysfs speed_show attribute via dev_attr_show, resulting in unnecessary HWC traffic every few minutes. Add a link_cfg_error field to mana_port_context to cache the query result. The field uses three states: 1 (not yet queried, initial value set during mana_probe_port), 0 (success, speed/max_speed are valid), or a negative errno for permanent errors like -EOPNOTSUPP when the hardware does not support the command. Transient errors and qos_unconfigured responses are not cached so that subsequent calls will retry. MANA is ops-locked because it implements net_shaper_ops, so the core already takes netdev_lock() around all ethtool_ops and net_shaper_ops entry points. Reuse that lock to serialize mana_query_link_cfg() and mana_set_bw_clamp(). This prevents a concurrent mana_set_bw_clamp() from racing with an in-flight query and publishing stale pre-clamp speed/max_speed. Invalidate the cache inside mana_set_bw_clamp() on success, so all current and future callers that change the link configuration automatically trigger a fresh query on the next mana_query_link_cfg() call. Also reset link_cfg_error during resume in mana_probe() under netdev_lock(), so that any query already in flight cannot later store 0 and silently overwrite the post-resume invalidation. Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com> Link: https://patch.msgid.link/20260606133301.2180073-1-ernis@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: mscc: ocelot: validate netdev belongs to switch in .netdev_to_port()David Yang
The .netdev_to_port() currently takes only a net_device and returns the port index, without verifying the netdev actually belongs to the switch being operated on. This can cause flower rule parsing to silently resolve to a wrong port on the local hardware. Update both implementations felix_netdev_to_port() and ocelot_netdev_to_port() to validate ownership. Also update the callers in ocelot_flower.c to pass through the ocelot context. Signed-off-by: David Yang <mmyangfl@gmail.com> Link: https://patch.msgid.link/20260606125247.305167-1-mmyangfl@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: mana: Add support for PF device 0x00C1Haiyang Zhang
Update the device id table to include the new device id 0x00C1. This device's BAR layout is similar to VF's, update the function, mana_gd_init_registers(), accordingly. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://patch.msgid.link/20260605212302.2135499-1-haiyangz@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09RDMA/mana_ib: Allocate interrupt contexts on EQsLong Li
Use the GIC functions to allocate interrupt contexts for RDMA EQs. These interrupt contexts may be shared with Ethernet EQs when MSI-X vectors are limited. The driver now supports allocating dedicated MSI-X for each EQ. Indicate this capability through driver capability bits. The RDMA EQs pass use_msi_bitmap=false to share MSI-X vectors with Ethernet, while the capability flag advertises that the driver supports per-vPort EQ separation when hardware has sufficient vectors. Populate eq.irq on all RDMA EQs for consistency with the Ethernet path. Also relocate the GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE define to its numeric BIT(6) position among the other capability flags. Signed-off-by: Long Li <longli@microsoft.com> Acked-by: Leon Romanovsky <leon@kernel.org> Link: https://patch.msgid.link/20260605005717.2059954-7-longli@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: mana: Allocate interrupt context for each EQ when creating vPortLong Li
Use GIC functions to create a dedicated interrupt context or acquire a shared interrupt context for each EQ when setting up a vPort. The caller now owns the GIC reference across the EQ create/destroy lifecycle: mana_create_eq() calls mana_gd_get_gic() before creating each EQ and mana_destroy_eq() calls mana_gd_put_gic() after destroying it. The msix_index invalidation is moved from mana_gd_deregister_irq() to the mana_gd_create_eq() error path so that mana_destroy_eq() can read the index before teardown. Signed-off-by: Long Li <longli@microsoft.com> Link: https://patch.msgid.link/20260605005717.2059954-6-longli@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: mana: Introduce GIC context with refcounting for interrupt managementLong Li
To allow Ethernet EQs to use dedicated or shared MSI-X vectors and RDMA EQs to share the same MSI-X, introduce a GIC (GDMA IRQ Context) with reference counting. This allows the driver to create an interrupt context on an assigned or unassigned MSI-X vector and share it across multiple EQ consumers. Signed-off-by: Long Li <longli@microsoft.com> Link: https://patch.msgid.link/20260605005717.2059954-4-longli@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: mana: Query device capabilities and configure MSI-X sharing for EQsLong Li
When querying the device, adjust the max number of queues to allow dedicated MSI-X vectors for each vPort. The per-vPort queue count is clamped towards MANA_DEF_NUM_QUEUES but will not exceed the hardware maximum reported by the device. MSI-X sharing among vPorts is enabled when there are not enough MSI-X vectors for dedicated allocation, or when the platform does not support dynamic MSI-X allocation (in which case all vectors are pre-allocated at probe time and sharing is always used). The msi_sharing flag is reset at the top of mana_gd_query_max_resources() so it is recomputed from current hardware state on each probe or resume cycle. Clamp apc->max_queues to gc->max_num_queues_vport in mana_init_port() so that on resume, if max_num_queues_vport has decreased due to fewer MSI-X vectors, num_queues is reduced accordingly before EQ allocation. A device reporting zero ports now results in a fatal probe error since the per-vPort MSI-X math requires at least one port. Rename mana_query_device_cfg() to mana_gd_query_device_cfg() as it is used at GDMA device probe time for querying device capabilities. Signed-off-by: Long Li <longli@microsoft.com> Link: https://patch.msgid.link/20260605005717.2059954-3-longli@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: mana: Create separate EQs for each vPortLong Li
To prepare for assigning vPorts to dedicated MSI-X vectors, remove EQ sharing among the vPorts and create dedicated EQs for each vPort. Move the EQ definition from struct mana_context to struct mana_port_context and update related support functions. Export mana_create_eq() and mana_destroy_eq() for use by the MANA RDMA driver. RSS QPs now take a vport reference via pd->vport_use_count to ensure EQs outlive all QP consumers. The vport must already be configured by a raw QP before an RSS QP can be created. EQs are only destroyed when the last QP (raw or RSS) on the PD releases its reference. Restrict each vport to a single RSS QP. The hardware only supports one steering configuration (indirection table / hash key) per vport, and mana_disable_vport_rx() on QP destroy disables RX globally for the vport. Previously, creating a second RSS QP would silently overwrite the first QP's steering config and destroy would blackhole all traffic. This is now explicitly rejected with -EBUSY. Existing applications (DPDK being the primary RDMA consumer) always create one RSS QP per vport, so no real-world flows are affected. Reject cross-port PD sharing for both raw and RSS QPs. Since EQs and vport configuration are per-port, a PD is bound to the port used by its first raw QP. Subsequent QPs on the same PD must use the same port or the creation fails with -EINVAL. Previously this was silently broken: with shared EQs it appeared to work, but with per-vPort EQs a cross-port PD would cause wrong-port EQ teardown and corruption. DPDK creates one PD per port so no existing flows are affected. Serialize mana_set_channels() and the async per-port queue reset handler against RDMA vport configuration to prevent RDMA from claiming the vport during the detach/attach window. A channel_changing flag is set under apc->vport_mutex before detach and checked by mana_cfg_vport() when called from the RDMA path, blocking RDMA from grabbing the vport during the entire window. When the port is down and RDMA already holds the vport, the channel change is rejected with -EBUSY. Signed-off-by: Long Li <longli@microsoft.com> Link: https://patch.msgid.link/20260605005717.2059954-2-longli@microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09Merge tag 'trace-rv-v7.1-rc6-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull runtime verifier fixes from Steven Rostedt: - Fix reset ordering on per-task destruction Reset the task before dropping the slot instead of after, which was causing out-of-bound memory accesses. - Fix HA monitor synchronization and cleanup Ensure synchronous cleanup for HA monitors by running timer callbacks in RCU read-side critical sections and using synchronize_rcu() during destruction. - Avoid armed timers after tasks exit Add automatic cleanup for per-task HA monitors to prevent timers from firing after task exit. - Fix memory ordering for DA/HA monitors Fix race conditions during monitor start by using release-acquire semantics for the monitoring flag. - Fix initialization for DA/HA monitors Ensure monitors are not initialized relying on potentially corrupted state like the monitoring flag, that is not reset by all monitors type and may have an unknown state in monitors reusing the storage (per-task). - Fix memory safety in per-task and per-object monitors Prevent use-after-free and out-of-bounds access by synchronizing with in-flight tracepoint probes using tracepoint_synchronize_unregister() before freeing monitor storage or releasing task slots. - Adjust monitors for preemptible tracepoints Fix monitors that relied on tracepoints disabling preemption. Explicitly disable task migration when per-CPU monitors handle events to avoid accessing the wrong state and update the opid monitor logic. - Fix incorrect __user specifier usage Remove __user from a non-pointer variable in the extract_params() helper. - Fix bugs in the rv tool Ensure strings are NUL-terminated, fix substring matching in monitor searches, and improve cleanup and exit status handling. - Fix several bugs in rvgen Fix LTL literal stringification, subparsers' options handling, and suffix stripping in dot2k. * tag 'trace-rv-v7.1-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: verification/rvgen: Fix ltl2k writing True as a literal verification/rvgen: Fix options shared among commands verification/rvgen: Fix suffix strip in dot2k tools/rv: Fix cleanup after failed trace setup tools/rv: Fix substring match when listing container monitors tools/rv: Fix substring match bug in monitor name search tools/rv: Ensure monitor name and desc are NUL-terminated rv: Use 0 to check preemption enabled in opid rv: Prevent task migration while handling per-CPU events rv: Ensure synchronous cleanup for HA monitors rv: Add automatic cleanup handlers for per-task HA monitors rv: Do not rely on clean monitor when initialising HA rv: Fix monitor start ordering and memory ordering for monitoring flag rv: Ensure all pending probes terminate on per-obj monitor destroy rv: Prevent in-flight per-task handlers from using invalid slots rv: Reset per-task DA monitors before releasing the slot rv: Fix __user specifier usage in extract_params()
2026-06-09net: ncsi: Set ncsi_stop_dev() to inline while NET_NCSI not enabledMinda Chen
While NET_NCSI not enabled, ncsi_stop_dev() is not inline and call with it, casue compile waring: linux/include/net/ncsi.h:63:13: warning: 'ncsi_stop_dev' defined but not used [-Wunused-function] static void ncsi_stop_dev(struct ncsi_dev *nd) Setting ncsi_stop_dev() to inline like other function to remove compile warnings. Signed-off-by: Minda Chen <minda.chen@starfivetech.com> Link: https://patch.msgid.link/20260605033607.37630-1-minda.chen@starfivetech.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-10software node: allow passing reference args to PROPERTY_ENTRY_REF()Dmitry Torokhov
When dynamically creating software nodes and properties for subsequent use with software_node_register() current implementation of PROPERTY_ENTRY_REF() is not suitable because it creates a temporary instance of struct software_node_ref_args on stack which will later disappear, and software_node_register() only does shallow copy of properties. Fix this by allowing to pass address of reference arguments structure directly into PROPERTY_ENTRY_REF(), so that caller can manage lifetime of the object properly. Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://patch.msgid.link/aiTo4dvKu8pyimHA@google.com Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2026-06-09PCI/P2PDMA: Add Intel QAT, DSA, IAA devices to whitelistLukas Wunner
The first device on a PCI root bus determines whether the host bridge is whitelisted for P2PDMA. All Intel Xeon chips since Ice Lake (ICX, 2021) expose a device with ID 0x09a2 as first device. It is loosely associated with the IOMMU. All these Xeon chips support P2PDMA, so since the addition of the device with commit feaea1fe8b36 ("PCI/P2PDMA: Add Intel 3rd Gen Intel Xeon Scalable Processors to whitelist"), P2PDMA has been allowed on all new Xeons without the need to amend the whitelist: Xeons with Performance Cores: Sapphire Rapids (SPR, 2023) Emerald Rapids (EMR, 2023) Granite Rapids (GNR, 2024) Diamond Rapids (DMR, 2026) Xeons with Efficiency Cores: Sierra Forest (SRF, 2024) Clearwater Forest (CWF, 2026) However these Xeons also expose accelerators as first device on a root bus of its own: QuickAssist Technology (QAT, crypto & compression accelerator) Data Streaming Accelerator (DSA, dma engine) In-Memory Analytics Accelerator (IAA, compression accelerator) Whitelist them for P2PDMA as well. Move their Device ID macros from the accelerator drivers to <linux/pci_ids.h> for reuse by P2PDMA code. Unfortunately the Device IDs vary across Xeon generations as additional features were added to the accelerators. This currently necessitates an amendment for each new Xeon chip. For future chips, this need shall be avoided by an ongoing effort to extend ACPI HMAT with PCIe P2PDMA characteristics (latency, bandwidth, ordering constraints). The PCI core will be able look up in this BIOS-provided ACPI table whether P2PDMA is supported, instead of relying on a whitelist that needs to be amended continuously. Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Acked-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> # QAT Cc: stable@vger.kernel.org Link: https://patch.msgid.link/6aac4922b5fe7070b11874427a9285e42ddd05a4.1780585518.git.lukas@wunner.de
2026-06-09hwmon: Add update_interval_us chip attributeFerdinand Schwenk
Some hardware monitoring chips support update intervals below one millisecond. The existing update_interval attribute uses millisecond granularity, which causes sub-millisecond steps to round to the same value and become inaccessible from userspace. Introduce update_interval_us, a companion chip-level attribute that expresses the same update interval in microseconds. Drivers implementing this attribute should also implement update_interval for compatibility with millisecond-based userspace interfaces. Signed-off-by: Ferdinand Schwenk <ferdinand.schwenk@advastore.com> Link: https://lore.kernel.org/r/20260609-hwmon-ina238-update-interval-us-v2-v3-2-016b55567950@advastore.com Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2026-06-09svcrdma: wake sq waiters when the transport closesChuck Lever
Threads parked in svc_rdma_sq_wait() on sc_sq_ticket_wait or sc_send_wait can hang indefinitely in TASK_UNINTERRUPTIBLE state across transport teardown, pinning svc_xprt references and blocking svc_rdma_free(). The close path sets XPT_CLOSE before invoking xpo_detach and both wait_event predicates include an XPT_CLOSE term, but the predicates are re-evaluated only on wakeup. sc_sq_ticket_wait has no completion-driven wake path; it is advanced solely by the chained ticket handoff inside svc_rdma_sq_wait() itself. Without an explicit wake at close, parked threads never observe XPT_CLOSE, hold their svc_xprt_get reference forever, and svc_rdma_free() blocks on xpt_ref dropping to zero. Two close entry points reach this transport. Local teardown runs svc_rdma_detach() from svc_handle_xprt() -> svc_delete_xprt() -> xpo_detach() on a worker thread. A remote disconnect arrives at svc_rdma_cma_handler(), which calls svc_xprt_deferred_close(): that sets XPT_CLOSE and enqueues the transport but does not access either RDMA waitqueue, so a worker already parked in svc_rdma_sq_wait() never re-evaluates its predicate. With every worker parked on this transport, no thread is available to run the local teardown either, and the wake site there is unreachable. Introduce svc_rdma_xprt_deferred_close(), a thin svcrdma wrapper that calls svc_xprt_deferred_close() and then wakes both sc_sq_ticket_wait and sc_send_wait. Convert the svcrdma producers that called svc_xprt_deferred_close() directly: svc_rdma_cma_handler(), qp_event_handler(), svc_rdma_post_send_err(), svc_rdma_wc_send(), the sendto drop path, the rw completion error paths, and the recvfrom flush and read-list error paths. Wake both waitqueues from svc_rdma_detach() as well. The synchronous svc_xprt_close() path (backchannel ENOTCONN, device removal via svc_rdma_xprt_done) reaches detach without flowing through svc_xprt_deferred_close() and therefore does not invoke the new helper. Fixes: ccc89b9d1ed2 ("svcrdma: Add fair queuing for Send Queue access") Cc: stable@vger.kernel.org Assisted-by: kres (claude-opus-4-7) Signed-off-by: Chris Mason <clm@meta.com> [ cel: add svc_rdma_xprt_deferred_close() to complete the fix ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-09SUNRPC: Return an error from xdr_buf_to_bvec() on overflowChuck Lever
xdr_buf_to_bvec() returns a slot count even when the caller's bvec budget is exhausted partway through the xdr_buf. Callers feed that count into iov_iter_bvec() and continue as if the conversion had succeeded, silently sending or writing fewer bytes than the data length declares. For an NFS WRITE the server reports the truncated transfer to the client as full success. The overflow represents an internal invariant violation: a higher layer reserved a bvec budget too small for the xdr_buf it then asked the encoder to convert. That is a server-side fault, not a media I/O failure and not a malformed client argument. Change xdr_buf_to_bvec() to return a signed int and have the overflow label return -ESERVERFAULT. Update the three callers to detect the negative return and fail the request: nfsd_vfs_write() folds the error into host_err, which nfserrno() translates to nfserr_serverfault for the WRITE reply; svc_udp_sendto() and svc_tcp_sendmsg() propagate the error out of the send path. Reported-by: Chris Mason <clm@meta.com> Fixes: 2eb2b9358181 ("SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly") Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-09Documentation: Add the RPC language description of NLM version 3Chuck Lever
In order to generate source code to encode and decode NLMv3 protocol elements, include a copy of the RPC language description of NLMv3 for xdrgen to process. The language description is derived from the Open Group's XNFS specification: https://pubs.opengroup.org/onlinepubs/9629799/chap10.htm#tagcjh_11_03 The C code committed here was generated from the new nlm3.x file using tools/net/sunrpc/xdrgen/xdrgen. The goals of replacing hand-written XDR functions with ones that are tool-generated are to improve memory safety and make XDR encoding and decoding less brittle to maintain. Parts of the NFSv4 protocol are still being extended actively. Tool-generated XDR code reduces the time it takes to get a working implementation of new protocol elements. The xdrgen utility derives both the type definitions and the encode/decode functions directly from protocol specifications, using names and symbols familiar to anyone who knows those specs. Unlike hand-written code that can inadvertently diverge from the specification, xdrgen guarantees that the generated code matches the specification exactly. We would eventually like xdrgen to generate Rust code as well, making the conversion of the kernel's NFS stacks to use Rust just a little easier for us. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-09svcrdma: Defer send context release to xpo_release_ctxtChuck Lever
Send completion currently queues a work item to an unbound workqueue for each completed send context. Under load, the Send Completion handlers contend for the shared workqueue pool lock. Replace the workqueue with a per-transport lock-free list (llist). The Send completion handler appends the send_ctxt to sc_send_release_list and does no further teardown. The nfsd thread drains the list in xpo_release_ctxt between RPCs, performing DMA unmapping, chunk I/O resource release, and page release in a batch. This eliminates both the workqueue pool lock and the DMA unmap cost from the Send completion path. DMA unmapping can be expensive when an IOMMU is present in strict mode, as each unmap triggers a synchronous hardware IOTLB invalidation. Moving it to the nfsd thread, where that latency is harmless, avoids penalizing completion handler throughput. The nfsd threads absorb the release cost at a point where the client is no longer waiting on a reply, and natural batching amortizes the overhead when completions arrive faster than RPCs complete. A self-enqueue backstops drain on a quiescing transport. When svc_rdma_send_ctxt_put() observes that its llist_add() transitions sc_send_release_list from empty to non-empty, it sets XPT_DATA and calls svc_xprt_enqueue() so that svc_xprt_ready() schedules an nfsd thread. The thread enters svc_rdma_recvfrom(), finds no pending receive, clears XPT_DATA, and returns 0; svc_xprt_release() then runs xpo_release_ctxt and drains the list. Under steady load the foreground drain keeps the list non-empty between adds and no enqueue fires; only the trailing edge of a burst pays for a wakeup. Without this path, a Send completion arriving after the last xpo_release_ctxt on an idle connection would leave the send_ctxt's DMA mappings and reply pages pinned until the next RPC, send-context exhaustion, or transport close. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-09svcrdma: Release write chunk resources without re-queuingChuck Lever
Each RDMA Send completion triggers a cascade of work items on the svcrdma_wq unbound workqueue: ib_cq_poll_work (on ib_comp_wq, per-CPU) -> svc_rdma_send_ctxt_put -> queue_work [work item 1] -> svc_rdma_write_info_free -> queue_work [work item 2] Every transition through queue_work contends on the unbound pool's spinlock. Profiling an 8KB NFSv3 read/write workload over RDMA shows about 4% of total CPU cycles spent on this lock, with the cascading re-queue of write_info release contributing roughly 1%. The initial queue_work in svc_rdma_send_ctxt_put is needed to move release work off the CQ completion context (which runs on a per-CPU bound workqueue). However, once executing on svcrdma_wq, there is no need to re-queue for each write_info structure. svc_rdma_reply_chunk_release already calls svc_rdma_cc_release inline from the same svcrdma_wq context, and svc_rdma_recv_ctxt_put does the same from nfsd thread context. Release write chunk resources inline in svc_rdma_write_info_free, removing the intermediate svc_rdma_write_info_free_async work item and the wi_work field from struct svc_rdma_write_info. Reviewed-by: Mike Snitzer <snitzer@kernel.org> Tested-by: Jonathan Flynn <jonathan.flynn@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-09SUNRPC: Remove dead rpcsec_gss_krb5 definitionsChuck Lever
The migration to crypto/krb5 eliminated the per-enctype function dispatch and direct crypto API usage, leaving behind a number of orphaned definitions. Remove the following from gss_krb5.h: - GSS_KRB5_K5CLENGTH, used only by removed key derivation - KG_TOK_MIC_MSG and KG_TOK_WRAP_MSG (Kerberos v1 token types; v1 support was dropped earlier) - KG2_TOK_INITIAL and KG2_TOK_RESPONSE (context establishment token types; no remaining users) - KG2_RESP_FLAG_ERROR and KG2_RESP_FLAG_DELEG_OK - enum sgn_alg and enum seal_alg (v1 algorithm constants) - All CKSUMTYPE_* definitions, now duplicated by KRB5_CKSUMTYPE_* in <crypto/krb5.h> - The KG_ error constants from gssapi_err_krb5.h, which have no remaining users - The ENCTYPE_* constant block, replaced by KRB5_ENCTYPE_* from <crypto/krb5.h> - KG_USAGE_SEAL/SIGN/SEQ (3DES usage constants) - KEY_USAGE_SEED_CHECKSUM/ENCRYPTION/INTEGRITY, duplicated by <crypto/krb5.h> - #include <crypto/skcipher.h>, no longer needed Remove the cksum[] field from struct krb5_ctx in gss_krb5_internal.h; no code reads or writes it after the key derivation removal. Switch gss_krb5_enctypes[] in gss_krb5_mech.c to the canonical KRB5_ENCTYPE_* names from <crypto/krb5.h>. Remove stale #include directives: - <crypto/skcipher.h> from gss_krb5_wrap.c - <linux/random.h> and <linux/crypto.h> from gss_krb5_seal.c Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-09SUNRPC: Remove dead code from rpcsec_gss_krb5Chuck Lever
With all per-message crypto operations routed through crypto/krb5, a substantial body of code in rpcsec_gss_krb5 has no remaining callers. The internal key derivation functions (krb5_derive_key_v2, krb5_kdf_hmac_sha2, krb5_kdf_feedback_cmac) and the low-level crypto primitives (krb5_encrypt, gss_krb5_checksum, krb5_cbc_cts_ encrypt/decrypt, krb5_etm_checksum) are unreachable because their only call sites were the per-enctype function pointers removed in previous patches. Delete gss_krb5_keys.c entirely and strip the dead functions from gss_krb5_crypto.c. The KUnit test suite in gss_krb5_test.c exercised exactly these internal functions: RFC 3961 n-fold, RFC 3962 key derivation, RFC 6803 Camellia key derivation, and RFC 8009 AES-SHA2 key derivation, plus encryption self-tests that drove the now-removed encrypt routines. The corresponding test coverage is provided by the crypto/krb5 selftests in crypto/krb5/selftest.c. Remove the test file, the RPCSEC_GSS_KRB5_KUNIT_TEST Kconfig symbol, the .kunitconfig, and all VISIBLE_IF_KUNIT / EXPORT_SYMBOL_IF_KUNIT annotations. xdr_process_buf() walked xdr_buf segments through a per-segment callback and existed solely for the crypto routines in gss_krb5_crypto.c. With that file removed, xdr_process_buf() has no remaining callers. Its successor, xdr_buf_to_sg(), populates a scatterlist directly from an xdr_buf byte range and was introduced earlier in this series. With every consumer of struct gss_krb5_enctype removed, replace its remaining uses with the equivalent fields from struct krb5_enctype (key_len). Remove struct gss_krb5_enctype, the supported_gss_krb5_enctypes[] table, gss_krb5_lookup_enctype(), and the gk5e pointer from krb5_ctx. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>