summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2026-06-11platform/x86: asus-wmi: add keystone dongle supportDariusz Figzał
The ASUS Keystone is a physical NFC-like dongle that slots into supported ASUS laptops. The EC fires WMI notify code 0xB4 on insert/remove events. Expose the current insert state via a sysfs attribute by querying WMI device ID 0x00120091 (DSTS). This devid does not follow the standard DSTS convention: PRESENCE_BIT (0x00010000) encodes the insert state rather than feature presence, and STATUS_BIT is never set. Presence of a keystone slot is detected by a successful DSTS call. Reviewed-by: Denis Benato <denis.benato@linux.dev> Signed-off-by: Dariusz Figzał <dariuszfigzal@gmail.com> Link: https://patch.msgid.link/20260610164942.74956-1-dariuszfigzal@gmail.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2026-06-11locking/percpu-rwsem: Extract __percpu_up_read()Dmitry Ilvokhin
Move the percpu_up_read() slowpath out of the inline function into a new __percpu_up_read() to avoid binary size increase from adding a tracepoint to an inlined function. Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Usama Arif <usama.arif@linux.dev> Link: https://patch.msgid.link/3dd2a1b9ab4f469e1892766cb63f41d6b0f53d29.1780506267.git.d@ilvokhin.com
2026-06-11RDMA/mlx5: Add support for rate limit in UD and UC QPsMaher Sanalla
Rate limiting is currently supported only for raw packet QPs, where the packet pacing index is programmed into the SQC during SQ modify. Extend rate limit support to UD and UC QPs by setting the pacing index in the QPC during RTR2RTS and RTS2RTS transitions. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20260524-packet-pacing-v1-3-3d79439f8d08@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-06-11net/mlx5: Add UD and UC packet pacing capsMaher Sanalla
Add the needed capabilities in mlx5_ifc to support packet pacing for UC and UD QPs. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/20260524-packet-pacing-v1-1-3d79439f8d08@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2026-06-11gpio: nomadik: remove dead DB8540 code from <gpio/gpio-nomadik.h>Ethan Nelson-Moore
DB8540 support was removed in commit b6d09f780761 ("pinctrl: nomadik: Drop U8540/9540 support"), but a couple small pieces of related code remained in <gpio/gpio-nomadik.h>. Remove them. Discovered while searching for CONFIG_* symbols referenced in code but not defined in any Kconfig file. Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com> Reviewed-by: Linus Walleij <linusw@kernel.org> Link: https://patch.msgid.link/20260610205007.44881-1-enelsonmoore@gmail.com Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
2026-06-10Merge tag 'wireless-next-2026-06-10' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Johannes Berg says: ==================== Quite a few last updates, notably: - b43: new support for an 11n device - mt76: - mt792x broken usb transport detection - mt7921 regd improvements - mt7927 support - iwlwifi: - more kunit tests - FW version updates - ath12k: WDS support - rtw89: - RTL8922AU support - USB 3 mode switch for performance - better monitor radiotap support - RTL8922DE preparations - cfg80211/mac80211: - update UHR to D1.4, UHR DBE support - finally remove 5/10 MHz support - S1G rate reporting - multicast encapsulation offload * tag 'wireless-next-2026-06-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (285 commits) b43: add RF power offset for N-PHY r8 + radio 2057 r8 b43: add channel info table for N-PHY r8 + radio 2057 r8 b43: add IPA TX gain table for N-PHY r8 + radio 2057 r8 b43: support radio 2057 rev 8 b43: route d11 corerev 22 to 24-bit indirect radio access b43: add d11 core revision 0x16 to id table b43: add firmware mappings for rev22 rfkill: Replace strcpy() with memcpy() wifi: brcmfmac: flowring: simplify flow allocation wifi: brcm80211: change current_bss to value wifi: ath12k: enable IEEE80211_VHT_EXT_NSS_BW_CAPABLE when NSS ratio is reported wifi: ath12k: fix EAPOL TX failure caused by stale tcl_metadata bits wifi: ath: Update copyright in testmode_i.h wifi: ath10k: Update Qualcomm copyrights wifi: ath11k: Update Qualcomm copyrights wifi: ath12k: Update Qualcomm copyrights wifi: mt76: Drop unneeded mt76_register_debugfs_fops() return checks wifi: mt76: mt7921: assert sniffer on chanctx change wifi: mt76: mt7996: fix potential tx_retries underflow wifi: mt76: mt7925: fix potential tx_retries underflow ... ==================== Link: https://patch.msgid.link/20260610103637.179340-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-10jbd2: remove special jbd2 slabsMatthew Wilcox (Oracle)
When jbd2 was originally written, kmalloc() would not guarantee memory alignment for the requested objects. Since commit 59bb47985c1d in 2019, kmalloc has guaranteed natural alignment for power-of-two allocations. We can now remove the jbd2 special slabs and just use kmalloc() directly. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reviewed-by: Tal Zussman <tz2294@columbia.edu> Link: https://patch.msgid.link/20260528171413.1088143-1-willy@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-06-10fanotify: allow reporting pidfds for reaped tasksAnonymeMeow
Fanotify used to refuse to report pidfds for reaped tasks by applying a pid_has_task() check before calling pidfd_prepare(). This prevented userspace from obtaining information about the task. Register the event pid with pidfs when creating the fanotify event if pidfd reporting was requested, so pidfd_prepare() can later create a pidfd for the reaped task. Suggested-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/linux-fsdevel/20260528-schmuckvoll-heilen-garen-be77b4208671@brauner/ Signed-off-by: AnonymeMeow <anonymemeow@gmail.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> Link: https://patch.msgid.link/20260607003343.425939-3-anonymemeow@gmail.com Signed-off-by: Jan Kara <jack@suse.cz>
2026-06-10virtio: add missing kernel-doc for map and vmap membersChristian Fontanez
Commit bee8c7c24b73 ("virtio: introduce map ops in virtio core") and commit b16060c5c7d5 ("virtio: introduce virtio_map container union") added 'map' and 'vmap' members to struct virtio_device but did not update the kernel-doc comment block. This caused 'make htmldocs' to emit warnings: ./include/linux/virtio.h:188 struct member 'map' not described in 'virtio_device' ./include/linux/virtio.h:188 struct member 'vmap' not described in 'virtio_device' Add the missing entries in struct-declaration order to match the existing convention in the file. After this patch, 'make htmldocs' no longer emits these warnings. Fixes: bee8c7c24b73 ("virtio: introduce map ops in virtio core") Fixes: b16060c5c7d5 ("virtio: introduce virtio_map container union") Reported-by: Luis Felipe Hernandez <luis.hernandez093@gmail.com> Signed-off-by: Christian Fontanez <christfontanez@gmail.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Message-ID: <20260519013321.32511-1-christfontanez@gmail.com>
2026-06-09bpf: Cancel special fields on map value recycleJustin Suess
Map update and delete paths currently call bpf_obj_free_fields() when a value is being replaced or recycled. That makes field destruction depend on the context of the update/delete operation. For tracing programs this can include NMI context, where referenced kptr destructors, uptr unpinning, and graph root destruction are not generally safe. Introduce bpf_obj_cancel_fields() for the reusable-value path. It only performs NMI-safe cleanup for timer, workqueue, and task_work fields. Fields that need full destruction are left attached to the recycled value and are destroyed by the final cleanup path instead. Switch array and hashtab update/delete/recycle paths to this cancel helper. Keep bpf_obj_free_fields() for final map destruction and for bpf_mem_alloc destructors. Preallocated hashtabs do not have allocator destructors, so teardown continues to walk the normal and extra elements and fully destroy their fields. This deliberately relaxes the eager-free semantics of map update/delete for special fields. Programs that relied on a recycled map slot becoming empty immediately after update/delete were relying on behavior that cannot be implemented safely from every BPF execution context without offloading arbitrary destructors. There is a chance this change breaks programs making assumptions regarding the eager freeing of fields. If so, we can relax semantics to cancellation only when irqs_disabled() is true in the future. However, theoretically, map values that get reused eagerly already have weaker guarantees as parallel users can recreate freed fields before the new element becomes visible again. Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr") Signed-off-by: Justin Suess <utilityemal77@gmail.com> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260609202548.3571690-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-09bpf: Reject bpf_obj_drop() from tracing progsJustin Suess
bpf_obj_drop() runs bpf_obj_free_fields() synchronously for program-allocated objects. When such an object contains NMI unsafe fields, tracing programs that can run from arbitrary instrumented context can reach that destruction from unsafe contexts, including NMI. NMI is likely one instance of this problem, and other instances would include possible unsafe reentrancy. Deferring bpf_obj_drop() is not appealing either: it would add delayed-free machinery to a release operation that otherwise has straightforward synchronous ownership semantics. Reject bpf_obj_drop() and bpf_percpu_obj_drop() from tracing programs that may run from unsafe contexts unless every field in the object's BTF record is explicitly NMI safe. Do not reject sleepable BPF_PROG_TYPE_TRACING programs, since they are not the arbitrary/NMI contexts that motivate the restriction. Note that while bpf_rb_root and bpf_list_head would be NMI safe on their own to free, the objects recursively held by them may not be; be conservative and just mark them as not NMI safe for now. Use a whitelist for the NMI-safe field set instead of listing only known NMI unsafe fields. Locks, async fields, unreferenced kptrs, and refcounts are known to be NMI safe because their destruction is either a no-op, simple state reset, or async cancellation. Referenced kptrs, percpu referenced kptrs, uptrs, graph roots, graph nodes, and any future field type are rejected until audited for arbitrary tracing and NMI contexts. This is less susceptible to future changes in fields that were previously safe by exclusion, and to new fields being added without updating this check. Convert the existing recursive local-object drop success case to a syscall program in the same commit, since this verifier change makes the old tracing program form invalid. The test still exercises bpf_obj_drop() releasing a referenced task kptr from a safe program type. Fixes: ac9f06050a35 ("bpf: Introduce bpf_obj_drop") Signed-off-by: Justin Suess <utilityemal77@gmail.com> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260609202548.3571690-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-09net: add retry mechanism to ndo_set_rx_mode_asyncStanislav Fomichev
When ndo_set_rx_mode_async returns an error, schedule a retry with exponential backoff (1s, 2s, 4s, 8s -- 15s total). Give up after the 4th retry and log an error via netdev_err(). This moves retry logic from individual drivers into the core stack. Timer callback does not hold a ref on dev. Safe because the timer can only be armed when dev is IFF_UP, and __dev_close_many runs timer_delete_sync before clearing IFF_UP. Unregister always closes IFF_UP devices first, so by the time dev can be freed the timer is dead and cannot be re-armed. Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260608154014.227538-3-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: change ndo_set_rx_mode_async return type to intStanislav Fomichev
Change the return type of ndo_set_rx_mode_async from void to int to allow drivers to report failures back to the core stack. This is a prerequisite for adding retry logic in the core when drivers fail to program RX filters (e.g. bnxt VF when PF is unavailable). All existing implementations return 0 for now, maintaining current behavior. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260608154014.227538-2-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-10software node: allow passing reference args to PROPERTY_ENTRY_REF()Dmitry Torokhov
When dynamically creating software nodes and properties for subsequent use with software_node_register() current implementation of PROPERTY_ENTRY_REF() is not suitable because it creates a temporary instance of struct software_node_ref_args on stack which will later disappear, and software_node_register() only does shallow copy of properties. Fix this by allowing to pass address of reference arguments structure directly into PROPERTY_ENTRY_REF(), so that caller can manage lifetime of the object properly. Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://patch.msgid.link/aiTo4dvKu8pyimHA@google.com Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2026-06-09PCI/P2PDMA: Add Intel QAT, DSA, IAA devices to whitelistLukas Wunner
The first device on a PCI root bus determines whether the host bridge is whitelisted for P2PDMA. All Intel Xeon chips since Ice Lake (ICX, 2021) expose a device with ID 0x09a2 as first device. It is loosely associated with the IOMMU. All these Xeon chips support P2PDMA, so since the addition of the device with commit feaea1fe8b36 ("PCI/P2PDMA: Add Intel 3rd Gen Intel Xeon Scalable Processors to whitelist"), P2PDMA has been allowed on all new Xeons without the need to amend the whitelist: Xeons with Performance Cores: Sapphire Rapids (SPR, 2023) Emerald Rapids (EMR, 2023) Granite Rapids (GNR, 2024) Diamond Rapids (DMR, 2026) Xeons with Efficiency Cores: Sierra Forest (SRF, 2024) Clearwater Forest (CWF, 2026) However these Xeons also expose accelerators as first device on a root bus of its own: QuickAssist Technology (QAT, crypto & compression accelerator) Data Streaming Accelerator (DSA, dma engine) In-Memory Analytics Accelerator (IAA, compression accelerator) Whitelist them for P2PDMA as well. Move their Device ID macros from the accelerator drivers to <linux/pci_ids.h> for reuse by P2PDMA code. Unfortunately the Device IDs vary across Xeon generations as additional features were added to the accelerators. This currently necessitates an amendment for each new Xeon chip. For future chips, this need shall be avoided by an ongoing effort to extend ACPI HMAT with PCIe P2PDMA characteristics (latency, bandwidth, ordering constraints). The PCI core will be able look up in this BIOS-provided ACPI table whether P2PDMA is supported, instead of relying on a whitelist that needs to be amended continuously. Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Acked-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> # QAT Cc: stable@vger.kernel.org Link: https://patch.msgid.link/6aac4922b5fe7070b11874427a9285e42ddd05a4.1780585518.git.lukas@wunner.de
2026-06-09hwmon: Add update_interval_us chip attributeFerdinand Schwenk
Some hardware monitoring chips support update intervals below one millisecond. The existing update_interval attribute uses millisecond granularity, which causes sub-millisecond steps to round to the same value and become inaccessible from userspace. Introduce update_interval_us, a companion chip-level attribute that expresses the same update interval in microseconds. Drivers implementing this attribute should also implement update_interval for compatibility with millisecond-based userspace interfaces. Signed-off-by: Ferdinand Schwenk <ferdinand.schwenk@advastore.com> Link: https://lore.kernel.org/r/20260609-hwmon-ina238-update-interval-us-v2-v3-2-016b55567950@advastore.com Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2026-06-09svcrdma: wake sq waiters when the transport closesChuck Lever
Threads parked in svc_rdma_sq_wait() on sc_sq_ticket_wait or sc_send_wait can hang indefinitely in TASK_UNINTERRUPTIBLE state across transport teardown, pinning svc_xprt references and blocking svc_rdma_free(). The close path sets XPT_CLOSE before invoking xpo_detach and both wait_event predicates include an XPT_CLOSE term, but the predicates are re-evaluated only on wakeup. sc_sq_ticket_wait has no completion-driven wake path; it is advanced solely by the chained ticket handoff inside svc_rdma_sq_wait() itself. Without an explicit wake at close, parked threads never observe XPT_CLOSE, hold their svc_xprt_get reference forever, and svc_rdma_free() blocks on xpt_ref dropping to zero. Two close entry points reach this transport. Local teardown runs svc_rdma_detach() from svc_handle_xprt() -> svc_delete_xprt() -> xpo_detach() on a worker thread. A remote disconnect arrives at svc_rdma_cma_handler(), which calls svc_xprt_deferred_close(): that sets XPT_CLOSE and enqueues the transport but does not access either RDMA waitqueue, so a worker already parked in svc_rdma_sq_wait() never re-evaluates its predicate. With every worker parked on this transport, no thread is available to run the local teardown either, and the wake site there is unreachable. Introduce svc_rdma_xprt_deferred_close(), a thin svcrdma wrapper that calls svc_xprt_deferred_close() and then wakes both sc_sq_ticket_wait and sc_send_wait. Convert the svcrdma producers that called svc_xprt_deferred_close() directly: svc_rdma_cma_handler(), qp_event_handler(), svc_rdma_post_send_err(), svc_rdma_wc_send(), the sendto drop path, the rw completion error paths, and the recvfrom flush and read-list error paths. Wake both waitqueues from svc_rdma_detach() as well. The synchronous svc_xprt_close() path (backchannel ENOTCONN, device removal via svc_rdma_xprt_done) reaches detach without flowing through svc_xprt_deferred_close() and therefore does not invoke the new helper. Fixes: ccc89b9d1ed2 ("svcrdma: Add fair queuing for Send Queue access") Cc: stable@vger.kernel.org Assisted-by: kres (claude-opus-4-7) Signed-off-by: Chris Mason <clm@meta.com> [ cel: add svc_rdma_xprt_deferred_close() to complete the fix ] Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-09SUNRPC: Return an error from xdr_buf_to_bvec() on overflowChuck Lever
xdr_buf_to_bvec() returns a slot count even when the caller's bvec budget is exhausted partway through the xdr_buf. Callers feed that count into iov_iter_bvec() and continue as if the conversion had succeeded, silently sending or writing fewer bytes than the data length declares. For an NFS WRITE the server reports the truncated transfer to the client as full success. The overflow represents an internal invariant violation: a higher layer reserved a bvec budget too small for the xdr_buf it then asked the encoder to convert. That is a server-side fault, not a media I/O failure and not a malformed client argument. Change xdr_buf_to_bvec() to return a signed int and have the overflow label return -ESERVERFAULT. Update the three callers to detect the negative return and fail the request: nfsd_vfs_write() folds the error into host_err, which nfserrno() translates to nfserr_serverfault for the WRITE reply; svc_udp_sendto() and svc_tcp_sendmsg() propagate the error out of the send path. Reported-by: Chris Mason <clm@meta.com> Fixes: 2eb2b9358181 ("SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly") Cc: stable@vger.kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-09Documentation: Add the RPC language description of NLM version 3Chuck Lever
In order to generate source code to encode and decode NLMv3 protocol elements, include a copy of the RPC language description of NLMv3 for xdrgen to process. The language description is derived from the Open Group's XNFS specification: https://pubs.opengroup.org/onlinepubs/9629799/chap10.htm#tagcjh_11_03 The C code committed here was generated from the new nlm3.x file using tools/net/sunrpc/xdrgen/xdrgen. The goals of replacing hand-written XDR functions with ones that are tool-generated are to improve memory safety and make XDR encoding and decoding less brittle to maintain. Parts of the NFSv4 protocol are still being extended actively. Tool-generated XDR code reduces the time it takes to get a working implementation of new protocol elements. The xdrgen utility derives both the type definitions and the encode/decode functions directly from protocol specifications, using names and symbols familiar to anyone who knows those specs. Unlike hand-written code that can inadvertently diverge from the specification, xdrgen guarantees that the generated code matches the specification exactly. We would eventually like xdrgen to generate Rust code as well, making the conversion of the kernel's NFS stacks to use Rust just a little easier for us. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-09svcrdma: Defer send context release to xpo_release_ctxtChuck Lever
Send completion currently queues a work item to an unbound workqueue for each completed send context. Under load, the Send Completion handlers contend for the shared workqueue pool lock. Replace the workqueue with a per-transport lock-free list (llist). The Send completion handler appends the send_ctxt to sc_send_release_list and does no further teardown. The nfsd thread drains the list in xpo_release_ctxt between RPCs, performing DMA unmapping, chunk I/O resource release, and page release in a batch. This eliminates both the workqueue pool lock and the DMA unmap cost from the Send completion path. DMA unmapping can be expensive when an IOMMU is present in strict mode, as each unmap triggers a synchronous hardware IOTLB invalidation. Moving it to the nfsd thread, where that latency is harmless, avoids penalizing completion handler throughput. The nfsd threads absorb the release cost at a point where the client is no longer waiting on a reply, and natural batching amortizes the overhead when completions arrive faster than RPCs complete. A self-enqueue backstops drain on a quiescing transport. When svc_rdma_send_ctxt_put() observes that its llist_add() transitions sc_send_release_list from empty to non-empty, it sets XPT_DATA and calls svc_xprt_enqueue() so that svc_xprt_ready() schedules an nfsd thread. The thread enters svc_rdma_recvfrom(), finds no pending receive, clears XPT_DATA, and returns 0; svc_xprt_release() then runs xpo_release_ctxt and drains the list. Under steady load the foreground drain keeps the list non-empty between adds and no enqueue fires; only the trailing edge of a burst pays for a wakeup. Without this path, a Send completion arriving after the last xpo_release_ctxt on an idle connection would leave the send_ctxt's DMA mappings and reply pages pinned until the next RPC, send-context exhaustion, or transport close. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-09svcrdma: Release write chunk resources without re-queuingChuck Lever
Each RDMA Send completion triggers a cascade of work items on the svcrdma_wq unbound workqueue: ib_cq_poll_work (on ib_comp_wq, per-CPU) -> svc_rdma_send_ctxt_put -> queue_work [work item 1] -> svc_rdma_write_info_free -> queue_work [work item 2] Every transition through queue_work contends on the unbound pool's spinlock. Profiling an 8KB NFSv3 read/write workload over RDMA shows about 4% of total CPU cycles spent on this lock, with the cascading re-queue of write_info release contributing roughly 1%. The initial queue_work in svc_rdma_send_ctxt_put is needed to move release work off the CQ completion context (which runs on a per-CPU bound workqueue). However, once executing on svcrdma_wq, there is no need to re-queue for each write_info structure. svc_rdma_reply_chunk_release already calls svc_rdma_cc_release inline from the same svcrdma_wq context, and svc_rdma_recv_ctxt_put does the same from nfsd thread context. Release write chunk resources inline in svc_rdma_write_info_free, removing the intermediate svc_rdma_write_info_free_async work item and the wi_work field from struct svc_rdma_write_info. Reviewed-by: Mike Snitzer <snitzer@kernel.org> Tested-by: Jonathan Flynn <jonathan.flynn@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-09SUNRPC: Remove dead rpcsec_gss_krb5 definitionsChuck Lever
The migration to crypto/krb5 eliminated the per-enctype function dispatch and direct crypto API usage, leaving behind a number of orphaned definitions. Remove the following from gss_krb5.h: - GSS_KRB5_K5CLENGTH, used only by removed key derivation - KG_TOK_MIC_MSG and KG_TOK_WRAP_MSG (Kerberos v1 token types; v1 support was dropped earlier) - KG2_TOK_INITIAL and KG2_TOK_RESPONSE (context establishment token types; no remaining users) - KG2_RESP_FLAG_ERROR and KG2_RESP_FLAG_DELEG_OK - enum sgn_alg and enum seal_alg (v1 algorithm constants) - All CKSUMTYPE_* definitions, now duplicated by KRB5_CKSUMTYPE_* in <crypto/krb5.h> - The KG_ error constants from gssapi_err_krb5.h, which have no remaining users - The ENCTYPE_* constant block, replaced by KRB5_ENCTYPE_* from <crypto/krb5.h> - KG_USAGE_SEAL/SIGN/SEQ (3DES usage constants) - KEY_USAGE_SEED_CHECKSUM/ENCRYPTION/INTEGRITY, duplicated by <crypto/krb5.h> - #include <crypto/skcipher.h>, no longer needed Remove the cksum[] field from struct krb5_ctx in gss_krb5_internal.h; no code reads or writes it after the key derivation removal. Switch gss_krb5_enctypes[] in gss_krb5_mech.c to the canonical KRB5_ENCTYPE_* names from <crypto/krb5.h>. Remove stale #include directives: - <crypto/skcipher.h> from gss_krb5_wrap.c - <linux/random.h> and <linux/crypto.h> from gss_krb5_seal.c Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-09SUNRPC: Remove dead code from rpcsec_gss_krb5Chuck Lever
With all per-message crypto operations routed through crypto/krb5, a substantial body of code in rpcsec_gss_krb5 has no remaining callers. The internal key derivation functions (krb5_derive_key_v2, krb5_kdf_hmac_sha2, krb5_kdf_feedback_cmac) and the low-level crypto primitives (krb5_encrypt, gss_krb5_checksum, krb5_cbc_cts_ encrypt/decrypt, krb5_etm_checksum) are unreachable because their only call sites were the per-enctype function pointers removed in previous patches. Delete gss_krb5_keys.c entirely and strip the dead functions from gss_krb5_crypto.c. The KUnit test suite in gss_krb5_test.c exercised exactly these internal functions: RFC 3961 n-fold, RFC 3962 key derivation, RFC 6803 Camellia key derivation, and RFC 8009 AES-SHA2 key derivation, plus encryption self-tests that drove the now-removed encrypt routines. The corresponding test coverage is provided by the crypto/krb5 selftests in crypto/krb5/selftest.c. Remove the test file, the RPCSEC_GSS_KRB5_KUNIT_TEST Kconfig symbol, the .kunitconfig, and all VISIBLE_IF_KUNIT / EXPORT_SYMBOL_IF_KUNIT annotations. xdr_process_buf() walked xdr_buf segments through a per-segment callback and existed solely for the crypto routines in gss_krb5_crypto.c. With that file removed, xdr_process_buf() has no remaining callers. Its successor, xdr_buf_to_sg(), populates a scatterlist directly from an xdr_buf byte range and was introduced earlier in this series. With every consumer of struct gss_krb5_enctype removed, replace its remaining uses with the equivalent fields from struct krb5_enctype (key_len). Remove struct gss_krb5_enctype, the supported_gss_krb5_enctypes[] table, gss_krb5_lookup_enctype(), and the gk5e pointer from krb5_ctx. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-09SUNRPC: Add helpers to convert xdr_buf byte ranges to scatterlistsChuck Lever
The crypto/krb5 library accepts data in scatterlist form, but the GSS-API layer presents RPC payloads as struct xdr_buf. Bridge that gap with a pair of helper functions: xdr_buf_to_sg() - populate a caller-supplied scatterlist array from a byte range xdr_buf_to_sg_alloc() - populate a caller-supplied inline scatterlist, chaining to a heap- allocated overflow for large payloads The inline array (typically stack-allocated at eight entries) covers the common case of small RPCs with no heap allocation on the encrypt/decrypt path. Only buffers spanning many pages incur a kmalloc for the chained extension. The segment-walking logic follows the same head, page array, tail traversal as xdr_process_buf(), but populates a scatterlist directly rather than invoking a per-segment callback. sg_next() traversal makes the walker safe for chained scatterlists. Once subsequent patches reroute all per-message crypto operations through crypto/krb5, xdr_process_buf() loses its last callers and is removed. Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-06-09docs: net: ethtool: document ops-locked drivers and op_needs_rtnlJakub Kicinski
Catch up various bits of documentation after the locking changes. Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-13-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: ethtool: optionally skip rtnl_lock on IOCTL pathJakub Kicinski
Convert the IOCTL path similarly to how we converted Netlink. The device lookup gets a little hairy. We could take rtnl_lock unconditionally and drop it before calling the driver (this would avoid the reference + liveness check). But I think being able to make progress even if rtnl is dead-locked is quite useful. First extra concern is handling features. List all the cmds which modify features and always take rtnl_lock. We could fold this list into ethtool_ioctl_needs_rtnl() but seems cleaner to keep ethtool_ioctl_needs_rtnl() driver-related. If a driver changed features and we were not holding rtnl_lock - warn about it. It can only happen on buggy ops locked drivers (buggy because they should have set appropriate "I need rtnl for op X" bit). Second wrinkle is the PHY ID hack which drops the locks while sleeping. Convert its static "busy" variable which used to be protected by rtnl_lock to a field in struct ethtool_netdev_state. This feature is about identifying an adapter or a port within a system, so being able to blink multiple LEDs at the same time is likely not very useful in practice. But it's the simplest fix, we can add a mutex if someone thinks a system should only be ID'ing one port at a time. Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-12-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: ethtool: optionally skip rtnl_lock in ethnl_act_module_fw_flash()Jakub Kicinski
Module firmware flashing reads SFF-8024 identifier bytes via .get_module_eeprom_by_page(). Other than that it modifies a bit in the netdev->ethtool struct. Both should be ops-locked at this point. Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: ethtool: optionally skip rtnl_lock on Netlink path for SET opsJakub Kicinski
Make ethtool not take rtnl_lock for SET commands when operation is performed on an ops-locked driver. cfg/cfg_pending are now ops-locked, since only ethtool modifies them. Some SET driver callbacks will still need rtnl_lock, most notably those which may end up calling netdev_update_features() or the qdisc layer (via netif_set_real_num_tx_queues()). Let drivers selectively opt back into the rtnl_lock with a new bitfield in ops. We need two helpers since Netlink and ioctl cmds have different values. Keep the helpers side by side in common.h to make sure they get updated together, even tho they will only get called from ioctl.c and netlink.c. SET commands which don't use ethnl_default_set_doit() are converted by subsequent commits. Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: ethtool: optionally skip rtnl_lock on Netlink path for GET opsJakub Kicinski
ethnl_default_doit() and ethnl_default_dump_one() are both used exclusively for GET callbacks (former to get info for a single device or get global strings). ops-locked devices don't need rtnl_lock for GET callbacks, stop taking it. Introduce an opt-out mechanism for devices which use phylink (fbnic) since phylink currently depends on rtnl_lock protection. Subsequent patches will add more exceptions, anyway. Practically the new helpers for judging if command needs rtnl_lock could also call netdev_need_ops_lock() but I find that it makes the code in the callers slightly less obvious. Add a helper for IOCTLs already, even tho it's unused so that we can keep them in sync as the series progresses. This is the first user-visible step of moving ethtool ops out from under rtnl. Subsequent patches do the same for SET ops, as well as the ioctl path. Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: ethtool: make dev->hwprov ops-protectedJakub Kicinski
dev->hwprov tracks the active hwtstamp provider for the device. Make it ops protected (instance lock if the netdev driver opts into holding instance lock around callbacks, otherwise rtnl_lock). hwprov is written and read in: - drivers/net/phy/phy_device.c phydev and ops protection don't currently mix, add a comment - net/ethtool/ as of now holds both rtnl lock and ops lock, this one will soon only hold one lock or the other read in: - net/core/dev_ioctl.c holds both rtnl lock and ops lock - net/core/timestamping.c RCU reader The new netdev_ops_lock_dereference() helper does not have "compat" in the name. The name would be quite long and I think in this case it should be obvious that we need _a_ lock. netdev_lock_dereference() already exists and means dev->lock is always expected. Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09net: ethtool: relax ethnl_req_get_phydev() locking assertionJakub Kicinski
phydev <> netdev linking and lifecycle depends on rtnl_lock. We want to switch to instance locks for most ethtool ops. Let's add an assert that ops locked devices don't use phydev today. If one does we can either opt the phy ops out of being purely ops locked, or do deeper surgery to make phy locking ops-compatible. I don't think there's any fundamental challenge to make that work. Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20260605002912.3456868-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-09hwmon: Support guard() and scoped_guard for subsystem locksGuenter Roeck
Add support for guard() and scoped_guard() for the hwmon subsystem lock to simplify its use. Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2026-06-09vfs: add FS_USERNS_DELEGATABLE flag and set it for NFSJeff Layton
Commit e1c5ae59c0f2 ("fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNT") prevents the mount of any filesystem inside a container that doesn't have FS_USERNS_MOUNT set. This broke NFS mounts in our containerized environment. We have a daemon somewhat like systemd-mountfsd running in the init_ns. A process does a fsopen() inside the container and passes it to the daemon via unix socket. The daemon then vets that the request is for an allowed NFS server and performs the mount. This now fails because the fc->user_ns is set to the value in the container and NFS doesn't set FS_USERNS_MOUNT. We don't want to add FS_USERNS_MOUNT to NFS since that would allow the container to mount any NFS server (even malicious ones). Add a new FS_USERNS_DELEGATABLE flag, and enable it on NFS. Fixes: e1c5ae59c0f2 ("fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNT") Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260129-twmount-v1-1-4874ed2a15c4@kernel.org Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-09Merge tag 'ti-driver-soc-for-v7.2' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/ti/linux into soc/drivers TI SoC driver updates for v7.2 TI K3 TISCI: - ti_sci: Add BOARDCFG_MANAGED mode for support system suspend/resume cycles - ti_sci: Add support for restoring IRQ and clock contexts during resume. - clk: keystone: sci-clk: Add clock restoration support. SoC Drivers: - k3-socinfo: Add support for identifying AM62P silicon variants via NVMEM, along with corresponding dt-bindings update for nvmem-cells support - k3-ringacc: Fix incorrect access mode for ring pop tail IO/proxy operations Keystone Navigator (knav) Cleanup and Fixes: - knav_qmss: Multiple code quality improvements - knav_qmss_queue: Implement proper resource cleanup in the remove() path General Cleanups: - k3-ringacc: Use str_enabled_disabled() helper for consistency - knav_qmss: Use %pe format specifier for PTR_ERR() printing * tag 'ti-driver-soc-for-v7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/ti/linux: firmware: ti_sci: Add support for restoring clock context during resume clk: keystone: sci-clk: Add restore_context() operation firmware: ti_sci: Add support for restoring IRQs during resume firmware: ti_sci: Add BOARDCFG_MANAGED mode support soc: ti: k3-ringacc: Use str_enabled_disabled() helper soc: ti: knav_dma: Use IOMEM_ERR_PTR() in pktdma_get_regs() soc: ti: knav_dma: Remove dead check on unsigned args.args[0] soc: ti: knav_dma: Remove unused DMA_PRIO_MASK macro soc: ti: knav_qmss_acc: Fix kernel-doc Return: tag soc: ti: knav_qmss: Fix __iomem annotations and __be32 type soc: ti: knav_qmss: Use %pe to print PTR_ERR() soc: ti: knav_qmss: Fix kernel-doc Return: tags soc: ti: knav_qmss: Inline lockdep condition in for_each_handle_rcu soc: ti: knav_qmss: Rename global kdev to knav_qdev to fix -Wshadow soc: ti: knav_qmss: Remove remaining redundant ENOMEM printks soc: ti: knav_qmss_queue: Implement resource cleanup in remove() soc: ti: k3-ringacc: Fix access mode for k3_ringacc_ring_pop_tail_io/proxy soc: ti: knav_dma: fix all kernel-doc warnings in knav_dma.h soc: ti: k3-socinfo: Add support for AM62P variants via NVMEM dt-bindings: hwinfo: ti,k3-socinfo: Add nvmem-cells support Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2026-06-09Merge tag 'samsung-drivers-7.2' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux into soc/drivers Samsung SoC drivers for v7.2 Improve Samsung Exynos (and Google GS101) ACPM (Alive Clock and Power Manager) firmware driver: 1. Few code improvements. 2. Add support for protocol used to communicate with Thermal Management Unit (TMU). This will allow to implement the thermal driver working for newer Samsung Exynos and Google GS101 SoCs. * tag 'samsung-drivers-7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux: firmware: samsung: acpm: remove compile-testing stubs firmware: samsung: acpm: Add devm_acpm_get_by_phandle helper firmware: samsung: acpm: Add TMU protocol support firmware: samsung: acpm: Make acpm_ops const and access via pointer firmware: samsung: acpm: Drop redundant _ops suffix in acpm_ops members firmware: samsung: acpm: Annotate rx_data->cmd with __counted_by_ptr firmware: samsung: acpm: Consolidate transfer initialization helper firmware: samsung: acpm: Fix infinite loop on sequence number exhaustion firmware: samsung: acpm: Fix missing LKMM barriers in sequence allocator firmware: samsung: acpm: Fix false timeouts and Use-After-Free in polling firmware: samsung: acpm: Fix mailbox channel leak on probe error firmware: samsung: acpm: Fix cross-thread RX length corruption Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2026-06-09fbdev: Do not export fbcon from fbdevThomas Zimmermann
There are no callers of fbcon outside fbdev. Move the declarations into the internal header. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Helge Deller <deller@gmx.de>
2026-06-09fbdev: Wrap fbcon updates from vga-switcheroo in helperThomas Zimmermann
Handle console remapping in fbcon in fb_switch_output(). Vga-switcheroo invokes this functionality before switching physical outputs to a new graphics device. Open-coding fbcon state in vga-switcheroo exposed fbdev implementation details. Vga-switcheroo is used for switching physical outputs among graphics hardware. This functionality is only supported by DRM drivers. A later update will further move fb_switch_output() into DRM's fbdev emulation; thus fully decoupling vga-switcheroo from fbdev. v3: - remove Kconfig dependency related to fbcon (Geert) v2: - use '#if defined' (Helge) Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Helge Deller <deller@gmx.de>
2026-06-09fbdev: Wrap user-invoked calls to fb_set_var() in helperThomas Zimmermann
Handle fbcon during display updates in fb_set_var_from_user(). Check with fbcon if the mode change is possible, update hardware state and finally update fbcon. Update all callers. Only the FBIOPUT_VSCREENINFO ioctl currently does all steps. Other mode-changes callers in sysfs and driver code are missing fbcon-related steps. With the new helper, ps3fb and sh_mobile_lcdcfb no longer maintain fbcon state themselves. Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Helge Deller <deller@gmx.de>
2026-06-09netconsole: do not dequeue pooled skbs that cannot satisfy lenBreno Leitao
find_skb() falls back to np->skb_pool when the GFP_ATOMIC alloc_skb() fails. The pool is refilled by refill_skbs(), which always allocates buffers of MAX_SKB_SIZE (ethhdr + iphdr + udphdr + MAX_UDP_CHUNK == 1502 bytes). netconsole, however, computes the requested length dynamically as total_len + np->dev->needed_tailroom If the egress device declares a non-zero needed_tailroom (e.g. some tunnel or hardware accelerator devices), the required length can exceed MAX_SKB_SIZE. The pooled skb is then handed back to the caller, which immediately performs skb_put(skb, len), trips the tail > end check, and triggers skb_over_panic(). Leave the normal alloc_skb(len, GFP_ATOMIC) path untouched -- the slab allocator can still satisfy oversized requests when memory is available, so senders to devices with non-zero needed_tailroom keep working in the common case. Only the pool fallback is gated: when alloc_skb() failed and len exceeds the pool buffer size, skip the skb_dequeue() instead of burning a pre-allocated skb on a request that would later trip skb_over_panic(). Reserving pool entries for requests they can actually satisfy also keeps the panic path, which depends on the pool being primed, intact. When that drop happens, emit a rate-limited net_warn() so the user notices that netconsole is unable to push messages on the egress device. The warn is skipped under in_nmi() for the same reason schedule_work() is: printk machinery taken by net_warn_ratelimited() is not NMI-safe and would risk recursing into the same nbcon console we are servicing. MAX_SKB_SIZE / MAX_UDP_CHUNK were private to net/core/netpoll.c. Move them to include/linux/netpoll.h so netconsole can reference the same definition that refill_skbs() uses, keeping the two in sync by construction. The header now pulls in <linux/ip.h> and <linux/udp.h> explicitly so MAX_SKB_SIZE remains self-contained for any future user. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20260604-netcons_fix_before_move-v3-2-ab055b3a6aa5@debian.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-06-09Merge tag 'mtk-soc-for-v7.2' of ↵Krzysztof Kozlowski
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mediatek/linux into soc/drivers MediaTek SoC driver updates This adds subsys ID compatibility in MediaTek CMDQ, paving the way for adding support for the MT8196 SoC, and fixes the Multimedia System (MMSYS) routing masks for the MT8167 SoC. * tag 'mtk-soc-for-v7.2' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/mediatek/linux: soc: mediatek: mtk-mmsys: Restore MT8167 routing masks lost during merge soc: mediatek: mtk-cmdq: Add cmdq_pkt_jump_rel_temp() for removing shift_pa soc: mediatek: Use pkt_write function pointer for subsys ID compatibility Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
2026-06-08net/mlx5: Fix slab-out-of-bounds in mlx5_query_nic_vport_mac_listDragos Tatulea
mlx5_query_nic_vport_mac_list() sizes its firmware command buffer using the PF's log_max_current_uc/mc_list capabilities. When querying a VF vport with a larger configured max (via devlink), the firmware response can overflow this buffer: BUG: KASAN: slab-out-of-bounds in mlx5_query_nic_vport_mac_list+0x453/0x4c0 [mlx5_core] Read of size 4 at addr ff1100013ffc8a12 by task kworker/u96:2/385 CPU: 12 UID: 0 PID: 385 Comm: kworker/u96:2 Not tainted 7.0.0-rc6+ #1 PREEMPT Hardware name: QEMU Standard PC (Q35 + ICH9, 2009) Workqueue: mlx5_esw_wq esw_vport_change_handler [mlx5_core] Call Trace: <TASK> dump_stack_lvl+0x69/0xa0 print_report+0x176/0x4e4 kasan_report+0xc8/0x100 mlx5_query_nic_vport_mac_list+0x453/0x4c0 [mlx5_core] esw_update_vport_addr_list+0x2e3/0xda0 [mlx5_core] esw_vport_change_handle_locked+0xa1f/0x1060 [mlx5_core] esw_vport_change_handler+0x6a/0x90 [mlx5_core] process_one_work+0x87f/0x15e0 worker_thread+0x62b/0x1020 kthread+0x375/0x490 ret_from_fork+0x4dc/0x810 ret_from_fork_asm+0x11/0x20 </TASK> Fix by querying the vport's own HCA caps to size the buffer correctly. Refactor the function to allocate and return the MAC list internally, removing the caller's dependency on knowing the correct max. Fixes: e16aea2744ab ("net/mlx5: Introduce access functions to modify/query vport mac lists") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260604135849.458060-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-08mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAXJP Kobryn
compact_gap() returns 2 << order, which is used as watermark headroom in __compaction_suitable() and as a threshold in kswapd reclaim decisions. The computed value scales exponentially by order. For order-9 THP allocations this evaluates to 1024 pages, but the compaction free scanner's working set is bounded by COMPACT_CLUSTER_MAX (32 pages). The scanner stops isolating free pages once it matches the migration batch. The current gap over-reserves by 32x. On fragmented production hosts, kswapd will try to reclaim up to the gap, but it only reaches that threshold in 18% of attempts. As a result, reclaim continues in the majority of cases despite many lower-order free pages being available. The over-sized gap also causes 46% of order-9 compaction suitability checks to fail unnecessarily: the zone has sufficient free pages for the scanner to operate, but not enough to clear the inflated threshold. Cap compact_gap() at COMPACT_CLUSTER_MAX so the watermark headroom reflects the scanner's actual capacity. This function is used by two key heuristics. The first is when kswapd can stop high-order reclaim and downgrade to order-0 balancing, allowing kcompactd to be woken for the original higher allocation order. The second is zone suitability checking, where the smaller gap allows compaction to start sooner. Note that orders 0-4 are unaffected since their gap is already less than or equal to COMPACT_CLUSTER_MAX. A/B test on v6.13-based instagram production hosts (64GB, 60s measurement): Unpatched (43 hosts) pgscan_kswapd (mean/host): ~1.6M reclaim efficiency (steal/scan): 83.8% per-compaction success (success/stall): 2.1% THP success (alloc/alloc+fallback): 4.9% forced lru_add_drain (mean/host): ~107K Patched (59 hosts) pgscan_kswapd (mean/host): ~449K reclaim efficiency (steal/scan): 91.0% per-compaction success (success/stall): 28.3% THP success (alloc/alloc+fallback): 17.2% forced lru_add_drain (mean/host): ~64K Additional tests were also performed using a workload of similar shape and based on mm-new at the time of testing. Across three 60s runs, the patch showed improvements consistent with the previous test: reduced kswapd reclaim and fewer THP fault fallbacks. Unpatched kswapd_shrink_node downgrade to order-0 (mean): 0 thp_fault_fallback (mean): 1217 pgscan_kswapd (mean): 6328 pgsteal_kswapd (mean): 5657 Patched kswapd_shrink_node downgrade to order-0 (mean): 28 thp_fault_fallback (mean): 738 pgscan_kswapd (mean): 3773 pgsteal_kswapd (mean): 3243 Link: https://lore.kernel.org/20260604061725.13800-1-jp.kobryn@linux.dev Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-08mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap deviceYoungjun Park
Patch series "mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap device", v8. Currently, in the uswsusp path, only the swap type value is retrieved at lookup time without holding a reference. If swapoff races after the type is acquired, subsequent slot allocations operate on a stale swap device. Additionally, grabbing and releasing the swap device reference on every slot allocation is inefficient across the entire hibernation swap path. This patch series addresses these issues: - Patch 1: Fixes the swapoff race in uswsusp by pinning the swap device from the point it is looked up until the session completes. - Patch 2: Removes the overhead of per-slot reference counting in alloc/free paths and cleans up the redundant SWP_WRITEOK check. This patch (of 2): Hibernation via uswsusp (/dev/snapshot ioctls) has a race window: after selecting the resume swap area but before user space is frozen, swapoff may run and invalidate the selected swap device. Fix this by pinning the swap device with SWP_HIBERNATION while it is in use. The pin is exclusive, which is sufficient since hibernate_acquire() already prevents concurrent hibernation sessions. The kernel swsusp path (sysfs-based hibernate/resume) uses find_hibernation_swap_type() which is not affected by the pin. It freezes user space before touching swap, so swapoff cannot race. Introduce dedicated helpers: - pin_hibernation_swap_type(): Look up and pin the swap device. Used by the uswsusp path. - find_hibernation_swap_type(): Lookup without pinning. Used by the kernel swsusp path. - unpin_hibernation_swap_type(): Clear the hibernation pin. While a swap device is pinned, swapoff is prevented from proceeding. Link: https://lore.kernel.org/20260323160822.1409904-1-youngjun.park@lge.com Link: https://lore.kernel.org/20260323160822.1409904-2-youngjun.park@lge.com Signed-off-by: Youngjun Park <youngjun.park@lge.com> Reviewed-by: Kairui Song <kasong@tencent.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: "Rafael J . Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-08vmalloc: fix NULL pointer dereference in is_vm_area_hugepages()Hui Zhu
find_vm_area() can return NULL if the given address is not a valid vmalloc area. Check the return value before dereferencing it to avoid a kernel crash. Link: https://lore.kernel.org/20260529014130.671291-1-hui.zhu@linux.dev Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") Signed-off-by: Hui Zhu <zhuhui@kylinos.cn> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-08userfaultfd: build __VMA_UFFD_FLAGS from config-gated masksKiryl Shutsemau (Meta)
The VMA flags bitmap is a single word today: NUM_VMA_FLAG_BITS is BITS_PER_LONG, so on 32-bit vma_flags_t holds only 32 bits. (The bitmap type exists so this can grow past BITS_PER_LONG later; until it does, anything declared above the first word is out of range on 32-bit.) The bit enum nevertheless declares some bits unconditionally above BITS_PER_LONG -- VMA_UFFD_MINOR_BIT is 41, with VM_UFFD_MINOR == VM_NONE on 32-bit so no VMA actually carries the bit. __VMA_UFFD_FLAGS feeds VMA_UFFD_MINOR_BIT to mk_vma_flags() unconditionally. On 32-bit that becomes __set_bit(41, &one_long), a write one word past the end of the single-word bitmap. The compiler folds the out-of-bounds store with wraparound (1UL << (41 % 32) == bit 9) into the first word; bit 9 is already in __VMA_UFFD_FLAGS so the mask happens to come out right today, but it is an out-of-bounds write all the same, and any high-numbered bit whose mod-BITS_PER_LONG position is otherwise unused would silently OR an extra bit into the mask. Rather than feed bit numbers that may not exist on the current build to mk_vma_flags(), build the mask from whole per-mode masks that collapse to EMPTY_VMA_FLAGS when their feature is unavailable. Add mk_vma_flags_from_masks() for that, and define VMA_UFFD_MISSING / _WP / _MINOR alongside the VM_UFFD_* flags, gating VMA_UFFD_MINOR on the same config as VM_UFFD_MINOR (which implies 64BIT, where bit 41 fits). An out-of-range bit is then never materialised, on any arch, and the in-range fast path stays a compile-time constant. Link: https://lore.kernel.org/20260529172331.356655-7-kas@kernel.org Fixes: 9ea35a25d51b ("mm: introduce VMA flags bitmap type") Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reported-by: Sashiko AI review <sashiko-bot@kernel.org> Suggested-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Assisted-by: Claude:claude-opus-4-8 Cc: David Hildenbrand <david@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Balbir Singh <balbirs@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-08mm: delete stale comment about cachelinesBrendan Jackman
These comments have been wrong since commit a211c6550efc ("mm: page_alloc: defrag_mode kswapd/kcompactd watermarks") added NR_FREE_PAGES_BLOCKS. Since nobody has complained about it in the last year, it seems unlikely these comments were particularly useful anyway, so delete them. Link: https://lore.kernel.org/20260601-zone_stat_item-comment-v1-1-f452dd91d5eb@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-08mm/compaction: respect cpusets when checking retry suitabilityfujunjie
should_compact_retry() handles COMPACT_SKIPPED by asking compaction_zonelist_suitable() whether reclaim can make a later compaction attempt worthwhile. That answer is used for the current allocation, so it should follow the same zone eligibility rules as the allocation itself. When cpusets are enabled, allocator slowpath decisions are marked with ALLOC_CPUSET. The allocation path, direct compaction and reclaim retry all skip zones rejected by __cpuset_zone_allowed(). compaction_zonelist_suitable() does not apply that filter. It only walks ac->zonelist/ac->nodemask, so it can return true because a zone that is not usable for the current allocation would pass __compaction_suitable(). That does not let the allocation use the disallowed zone. Later allocation and direct compaction paths still apply cpuset filtering. However, it can make should_compact_retry() retry based on memory that this allocation cannot use. Pass gfp_mask down and apply the same ALLOC_CPUSET check in compaction_zonelist_suitable(). This keeps the retry decision aligned with the zones that the allocation is allowed to use. A temporary debugfs probe was also used to call the old and new compaction_zonelist_suitable() predicates in the same two-node NUMA guest. The task was restricted to mems=0 while ac->nodemask covered nodes 0-1. After putting pressure on node0, node0 failed __compaction_suitable() for order-10 and node1 passed it, but node1 was rejected by __cpuset_zone_allowed(). In that state the old predicate returned true and the patched predicate returned false. Link: https://lore.kernel.org/tencent_F59F2BA2CC5779308E10DF54593C736D3E0A@qq.com Fixes: 435b3894e742 ("mm:page_alloc: fix the NULL ac->nodemask in __alloc_pages_slowpath()") Signed-off-by: fujunjie <fujunjie1@qq.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-08mm: switch deferred split shrinker to list_lruJohannes Weiner
The deferred split queue handles cgroups in a suboptimal fashion. The queue is per-NUMA node or per-cgroup, not the intersection. That means on a cgrouped system, a node-restricted allocation entering reclaim can end up splitting large pages on other nodes: alloc/unmap deferred_split_folio() list_add_tail(memcg->split_queue) set_shrinker_bit(memcg, node, deferred_shrinker_id) for_each_zone_zonelist_nodemask(restricted_nodes) mem_cgroup_iter() shrink_slab(node, memcg) shrink_slab_memcg(node, memcg) if test_shrinker_bit(memcg, node, deferred_shrinker_id) deferred_split_scan() walks memcg->split_queue The shrinker bit adds an imperfect guard rail. As soon as the cgroup has a single large page on the node of interest, all large pages owned by that memcg, including those on other nodes, will be split. list_lru properly sets up per-node, per-cgroup lists. As a bonus, it streamlines a lot of the list operations and reclaim walks. It's used widely by other major shrinkers already. Convert the deferred split queue as well. The list_lru per-memcg heads are instantiated on demand when the first object of interest is allocated for a cgroup, by calling folio_memcg_alloc_deferred(). Add calls to where splittable pages are created: anon faults, swapin faults, khugepaged collapse. These calls create all possible node heads for the cgroup at once, so the migration code (between nodes) doesn't need any special care. [akpm@linux-foundation.org: fix build with CONFIG_TRANSPARENT_HUGEPAGE=n] Link: https://lore.kernel.org/202605281620.lc3rtkBm-lkp@intel.com [hannes@cmpxchg.org: fix cgroup.memory=nokmem handling] Link: https://lore.kernel.org/ah9PGv12mqai84ES@cmpxchg.org Link: https://lore.kernel.org/20260527204757.2544958-10-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Tested-by: Mikhail Zaslonko <zaslonko@linux.ibm.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Usama Arif <usama.arif@linux.dev> Reviewed-by: Kairui Song <kasong@tencent.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nico Pache <npache@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-08mm: list_lru: introduce folio_memcg_list_lru_alloc()Johannes Weiner
memcg_list_lru_alloc() is called every time an object that may end up on the list_lru is created. It needs to quickly check if the list_lru heads for the memcg already exist, and allocate them when they don't. Doing this with folio objects is tricky: folio_memcg() is not stable and requires either RCU protection or pinning the cgroup. But it's desirable to make the existence check lightweight under RCU, and only pin the memcg when we need to allocate list_lru heads and may block. In preparation for switching the THP shrinker to list_lru, add a helper function for allocating list_lru heads coming from a folio. Link: https://lore.kernel.org/20260527204757.2544958-8-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mikhail Zaslonko <zaslonko@linux.ibm.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nico Pache <npache@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Usama Arif <usama.arif@linux.dev> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-08mm: list_lru: introduce caller locking for additions and deletionsJohannes Weiner
Locking is currently internal to the list_lru API. However, a caller might want to keep auxiliary state synchronized with the LRU state. For example, the THP shrinker uses the lock of its custom LRU to keep PG_partially_mapped and vmstats consistent. To allow the THP shrinker to switch to list_lru, provide normal and irqsafe locking primitives as well as caller-locked variants of the addition and deletion functions. Link: https://lore.kernel.org/20260527204757.2544958-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Liam R. Howlett (Oracle) <liam@infradead.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mikhail Zaslonko <zaslonko@linux.ibm.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nico Pache <npache@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Usama Arif <usama.arif@linux.dev> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>