summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2026-06-15Merge tag 'kbuild-7.2-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux Pull Kbuild / Kconfig updates from Nathan Chancellor: "Kbuild: - Remove broken module linking exclusion for BTF - Add documentation around how offset header files work - Include unstripped vDSO libraries in pacman packages - Bump minimum version of LLVM for building the kernel to 17.0.1 and clean up unnecessary workarounds - Use a context manager in run-clang-tools - Add dist macro value if present to release tag for RPM packages - Detect and report truncated buf_printf() output in modpost - Add __llvm_covfun and __llvm_covmap to section whitelist in modpost - Support Clang's distributed ThinLTO mode - Remove architecture specific configurations for AutoFDO and Propeller to ease individual architecture maintenance Kconfig: - Add kconfig-sym-check target to look for dangling Kconfig symbol references and invalid tristate literal values - Harden against potential NULL pointer dereference - Fix typo in Kconfig test comment" * tag 'kbuild-7.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux: (31 commits) kconfig: tests: fix typo in comment kconfig: Remove the architecture specific config for Propeller kconfig: Remove the architecture specific config for AutoFDO modpost: Add __llvm_covfun and __llvm_covmap to section_white_list kconfig: add kconfig-sym-check static checker kbuild: Remove unnecessary 'T' modifier in cmd_ar_builtin_fixup kbuild: distributed build support for Clang ThinLTO kbuild: move vmlinux.a build rule to scripts/Makefile.vmlinux_a scripts: modpost: detect and report truncated buf_printf() output kbuild: rpm-pkg: append %{?dist} macro to Release tag run-clang-tools: run multiprocessing.Pool as context manager compiler-clang.h: Drop explicit version number from "all" diagnostic macro compiler-clang.h: Remove __cleanup -Wunused-variable workaround kbuild: Remove check for broken scoping with clang < 17 in CC_HAS_ASM_GOTO_OUTPUT x86/entry/vdso32: Remove conditional omission of '.cfi_offset eflags' x86/module: Revert "Deal with GOT based stack cookie load on Clang < 17" x86/build: Drop unnecessary '-ffreestanding' addition to KBUILD_CFLAGS scripts/Makefile.warn: Drop -Wformat handling for clang < 16 riscv: Drop tautological condition from TOOLCHAIN_NEEDS_OLD_ISA_SPEC riscv: Remove tautological condition from selection of ARCH_SUPPORTS_CFI ...
2026-06-15platform/x86/intel/pmt: Unify header fetch and add ACPI sourceDavid E. Box
Allow the PMT class to read discovery headers from either PCI MMIO or ACPI-provided entries, depending on the discovery source. The new source-aware fetch helper caches the canonical discovery header for both paths, capping PCI MMIO reads to the mapped resource size, while keeping the mapped PCI discovery table available for users such as crashlog. Split intel_pmt_populate_entry() into source-specific resolvers: - pmt_resolve_access_pci(): handles both ACCESS_LOCAL and ACCESS_BARID for PCI-backed devices and sets entry->pcidev. Same existing functionality. - pmt_resolve_access_acpi(): handles only ACCESS_BARID for ACPI-backed devices, rejecting ACCESS_LOCAL which has no valid semantics without a physical discovery resource. This maintains existing PCI behavior and makes no functional changes for PCI devices. Assisted-by: GitHub-Copilot:claude-opus-4.7 Signed-off-by: David E. Box <david.e.box@linux.intel.com> Link: https://patch.msgid.link/4b33b04ffaf0943b67d330f48b5d1dfcb6d1be5d.1781294741.git.david.e.box@linux.intel.com Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2026-06-15Merge tag 'pull-dcache' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dcache updates from Al Viro: - d_alloc_parallel() API change (Neil's with my changes) - NORCU fixes - Reorganization and simplification of dentry eviction logic - Simplifying rcu_read_lock() scopes in fs/dcache.c - Secondary roots work - getting rid of NFS fake root dentries and dealing with remaining shrink_dcache_for_umount() and shrink_dentry_list() races - making cursors NORCU (surprisingly easy) * tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (22 commits) make cursors NORCU nfs: get rid of fake root dentries wind ->s_roots via ->d_sib instead of ->d_hash shrink_dentry_tree(): unify the calls of shrink_dentry_list() shrinking rcu_read_lock() scope in d_alloc_parallel() d_walk(): shrink rcu_read_lock() scope document dentry_kill() adjust calling conventions of lock_for_kill(), fold __dentry_kill() into dentry_kill() Document rcu_read_lock() use in select_collect2() Shift rcu_read_{,un}lock() inside fast_dput() simplify safety for lock_for_kill() slowpath fold lock_for_kill() and __dentry_kill() into common helper fold lock_for_kill() into shrink_kill() shrink_dentry_list(): start with removing from shrink list d_prune_aliases(): make sure to skip NORCU aliases kill d_dispose_if_unused() make to_shrink_list() return whether it has moved dentry to list select_collect(): ignore dentries on shrink lists if they have positive refcounts find_acceptable_alias(): skip NORCU aliases with zero refcount fix a race between d_find_any_alias() and final dput() of NORCU dentries ...
2026-06-15Merge tag 'vfs-7.2-rc1.procfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull procfs updates from Christian Brauner: - Revamp fs/filesystems.c The file was a mess with a hand-rolled linked list in desperate need of a cleanup. The filesystems list is now RCU-ified, /proc files can be marked permanent from outside fs/proc/, and the string emitted when reading /proc/filesystems is pre-generated and cached instead of pointer-chasing and printfing entry by entry on every read. The file is read frequently because libselinux reads it and is linked into numerous frequently used programs (even ones you would not suspect, like sed!). Scalability also improves since reference maintenance on open/close is bypassed. open+read+close cycle single-threaded (ops/s): before: 442732 after: 1063462 (+140%) open+read+close cycle with 20 processes (ops/s): before: 606177 after: 3300576 (+444%) A follow-up patch adds missing unlocks in some corner cases and tidies things up. - Relax the mount visibility check for subset=pid mounts When procfs is mounted with subset=pid, all static files become unavailable and only the dynamic pid information is accessible. In that case there is no point in imposing the full mount visibility restrictions on the mounter - everything that can be hidden in procfs is already inaccessible. These restrictions prevented procfs from being mounted inside rootless containers since almost all container implementations overmount parts of procfs to hide certain directories. As part of this /proc/self/net is only shown in subset=pid mounts for CAP_NET_ADMIN, reconfiguring subset=pid is rejected, the SB_I_USERNS_VISIBLE superblock flag is replaced with an FS_USERNS_MOUNT_RESTRICTED filesystem flag, fully visible mounts are recorded in a list, and the mount restrictions are finally documented. - Protect ptrace_may_access() with exec_update_lock in procfs Most uses of ptrace_may_access() in procfs should hold exec_update_lock to avoid TOCTOU issues with concurrent privileged execve() (like setuid binary execution). This fixes the easy cases - the owner and visibility checks and the FD link permission checks - with the gnarlier ones to follow later. * tag 'vfs-7.2-rc1.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: fix ups and tidy ups to /proc/filesystems caching proc: protect ptrace_may_access() with exec_update_lock (FD links) proc: protect ptrace_may_access() with exec_update_lock (part 1) docs: proc: add documentation about mount restrictions proc: handle subset=pid separately in userns visibility checks proc: prevent reconfiguring subset=pid proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN fs: cache the string generated by reading /proc/filesystems sysfs: remove trivial sysfs_get_tree() wrapper fs: RCU-ify filesystems list fs: move SB_I_USERNS_VISIBLE to FS_USERNS_MOUNT_RESTRICTED proc: allow to mark /proc files permanent outside of fs/proc/ namespace: record fully visible mounts in list
2026-06-15Merge tag 'vfs-7.2-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Reduce pipe->mutex contention by pre-allocating pages outside the lock in anon_pipe_write(). anon_pipe_write() called alloc_page() once per page while holding pipe->mutex. The allocation can sleep doing direct reclaim and runs memcg charging, which extends the critical section and stalls any concurrent reader on the same mutex. Now up to 8 pages are pre-allocated before the mutex is taken, leftovers are recycled into the per-pipe tmp_page[] cache before unlock, and any remainder is released after unlock, keeping the allocator out of the critical section on both sides. On a writers x readers sweep with 64KB writes against a 1 MB pipe throughput improves 6-28% and average write latency drops 5-22%; under memory pressure - when the cost of holding the mutex across reclaim is highest - throughput improves 21-48% and latency drops 17-33%. The microbenchmark is added to selftests. - uaccess/sockptr: fix the ignored_trailing logic in copy_struct_to_user() to behave as documented and the usize check in copy_struct_from_sockptr() for user pointers, and add copy_struct_{from,to}_bounce_buffer() and copy_struct_to_sockptr() helpers for upcoming users (IPPROTO_SMBDIRECT, IPPROTO_QUIC). - bpf: add a sleepable bpf_real_inode() kfunc that resolves the real inode backing a dentry via d_real_inode(). On overlayfs the inode attached to the dentry doesn't carry the underlying device information; this is used by the filesystem restriction BPF program that was merged into systemd. - docs: add guidelines for submitting new filesystems, motivated by the maintenance burden abandoned and untestable filesystems impose on VFS developers, blocking infrastructure work like folio conversions and iomap migration. Fixes: - libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo() and drop the now-redundant assignments in callers. This began as a one-line dma-buf fix for a path_noexec() warning; a pseudo filesystem has no reason not to set SB_I_NOEXEC. All init_pseudo() callers were audited: the only visible effect is on dma-buf where SB_I_NOEXEC silences the warning. - Handle set_blocksize() failures in legacy filesystems (bfs, hpfs, qnx4, jfs, befs, affs, isofs, minix, ntfs3, omfs). Mounting a device with a sector size > PAGE_SIZE crashed roughly half of them; the rest had the same missing error handling pattern. Plus a follow-up releasing the superblock buffer_head when setting the minix v3 block size fails. - mount: honour SB_NOUSER in the new mount API. - fs/fcntl: fix a SOFTIRQ-unsafe lock order in fasync signaling by switching the process-group paths of send_sigio() and send_sigurg() from read_lock(&tasklist_lock) to RCU, matching the single-PID path. - vfs: add an FS_USERNS_DELEGATABLE flag and set it for NFS, fixing delegated NFS mounts (fsopen() in a container with the mount performed by a privileged daemon) that broke when non-init s_user_ns was tied to FS_USERNS_MOUNT. - selftests/namespaces: fix a hang in nsid_test where an unreaped grandchild kept the TAP pipe write-end open, a waitpid(-1) race in listns_efault_test, and a false FAIL on kernels without listns() where the tests should SKIP. - filelock: fix the break_lease() stub signature for CONFIG_FILE_LOCKING=n. - init/initramfs_test: wait for the async initramfs unpacking before running; the test and do_populate_rootfs() share the parser state. - fs/coredump: reduce redundant log noise in validate_coredump_safety(). - iomap: pass the correct length to fserror_report_io() in __iomap_write_begin(). - backing-file: fix the backing_file_open() kerneldoc. Cleanups: - initramfs: refactor the cpio hex header parsing to use hex2bin() instead of the hand-rolled simple_strntoul() which is reverted, and extend the initramfs KUnit tests to cover header fields with 0x prefixes. - Replace __get_free_pages() and friends with kmalloc()/kzalloc() across quota, proc, ocfs2/dlm, nilfs2, nfs, nfsd, libfs, jfs, jbd2, isofs, fuse, select, namespace, configfs, binfmt_misc, bfs, and the do_mounts init code - part of the larger work of replacing page allocator calls with kmalloc(). - Use clear_and_wake_up_bit() in unlock_buffer() and journal_end_buffer_io_sync() instead of open-coding the sequence. - Drop unused VFS exports: unexport drop_super_exclusive(), remove start_removing_user_path_at(), and fold __start_removing_path() into start_removing_path(). - fs/read_write: narrow the __kernel_write() export with EXPORT_SYMBOL_FOR_MODULES(). - vfs: uapi: retire octal and hex constants in favor of (1 << n) for the O_ flags. Finding a free bit for a new flag across the architectures was needlessly hard with the mixed bases. - dcache: add extra sanity checks of dead dentries in dentry_free() via a new DENTRY_WARN_ONCE() that also prints d_flags. - iov_iter: use kmemdup_array() in dup_iter() to harden the allocation against multiplication overflow. - fs/pipe: write to ->poll_usage only once. - vfs: remove an always-taken if-branch in find_next_fd(). - dcache: use kmalloc_flex() for struct external_name in __d_alloc(). - namei: use QSTR() instead of QSTR_INIT() in path_pts(). - sync_file_range: delete dead S_ISLNK code. - Comment fixes: retire a stale comment in fget_task_next() and fix assorted spelling mistakes" * tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (73 commits) backing-file: fix backing_file_open() kerneldoc parameter iomap: pass the correct len to fserror_report_io in __iomap_write_begin vfs: add FS_USERNS_DELEGATABLE flag and set it for NFS filelock: fix break_lease() stub signature for CONFIG_FILE_LOCKING=n vfs: uapi: retire octal and hex numbers in favor of (1 << n) for O_ flags bpf: add bpf_real_inode() kfunc fs/read_write: Do not export __kernel_write() to the entire world libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo() mount: honour SB_NOUSER in the new mount API fs/fcntl: fix SOFTIRQ-unsafe lock order in fasync signaling selftests/pipe: add pipe_bench microbenchmark fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write fs: retire stale comment in fget_task_next() fs: fix spelling mistakes in comment bfs: replace get_zeroed_page() with kzalloc() binfmt_misc: replace __get_free_page() with kmalloc() configfs: replace __get_free_pages() with kzalloc() fs/namespace: use __getname() to allocate mntpath buffer fs/select: replace __get_free_page() with kmalloc() ...
2026-06-15Merge tag 'vfs-7.2-rc1.xattr' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull simple_xattr updates from Christian Brauner: "This reworks the simple xattr api to make it more efficient and easier to use for all consumers. The simple_xattr hash table moves from the inode into a per-superblock cache, removing the per-inode overhead for the common case of few or no xattrs. The interface now passes struct simple_xattrs ** so lazy allocation is handled internally instead of by every caller, kernfs xattr operations on kernfs nodes shared between multiple superblocks are properly serialized, and tmpfs constructs "security.foo" xattr names with kasprintf() instead of kmalloc() plus two memcpy()s. A follow-up fix links kernfs nodes to their parent before the LSM init hook runs: with the per-sb cache kernfs_xattr_set() computes the cache via kernfs_root(kn), which faulted on a freshly allocated node when selinux_kernfs_init_security() called into it - reproducible as a NULL pointer dereference on the first cgroup mkdir on SELinux-enabled systems. On top of this bpffs gains support for trusted.* and security.* xattrs so that user space and BPF LSM programs can attach metadata - for example a content hash or a security label - to pinned objects and directories and inspect it uniformly like on other filesystems. The store is in-memory and non-persistent, living only for the lifetime of the mount like everything else in bpffs" * tag 'vfs-7.2-rc1.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: bpf: Add simple xattr support to bpffs kernfs: link kn to its parent before the LSM init hook simpe_xattr: use per-sb cache simple_xattr: change interface to pass struct simple_xattrs ** tmpfs: simplify constructing "security.foo" xattr names kernfs: fix xattr race condition with multiple superblocks
2026-06-15Merge tag 'vfs-7.2-rc1.iomap' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull iomap updates from Christian Brauner: - Add the vfs infrastructure required to implement fs-verity support for XFS with a post-EOF merkle tree: fsverity generates and stores a zero-block hash, and iomap learns to verify data on buffered reads, to handle fsverity during writeback via the new IOMAP_F_FSVERITY flag, and to write fsverity metadata through iomap_fsverity_write(). - Skip the memset of the iomap in iomap_iter() once the iteration is done. In high-IOPS scenarios (4k randread NVMe polling via io_uring) the pointless memset wasted memory write bandwidth; this improves IOPS by about 5% on ext4 and xfs. - Add balance_dirty_pages_ratelimited() to iomap_zero_iter(), aligning it with iomap_write_iter(). This prepares for the exFAT iomap conversion where zeroing beyond valid_size can trigger large-scale zeroing operations that caused memory pressure without throttling. - Remove the over-strict inline data boundary check. If a filesystem provides a valid inline_data pointer and length there is no reason to require that inline data must not cross a page boundary. - Don't make REQ_POLLED imply REQ_NOWAIT, matching the earlier equivalent block layer fix: there are valid cases to poll for I/O completion without REQ_NOWAIT, and REQ_NOWAIT for file system writes is currently not supported as writes aren't idempotent. - Introduce IOMAP_F_ZERO_TAIL for filesystems that maintain a separate valid data length (exFAT, NTFS). For a write starting at or beyond valid_size, __iomap_write_begin() now zeroes only the tail portion of the block while preserving valid data before it, instead of leaving stale data in the page cache. The flag is also added to the iomap trace event strings. * tag 'vfs-7.2-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: iomap: Add IOMAP_F_ZERO_TAIL flag to trace event strings iomap: introduce iomap_fsverity_write() for writing fsverity metadata iomap: teach iomap to read files with fsverity iomap: introduce IOMAP_F_FSVERITY and teach writeback to handle fsverity fsverity: generate and store zero-block hash iomap: introduce IOMAP_F_ZERO_TAIL flag iomap: don't make REQ_POLLED imply REQ_NOWAIT iomap: remove over-strict inline data boundary check iomap: add dirty page control to iomap_zero_iter iomap: avoid memset iomap when iter is done
2026-06-15Merge tag 'vfs-7.2-rc1.eventpoll' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull eventpoll updates from Christian Brauner: - eventpoll clarity refactor The recent eventpoll UAF fixes (a6dc643c6931 and follow-ups) depended on invariants in fs/eventpoll.c that were nowhere documented and had to be reverse-engineered from the code: the lifetime relationships between struct eventpoll, struct epitem, and struct file, the three removal paths coordinating via epi_fget() pins and ep->mtx, the ovflist sentinel-encoded scan state machine, the POLLFREE release/acquire handshake, and the loop / path check globals serialized by epnested_mutex. The fixes were correct but the next person to touch this code would hit the same learning curve. This series codifies those invariants in source and tightens the surrounding structure. No functional changes intended: - Documentation: a top-of-file overview with field-protection tables for struct eventpoll and struct epitem, a section gathering the loop-check / path-check globals next to their declarations, labelled comments on the two sides of the POLLFREE handshake, refreshed comments on epi_fget() and ep_remove_file(), and a docblock on ep_clear_and_put() that names its two-pass structure as load-bearing. - Mechanical renames: ep_refcount_dec_and_test() -> ep_put() to pair with ep_get(), attach_epitem() -> ep_attach_file() for ep_remove_file() symmetry, the unused depth argument dropped from epoll_mutex_lock(), and the CONFIG_KCMP block relocated next to CONFIG_COMPAT so the hot-path code is contiguous. - Helper extraction: ep_insert() splits into ep_alloc_epitem() and ep_register_epitem(), ep_clear_and_put()'s two passes become ep_drain_pollwaits() and ep_drain_tree() so the ordering invariant is enforced by the call sequence rather than convention, the per-event delivery loop body becomes ep_deliver_event(), and the ep->mtx + epnested_mutex acquisition dance lifts out of do_epoll_ctl() into ep_ctl_lock() / ep_ctl_unlock(). - Sentinel and predicate cleanup: the EP_UNACTIVE_PTR overload is hidden behind named helpers (ep_is_scanning, epi_on_ovflist, ...), epi->next is renamed to epi->ovflist_next, and the boolean predicates return bool. - The per-CTL_ADD scratch state (tfile_check_list, path_count[], inserting_into) moves from file-scope globals into a stack-allocated struct ep_ctl_ctx plumbed through the loop / path check chain. Two follow-up fixes are included: missing kernel-doc for the new @ctx parameters, and restoring the EP_UNACTIVE_PTR sentinel for ctx->tfile_check_list - replacing it with NULL termination broke ep_remove_file()'s "never listed" check for the list tail, causing a syzbot-reported use-after-free. - io_uring related epoll cleanups One of the nastier things about epoll is how it allows nesting contexts inside each other, leading to the necessity of loop detection and the issues that have come with that. There is no reason to support nesting on the io_uring side, so contain the damage and disallow nested contexts from there: eventpoll gains a file based control interface and struct epoll_filefd is renamed to epoll_key. The io_uring side proper goes on top of this through the block tree. - Fix epoll_wait() reporting false negatives ep_events_available() checks ep->rdllist and ep_is_scanning() without a lock and can race with a concurrent scan such that neither check sees the events, causing epoll_wait() with a zero timeout to wrongly report no events even though events are available. A sequence lock closes the race and a reproducer is added to the eventpoll selftests. * tag 'vfs-7.2-rc1.eventpoll' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (25 commits) eventpoll: restore EP_UNACTIVE_PTR sentinel for ctx->tfile_check_list eventpoll: Fix epoll_wait() report false negative selftests/eventpoll: Add test for multiple waiters eventpoll: add missing kernel-doc for @ctx function parameters eventpoll: rename struct epoll_filefd to epoll_key eventpoll: add file based control interface eventpoll: export is_file_epoll() eventpoll: pass struct epoll_filefd through ep_find() and ep_insert() eventpoll: hoist CTL_ADD scratch state into struct ep_ctl_ctx eventpoll: use bool for predicate helpers eventpoll: rename epi->next and txlist for clarity eventpoll: wrap EP_UNACTIVE_PTR in typed sentinel helpers eventpoll: extract lock dance from do_epoll_ctl() into ep_ctl_lock() eventpoll: extract ep_deliver_event() from ep_send_events() eventpoll: split ep_clear_and_put() into drain helpers eventpoll: split ep_insert() into alloc + register stages eventpoll: relocate KCMP helpers near compat syscalls eventpoll: rename attach_epitem() to ep_attach_file() eventpoll: drop unused depth argument from epoll_mutex_lock() eventpoll: rename ep_refcount_dec_and_test() to ep_put() ...
2026-06-15Merge tag 'vfs-7.2-rc1.bh' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull buffer_head updates from Christian Brauner: "This removes b_end_io from struct buffer_head. Instead of setting bio->bi_end_io to end_bio_bh_io_sync() which then calls bh->b_end_io(), the new bh_submit() and __bh_submit() interfaces set bio->bi_end_io to the appropriate completion handler directly, replacing two indirect function calls in the completion path with one. It is also one fewer function pointer in the middle of a writable data structure that can be corrupted, it shrinks struct buffer_head from 104 to 96 bytes allowing roughly 7% more buffer_heads to be cached in the same amount of memory, and it removes some atomic operations as the buffer refcount is no longer incremented before calling the end_io handler. All in-tree users (fs/buffer.c itself, ext4, jbd2, ocfs2, gfs2, nilfs2, and md-bitmap) are converted, and submit_bh(), mark_buffer_async_write(), and end_buffer_write_sync() are removed" * tag 'vfs-7.2-rc1.bh' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (34 commits) buffer: Remove end_buffer_write_sync() buffer: Change calling convention for end_buffer_read_sync() buffer: Remove b_end_io buffer: Remove submit_bh() md-bitmap: Convert read_file_page and write_file_page to bh_submit() nilfs2: Convert nilfs_mdt_submit_block to bh_submit() nilfs2: Convert nilfs_gccache_submit_read_data to bh_submit() nilfs2: Convert nilfs_btnode_submit_block to bh_submit() buffer: Remove mark_buffer_async_write() gfs2: Convert gfs2_aspace_write_folio to bh_submit() gfs2: Remove use of b_end_io in gfs2_meta_read_endio() gfs2: Convert gfs2_dir_readahead to bh_submit() gfs2: Convert gfs2_metapath_ra to bh_submit() ocfs2: Convert ocfs2_write_super_or_backup to bh_submit() ocfs2: Convert ocfs2_read_blocks to bh_submit() ocfs2: Convert ocfs2_read_block to bh_submit() ocfs2: Convert ocfs2_write_block to bh_submit() jbd2: Convert jbd2_write_superblock() to bh_submit() jbd2: Convert journal commit to bh_submit() ext4: Convert ext4_commit_super() to bh_submit() ...
2026-06-15Merge tag 'vfs-7.2-rc1.writeback' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs writeback updates from Christian Brauner: - Fix a race between cgroup_writeback_umount() and inode_switch_wbs() When a container exits, a race between cgroup_writeback_umount() and inode_switch_wbs()/cleanup_offline_cgwb() can trigger "VFS: Busy inodes after unmount" followed by a use-after-free on percpu counters. There is a window between inode_prepare_wbs_switch() returning true (having passed the SB_ACTIVE check and grabbed the inode) and the subsequent wb_queue_isw() call: if cgroup_writeback_umount() observes the global isw_nr_in_flight counter as non-zero but flush_workqueue() finds nothing queued yet, it returns early - leaving a held inode reference that blocks evict_inodes() and a later iput() that hits freed percpu counters. The race is closed by covering the window from inode_prepare_wbs_switch() through wb_queue_isw() with an RCU read-side critical section and synchronizing in the umount path. On top of that the now-dead rcu_barrier() left over from the queue_rcu_work() era is removed, and the global synchronize_rcu()/flush_workqueue() pair is replaced with a per-sb in-flight counter plus pin/unpin/drain helpers so umount no longer serializes against switch activity on unrelated superblocks. Under cgroup writeback churn on a 16 vCPU guest this takes umount latency from ~92-138ms p50 down to ~5-8ms p50 and the cumulative cost of cgroup_writeback_umount() from ~62ms to ~4us per call. The initial race fix is kept separate and minimal so it backports cleanly to stable trees that still queue switches via queue_rcu_work(). - Improve write performance with RWF_DONTCACHE Dirty DONTCACHE pages are now tracked per bdi_writeback so that the writeback flusher can be kicked in a targeted fashion for IOCB_DONTCACHE writes instead of relying on global writeback, and the PG_dropbehind flag is preserved when a folio is split. * tag 'vfs-7.2-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking mm: track DONTCACHE dirty pages per bdi_writeback mm: preserve PG_dropbehind flag during folio split writeback: use a per-sb counter to drain inode wb switches at umount writeback: drop now-unnecessary rcu_barrier() in cgroup_writeback_umount() writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()
2026-06-15Merge tag 'vfs-7.2-rc1.super' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs superblock updates from Christian Brauner: "This retires sget(). CIFS plus the two ext4 KUnit tests (extents-test, mballoc-test) were the last in-tree callers, and all three convert cleanly to sget_fc(). That lets sget() and its prototype come out, taking ~60 lines that only existed to be kept in lockstep with sget_fc() on every publish-path change" * tag 'vfs-7.2-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: retire sget() smb: client: convert cifs_smb3_do_mount() to sget_fc() ext4: convert mballoc KUnit test to sget_fc() ext4: convert extents KUnit test to sget_fc()
2026-06-15Merge tag 'vfs-7.2-rc1.openat2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull openat2 updates from Christian Brauner: "Features: - Add O_EMPTYPATH to openat(2)/openat2(2). To get an operable file descriptor from an O_PATH file descriptor it is possible to use openat(fd, ".", O_DIRECTORY) for directories, but other file types require going through open("/proc/<pid>/fd/<nr>") and thus depend on a functioning procfs. With O_EMPTYPATH an empty path string is accepted and LOOKUP_EMPTY is set at path resolution time, allowing to reopen the file behind the file descriptor directly. Selftests are included. - Add an OPENAT2_REGULAR flag for openat2(2) which refuses to open anything but regular files with the new EFTYPE error code. This implements the "ability to only open regular files" feature requested by userspace via uapi-group.org and protects services from being redirected to fifos, device nodes, and friends. All atomic_open implementations were audited for OPENAT2_REGULAR handling. Explicit checks were added to ceph, gfs2, nfs (v4), and cifs/smb - these are the filesystems whose atomic_open can encounter an existing non-regular file and would otherwise call finish_open() on it or return a misleading error code. The remaining implementations (9p, fuse, vboxsf, nfs v2/v3) only call finish_open() on freshly created files and use finish_no_open() for lookup hits, letting the VFS catch non-regular files via the do_open() safety net. Cleanups: - Migrate the openat2 selftests to the kselftest harness and move them under selftests/filesystems/. The tests were written in the early days of selftests' TAP support and the modern kselftest harness is much easier to follow and maintain. The contents of the tests are unchanged and the new emptypath tests are ported on top. - Make the LAST_XXX last-type constants private to fs/namei.c. The only user outside of fs/namei.c was ksmbd which only needs to know whether the last component is a regular one, so vfs_path_parent_lookup() now performs the LAST_NORM check internally. The ints are replaced with a dedicated enum last_type" * tag 'vfs-7.2-rc1.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: vfs: replace ints with enum last_type for LAST_XXX vfs: make LAST_XXX private to fs/namei.c selftests: openat2: port emptypath_test to kselftest harness kselftest/openat2: test for OPENAT2_REGULAR flag openat2: new OPENAT2_REGULAR flag support openat2: introduce EFTYPE error code selftest: add tests for O_EMPTYPATH vfs: add O_EMPTYPATH to openat(2)/openat2(2) selftests: openat2: migrate to kselftest harness selftests: openat2: switch from custom ARRAY_LEN to ARRAY_SIZE selftests: openat2: move helpers to header selftests: move openat2 tests to selftests/filesystems/
2026-06-15Merge tag 'kernel-7.2-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc kernel updates from Christian Brauner: "Fixes - rhashtable: give each instance its own lockdep class syzbot reported a circular locking dependency between ht->mutex and fs_reclaim via the simple_xattrs rhashtable being torn down during inode eviction. The predicted deadlock cannot occur: rhashtable_free_and_destroy() cancels the deferred worker before taking ht->mutex and acquisitions on distinct rhashtables are on distinct mutexes. Lockdep flags a cycle anyway because every ht->mutex in the kernel shared the single static lockdep class from rhashtable_init_noprof(). The lockdep key is lifted to a per-call-site static key so every rhashtable instance gets its own class. - selftests/clone3: fix misuse of the libcap library interface in the cap_checkpoint_restore test and remove unused variables - selftests/pid_namespace: compute the pid_max test limits dynamically instead of hardcoding values below the kernel-enforced minimum of PIDS_PER_CPU_MIN * num_possible_cpus() which made the tests fail on machines with many possible CPUs - selftests: fix the Makefile TARGETS entry for nsfs which wasn't adjusted when the tests moved under filesystems/ Cleanups - ipc/sem.c: use unsigned int for nsops to match the declaration in syscalls.h" * tag 'kernel-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests/clone3: remove unused variables selftests/clone3: fix libcap interface usage ipc/sem.c: use unsigned int for nsops selftests: Fix Makefile target for nsfs rhashtable: give each instance its own lockdep class selftests/pid_namespace: compute pid_max test limits dynamically
2026-06-15Merge tag 'kernel-7.2-rc1.task_exec_state' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull task_exec_state updates from Christian Brauner: "This introduces a new per-task task_exec_state structure and relocates the dumpable mode and the user namespace captured at execve() from mm_struct onto it. It stays attached to the task for its full lifetime. __ptrace_may_access() and several /proc owner and visibility checks need to consult two pieces of state for any observable task, including zombies that have already gone through exit_mm(): the dumpable mode and the user namespace captured at execve(). Both live on mm_struct today, which exit_mm() clears from the task long before the task is reaped. A reader that races with do_exit() observes task->mm == NULL and either fails the check or falls back to init_user_ns - which denies legitimate access to non-dumpable zombies that were running in a nested user namespace. mm_struct loses ->user_ns and the dumpability bits in ->flags. MMF_DUMPABLE_BITS is reserved so the MMF_DUMP_FILTER_* layout exposed via /proc/<pid>/coredump_filter stays stable. task->user_dumpable and its exit_mm() snapshot are removed. task_exec_state is the privilege domain established by an execve(). Within a thread group it is shared via refcount; across thread groups each task has its own: - CLONE_VM siblings (thread-group members, io_uring workers) refcount-share the parent's exec_state. - Non-CLONE_VM clones (fork(), vfork() without CLONE_VM) allocate a fresh exec_state inheriting the parent's dumpable mode and user_ns. - execve() in the child allocates a fresh instance and installs it under task_lock + exec_update_lock via task_exec_state_replace(). - Credential changes (setresuid, capset, ...) and prctl(PR_SET_DUMPABLE) update dumpability on the current task's exec_state, i.e., on the thread group's shared instance. On top of this exec_mmap() no longer tears down the old mm while holding exec_update_lock for writing and cred_guard_mutex. Neither lock is needed for that: exec_update_lock only exists to make the mm swap atomic with the later commit_creds() and all its readers operate on the new mm; none looks at the detached old mm. The cost was real: __mmput() runs exit_mmap() over the entire old address space and can block in exit_aio() waiting for in-flight AIO, so execve() of a large process blocked ptrace_attach() and every exec_update_lock reader for the duration of the teardown. The old mm is now stashed in bprm->old_mm and released from setup_new_exec() after both locks are dropped, with a backstop in free_bprm() for the error paths" * tag 'kernel-7.2-rc1.task_exec_state' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: exec: free the old mm outside the exec locks exec_state: relocate dumpable information ptrace: add ptracer_access_allowed() exec: introduce struct task_exec_state sched/coredump: introduce enum task_dumpable
2026-06-15Merge tag 'vfs-7.2-rc1.casefold' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs casefolding updates from Christian Brauner: "This exposes the case folding behavior of local filesystems so that file servers - nfsd, ksmbd, and user space file servers - can report the actual behavior to clients instead of guessing. Filesystems report case-insensitive and case-nonpreserving behavior via new file_kattr flags in their fileattr_get implementations. fat, exfat, ntfs3, hfs, hfsplus, xfs, cifs, nfs, vboxsf, and isofs are wired up. Local filesystems that are not explicitly handled default to the usual POSIX behavior of case-sensitive and case-preserving. nfsd uses this to report case folding via NFSv3 PATHCONF and to implement the NFSv4 FATTR4_CASE_INSENSITIVE and FATTR4_CASE_PRESERVING attributes - both have been part of the NFS protocols for decades to support clients on non-POSIX systems - and ksmbd reports it via FS_ATTRIBUTE_INFORMATION. Exposing the information through the fileattr uapi covers user space file servers. The immediate motivation is interoperability: Windows NFS clients hard-require servers to report case-insensitivity for Win32 applications to work correctly, and a client that knows the server is case-insensitive can avoid issuing multiple LOOKUP/READDIR requests searching for case variants. The Linux NFS client already grew support for case-insensitive shares years ago in support of the Hammerspace NFS server - negative dentry caching must be disabled (a lookup for "FILE.TXT" failing must not cache a negative entry when "file.txt" exists) and directory change invalidation must drop cached case-folded name variants. Such servers often operate in multi-protocol environments where a single file service instance caters to both NFS and SMB clients, and nfsd needs to report case folding properly to participate as a first-class citizen there. A follow-up series brings fixes for the initial work: the nfsd case-info probe now uses kernel credentials, maps -ESTALE to NFS3ERR_STALE, and has its cost capped across READDIR entries; the nfs client avoids transiently zeroed case capability bits during the probe and skips the pathconf probe when neither field is consumed; the FS_CASEFOLD_FL semantics are clarified in the UAPI header; and the tools UAPI headers are synced" * tag 'vfs-7.2-rc1.casefold' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits) nfsd: Cap case-folding probe cost across READDIR entries nfsd: Map -ESTALE from case probe to NFS3ERR_STALE nfsd: Use kernel credentials for case-info probe fs: Clarify FS_CASEFOLD_FL semantics in UAPI header nfs: Skip pathconf probe when neither field is consumed nfs: Avoid transient zeroed case capability bits during probe tools headers UAPI: Sync case-sensitivity flags from linux/fs.h ksmbd: Report filesystem case sensitivity via FS_ATTRIBUTE_INFORMATION nfsd: Implement NFSv4 FATTR4_CASE_INSENSITIVE and FATTR4_CASE_PRESERVING nfsd: Report export case-folding via NFSv3 PATHCONF isofs: Implement fileattr_get for case sensitivity vboxsf: Implement fileattr_get for case sensitivity nfs: Implement fileattr_get for case sensitivity cifs: Implement fileattr_get for case sensitivity xfs: Report case sensitivity in fileattr_get hfsplus: Report case sensitivity in fileattr_get hfs: Implement fileattr_get for case sensitivity ntfs3: Implement fileattr_get for case sensitivity exfat: Implement fileattr_get for case sensitivity fat: Implement fileattr_get for case sensitivity ...
2026-06-15Merge tag 'vfs-7.2-rc1.directory.delegations' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs directory delegations from Christian Brauner: "This contains the VFS prerequisites for supporting directory delegations in nfsd via CB_NOTIFY callbacks. The filelock core gains support for ignoring delegation breaks for directory change events together with an inode_lease_ignore_mask() helper, and fsnotify gains fsnotify_modify_mark_mask() and a FSNOTIFY_EVENT_RENAME data type. With this in place nfsd can request delegations on directories and set up inotify watches to trigger sending CB_NOTIFY events to clients instead of having every directory change break the delegation. New tracepoints are added to fsnotify() and to the start of break_lease(), and trace_break_lease_block() is passed the currently blocking lease instead of the new one. A follow-up fix moves the LEASE_BREAK_* flags out of #ifdef CONFIG_FILE_LOCKING to fix the build for CONFIG_FILE_LOCKING=n configurations" * tag 'vfs-7.2-rc1.directory.delegations' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: filelock: move LEASE_BREAK_* flags out of #ifdef CONFIG_FILE_LOCKING fsnotify: add FSNOTIFY_EVENT_RENAME data type fsnotify: add fsnotify_modify_mark_mask() fsnotify: new tracepoint in fsnotify() filelock: add an inode_lease_ignore_mask helper filelock: add a tracepoint to start of break_lease() filelock: add support for ignoring deleg breaks for dir change events filelock: pass current blocking lease to trace_break_lease_block() rather than "new_fl"
2026-06-15Merge tag 'vfs-7.2-rc1.inode' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode updates from Christian Brauner: "This extends the lockless ->i_count handling. iput() could already decrement any value greater than one locklessly but acquiring a reference always required taking inode->i_lock. Now acquiring a reference is lockless as long as the count was already at least 1, i.e., only the 0->1 and 1->0 transitions take the lock. This avoids the lock for the common cases of nfs calling into the inode hash and btrfs using igrab(). Cleanup-wise icount_read_once() is added to line up with inode_state_read_once() and the open-coded ->i_count loads across the tree are converted, and ihold() is relocated and tidied up. On top of that some stale lock ordering annotations are retired from the inode hash code: iunique() no longer takes the hash lock since the inode hash became RCU-searchable and s_inode_list_lock is no longer taken under the hash lock either" * tag 'vfs-7.2-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: retire stale lock ordering annotations from inode hash fs: allow lockless ->i_count bumps as long as it does not transition 0->1 fs: relocate and tidy up ihold() fs: add icount_read_once() and stop open-coding ->i_count loads
2026-06-15Merge tag 'vfs-7.2-rc1.exportfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull exportfs updates from Christian Brauner: "This cleans up the exportfs support for block-style layouts that provide direct block device access: the operations for layout-based block device access are split out of struct export_operations into a separate header, ->commit_blocks() no longer takes a struct iattr argument, and the way support for layout-based block device access is detected is reworked. nfsd's blocklayout code also stops honoring loca_time_modify. This is preparation for supporting export of more than a single device per file system" * tag 'vfs-7.2-rc1.exportfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: exportfs,nfsd: rework checking for layout-based block device access support exportfs: don't pass struct iattr to ->commit_blocks exportfs: split out the ops for layout-based block device access nfsd/blocklayout: always ignore loca_time_modify
2026-06-14bpf: Raise maximum call chain depth to 16 framesAlexei Starovoitov
Bump MAX_CALL_FRAMES from 8 to 16 to allow deeper call chains that Rust-BPF requires and update selftests. Link: https://lore.kernel.org/r/20260613180755.29671-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-14i3c: master: Make i3c_master_add_i3c_dev_locked() return voidAdrian Hunter
The return value of i3c_master_add_i3c_dev_locked() is not used by any caller, and callers are not in a position to recover from failures in this path. Change the function to return void. Amend the kernel-doc accordingly, fix some grammar and remove a stale paragraph. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20260612080107.11606-6-adrian.hunter@intel.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2026-06-14i3c: master: Export i3c_master_enec_disec_locked()Adrian Hunter
The existing i3c_master_enec_locked() wrapper always treats a NACKed ENEC CCC as a failure (M2 error). However, broadcasting ENEC to enable Hot-Join is legitimately useful even when no I3C devices are currently present on the bus, in which case the broadcast will be NACKed and should not be reported as an error. The underlying helper i3c_master_enec_disec_locked() already accepts a suppress_m2 flag that lets callers ignore such NACKs. Expose it so that a subsequent patch enabling Hot-Join events can issue ENEC with M2 suppression. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20260608054312.10604-8-adrian.hunter@intel.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2026-06-14i3c: master: Defer new-device registration out of DAA caller contextAdrian Hunter
Master drivers may invoke i3c_master_do_daa_ext() during resume to re-run Dynamic Address Assignment. As well as assigning addresses to any newly arrived devices, this restores the dynamic address of devices that lost it across system suspend, so it has to run as part of the controller's resume path. A side effect of i3c_master_do_daa_ext() today is that it also registers any newly discovered I3C devices with the driver model inline, via i3c_master_register_new_i3c_devs(). Doing that from the resume path is problematic: a hot-join-capable device may join the bus during this same DAA, and registering it immediately would push driver model work (probing, sysfs, etc.) into the controller's resume context, where the rest of the system is not yet fully resumed and the controller driver is still partway through its own resume sequence. Decouple discovery from registration: add a reg_work work item to struct i3c_master_controller and have i3c_master_do_daa_ext() queue it on master->wq (the freezable workqueue) instead of calling i3c_master_register_new_i3c_devs() directly. The worker performs the registration only when the controller is not shutting_down, and is cancelled alongside hj_work in i3c_master_shutdown(). Because wq is freezable, any newly observed devices end up being registered after the system has finished resuming. i3c_master_register() also routes its initial post-bus-init registration through reg_work, using flush_work() to keep probe-time behavior synchronous. This keeps a single registration code path and ensures the worker is the only writer of desc->dev. Fixes: 3a379bbcea0af ("i3c: Add core I3C infrastructure") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20260608054312.10604-7-adrian.hunter@intel.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2026-06-14i3c: master: Ensure Hot-Join operations are stopped on shutdownAdrian Hunter
System shutdown invokes each device's bus shutdown callback to quiesce hardware, but the I3C bus type does not currently implement one. As a result, on shutdown the controller's Hot-Join work and any in-flight i3c_master_do_daa() can keep running (or be newly triggered) while the rest of the system is being torn down. A similar window exists at i3c_master_unregister() time: cancel_work_sync() on hj_work prevents queued work from completing, but does not stop a fresh Hot-Join IBI from re-queueing the worker, nor a concurrent sysfs writer from toggling Hot-Join via i3c_set_hotjoin(). Introduce a single "shutting down" gate in the I3C core, set under the bus maintenance lock so it is observed by any in-progress DAA path before pending work is cancelled. Install an i3c_bus_type shutdown callback that engages this gate for master devices during system shutdown, and use the same gate in i3c_master_unregister() so both paths get identical guarantees. Once the gate is engaged, the Hot-Join worker, i3c_master_do_daa_ext() and i3c_set_hotjoin() all bail out cleanly, so Hot-Join IBIs that race with shutdown become no-ops, direct DAA callers see -ENODEV, and sysfs writers can no longer re-enable Hot-Join through ops->enable_hotjoin() while the controller is going away. No functional change for the steady-state runtime path; the new checks only take effect once the controller has been marked as shutting down. Note, this patch depends on patch "i3c: master: Consolidate Hot-Join DAA work in the core". Fixes: 3a379bbcea0af ("i3c: Add core I3C infrastructure") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20260608054312.10604-5-adrian.hunter@intel.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2026-06-14i3c: master: Consolidate Hot-Join DAA work in the coreAdrian Hunter
Three master drivers (dw-i3c-master, i3c-master-cdns, svc-i3c-master) each carry an essentially identical Hot-Join handler: a struct work_struct embedded in their private state, a work function that just calls i3c_master_do_daa() on the embedded i3c_master_controller, plus matching INIT_WORK()/cancel_work_sync() boilerplate in probe/remove (and shutdown for dw-i3c). The IBI/ISR paths then queue that work onto master->wq, which already lives in the core. Move this pattern into the I3C core: - Add struct work_struct hj_work to struct i3c_master_controller and initialise it in i3c_master_register() with a core-provided handler i3c_master_hj_work_fn() that performs i3c_master_do_daa(). - Cancel the work in i3c_master_unregister() so all controllers get correct teardown ordering against the workqueue for free. - Export i3c_master_queue_hotjoin() as the single entry point drivers call from their Hot-Join IBI handler. Convert the three existing users to the new API: drop their private hj_work fields, work functions, INIT_WORK() and cancel_work_sync() calls, and replace the queue_work(master->wq, &drv->hj_work) call sites with i3c_master_queue_hotjoin(&drv->base). The dw-i3c shutdown path still needs to flush pending Hot-Join work before tearing down the hardware, so it is updated to cancel master->base.hj_work directly. No functional change intended: the work is still queued on the same master->wq, runs the same i3c_master_do_daa(), and is cancelled at controller teardown. Future Hot-Join improvements now only need to be made in one place. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20260608054312.10604-4-adrian.hunter@intel.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2026-06-14i3c: master: Make hot-join workqueue freezable to block hot-join during suspendAdrian Hunter
The I3C master workqueue (master->wq) is used to defer work that needs thread context and the bus maintenance lock, most notably Hot Join processing (which calls i3c_master_do_daa() to assign dynamic addresses to newly joined devices). Currently the workqueue keeps running across system suspend, which can race with the suspend path: - do_daa() may execute after the controller has been suspended, issuing bus transactions on a powered-down or otherwise unusable controller. - New I3C devices can be enumerated and added to the bus mid-suspend, registering driver model objects at a point where the I3C subsystem and its consumers are not prepared to handle them. Mark the workqueue WQ_FREEZABLE so its workers are frozen for the duration of system suspend/hibernate and resumed afterwards. This naturally defers any pending or newly queued Hot Join work until the system (and the controller) is fully resumed, closing both races without adding explicit suspend/resume synchronization in the master drivers. Update the kerneldoc for struct i3c_master_controller::wq to reflect that the workqueue is freezable. Fixes: 3a379bbcea0af ("i3c: Add core I3C infrastructure") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20260608054312.10604-2-adrian.hunter@intel.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
2026-06-13dpll: extend pin notifier with notification source IDGrzegorz Nitka
Extend the DPLL pin notification API to include a source identifier indicating where the notification originates. This allows notifier consumers to distinguish between notifications coming from an associated DPLL instance, a parent pin, or the pin itself. A new field, src_clock_id, is added to struct dpll_pin_notifier_info and is passed through all pin-related notification paths. Callers of dpll_pin_notify() are updated to provide a meaningful source identifier based on their context: - pin registration/unregistration uses the DPLL's clock_id, - pin-on-pin operations use the parent pin's clock_id, - pin changes use the pin's own clock_id. As introduced in the commit ("dpll: allow registering FW-identified pin with a different DPLL"), it is possible to share the same physical pin via firmware description (fwnode) with DPLL objects from different kernel modules. This means that a given pin can be registered multiple times. Driver such as ICE (E825 devices) rely on this mechanism when listening for the event where a shared-fwnode pin appears, while avoiding reacting to events triggered by their own registration logic. This change only extends the notification metadata and does not alter existing semantics for drivers that do not use the new field. Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Link: https://patch.msgid.link/20260607183045.1213735-9-grzegorz.nitka@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-13io_uring: remove the per-ctx fallback task_work machineryJens Axboe
With the tctx fallback running its entries directly, the per-ctx fallback work has a single user left: moving local (DEFER_TASKRUN) task_work entries out of a ring that is going away. Both of its call sites are process context and don't hold ->uring_lock, the same conditions the deferred fallback work itself ran under - so run the entries in cancel mode right there instead, and rename the helper to io_cancel_local_task_work() to match what it now does. With that, ->fallback_llist, ->fallback_work, io_fallback_req_func() and __io_fallback_tw() can all go away, along with the fallback work flushing in the ring exit and cancel paths. Requests that get orphaned by an exiting task now run via the tctx fallback work, which the ring exit side implicitly waits on through the ctx refs those requests hold. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-13io_uring: switch normal task_work to a mpscqJens Axboe
Like the local task_work list, the normal (tctx) task_work list is an llist, and hence needs the O(n) llist_reverse_order() pass before running entries in queue order. On top of that, capped runs - sqpoll processing IORING_TW_CAP_ENTRIES_VALUE entries at a time - need the claimed-but-unprocessed leftovers carried in a separate retry_list, as they can't be pushed back to the shared list. Switch tctx->task_list to a mpscq, like what was done for the DEFER_TASKRUN paths as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-13io_uring: switch local task_work to a mpscqJens Axboe
The local (DEFER_TASKRUN) task_work list is an llist, which is LIFO ordered, and hence __io_run_local_work() has to restore the right running order with an O(n) llist_reverse_order() pass first. On top of that, a batch that gets capped by max_events needs the leftover entries parked on a separate ->retry_llist, as they can't be pushed back to the shared list. Switch it to the FIFO mpscq. Adds are wait-free instead of a cmpxchg retry loop, entries are popped in queue order with no reversal pass, capping a run simply leaves the remainder on the queue, and ->retry_llist goes away entirely. The consumer cursor, ->work_head, lives with the rest of the ->uring_lock protected state rather than next to the queue, so that popping entries doesn't dirty the producer side cacheline. For low amounts of task_work, this ends up being a bit more efficient than the existing scheme. As an example of that, doing multishot receives for 8 clients has the following task_work overhead: 1.02% sock-test [kernel.kallsyms] [k] io_req_local_work_add 0.88% sock-test [kernel.kallsyms] [k] __io_run_local_work_loop 0.60% sock-test [kernel.kallsyms] [k] llist_reverse_order 0.14% sock-test [kernel.kallsyms] [k] __io_run_local_work 2.64% at ~46Gb/sec and after this change: 1.08% sock-test [kernel.kallsyms] [k] io_req_local_work_add 1.03% sock-test [kernel.kallsyms] [k] __io_run_local_work 2.11% at ~53Gb/sec which has less overhead even though that test run was faster. For a case of having 1024 clients on a single ring: 2.22% sock-test [kernel.kallsyms] [k] llist_reverse_order 0.84% sock-test [kernel.kallsyms] [k] __io_run_local_work_loop 0.42% sock-test [kernel.kallsyms] [k] io_req_local_work_add 0.02% sock-test [kernel.kallsyms] [k] __io_run_local_work 3.50% at ~24Gb/sec we start to see the llist reversing taking a considerable amount of time, and the total add+run task_work overhead is around 3.5%. After the change: 0.90% sock-test [kernel.kallsyms] [k] __io_run_local_work 0.42% sock-test [kernel.kallsyms] [k] io_req_local_work_add 1.32% at ~26Gb/sec most of that overhead is gone, and performance is better as well. Caleb Sander Mateos <csander@purestorage.com> reports that it improves the performance of a ublk 4kb workload by 4% [1], while testing v1 of this patchset. [1] https://lore.kernel.org/io-uring/CADUfDZr-MMYBaP-e+y9+xuRhuiunO2sBTUCmwZyd7AgT8sVtiQ@mail.gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-13io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queueJens Axboe
Local task_work is currently using llists for managing the work, but that's a LIFO type of list. This means that running this task_work needs to reverse the list first, to ensure fairness in running the queued items. Add a lockless FIFO queued, based on Dmitry Vyukov's intrusive MPSC node-based queue algorithm, modified with an externally held consumer cursor and conditional stub reinsertion. See comments in the header. Producers are wait-free: a push is a single xchg() on the queue tail, which serializes concurrent producers and defines the FIFO order, plus a store linking the node to its predecessor. There are no cmpxchg retry loops, and pushing is safe from any context, including hardirq. The cost of linked list FIFO ordering is that a push publishes the node in two steps - the xchg() makes it visible as the new tail before the subsequent store links it into the chain that is reachable from the head. A consumer hitting that window gets a NULL from mpscq_pop() while mpscq_empty() reports false, and must retry later rather than treat the queue as empty. The window is two instructions wide, but a producer can get preempted inside it, so the consumer must not busy wait on it. The consumer side supports a single consumer at a time, with callers providing their own serialization. A stub node, which also defines the empty state (tail == stub), allows the consumer to detach the final node without racing against producer link stores: that node is only handed out once the stub has been cmpxchg'ed back in as the tail. This also guarantees that the previous tail returned by mpscq_push() cannot get freed before that push has linked it, making it always valid for comparisons. The consumer cursor is deliberately not part of the queue struct - the caller owns it and passes it to mpscq_pop(). This is done to separate the consumer and producers cacheline. The cursor is written for every popped entry, and keeping it on the same cacheline as ->tail would have the consumer invalidating the line that producers need for every push. Keeping it external lets the caller place it with its own consumer side data instead. Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-12bpf: Fix setting retval to -EPERM for cgroup hooks not returning errnoXu Kuohai
When a cgroup BPF program exits with 0, bpf_prog_run_array_cg() sets the hook return value to -EPERM if it is not a valid errno. This is correct for errno-based hooks, which return 0 on success and negative errno on failure, but wrong for boolean and void LSM hooks. Boolean LSM hooks should only return true or false, and void LSM hooks have no return value at all. Fix it by skipping setting -EPERM for hooks not returning errno. Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor") Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/r/20260610201724.733943-2-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-12net: watchdog: fix refcount tracking racesEric Dumazet
Blamed commit converted the untracked dev_hold()/dev_put() calls in the watchdog code to use the tracked dev_hold_track()/dev_put_track() (which were later renamed/interfaced to netdev_hold() and netdev_put()). By introducing dev->watchdog_dev_tracker to store the reference tracking information without adding synchronization between netdev_watchdog_up() and dev_watchdog(), it enabled the race condition where this pointer could be overwritten or freed concurrently, leading to the list corruption crash syzbot reported: list_del corruption, ffff888114a18c00->next is NULL kernel BUG at lib/list_debug.c:52 ! Oops: invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 1 UID: 0 PID: 91 Comm: kworker/u8:5 Not tainted syzkaller #0 PREEMPT(lazy) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026 Workqueue: events_unbound linkwatch_event RIP: 0010:__list_del_entry_valid_or_report.cold+0x22/0x2a lib/list_debug.c:52 Call Trace: <TASK> __list_del_entry_valid include/linux/list.h:132 [inline] __list_del_entry include/linux/list.h:246 [inline] list_move_tail include/linux/list.h:341 [inline] ref_tracker_free+0x1a7/0x6c0 lib/ref_tracker.c:329 netdev_tracker_free include/linux/netdevice.h:4491 [inline] netdev_put include/linux/netdevice.h:4508 [inline] netdev_put include/linux/netdevice.h:4504 [inline] netdev_watchdog_down net/sched/sch_generic.c:600 [inline] dev_deactivate_many+0x28c/0xfe0 net/sched/sch_generic.c:1363 dev_deactivate+0x109/0x1d0 net/sched/sch_generic.c:1397 linkwatch_do_dev net/core/link_watch.c:184 [inline] linkwatch_do_dev+0xd3/0x120 net/core/link_watch.c:166 __linkwatch_run_queue+0x3a5/0x810 net/core/link_watch.c:240 linkwatch_event+0x8f/0xc0 net/core/link_watch.c:314 process_one_work+0xa0e/0x1980 kernel/workqueue.c:3314 process_scheduled_works kernel/workqueue.c:3397 [inline] worker_thread+0x5ef/0xe50 kernel/workqueue.c:3478 kthread+0x370/0x450 kernel/kthread.c:436 ret_from_fork+0x69a/0xc80 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 This patch has three coordinated parts: 1) Add dev->watchdog_lock and dev->watchdog_ref_held to serialize watchdog operations. 2) Remove netdev_watchdog_up() call from netif_carrier_on(): This ensures netdev_watchdog_up() is only called from process/BH context (via linkwatch workqueue dev_activate()), allowing us to use spin_lock_bh() for synchronization. 3) Synchronize watchdog up and watchdog timer: Protect netdev_watchdog_up() with tx_global_lock and watchdog_lock. Only allocate a new tracker in netdev_watchdog_up() if one is not already present. In dev_watchdog(), ensure we don't release the tracker if the timer was rescheduled either by dev_watchdog() itself or concurrently by netdev_watchdog_up(). Fixes: f12bf6f3f942 ("net: watchdog: add net device refcount tracker") Reported-by: syzbot+381d82bbf0253710b35d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6a26b751.c25708ab.1b19ef.0013.GAE@google.com/T/#u Tested-by: syzbot+3479efbc2821cb2a79f2@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260611152737.2580480-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-12tls: remove tls_toe and the related driverSabrina Dubroca
The tls_toe feature and its single user (chelsio chtls) have been unmaintained for multiple years. It also hooks into the core of the TCP implementation, and bypasses most of the networking stack. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/1f30e73275c07bf879f547589872d0916025a52e.1781165969.git.sd@queasysnail.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-12block: add configurable error injectionChristoph Hellwig
Add a new block error injection interface that allows to inject specific status code for specific ranges. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev> Link: https://patch.msgid.link/20260611140703.2401204-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-12of: Respect #{iommu,msi}-cells in mapsRobin Murphy
So far our parsing of {iommu,msi}-map properties has always blindly assumed that the output specifiers will always have exactly 1 cell. This typically does happen to be the case, but is not actually enforced (and the PCI msi-map binding even explicitly states support for 0 or 1 cells) - as a result we've now ended up with dodgy DTs out in the field which depend on this behaviour to map a 1-cell specifier for a 2-cell provider, despite that being bogus per the bindings themselves. Since there is some potential use in being able to map at least single input IDs to multi-cell output specifiers (and properly support 0-cell outputs as well), add support for properly parsing and using the target nodes' #cells values, albeit with the unfortunate complication of still having to work around expectations of the old behaviour too. Since there are multi-cell output specifiers, the callers of of_map_id() may need to get the exact cell output value for further processing. Update of_map_id() to set args_count in the output to reflect the actual number of output specifier cells. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Charan Teja Kalla <charan.kalla@oss.qualcomm.com> Signed-off-by: Vijayanand Jitta <vijayanand.jitta@oss.qualcomm.com> Link: https://patch.msgid.link/20260603-parse_iommu_cells-v16-3-dc509dacb19a@oss.qualcomm.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2026-06-12of: Factor arguments passed to of_map_id() into a structCharan Teja Kalla
Change of_map_id() to take a pointer to struct of_phandle_args instead of passing target device node and translated IDs separately. Update all callers accordingly. Add an explicit filter_np parameter to of_map_id() and of_map_msi_id() to separate the filter input from the output. Previously, the target parameter served dual purpose: as an input filter (if non-NULL, only match entries targeting that node) and as an output (receiving the matched node with a reference held). Now filter_np is the explicit input filter and arg->np is the pure output. Previously, of_map_id() would call of_node_put() on the matched node when a filter was provided, making reference ownership inconsistent. Remove this internal of_node_put() call so that of_map_id() now always transfers ownership of the matched node reference to the caller via arg->np. Callers are now consistently responsible for releasing this reference with of_node_put(arg->np) when done. Acked-by: Frank Li <Frank.Li@nxp.com> Suggested-by: Rob Herring (Arm) <robh@kernel.org> Suggested-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> Signed-off-by: Charan Teja Kalla <charan.kalla@oss.qualcomm.com> Signed-off-by: Vijayanand Jitta <vijayanand.jitta@oss.qualcomm.com> Link: https://patch.msgid.link/20260603-parse_iommu_cells-v16-2-dc509dacb19a@oss.qualcomm.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2026-06-12of: Add convenience wrappers for of_map_id()Robin Murphy
Since we now have quite a few users parsing "iommu-map" and "msi-map" properties, give them some wrappers to conveniently encapsulate the appropriate sets of property names. This will also make it easier to then change of_map_id() to correctly account for specifier cells. Reviewed-by: Rob Herring (Arm) <robh@kernel.org> Reviewed-by: Frank Li <Frank.Li@nxp.com> Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Vijayanand Jitta <vijayanand.jitta@oss.qualcomm.com> Link: https://patch.msgid.link/20260603-parse_iommu_cells-v16-1-dc509dacb19a@oss.qualcomm.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2026-06-12iommu: Avoid copying the user array twice in the full-array copy helperNicolin Chen
iommu_copy_struct_from_full_user_array() copies a whole user array into a kernel buffer. In the common case, where user entry_len equals destination entry size, it takes a fast path and copies the whole array with a single copy_from_user(). That fast path does not return, so it falls through into the item-by-item copy_struct_from_user() loop and copies every entry a second time. For an equal entry_len that loop is just a copy_from_user() of the same bytes, so the whole array is copied twice for no benefit. Return right after the bulk copy. The per-item loop then runs only on the slow path, where entry_len differs and each entry needs size adaption. Fixes: 4f2e59ccb698 ("iommu: Add iommu_copy_struct_from_full_user_array helper") Link: https://patch.msgid.link/r/6c9eca4ff584cb977661e97799ac6fe934e7f51c.1780521606.git.nicolinc@nvidia.com Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2026-06-12spi: spi-mem: Fix spi_controller_mem_ops kdocMiquel Raynal
The secondary_op_tmpl kdoc line has been removed accidentally, add it back. Reported-by: Michael Walle <mwalle@kernel.org> Closes: https://lore.kernel.org/linux-mtd/DJ56CDMRVFQ6.FOZRIQTF3VDW@kernel.org/T/#u Fixes: 38fbe4b3f66e ("spi: spi-mem: Add a no_cs_assertion capability") Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Link: https://patch.msgid.link/20260612-perso-fix-no-cs-assertion-kdoc-v1-1-626b2d6d0d9b@bootlin.com Signed-off-by: Mark Brown <broonie@kernel.org>
2026-06-12spi: spi-mem: Add a no_cs_assertion capabilityMark Brown
Merge tag 'mtd/spi-mem-cont-read-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux into spi-7.2 Miquel Raynal <miquel.raynal@bootlin.com> says: Aside from preparation changes in the SPI NAND core, the changes carried here focus on the shared spi-mem layer which is enhanced in order to bring two new features: - The possibility to fill a primary and a secondary operation template in the direct mapping structure in order to support continuous reads in SPI NAND, which may require two different read operations. - SPI controllers may indicate possible CS instabilities over long transfers by setting a boolean. This capability is related to the previous one, the need for it has arised while testing SPI NAND continuous reads with the Cadence QSPI controller which cannot, under certain conditions, keep the CS asserted for the length of an eraseblock-large transfer.
2026-06-12Merge branches 'apple/dart', 'arm/smmu/updates', 'arm/smmu/bindings', ↵Joerg Roedel
'rockchip', 'verisilicon', 'riscv', 'intel/vt-d', 'amd/amd-vi' and 'core' into next
2026-06-12Merge branch 'slab/for-7.2/alloc_token' into slab/for-nextVlastimil Babka (SUSE)
Merge series "slab: support for compiler-assisted type-based slab cache partitioning" from Marco Elver. From the cover letter [6]: Rework the general infrastructure around RANDOM_KMALLOC_CACHES into more flexible KMALLOC_PARTITION_CACHES, with the former being a partitioning mode of the latter. Introduce a new mode, KMALLOC_PARTITION_TYPED, which leverages a feature available in Clang 22 and later, called "allocation tokens" via __builtin_infer_alloc_token() [1]. Unlike KMALLOC_PARTITION_RANDOM (formerly RANDOM_KMALLOC_CACHES), this mode deterministically assigns a slab cache to an allocation of type T, regardless of allocation site. The builtin __builtin_infer_alloc_token(<malloc-args>, ...) instructs the compiler to infer an allocation type from arguments commonly passed to memory-allocating functions and returns a type-derived token ID. The implementation passes kmalloc-args to the builtin: the compiler performs best-effort type inference, and then recognizes common patterns such as `kmalloc(sizeof(T), ...)`, `kmalloc(sizeof(T) * n, ...)`, but also `(T *)kmalloc(...)`. Where the compiler fails to infer a type the fallback token (default: 0) is chosen. Note: kmalloc_obj(..) APIs fix the pattern how size and result type are expressed, and therefore ensures there's not much drift in which patterns the compiler needs to recognize. Specifically, kmalloc_obj() and friends expand to `(TYPE *)KMALLOC(__obj_size, GFP)`, which the compiler recognizes via the cast to TYPE*. Clang's default token ID calculation is described as [1]: typehashpointersplit: This mode assigns a token ID based on the hash of the allocated type's name, where the top half ID-space is reserved for types that contain pointers and the bottom half for types that do not contain pointers. Separating pointer-containing objects from pointerless objects and data allocations can help mitigate certain classes of memory corruption exploits [2]: attackers who gains a buffer overflow on a primitive buffer cannot use it to directly corrupt pointers or other critical metadata in an object residing in a different, isolated heap region. It is important to note that heap isolation strategies offer a best-effort approach, and do not provide a 100% security guarantee, albeit achievable at relatively low performance cost. Note that this also does not prevent cross-cache attacks: while waiting for future features like SLAB_VIRTUAL [3] to provide physical page isolation, this feature should be deployed alongside SHUFFLE_PAGE_ALLOCATOR and init_on_free=1 to mitigate cross-cache attacks and page-reuse attacks as much as possible today. With all that, my kernel (x86 defconfig) shows me a histogram of slab cache object distribution per /proc/slabinfo (after boot): <slab cache> <objs> <hist> kmalloc-part-15 1465 ++++++++++++++ kmalloc-part-14 2988 +++++++++++++++++++++++++++++ kmalloc-part-13 1656 ++++++++++++++++ kmalloc-part-12 1045 ++++++++++ kmalloc-part-11 1697 ++++++++++++++++ kmalloc-part-10 1489 ++++++++++++++ kmalloc-part-09 965 +++++++++ kmalloc-part-08 710 +++++++ kmalloc-part-07 100 + kmalloc-part-06 217 ++ kmalloc-part-05 105 + kmalloc-part-04 4047 ++++++++++++++++++++++++++++++++++++++++ kmalloc-part-03 183 + kmalloc-part-02 283 ++ kmalloc-part-01 316 +++ kmalloc 1422 ++++++++++++++ The above /proc/slabinfo snapshot shows me there are 6673 allocated objects (slabs 00 - 07) that the compiler claims contain no pointers or it was unable to infer the type of, and 12015 objects that contain pointers (slabs 08 - 15). On a whole, this looks relatively sane. Additionally, when I compile my kernel with -Rpass=alloc-token, which provides diagnostics where (after dead-code elimination) type inference failed, I see 186 allocation sites where the compiler failed to identify a type (down from 966 when I sent the RFC [4]). Some initial review confirms these are mostly variable sized buffers, but also include structs with trailing flexible length arrays. Link: https://clang.llvm.org/docs/AllocToken.html [1] Link: https://blog.dfsec.com/ios/2025/05/30/blasting-past-ios-18/ [2] Link: https://lwn.net/Articles/944647/ [3] Link: https://lore.kernel.org/all/20250825154505.1558444-1-elver@google.com/ [4] Link: https://discourse.llvm.org/t/rfc-a-framework-for-allocator-partitioning-hints/87434 [5] Link: https://lore.kernel.org/all/20260511200136.3201646-1-elver@google.com/ [6]
2026-06-12Merge tag 'thunderbolt-for-v7.2-rc1' of ↵Greg Kroah-Hartman
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt into usb-next Mika writes: thunderbolt: Changes for v7.2 merge window This includes following USB4/Thunderbolt changes for the v7.2 merge window: - Make the driver more compliant with the connection manager guide. - Improvements over Thunderbolt XDomain service handling. - USB4STREAM driver. - Split out PCIe bits into pci.c to allow the driver to work on non-PCIe hosts as well. - Various fixes and improvements. All these have been in linux-next with no reported issues. * tag 'thunderbolt-for-v7.2-rc1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt: (41 commits) thunderbolt: debugfs: Fix sideband write size check thunderbolt: debugfs: Fix margining error counter buffer leak thunderbolt: test: Release third DP tunnel thunderbolt: Prevent XDomain delayed work use-after-free on disconnect thunderbolt: test: Add KUnit tests for property parser bounds checks thunderbolt: Add some more descriptive probe error messages thunderbolt: Require nhi->ops be valid thunderbolt: Separate out common NHI bits thunderbolt: Move pci_device out of tb_nhi thunderbolt: Increase Notification Timeout to 255 ms for USB4 routers thunderbolt: Increase timeout for Configuration Ready bit thunderbolt: Verify Router Ready bit is set after router enumeration thunderbolt: Verify PCIe adapter in detect state before tunnel setup thunderbolt: Activate path hops from source to destination thunderbolt: Fix lane bonding log when bonding not possible thunderbolt: Don't access path config space on Lane 1 adapters in tb_switch_reset_host() thunderbolt: Improve multi-display DisplayPort tunnel allocation docs: admin-guide: thunderbolt: Add instructions how to use USB4STREAM thunderbolt: Add support for USB4STREAM thunderbolt: Add support for ConfigFS ...
2026-06-12Merge tag 'kvm-x86-sev-7.2' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM SEV changes for 7.2 - Don't advertise support for unusuable VM types, and account for VM types that are disabled by firmware, e.g. to mitigate security vulnerabilities. - Rewrite the SEV {en,de}crypt debug ioctls as they were riddle with bugs and unnecessarily complicated, and add comprehensive tests. - Clean up and deduplicate the SEV page pinning code. - Fix minor goofs related to writing back CPUID information after firmware rejects a CPUID page for an SNP vCPU.
2026-06-12Merge tag 'kvm-x86-generic-7.2' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM generic changes for 7.2 - Rename invalidate_begin() to invalidate_start() throughout KVM to follow the kernel's nomenclature, e.g. for mmu_notifiers. - Minor cleanups.
2026-06-11ipmr: Convert mr_table.cache_resolve_queue_len to u32.Kuniyuki Iwashima
mr_table.cache_resolve_queue_len is always updated under spin_lock_bh(&mfc_unres_lock). Let's convert it to u32. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260609222013.1550355-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-7.1-rc8). Conflicts: drivers/net/ethernet/wangxun/txgbe/txgbe_aml.c f67aead16e85 ("net: txgbe: rework service event handling") 57d39faed4c9 ("net: txgbe: improve functions of AML 40G devices") net/rds/info.c 512db8267b73 ("rds: mark snapshot pages dirty in rds_info_getsockopt()") 6e94eeb2a2a6 ("rds: convert to getsockopt_iter") Adjacent changes: include/net/sock.h 1ee90b77b727 ("net: guard timestamp cmsgs to real error queue skbs") f0de88303d5e ("net: make is_skb_wmem() available to modules") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-11Merge branch 'pm-cpufreq'Rafael J. Wysocki
Merge cpufreq updates for 7.2: - Fix a race between cpufreq suspend and CPU hotplug during system shutdown (Tianxiang Chen) - Avoid redundant target() calls for unchanged limits and fix a typo in a comment in the cpufreq core (Viresh Kumar) - Fix concurrency issues related to sysfs attributes access that affect cpufreq governors using the common governor code (Zhongqiu Han) - Simplify frequency limit handling in the conservative cpufreq governor (Lifeng Zheng) - Fix descriptions of the conservative governor freq_step tunable and the ondemand governor sampling_down_factor tunable in the cpufreq documentation (Pengjie Zhang) - Fix use-after-free and double free during _OSC evaluation in the PCC cpufreq driver (Yuho Choi) - Rework the handling of policy min and max frequency values in the cpufreq core to allow drivers to specify special initial values for the scaling_min_freq and scaling_max_freq sysfs attributes (Pierre Gondois) - Add cpufreq scaling support for Qualcomm Shikra SoC (Taniya Das, Imran Shaik). - Improve the warning message on HWP-disabled hybrid processors printed by the intel_pstate driver and sync policy->cur during CPU offline in it (Yohei Kojima, Fushuai Wang) - Drop cpufreq support for AMD Elan SC4* (Sean Young) - Minor fixes for cpufreq drivers (Krzysztof Kozlowski, Akashdeep Kaur, Hans Zhang, Guangshuo Li, Xueqin Luo) - Clean up dead dependencies on X86 in the cpufreq Kconfig (Julian Braha) * pm-cpufreq: (25 commits) cpufreq: Use policy->min/max init as QoS request cpufreq: Remove driver default policy->min/max init cpufreq: Set default policy->min/max values for all drivers cpufreq: Extract cpufreq_policy_init_qos() function cpufreq: Documentation: fix conservative governor freq_step description cpufreq: ti: Add EPROBE_DEFER for K3 SoCs cpufreq: qcom: Add cpufreq scaling support for Qualcomm Shikra SoC dt-bindings: cpufreq: Document Qualcomm Shikra SoC EPSS cpufreq: governor: Fix stale prev_cpu_nice spike when enabling ignore_nice_load cpufreq: governor: Fix data races on per-CPU idle/nice baselines cpufreq: intel_pstate: Improve warning message on HWP-disabled hybrid CPUs cpufreq: elanfreq: Drop support for AMD Elan SC4* cpufreq: clean up dead dependencies on X86 in Kconfig cpufreq: conservative: Simplify frequency limit handling cpufreq: Avoid redundant target() calls for unchanged limits cpufreq: Fix typo in comment cpufreq: intel_pstate: Sync policy->cur during CPU offline cpufreq: Documentation: fix sampling_down_factor range cpufreq: Fix hotplug-suspend race during reboot cpufreq: pcc: fix use-after-free and double free in _OSC evaluation ...
2026-06-11Merge tag 'net-7.1-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from IPsec and netfilter. This is relatively small, mostly because we are a bit behind our PW queue. I'm not aware of any pending regression. Current release - regressions: - netfilter: nf_tables_offload: drop device refcount on error Previous releases - regressions: - core: add pskb_may_pull() to skb_gro_receive_list() - xfrm: iptfs: preserve shared-frag marker in iptfs_consume_frags() - ipv6: fix a potential NPD in cleanup_prefix_route() - ipv4: fix use-after-free caused by the fqdir_pre_exit() flush - eth: - bnxt_en: fix NULL pointer dereference - emac: fix use-after-free during device removal - octeontx2-af: fix memory leak in rvu_setup_hw_resources() - tun: zero the whole vnet header in tun_put_user() - sit: reload inner IPv6 header after GSO offloads Previous releases - always broken: - core: fix double-free in netdev_nl_bind_rx_doit() - netfilter: nf_log: validate MAC header was set before dumping it - xfrm: iptfs: fix ABBA deadlock in iptfs_destroy_state() - tcp: restrict SO_ATTACH_FILTER to priv users - mctp: usb: fix race between urb completion and rx_retry cancellation - eth: - mlx5: fix slab-out-of-bounds in mlx5_query_nic_vport_mac_list - mvpp2: sync RX data at the hardware packet offset" * tag 'net-7.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (64 commits) octeontx2-af: fix IP fragment flag corruption on custom KPU profile load ipv6: Fix a potential NPD in cleanup_prefix_route() net: txgbe: initialize PHY interface to 0 net: txgbe: distinguish module types by checking identifier net: txgbe: initialize module info buffer net: mvpp2: build skb from XDP-adjusted data on XDP_PASS net: mvpp2: refill RX buffers before XDP or skb use net: mvpp2: limit XDP frame size to the RX buffer net: mvpp2: sync RX data at the hardware packet offset netfilter: nft_meta_bridge: fix stale stack leak via IIFHWADDR register netfilter: nft_fib: fix stale stack leak via the OIFNAME register netfilter: nft_exthdr: fix register tracking for F_PRESENT flag netfilter: nf_log: validate MAC header was set before dumping it netfilter: x_tables: avoid leaking percpu counter pointers netfilter: nf_conntrack: destroy stale expectfn expectations on unregister netfilter: nf_tables_offload: drop device refcount on error netfilter: revalidate bridge ports rds: mark snapshot pages dirty in rds_info_getsockopt() ip6_vti: fix incorrect tunnel matching in vti6_tnl_lookup() ptp: ocp: fix resource freeing order ...
2026-06-11Merge back earlier thermal control material for 7.2Rafael J. Wysocki