summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2026-06-21Merge tag 'mm-nonmm-stable-2026-06-21-10-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "taskstats: fix TGID dead-thread stat retention" (Yiyang Chen) Fix a taskstats TGID aggregation bug where fields added in the TGID query path were not preserved after thread exit, and adds a kselftest covering the regression. - "lib/tests: string_helpers: Slight improvements" (Andy Shevchenko) Improve lib/tests/string_helpers_kunit.c a little - "lib/base64: decode fixes" (Josh Law) Address minor issues in lib/base64.c - "selftests/filelock: Make output more kselftestish" (Mark Brown) Make the output from the ofdlocks test a bit easier for tooling to work with. Also ignore the generated file - "uaccess: unify inline vs outline copy_{from,to}_user() selection" (Yury Norov) Simplify the usercopy code by removing the selectability of inlining copy_{from,to}_user(). - "ocfs2: validate inline xattr header consumers" (ZhengYuan Huang) Fix a number of possible issues in the ocfs2 xattr code - "lib and lib/cmdline enhancements" (Dmitry Antipov) Provide additional robustness checking in the cmdline handling code and its in-kernel testing and selftests - "cleanup the RAID6 P/Q library" (Christoph Hellwig) Clean up the RAID6 P/Q library to match the recent updates to the RAID 5 XOR library and other CRC/crypto libraries - "ocfs2: harden inode validators against forged metadata" (Michael Bommarito) Add three structural checks to OCFS2 dinode validation so malformed on-disk fields are rejected before ocfs2_populate_inode() copies them into the in-core inode - "lib/raid: replace __get_free_pages() call with kmalloc()" (Mike Rapoport) Clean up the lib/raid code by using kmalloc() in more places * tag 'mm-nonmm-stable-2026-06-21-10-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (108 commits) ocfs2: fix circular locking dependency in ocfs2_dio_end_io_write ocfs2: fix NULL h_transaction deref in ocfs2_assure_trans_credits lib: interval_tree_test: validate benchmark parameters ocfs2: avoid moving extents to occupied clusters treewide: fix transposed "sign" typos and update spelling.txt ocfs2: fix UBSAN array-index-out-of-bounds in ocfs2_sum_rightmost_rec fat: reject BPB volumes whose data area starts beyond total sectors selftests/uevent: increase __UEVENT_BUFFER_SIZE to avoid ENOBUFS on busy systems lib/test_firmware: allocate the configured into_buf size fs: efs: remove unneeded debug prints checkpatch: cuppress warnings when Reported-by: is followed by Link: MAINTAINERS: add Alexander as a kcov reviewer mailmap: update Alexander Sverdlin's Email addresses fs: fat: inode: replace sprintf() with scnprintf() ocfs2: fix out-of-bounds write in ocfs2_remove_refcount_extent ocfs2: fix race between ocfs2_control_install_private() and ocfs2_control_release() ocfs2/dlm: require a ref for locking_state debugfs open ocfs2: reject FITRIM ranges shorter than a cluster ocfs2: validate fast symlink target during inode read ocfs2: add journal NULL check in ocfs2_checkpoint_inode() ...
2026-06-21cpu: hotplug: Bound hotplug states sysfs outputBradley Morgan
states_show() adds CPU hotplug state names into a single sysfs buffer using sprintf(). With enough registered states, this can write past the end of the PAGE_SIZE buffer. Use sysfs_emit_at() so output is bounded. Fixes: 98f8cdce1db5 ("cpu/hotplug: Add sysfs state interface") Signed-off-by: Bradley Morgan <include@grrlz.net> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260619163719.12103-2-include@grrlz.net
2026-06-21cpu: hotplug: Preserve per instance callback errorsBradley Morgan
cpuhp_invoke_callback() unwinds earlier callbacks for the same hotplug state when one instance fails. The rollback path currently reuses ret, so a successful rollback can hide the original error and make the failed transition look successful. Keep the rollback result separate from the original error. Fixes: 724a86881d03 ("smp/hotplug: Callback vs state-machine consistency") Signed-off-by: Bradley Morgan <include@grrlz.net> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260619163719.12103-1-include@grrlz.net
2026-06-21Merge tag 'liveupdate-v7.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux Pull liveupdate updates from Mike Rapoport: "Kexec Handover (KHO): - make memory preservation compatible with deferred initialization of the memory map Live Update Orchestrator (LUO): - add LIVEUPDATE_SESSION_GET_NAME ioctl and parameter verification for LIVEUPDATE_IOCTL_CREATE_SESSION ioctl - documentation updates for liveupdate=on command line option, systemd support and the current compatibility status - remove the fixed limits on the number of files that can be preserved within a single session, and the total number of sessions managed by the LUO Misc fixes: - reference count incoming File-Lifecycle-Bound (FLB) data so it cannot be freed while a subsystem is still using it - fixes for a TOCTOU race in luo_session_retrieve(), a use- after-free in the file finish and unpreserve paths, concurrent session mutations during reboot and serialization on preserve_context kexec - make sure ioctls for incoming LUO sessions are blocked for outgoing sessions and vice versa - make sure KHO scratch size is always aligned by CMA_MIN_ALIGNMENT_BYTES - fix memblock tests build issue introduced by KHO changes" * tag 'liveupdate-v7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux: (36 commits) liveupdate: Document that retrieve failure is permanent docs: memfd_preservation: fix rendering of ABI documentation selftests/liveupdate: Add stress-files kexec test selftests/liveupdate: Add stress-sessions kexec test selftests/liveupdate: Test session and file limit removal liveupdate: Remove limit on the number of files per session liveupdate: Remove limit on the number of sessions liveupdate: defer session block allocation and physical address setting kho: add support for linked-block serialization liveupdate: Extract luo_session_deserialize_one helper liveupdate: Extract luo_file_deserialize_one helper liveupdate: register luo_ser as KHO subtree liveupdate: centralize state management into struct luo_ser liveupdate: avoid mixing cleanup guards with goto in luo_session_retrieve_fd liveupdate: change file_set->count type to u64 for type safety liveupdate: Remove unused ser field from struct luo_session liveupdate: fix u-a-f in luo_file_unpreserve_files() and luo_file_finish() liveupdate: block session mutations during reboot liveupdate: fix TOCTOU race in luo_session_retrieve() liveupdate: skip serialization for context-preserving kexec ...
2026-06-21locking/rt: Fix the incorrect RCU protection in rt_spin_unlock()Thomas Gleixner
rt_spin_unlock() releases the RCU protection before unlocking the lock. That opens the door for the following UAF scenario: T1 T2 spin_lock(&p->lock); rcu_read_lock(); invalidate(p); p = rcu_dereference(ptr); rcu_assign_pointer(ptr, NULL); if (!p) return; spin_unlock(&p->lock); spin_lock(&p->lock) lock(&lock->lock); rcu_read_lock(); kfree_rcu(p); rcu_read_unlock(); .... spin_unlock(&p->lock) rcu_read_unlock(); // Ends grace period rcu_do_batch() kfree(p); UAF -> rt_mutex_cmpxchg_release(&lock->lock...) Regular spinlocks keep preemption disabled accross the unlock operation, which provides full RCU protection, but the RT substitution fails to resemble that. Same applies for the rwlock substitution. Move the rcu_read_unlock() invocation past the unlock operations to match the non-RT semantics. This makes it asymmetric vs. rt_xxx_lock(), but that's harmless as the caller needs to hold RCU read lock across the lock operation. The migrate_enable() call stays before the unlock operation because there is no per CPU operation in the unlock path which would require migration to be kept disabled. Fixes: 0f383b6dc96e ("locking/spinlock: Provide RT variant") Reported-by: syzbot+000c800a02097aaa10ed@syzkaller.appspotmail.com Decoded-by: Jann Horn <jannh@google.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/87jyrud75z.ffs@fw13
2026-06-19sched/mmcid: Fix OOB clear_bit when CID is MM_CID_UNSET in fixup pathRik van Riel
In mm_cid_fixup_cpus_to_tasks(), when rq->curr has the target mm and mm_cid.active is set, the CID is checked with cid_in_transit() before setting the transition bit. In per-CPU mode a newly forked or exec'd task can be running with mm_cid.cid == MM_CID_UNSET because CIDs are assigned lazily on schedule-in. With cid_in_transit() the guard passes for MM_CID_UNSET (no transit bit), converts it to MM_CID_UNSET | MM_CID_TRANSIT and stores it back; later mm_cid_schedout() feeds this to clear_bit() with MM_CID_UNSET as the bit number, triggering an out-of-bounds write. Symptoms: this is genuine memory corruption, but a bounded out-of-bounds write, not an arbitrary one. MM_CID_UNSET is the fixed sentinel BIT(31), so once the bad value reaches mm_cid_schedout() the cid_from_transit_cid() strip leaves MM_CID_UNSET, which fails the "cid < max_cids" convergence test and falls into mm_drop_cid() -> clear_bit(MM_CID_UNSET, mm_cidmask(mm)). The cid bitmap is embedded in the mm_struct slab object (after cpu_bitmap and mm_cpus_allowed) and is only num_possible_cpus() bits wide, so clearing bit 31 is a deterministic OOB bit-clear at a fixed offset of 2^31 / 8 == 256 MiB past the bitmap base. The address is not attacker-influenced (fixed sentinel -> fixed offset) and the op only clears a single bit; what sits 256 MiB further along the direct map is whatever kernel object happens to live there, so this corrupts one bit of unpredictable kernel memory -- it is not an arbitrary-address or arbitrary-value write. It triggers only in per-CPU CID mode, when a CPU is running an active task of the target mm whose cid is still MM_CID_UNSET -- the fork()/execve() window before that task's next schedule-in assigns it a real CID -- and a per-CPU -> per-task fixup walks over it (the mode fallback driven by a thread exit, sched_mm_cid_exit(), or by the deferred max_cids recompute in mm_cid_work_fn()). In practice syzkaller surfaced it as a KASAN use-after-free reported in __schedule -> mm_cid_switch_to, where the offending clear_bit() is inlined via mm_cid_schedout() -> mm_drop_cid(). Guard the transition-bit assignment against MM_CID_UNSET, in addition to the existing cid_in_transit() check, so the bit is only set on a genuine task-owned CID. A CPU-owned (MM_CID_ONCPU) CID of a running active task is handled by the cid_on_cpu(pcp->cid) branch above and never reaches this path, so excluding MM_CID_UNSET (and the already-transitioning case) is sufficient. Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions") Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Assisted-by: Claude:claude-opus-4-8 syzkaller Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260616203818.1516263-1-riel@surriel.com
2026-06-19Merge tag 'mm-stable-2026-06-18-09-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "selftests/mm: clean up build output and verbosity" (Li Wang) Remove some noise from the MM selftests build - "mm: Free contiguous order-0 pages efficiently" (Ryan Roberts) Speed up the freeing of a batch of 0-order pages by first scanning them for coalescing opportunities. This is applicable to vfree() and to the releasing of frozen pages - "mm/damon: introduce DAMOS failed region quota charge ratio" (SeongJae Park) Address a DAMOS usability issue: The DAMOS quota often exhausts prematurely because it charges for all memory attempted, causing slow and inconsistent performance when actions fail on unreclaimable memory. To fix this, a new feature lets users set a smaller, flexible quota charge ratio (via a numerator and denominator) for failed regions. Since failed actions cause less overhead, reducing their quota cost ensures more predictable and efficient DAMOS processing - "selftests/cgroup: improve zswap tests robustness and support large page sizes" (Li Wang) Fix various spurious failures and improves the overall robustness of the cgroup zswap selftests - "fix MAP_DROPPABLE not supported errno" (Anthony Yznaga) Fix an issue in the mlock selftests on arm32 - "mm: huge_memory: clean up defrag sysfs with shared" (Breno Leitao) Some maintenance work in the huge_memory code - "treewide: fixup gfp_t printks" (Brendan Jackman) Use the special vprintf() gfp_t conversion in various places - "mm: Fix vmemmap optimization accounting and initialization" (Muchun Song) Fix several bugs in the vmemmap optimization, mainly around incorrect page accounting and memmap initialization in the DAX and memory hotplug paths. It also fixes pageblock migratetype initialization and struct page initialization for ZONE_DEVICE compound pages - "mm/damon: repost non-hotfix reviewed patches in damon/next tree" A sprinkle of unrelated minor bugfixes for DAMON - "mm: remove page_mapped()" (David Hildenbrand) Remove this function from the tree, replacing it with folio_mapped() - "mm/damon: let DAMON be paused and resumed" (SeongJae Park) Allow DAMON to be paused and resumed without losing its current state - "kasan: hw_tags: Disable tagging for stack and page-tables" (Muhammad Usama Anjum) Simplify and speed up kasan by removing its ineffective tagging of stacks and page tables - "mm/damon/reclaim,lru_sort: monitor all system rams by default" (SeongJae Park) Simplify deployment on diverse hardware like NUMA systems by updating DAMON_RECLAIM and DAMON_LRU_SORT to automatically monitor the physical address range covering all System RAM areas by default, replacing the overly restrictive behavior that only targeted the single largest memory block to save on negligible overhead - "mm/damon/sysfs: document filters/ directory as deprecated" (SeongJae Park) Update some DAMON docs - "mm: use spinlock guards for zone lock" (Dmitry Ilvokhin) Switch zone->lock handling over to using the guard() mechanisms - "mm/filemap: tighten mmap_miss hit accounting" (fujunjie) Fix a flaw where the mmap_miss counter over-credited page cache hits during fault-arounds and page-fault retries. This results in significant reduction of redundant synchronous mmap readahead I/O, drastically cutting down execution time and gigabytes read for sparse random or strided memory access workloads - "selftests/cgroup: Fix false positive failures in test_percpu_basic" (Li Wang) Fix a couple of false-positives in the cgroup kmem selftests - "mm/damon/reclaim: support monitoring intervals auto-tuning" (SeongJae Park) Add a new parameter to DAMON permitting DAMON_RECLAIM to automatically tune DAMON's sampling and aggregation intervals - "mm/damon/stat: add kdamond_pid parameter" (SeongJae Park) Change DAMON_STAT to provide the pid of its kdamond - "mm/kmemleak: dedupe verbose scan output" (Breno Leitao) Remove large amounts of duplicated backtraces from the verbose-mode kmemleak output - "mm: remove CONFIG_HAVE_BOOTMEM_INFO_NODE (Part 1)" (David Hildenbrand) Reduce our use of CONFIG_HAVE_BOOTMEM_INFO_NODE, with a view to removing it entirely in a later series - "mm/damon: validate min_region_size to be power of 2" (Liew Rui Yan) Prevent users from passing a non-power-of-2 value of `addr_unit', as this later results in undesirable behavior - "mm: document read_pages and simplify usage" (Frederick Mayle) - "tools/mm/page-types: Fix misc bugs" (Ye Liu) Fix three issues in tools/mm/page-types.c - "mm: misc cleanups from __GFP_UNMAPPED series" (Brendan Jackman) Implement several cleanups in the page allocator and related code - "mm, swap: swap table phase IV: unify allocation" (Kairui Song) Unify the allocation and charging of anon and shmem swap in folios, provides better synchronization, consolidates the metadata management, hence dropping the static array and map, and improves performance - "mm/damon: introduce data attributes monitoring" (SeongJae Park( Extend DAMON to monitor general data attributes other than accesses - "mm/vmalloc: free unused pages on vrealloc() shrink" (Shivam Kalra) Implement the TODO in vrealloc() to unmap and free unused pages when shrinking across a page boundary - "mm/damon: documentation and comment fixes" (niecheng) - "remove mmap_action success, error hooks" (Lorenzo Stoakes) Eliminate custom hooks from mmap_action by removing the problematic success_hook which allowed drivers to improperly access uninitialized VMAs. It replaces the error_hook with a simple error-code field and updates the memory char driver accordingly - "mm/damon: minor improvements for code readability and tests" (SeongJae Park) - "mm/damon: fix macro arguments and clarify quota goals doc" (Maksym Shcherba) - "userfaultfd: merge fs/userfaultfd.c into mm/userfaultfd.c" (Mike Rapoport) - "mm/mglru: improve reclaim loop and dirty folio" (Kairui Song and others) Clean up and slightly improves MGLRU's reclaim loop and dirty writeback handling. Large performance improvements are measured - "use vma locks for proc/pid/{smaps|numa_maps} reads" (Suren Baghdasaryan) Use per-vma locks when reading /proc/pid/smaps and numa_maps similar to reduce contention on central mmap_lock - "refactors thpsize_shmem_enabled_store() and thpsize_shmem_enabled_show()" (Ran Xiaokai) Some cleanup work in the THP code - "selftests/memfd: fix compilation warnings" (Konstantin Khorenko) Fix a few build glitches in the memfd selftest code. - "memcg: shrink obj_stock_pcp and cache multiple objcgs" (Shakeel Butt) Resolve a 68% performance regression caused by NUMA-node cache thrashing around struct obj_stock_pcp by shrinking its existing fields and expanding it into a multi-slot array that caches up to five obj_cgroup pointers per CPU, allowing per-node variants of the same memcg to coexist within a single 64-byte cache line. - "zram: writeback fixes" (Sergey Senozhatsky) address a couple of unrelated zram writeback issues - "mm: switch THP shrinker to list_lru" (Johannes Weiner) Resolve NUMA-awareness issues and streamlines callsite interaction by refactoring and extending the list_lru API to completely replace the complex, open-coded deferred split queue for Transparent Huge Pages - "mm: improve large folio readahead for exec memory" (Usama Arif) Improve large-folio readahead on systems like 64K-page arm64 by preventing the mmap_miss check from permanently disabling target-oriented VM_EXEC readahead, and by generalizing the force_thp_readahead gate to support mappings with any usefully large maximum folio order under the cache cap. - "userfaultfd/pagemap: pre-existing fixes" (Kiryl Shutsemau) Fix a bunch of minor issues in the userfaultfd/pagemap, all of which were flagged by Sashiko review of proposed new material - "mm/sparse-vmemmap: Provide generic vmemmap_set_pmd() and vmemmap_check_pmd()" (Muchun Song) Provide generic versions of these two functions so the four arch-specific implementations can be removed. - "mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap device" (Youngjun Park) Address a uswsusp-vs-swapoff race and reduces the swap device reference taking/releasing frequency. - "mm/hmm: A fix and a selftest" (Dev Jain) * tag 'mm-stable-2026-06-18-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) selftests/mm/hmm-tests: test pagemap reads of PMD device-private entries fs/proc/task_mmu: do not warn on seeing non-migration pmd entry lib/test_hmm: check alloc_page_vma() return value and handle OOM mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX mm/swap: remove redundant swap device reference in alloc/free mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap device mm/filemap: use folio_next_index() for start vmalloc: fix NULL pointer dereference in is_vm_area_hugepages() sparc/mm: drop vmemmap_check_pmd helper and use generic code loongarch/mm: drop vmemmap_check_pmd helper and use generic code riscv/mm: drop vmemmap_pmd helpers and use generic code arm64/mm: drop vmemmap_pmd helpers and use generic code mm/sparse-vmemmap: provide generic vmemmap_set_pmd() and vmemmap_check_pmd() rust: page: mark Page::nid as inline userfaultfd: build __VMA_UFFD_FLAGS from config-gated masks userfaultfd: gate must_wait writability check on pte_present() mm/huge_memory: preserve pmd_swp_uffd_wp on device-private PMD downgrade fs/proc/task_mmu: fix hugetlb self-deadlock in pagemap_scan_pte_hole() fs/proc/task_mmu: use huge_page_size() in pagemap_scan_hugetlb_entry() fs/proc/task_mmu: fix make_uffd_wp_huge_pte() prot-update race ...
2026-06-19perf: Fix addr_filter_ranges lifetimePeter Zijlstra
Lee Jia Jie reported that since event::addr_filter_ranges is used under RCU, it should be RCU freed. Reported-by: Lee Jia Jie <jiajie.lee@starlabs.sg> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2026-06-18Merge tag 'trace-ring-buffer-v7.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ring-buffer updates from Steven Rostedt - Do not invalidate entire buffer for invalid sub-buffers For the persistent ring buffer, if one sub-buffer is found to be invalid, it invalidates the entire per CPU ring buffer. This can lose a lot of valuable data if there's some corruption with the writes to the buffer not syncing properly on a hard crash. Instead, if a sub-buffer is found to be invalid, simply zero it out and mark it for "missed events". When the persistent ring buffer is read and a sub-buffer that was cleared due to being invalid on boot up is discovered, the output will show "[LOST EVENTS]" to let the user know that events were missing at that location. Displaying the events from valid buffers can still be useful. - Add a test to be able to test corrupted sub-buffers If a persistent ring buffer is created as "ptraingtest" and the new config that adds the test is enabled, when a panic happens, the kernel will randomly corrupt one of the per CPU ring buffers. On boot up, the sub-buffers with the corruption should be cleared and flagged. When reading this buffer, the missed events should should [LOST EVENTS]. - Add commit number in the sub-buffer meta debug info The commit is used to know the content of a meta page. Add it to the buffer_meta file that is shown for each per CPU buffer. - Clean up the persistent ring buffer validation code Add some helper functions and make variable names more consistent. * tag 'trace-ring-buffer-v7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Better comment the use of RB_MISSED_EVENTS ring-buffer: Show persistent buffer dropped events in trace_pipe file ring-buffer: Show persistent buffer dropped events in trace file ring-buffer: Have dropped subbuffers be persistent across reboots ring-buffer: Cleanup buffer_data_page related code ring-buffer: Cleanup persistent ring buffer validation ring-buffer: Show commit numbers in buffer_meta file ring-buffer: Add persistent ring buffer invalid-page inject test ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
2026-06-18Merge tag 'trace-v7.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: - Remove a redundant IS_ERR() check trace_pipe_open() already checks for IS_ERR() and does it again in the return path. Remove the return check. - Export seq_buf_putmem_hex() to allow kunit tests against them To add Kunit tests on seq_buf_putmem_hex(), it needs to be exported. - Replace strcat() and strcpy() with seq_buf() logic The code for synthetic events uses a series of strcat() and strcpy() which can be error prone. Replace them with seq_buf() logic that does all the necessary bound checking. - Add a lockdep rcu_is_watching() to trace_##event##_enabled() call The trace_##event##_enabled() is a static branch that is true if the "event" is enabled. But this can hide bugs if this logic is in a location where RCU is disabled and not "watching". It would only trigger if lockdep is enabled and the event is enabled. Add a "rcu_is_watching()" warning if lockdep is enabled in that helper function to trigger regardless if the event is enabled or not. - Remove the local variable in the trace_printk() macro For name space integrity, remove the _______STR variable in the trace_printk() macro for using the sizeof() macro directly. - Use guard()s for the trace_recursion_record.c file - Fix typo in a comment of eventfs_callback() kerneldoc - Use trace_call__##event() in events within trace_##event##_enabled() A couple of events are called within an if block guarded by trace_##event##_enabled(). That is a static key that is only enabled when the event is enabled. The trace_call_##event() calls the tracepoint code directly without adding a redundant static key for that check. - Allow perf to read synthetic events Currently, perf does not have the ability to enable a synthetic event. If it does, it will either cause a kernel warning or error with "No such device". Synthetic events are not much different than kprobes and perf can handle fine with a few modifications. - Replace printk(KERN_WARNING ...) with pr_warn() - Replace krealloc() on an array with krealloc_array() - Fix README file path name for synthetic events - Change tracing_map tracing_map_array to use a flexible array Instead of allocating a separate pointer to hold the pages field of tracing_map_array, allocate the pages field as a flexible array when allocating the structure. - Fold trace_iterator_increment() into trace_find_next_entry_inc() The function trace_iterator_increment() was only used by trace_find_next_entry_inc(). It's not big enough to be a helper function for one user. Fold it into its caller. - Make field_var_str field a flexible array of hist_elt_data Instead of allocating a separate pointer for the field_var_str array of the hist_elt_data structure, allocate it as a flexible array when allocating the structure. - Disable KCOV for trace_irqsoff.c Like trace_preemptirq.c, trace_irqsoff.c has code that will crash when KCOV is enabled on ARM. The irqsoff tracing can be called on ARM because the irqsoff tracing code can be run from early interrupt code and produce coverage unrelated to syscall inputs. - Fix warning in __unregister_ftrace_function() called by perf Perf calls unregister_ftrace_function() without checking if its ftrace_ops has already been unregistered. There's an error path where on clean up it will unregister the ftrace_ops even if it wasn't registered and causes a warning. * tag 'trace-v7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: perf/ftrace: Fix WARNING in __unregister_ftrace_function tracing: Disable KCOV instrumentation for trace_irqsoff.o tracing: Turn hist_elt_data field_var_str into a flexible array tracing: Move trace_iterator_increment() into trace_find_next_entry_inc() tracing: Simplify pages allocation for tracing_map logic tracing: Fix README path for synthetic_events tracing: Use krealloc_array() for trace option array growth tracing/branch: Use pr_warn() instead of printk(KERN_WARNING) tracing: Allow perf to read synthetic events HID: Use trace_call__##name() at guarded tracepoint call sites cpufreq: amd-pstate: Use trace_call__##name() at guarded tracepoint call site tracefs: Fix typo in a comment of eventfs_callback() kerneldoc tracing: Switch trace_recursion_record.c code over to use guard() tracing: Remove local variable for argument detection from trace_printk() tracepoint: Add lockdep rcu_is_watching() check to trace_##name##_enabled() tracing: Bound synthetic-field strings with seq_buf seq_buf: Export seq_buf_putmem_hex() and add KUnit tests tracing: Remove redundant IS_ERR() check in trace_pipe_open()
2026-06-17treewide: fix transposed "sign" typos and update spelling.txtShardul Deshpande
Several comments transpose the letters in "assigned" and "unsigned", spelling them with "sing" instead of "sign". Correct all of them. Of these, the misspelling of "assigned" is not yet flagged by checkpatch, so also add it to scripts/spelling.txt. The remaining matches of `grep -ri singed` are RISINGEDGE register and enum names, not typos. Link: https://lore.kernel.org/20260612181633.734458-1-iamsharduld@gmail.com Signed-off-by: Shardul Deshpande <iamsharduld@gmail.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-17Merge tag 'dma-mapping-7.2-2026-06-16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping updates from Marek Szyprowski: - added checks for DMA attributes in the debug code, especially to ensure that mappings are created and released with matching attributes (Leon Romanovsky) - better default configuration for CMA on NUMA machines (Feng Tang) - code cleanup in dma benchmark tool (Rosen Penev) * tag 'dma-mapping-7.2-2026-06-16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma: map_benchmark: turn dma_sg_map_param buf into a flexible array dma-contiguous: simplify numa cma area handling dma-contiguous: add kconfig option to setup numa cma area if not configured explicitly dma-debug: Ensure mappings are created and released with matching attributes dma-debug: Feed DMA attribute for unmapping flows too dma-debug: Record DMA attributes in debug entry dma-debug: Remove unused DMA attribute parameter ntb: Use consistent DMA attributes when freeing DMA mappings ntb: Store original DMA address for future release
2026-06-17Merge tag 'printk-for-7.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Add upper case flavor for printing MAC addresses (%p[mM][U]) and use it in the nintendo driver - Fix matching of hash_pointers= parameter modes - Fix size check of vsprintf() field_width and precision values - Add check of size returned by vsprintf() - Add KUnit test for restricted pointer printing (%pK) - Some code cleanup * tag 'printk-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: HID: nintendo: Use %pM format specifier for MAC addresses vsprintf: Add upper case flavour to %p[mM] lib/vsprintf: replace min_t/max_t with min/max printk: fix typos in comments lib/vsprintf: Require exact hash_pointers mode matches vsprintf: Add test for restricted kernel pointers vsprintf: Only export no_hash_pointers to test module lib/vsprintf: Limit the returning size to INT_MAX lib/vsprintf: Fix to check field_width and precision
2026-06-17Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: - new virtio CAN driver - support for LoongArch architecture in fw_cfg - support for firmware notifications in vdpa/octeon_ep - support for VFs in virtio core - fixes, cleanups all over the place, notably: - vhost: fix vhost_get_avail_idx for a non empty ring fixing an significant old perf regression - READ_ONCE() annotations mean virtio ring is now free of KCSAN warnings * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (37 commits) can: virtio: Fix comment in UAPI header can: virtio: Add virtio CAN driver virtio: add num_vf callback to virtio_bus fw_cfg: Add support for LoongArch architecture vdpa/octeon_ep: fix IRQ-to-ring mapping in interrupt handler vdpa/octeon_ep: Add vDPA device event handling for firmware notifications vdpa/octeon_ep: Use 4 bytes for mailbox signature vdpa/octeon_ep: Fix PF->VF mailbox data address calculation vhost_task_create: kill unnecessary .exit_signal initialization vhost: remove unnecessary module_init/exit functions vdpa/mlx5: Use kvzalloc_flex() for MTT command memory vdpa_sim_net: switch to dynamic root device vdpa_sim_blk: switch to dynamic root device virtio-mem: Destroy mutex before freeing virtio_mem virtio-balloon: Destroy mutex before freeing virtio_balloon tools/virtio: fix build for kmalloc_obj API and missing stubs virtio_ring: Add READ_ONCE annotations for device-writable fields vduse: fix compat handling for VDUSE_IOTLB_GET_FD/VDUSE_VQ_GET_INFO tools/virtio: check mmap return value in vringh_test vhost/net: complete zerocopy ubufs only once ...
2026-06-17cpufreq: schedutil: Fix uncleared need_freq_update on the .adjust_perf() pathZhongqiu Han
The need_freq_update flag makes sugov_should_update_freq() return true regardless of the rate_limit_us throttling, and is cleared in sugov_update_next_freq(). sugov_update_single_freq() and sugov_update_shared() go through that helper, so the flag does not persist there. However, sugov_update_single_perf(), used by drivers implementing the .adjust_perf() callback (e.g. intel_pstate or amd-pstate in passive mode) calls cpufreq_driver_adjust_perf() directly and never goes through sugov_update_next_freq(), so the need_freq_update flag is not cleared in that path. Before commit 75da043d8f88 ("cpufreq/sched: Set need_freq_update in ignore_dl_rate_limit()"), this was effectively harmless because sugov_should_update_freq() still honored the rate limit even when need_freq_update was set. After that change, the flag forces sugov_should_update_freq() to always return true, so once set, it stays effective indefinitely on the .adjust_perf() path. As a result, cpufreq_driver_adjust_perf() gets called on every scheduler utilization update (with the runqueue lock held) rather than being throttled by rate_limit_us, even if the driver itself may skip redundant hardware updates. Clear need_freq_update at the end of the adjust_perf path as well. Fixes: 75da043d8f88 ("cpufreq/sched: Set need_freq_update in ignore_dl_rate_limit()") Signed-off-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com> Reviewed-by: Hongyan Xia <hongyan.xia@transsion.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Cc: All applicable <stable@vger.kernel.org> [ rjw: Subject and changelog edits ] Link: https://patch.msgid.link/20260616154733.2405236-1-zhongqiu.han@oss.qualcomm.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-06-17hrtimer: Correct CONFIG_NO_HZ_COMMON macro name in commentEthan Nelson-Moore
A comment in kernel/time/hrtimer.c incorrectly refers to CONFIG_NOHZ_COMMON instead of CONFIG_NO_HZ_COMMON. Correct it. Discovered while searching for CONFIG_* symbols referenced in code but not defined in any Kconfig file. Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260609034314.25029-1-enelsonmoore@gmail.com
2026-06-17posix-cpu-timers: Use u64 multiplication in update_rlimit_cpu()Zhan Xusheng
update_rlimit_cpu() converts the RLIMIT_CPU value to nanoseconds with u64 nsecs = rlim_new * NSEC_PER_SEC; On 32-bit kernels both rlim_new (unsigned long) and NSEC_PER_SEC (1000000000L) are 32-bit, so the multiplication is performed in unsigned long and truncated for rlim_new > 4 seconds before being widened to u64. The same file already casts to u64 for the matching computation in check_process_timers(): u64 softns = (u64)soft * NSEC_PER_SEC; As a result, the truncated value is installed into the CPUCLOCK_PROF expiry cache (nextevt), causing the process CPU timer to be programmed to fire prematurely for any RLIMIT_CPU soft limit >= 5 seconds. The actual SIGXCPU/SIGKILL decision in check_process_timers() already casts to u64 and is therefore correct, so limit enforcement is not broken; only the expiry-cache programming is wrong. Apply the same cast here so both paths convert rlim_cur identically. 64-bit kernels are unaffected. Fixes: 858cf3a8c599 ("timers/itimer: Convert internal cputime_t units to nsec") Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260616112017.1681372-1-zhanxusheng@xiaomi.com
2026-06-17timekeeping: Register default clocksource before taking tk_core.lockMikhail Gavrilov
Commit f24df84cbe05 ("time/jiffies: Register jiffies clocksource before usage") moved the jiffies clocksource registration into clocksource_default_clock(), so that it is registered lazily on the first call. __clocksource_register() acquires clocksource_mutex, but the first caller is timekeeping_init(), which invokes clocksource_default_clock() while holding tk_core.lock, a raw spinlock. Acquiring a sleeping mutex while holding a raw spinlock is invalid. The default clocksource only has to be registered before tk_setup_internals() consumes its mult/shift/maxadj. Neither clocksource_default_clock(), the ->enable() callback, nor the registration itself need tk_core.lock, so fetch and enable the clock before acquiring the lock. This preserves the "register before usage" ordering while keeping clocksource_mutex out of the raw spinlock section. clocksource_default_clock() has a second caller, clocksource_done_booting(), which invokes it with clocksource_mutex already held. That path avoids a recursive lock because timekeeping_init() has already run and set cs_jiffies_registered, so the registration is skipped there. This change does not alter that; it only fixes the invalid wait context in timekeeping_init(). Fixes: f24df84cbe05 ("time/jiffies: Register jiffies clocksource before usage") Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reported-by: Breno Leitao <leitao@debian.org> Reported-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Breno Leitao <leitao@debian.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260616070914.65818-1-mikhail.v.gavrilov@gmail.com
2026-06-17Merge tag 'audit-pr-20260615' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit updates from Paul Moore: - Fix a recursive deadlock when duplicating executable file rules Avoid multiple lookups and attempted I_MUTEX_PARENT locks when moving watched files by passing the already resolved inodes through the audit code. - Fix removal of executable watch rules after the file is deleted Prior to this fix we were unable to remove an executable file watch where the file had been previously deleted due to a negative dentry check in the code that performs the lookup on the file watches. - Convert our basic "unsigned" type usage to "unsigned int". * tag 'audit-pr-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: fix recursive locking deadlock in audit_dupe_exe() audit: fix removal of dangling executable rules audit: use 'unsigned int' instead of 'unsigned'
2026-06-17Merge tag 'sched_ext-for-7.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: "Most of this continues the in-development sub-scheduler support, which lets a root BPF scheduler delegate to nested sub-schedulers. The dispatch-path building blocks landed in 7.1. A follow-up patchset in development will complete enqueue-path support for hierarchical scheduling. This cycle adds most of that infrastructure: - Topological CPU IDs (cids): a dense, topology-ordered CPU numbering where the CPUs of a core, LLC, or NUMA node form contiguous ranges, so a topology unit becomes a (start, length) slice. Raw CPU numbers are sparse and don't track topological closeness, which makes them clumsy for sharding work across sub-schedulers and awkward in BPF. - cmask: bitmaps windowed over a slice of cid space, so a sub-scheduler can track, for example, the idle cids of its shard without a full NR_CPUS cpumask. - A struct_ops variant that cid-form sub-schedulers register with, along with the cid-form kfuncs they call. - BPF arena integration, which sub-scheduler support is built on. The bpf-next additions let the kernel read and write the BPF scheduler's arena directly, turning it into a real kernel/BPF shared-memory channel. Shared state like the per-CPU cmask now lives there. - scx_qmap is reworked to exercise the new arena and cid interfaces. Additionally: - Exit-dump improvements: dump the faulting CPU first, expose the exit CPU to BPF and userspace, and normalize the dump header. - Misc kfuncs and cleanups: a task-ID lookup kfunc, __printf checking on the error and dump formatters, header reorganization, and assorted fixes" * tag 'sched_ext-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (59 commits) sched_ext: Add scx_arena_to_kaddr() / scx_kaddr_to_arena() sched_ext: Make scx_bpf_kick_cid() return s32 sched_ext: Add scx_cmask_test() and scx_cmask_for_each_cid() tools/sched_ext: Order single-cid cmask helpers as (cid, mask) sched_ext: Order single-cid cmask helpers as (cid, mask) selftests/sched_ext: Fix dsq_move_to_local check sched_ext: Guard BPF arena helper calls to fix 32-bit build sched_ext: idle: Fix errno loss in scx_idle_init() sched_ext: Convert ops.set_cmask() to arena-resident cmask sched_ext: Sub-allocator over kernel-claimed BPF arena pages sched_ext: Require an arena for cid-form schedulers sched_ext: Add cmask mask ops sched_ext: Track bits[] storage size in struct scx_cmask sched_ext: Rename scx_cmask.nr_bits to nr_cids tools/sched_ext: scx_qmap: Fix qa arena placement sched_ext: Mark !CONFIG_EXT_SUB_SCHED dummy stubs static inline sched_ext: Replace tryget_task_struct() with get_task_struct() sched_ext: Add scx_task_iter_relock() and use it in scx_root_enable_workfn() sched_ext: Fix ops_cid layout assert sched_ext: Use offsetofend on both sides of the ops_cid layout assert ...
2026-06-17Merge tag 'cgroup-for-7.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Last cycle deferred css teardown on cgroup removal until the cgroup depopulated, so a css is not taken offline while tasks can still reference it. Disabling a controller through cgroup.subtree_control still had the same problem. This reworks the deferral from per-cgroup to per-css so that path is covered too. - New RDMA controller monitoring files: rdma.peak for per-device peak usage and rdma.events / rdma.events.local for resource-limit exhaustion. The max-limit parser was rewritten, fixing two input parsing bugs. - cpuset: fix a sched-domain leak on the domain-rebuild failure path and skip a redundant hardwall ancestor scan on v2. - Misc: pair the remaining lockless cgroup.max.* reads with WRITE_ONCE, assorted selftest robustness fixes, and doc path corrections. * tag 'cgroup-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (22 commits) cgroup: Migrate tasks to the root css when a controller is rebound docs: cgroup: Fix stale source file paths cgroup/cpuset: Free sched domains on rebuild guard failure cgroup: pair max limit READ_ONCE() with WRITE_ONCE() selftests/cgroup: enable memory controller in hugetlb memcg test cgroup/rdma: Drop unnecessary READ_ONCE() on event counters cgroup: Defer kill_css_finish() in cgroup_apply_control_disable() cgroup: Add per-subsys-css kill_css_finish deferral cgroup: Move populated counters to cgroup_subsys_state cgroup: Annotate unlocked nr_populated_* accesses with READ_ONCE/WRITE_ONCE cgroup: Inline cgroup_has_tasks() in cgroup.h cgroup/rdma: document rdma.peak, rdma.events and rdma.events.local cgroup/rdma: add rdma.events.local for per-cgroup allocation failure attribution cgroup/rdma: add rdma.events to track resource limit exhaustion cgroup/rdma: add rdma.peak for per-device peak usage tracking selftests/cgroup: check malloc return value in alloc_anon functions cgroup/cpuset: Skip hardwall ancestor scan in cpuset v2 in cpuset_current_node_allowed() selftests/cgroup: fix misleading debug message in test_cgfreezer_time_child selftests/cgroup: fix child process escaping to parent cleanup in test_cpucg_nice selftests/cgroup: Add NULL check after malloc in cgroup_util.c ...
2026-06-17Merge tag 'wq-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue updates from Tejun Heo: - Continued progress toward making alloc_workqueue() unbound by default: more callers converted to WQ_PERCPU / system_percpu_wq / system_dfl_wq, and new warnings for queues that use neither WQ_PERCPU nor WQ_UNBOUND or the legacy system_wq / system_unbound_wq. - Misc: drop the now-trivial apply_wqattrs_lock()/unlock() wrappers, forbid the TEST_WORKQUEUE benchmark from being built-in, and fix a spurious pointer level in the worker debug-dump path. * tag 'wq-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: drm/bridge: anx7625: Add WQ_PERCPU add to alloc_workqueue wifi: ath6kl: fix invalid workqueue flags in ath6kl_usb_create() btrfs: Drop WQ_PERCPU from ordered_flags in btrfs_init_workqueues() workqueue: Add warnings and ensure one among WQ_PERCPU or WQ_UNBOUND is present workqueue: Add warnings and fallback if system_{unbound}_wq is used workqueue: drop spurious '*' from print_worker_info() fn declaration workqueue: forbid TEST_WORKQUEUE from being built-in workqueue: drop apply_wqattrs_lock()/unlock() wrappers umh: replace use of system_unbound_wq with system_dfl_wq rapidio: rio: add WQ_PERCPU to alloc_workqueue users media: ddbridge: add WQ_PERCPU to alloc_workqueue users platform: cznic: turris-omnia-mcu: replace use of system_wq with system_percpu_wq media: synopsys: hdmirx: replace use of system_unbound_wq with system_dfl_wq virt: acrn: Add WQ_PERCPU to alloc_workqueue users
2026-06-17Merge tag 'modules-7.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux Pull modules updates from Sami Tolvanen: - Add a missing return value check for module_extend_max_pages() to prevent a kernel oops on memory allocation failure. - Force sh_addr to 0 for architecture-specific module sections on arm, arm64, m68k, and riscv. This prevents non-zero section addresses when linking modules with ld.bfd -r, which may cause tools to misbehave and result in worse compressibility. - Replace pr_warn! with pr_warn_once! for set_param null pointer warnings in Rust abstractions, now that the _once variant is available. * tag 'modules-7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux: rust: module_param: add missing newline to pr_warn_once module: decompress: check return value of module_extend_max_pages() rust: module_param: use `pr_warn_once!` for null pointer warning module, riscv: force sh_addr=0 for arch-specific sections module, m68k: force sh_addr=0 for arch-specific sections module, arm64: force sh_addr=0 for arch-specific sections module, arm: force sh_addr=0 for arch-specific sections
2026-06-17Merge tag 'bpf-next-7.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: "Major changes: - Recover from BPF arena page faults using a scratch page and add ptep_try_set() for lockless empty-slot installs on x86 and arm64. This allows BPF kfuncs to access arena pointers directly. The 'arena_direct_access' stable branch was created for this work and was pulled into sched-ext and bpf-next trees (Tejun Heo, Kumar Kartikeya Dwivedi) - Lift old restriction and support 6+ arguments in BPF programs and kfuncs on x86 and arm64 (Yonghong Song, Puranjay Mohan) Other features and fixes: - Add 24-bit BTF vlen and reclaim unused bits in the BTF UAPI to ease addition of new BTF kinds (Alan Maguire) - Raise the maximum BPF call chain depth from 8 to 16 frames (Alexei Starovoitov) - Refactor object relationship tracking in the verifier and fix a dynptr use-after-free bug (Amery Hung) - Harden the signed program loader and reject exclusive maps as inner maps (Daniel Borkmann) - Replace the verifier min/max bounds fields with a circular number (cnum) representation and improve 32->64 bit range refinements (Eduard Zingerman) - Introduce the arena library and runtime (libarena) with a buddy allocator, rbtree and SPMC queue data structures, ASAN support and a parallel test harness. Allow subprograms to return arena pointers and switch to a BTF type-tag based __arena annotation (Emil Tsalapatis) - Cache build IDs in the sleepable stackmap path and avoid faultable build ID reads under mm locks (Ihor Solodrai) - Introduce the tracing_multi link to attach a single BPF program to many kernel functions at once. Allow specifying the uprobe_multi target via FD (Jiri Olsa) - Extend the bpf_list family of kfuncs with bpf_list_add/del(), and bpf_list_is_first/is_last/empty() (Kaitao Cheng) - Extend the BPF syscall with common attributes support for prog_load, btf_load and map_create (Leon Hwang) - Wrap rhashtable as BPF map (Mykyta Yatsenko, Herbert Xu) - Add sleepable support for tracepoint programs and fix deadlocks in LRU map due to NMI reentry (Mykyta Yatsenko) - Fix OOB access in bpf_flow_keys, fix nullness analysis of inner arrays, enforce write checks for global subprograms (Nuoqi Gui) - Report the maximum combined stack depth and print a breakdown of instructions processed per subprogram (Paul Chaignon) - Add an XDP load-balancer benchmark and arm64 JIT support for stack arguments (Puranjay Mohan) - Add kfuncs to traverse over wakeup_sources (Samuel Wu) - Allow sleepable BPF programs to use LPM trie maps directly (Vlad Poenaru) - Many more fixes and cleanups across the verifier, BTF, sockmap, devmap, bpffs, security hooks, s390/riscv/loongarch JITs, rqspinlock, libbpf, bpftool, selftests" * tag 'bpf-next-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (336 commits) selftests/bpf: Work around llvm stack overflow in crypto progs selftests/bpf: add test for bpf_msg_pop_data() overflow bpf, sockmap: fix integer overflow in bpf_msg_pop_data() bounds check sockmap: Fix use-after-free in udp_bpf_recvmsg() bpf, sockmap: keep sk_msg copy state in sync bpf, sockmap: Fix wrong rsge offset in bpf_msg_push_data() bpf, sockmap: reject overflowing copy + len in bpf_msg_push_data() selftsets/bpf: Retry map update on helper_fill_hashmap() selftests/bpf: Add test for sleepable lsm_cgroup rejection selftests/bpf: Add test to verify the fix for bpf_setsockopt() helper bpf: Fix bpf_get/setsockopt to tos for ipv4-mapped ipv6 socket selftests/bpf: Avoid static LLVM linking for cross builds selftests/bpf: Use common CFLAGS for urandom_read selftests/bpf: Initialize operation name before use tools/bpf: build: Append extra cflags libbpf: Initialize CFLAGS before including Makefile.include bpftool: Append extra host flags bpftool: Avoid adding EXTRA_CFLAGS to HOST_CFLAGS bpftool: Pass host flags to bootstrap libbpf selftests/bpf: correct CONFIG_PPC64 macro name in comment ...
2026-06-16block: invalidate cached plug timestamp after task switchUsama Arif
blk_time_get_ns() caches ktime_get_ns() in current->plug->cur_ktime and marks the task with PF_BLOCK_TS. That cache is only valid while the task keeps running; if the task is switched out, wall-clock time advances and the cached value must not be reused when the task runs again. The existing invalidation covers explicit plug flushes through __blk_flush_plug(), and the schedule() / rtmutex paths through sched_update_worker(). It does not cover in-kernel preemption paths such as preempt_schedule(), preempt_schedule_notrace(), and preempt_schedule_irq(), which enter __schedule(SM_PREEMPT) directly and return without calling sched_update_worker(). As a result, a task preempted while holding a plug with PF_BLOCK_TS set can reuse a stale plug->cur_ktime after it is scheduled back in. blk-iocost then consumes that stale timestamp through ioc_now(), producing stale vnow values for throttle decisions, and through ioc_rqos_done(), inflating on-queue time and feeding false missed-QoS samples into vrate adjustment. Move the schedule-side invalidation to finish_task_switch(), which runs for the scheduled-in task after every actual context switch regardless of which schedule entry point was used. Keep __blk_flush_plug() as the explicit flush/finish-plug invalidation path, and remove only the PF_BLOCK_TS handling from sched_update_worker(). Fixes: 06b23f92af87 ("block: update cached timestamp post schedule/preemption") Cc: stable@vger.kernel.org Signed-off-by: Usama Arif <usama.arif@linux.dev> Link: https://patch.msgid.link/20260616141604.328820-3-usama.arif@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-16kernel/fork: clear PF_BLOCK_TS in copy_process()Usama Arif
PF_BLOCK_TS is only set in blk_time_get_ns() when current->plug is non-NULL, and blk_finish_plug() clears it via __blk_flush_plug() before NULLing the plug pointer. copy_process() breaks the invariant by inheriting PF_BLOCK_TS from the parent while resetting the child's plug to NULL. Clear PF_BLOCK_TS alongside that assignment so callers can rely on "PF_BLOCK_TS set implies current->plug != NULL" and dereference current->plug unguarded. Fixes: 06b23f92af87 ("block: update cached timestamp post schedule/preemption") Cc: stable@vger.kernel.org Signed-off-by: Usama Arif <usama.arif@linux.dev> Link: https://patch.msgid.link/20260616141604.328820-2-usama.arif@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-16Merge branch 'idle-time-acc' into featuresAlexander Gordeev
Heiko Carstens says: =================== This is supposed to improve s390 idle time accounting, and brings it back to the state it was before arch_cpu_idle_time() was removed from s390 [3]. In result all cpu time accounting is done by the s390 architecture backend again, instead of having a mix of architecure specific and common code accounting (common code: idle, s390 architecture: everything else). =================== Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2026-06-16Merge tag 'trace-latency-v7.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing latency updates from Steven Rostedt: - Dump the stack to the buffer on timerlat uret threashold event Record the stack trace in the buffer for THREAD_URET as well as THREAD_CONTEXT when the threshold is hit. Otherwise, if the threshold was not hit at task wakeup, but was at task return, it will not produce a stack trace making it harder to debug. - Have osnoise trace prints print to all buffers The osnoise tracer is allowed to print to the main buffer. Add a osnoise_print() helper function and use trace_array_vprintk() to print osnoise output. * tag 'trace-latency-v7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/osnoise: Array printk init and cleanup tracing/osnoise: Dump stack on timerlat uret threshold event
2026-06-16Merge tag 'probes-v7.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes updates from Masami Hiramatsu: - BTF support for dereferencing pointers Add syntax to the parsing of eprobes to typecast structure pointer trace event fields, enabling BTF-based dereferencing instead of relying on manual offsets. - Improvements and robustness enhancements - Use flexible array for entry fetch code. Store probe entry fetch instructions in the probe_entry_arg allocation via a flexible array member to simplify memory allocation and lifetime management. - Replace BUG_ON with lockdep_assert_held in uprobe_buffer functions Replace BUG_ON() calls with lockdep_assert_held() in uprobe buffer enable/disable paths to prevent kernel crashes and better verify lock ownership. - Ensure the uprobe buffer size is bigger than event size. Add a BUILD_BUG_ON() assertion to guarantee that the per-CPU uprobe working buffer size is always larger than the maximum probe event size. * tag 'probes-v7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/eprobes: Allow use of BTF names to dereference pointers tracing: Replace BUG_ON with lockdep_assert_held in uprobe_buffer functions tracing: Use flexible array for entry fetch code tracing/probes: Ensure the uprobe buffer size is bigger than event size
2026-06-16Merge tag 'linux_kselftest-kunit-7.2-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kunit updates from Shuah Khan: "Fixes to tool and kunit core and new features to both to support JUnit XML (primitive) and backtrace suppression API: - Core support for suppressing warning backtraces - Parse and print the reason tests are skipped - Add (primitive) support for outputting JUnit XML - Don't write to stdout when it should be disabled - Add backtrace suppression self-tests - Suppress intentional warning backtraces in scaling unit tests - Add documentation for warning backtrace suppression API - Fix spelling mistakes in comments and messages - gen_compile_commands: Ignore libgcc.a - qemu_configs: Add or1k / openrisc configuration" * tag 'linux_kselftest-kunit-7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: kunit:tool: Don't write to stdout when it should be disabled kunit: tool: Add (primitive) support for outputting JUnit XML kunit: tool: Parse and print the reason tests are skipped kunit: Add documentation for warning backtrace suppression API drm: Suppress intentional warning backtraces in scaling unit tests kunit: Add backtrace suppression self-tests bug/kunit: Core support for suppressing warning backtraces kunit: Fix spelling mistakes in comments and messages kunit: qemu_configs: Add or1k / openrisc configuration gen_compile_commands: Ignore libgcc.a
2026-06-16Merge tag 'slab-for-7.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - Support for "allocation tokens" (currently available in Clang 22+) for smarter partitioning of kmalloc caches based on the allocated object type, which can be enabled instead of the "random" per-caller-address-hash partitioning. It should be able to deterministically separate types containing a pointer from those that do not (Marco Elver) - Improvements and simplification of the kmem_cache_alloc_bulk() and mempool_alloc_bulk() API. This includes adaptation of callers (Christoph Hellwig) - Performance improvements and cleanups related mostly to sheaves refill (Hao Li, Shengming Hu, Vlastimil Babka) - Several fixups for the slabinfo tool (Xuewen Wang) * tag 'slab-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: do not limit zeroing to orig_size when only red zoning is enabled mm/slub: preserve original size in _kmalloc_nolock_noprof retry path mm: simplify the mempool_alloc_bulk API mm/slab: improve kmem_cache_alloc_bulk mm/slub: detach and reattach partial slabs in batch mm/slub: introduce helpers for node partial slab state mm/slub: use empty sheaf helpers for oversized sheaves tools/mm/slabinfo: remove redundant slab->partial assignment tools/mm/slabinfo: remove dead assignment in get_obj_and_str() tools/mm/slabinfo: Fix trace disable logic inversion MAINTAINERS: add slab-related scripts and tools to SLAB ALLOCATOR mm/slub: fix typo in sheaves comment mm, slab: simplify returning slab in __refill_objects_node() mm, slab: add an optimistic __slab_try_return_freelist() slab: fix kernel-docs for mm-api slab: improve KMALLOC_PARTITION_RANDOM randomness slab: support for compiler-assisted type-based slab cache partitioning mm/slub: defer freelist construction until after bulk allocation from a new slab
2026-06-15s390/idle: Provide arch specific kcpustat_field_idle()/kcpustat_field_iowait()Heiko Carstens
The former s390 specific arch_cpu_idle_time() implementation was removed, since its implementation was racy and reported idle time could go backwards [1]. However this removal was not necessary, since independently of the s390 architecture specific races there exists the iowait counter update race, which can also lead to reported idle time going backwards [2]. With Frederic Weisbecker's recent cpu idle time accounting refactoring kernel_cpustat got a sequence counter. Use this to implement s390 specific variants of kcpustat_field_idle() and kcpustat_field_iowait(). This is logically a revert of [1] and moves cpu idle time accounting back into s390 architecture code, which is also more precise than the dyntick idle time accounting by nohz/scheduler. For comparing cross cpu time stamps it is necessary to use the stcke instead of the stckf instruction in irq entry path. Furthermore this open-codes a sequence lock in assembler and C code, which is required to update the irq entry time stamp to the per cpu idle_data structure in a race free manner. [1] commit be76ea614460 ("s390/idle: remove arch_cpu_idle_time() and corresponding code") [2] commit ead70b752373 ("timers/nohz: Add a comment about broken iowait counter update race") Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2026-06-15Merge tag 'sched-core-2026-06-14' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "SMP load-balancing updates: - A large series to introduce infrastructure for cache-aware load balancing, with the goal of co-locating tasks that share data within the same Last Level Cache (LLC) domain. By improving cache locality, the scheduler can reduce cache bouncing and cache misses, ultimately improving data access efficiency. Implemented by Chen Yu and Tim Chen, based on early prototype work by Peter Zijlstra, with fixes by Jianyong Wu, Peter Zijlstra and Shrikanth Hegde. - A series to simplify CONFIG_SCHED_SMT ifdef usage (Shrikanth Hegde) Fair scheduler updates: - A series to improve SD_ASYM_CPUCAPACITY scheduling by introducing SMT awareness (Andrea Righi, K Prateek Nayak) - A series to optimize cfs_rq and sched_entity allocation for better data locality (Zecheng Li) - A preparatory series to change fair/cgroup scheduling to a single runqueue, without the final change (Peter Zijlstra) - Auto-manage ext/fair dl_server bandwidth (Andrea Righi) - Fix cpu_util runnable_avg arithmetic (Hongyan Xia) - Optimize update_tg_load_avg()'s rate-limiting code (Rik van Riel) - Allow account_cfs_rq_runtime() to throttle current hierarchy (K Prateek Nayak) - Update util_est after updating util_avg during dequeue, to fix the util signal update logic, which reduces signal noise (Vincent Guittot) Scheduler topology updates: - Allow multiple domains to claim sched_domain_shared (K Prateek Nayak) - Add parameter to split LLC (Peter Zijlstra) Core scheduler updates: - Use trace_call__<tp>() to save a static branch (Gabriele Monaco) Scheduler statistics updates: - Drop now-stale mul_u64_u64_div_u64() cputime over-approximation guard (Nicolas Pitre) Deadline scheduler updates: - Reject debugfs dl_server writes for offline CPUs (Andrea Righi) - Fix replenishment logic for non-deferred servers (Yuri Andriaccio) RT scheduling updates: - Turn RT_PUSH_IPI default off for non PREEMPT_RT (Steven Rostedt) - Update default bandwidth for real-time tasks to 1.0 (Yuri Andriaccio) Proxy scheduling updates: - A series to implement Optimized Donor Migration for Proxy Execution (John Stultz, Peter Zijlstra) - Various proxy scheduling cleanups and fixes (Peter Zijlstra, K Prateek Nayak) Misc fixes, improvements and cleanups by Aaron Lu, Andrea Righi, Zenghui Yu, Chen Yu, Guanyou.Chen, John Stultz, Shrikanth Hegde, Peter Zijlstra, Liang Luo and Yiyang Chen" * tag 'sched-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (91 commits) sched/fair: Fix newidle vs core-sched sched/deadline: Use task_on_rq_migrating() helper sched/core: Combine separate 'else' and 'if' statements sched/fair: Fix cpu_util runnable_avg arithmetic sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime() sched/fair: Move the throttled tasks to a local list in tg_unthrottle_up() sched/fair: Call update_curr() before unthrottling the hierarchy sched/fair: Use throttled_csd_list for local unthrottle sched/fair: Convert cfs bandwidth throttling to use guards sched/fair: Allocate cfs_tg_state with percpu allocator sched/fair: Remove task_group->se pointer array sched/fair: Co-locate cfs_rq and sched_entity in cfs_tg_state sched: restore timer_slack_ns when resetting RT policy on fork MAINTAINERS: Fix spelling mistake in Peter's name sched: Simplify ttwu_runnable() sched/proxy: Remove superfluous clear_task_blocked_in() sched/proxy: Remove PROXY_WAKING sched/proxy: Switch proxy to use p->is_blocked sched/proxy: Only return migrate when needed sched: Be more strict about p->is_blocked ...
2026-06-15Merge tag 'perf-core-2026-06-14' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: "Core perf code updates: - Reveal PMU type in fdinfo (Chun-Tse Shao) Intel CPU PMU driver updates: - Fix various inaccurate hard-coded event configurations (Dapeng Mi) Intel uncore PMU driver updates (Zide Chen): - Fix discovery unit lookup bug for multi-die systems - Guard against invalid box control address - Fix PCI device refcount leak in UPI discovery - Defer ADL global PMON enable to enable_box() to save power - Fix uncore_die_to_cpu() for offline dies - Implement global init callback for GNR uncore AMD CPU PMU driver updates: - Always use the NMI latency mitigation (Sandipan Das) AMD uncore PMU driver updates: - Use Node ID to identify DF and UMC domains (Sandipan Das)" * tag 'perf-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (22 commits) perf/x86/amd/uncore: Use Node ID to identify DF and UMC domains perf: Reveal PMU type in fdinfo perf/x86/intel/uncore: Implement global init callback for GNR uncore perf/x86/intel/uncore: Fix uncore_die_to_cpu() for offline dies perf/x86/intel/uncore: Move die_to_cpu() to uncore.c perf/x86/intel/uncore: Defer ADL global PMON enable to enable_box() perf/x86/intel/uncore: Fix PCI device refcount leak in UPI discovery perf/x86/intel/uncore: Guard against invalid box control address perf/x86/intel/uncore: Fix discovery unit lookup for multi-die systems perf/x86/amd/core: Always use the NMI latency mitigation perf/x86/intel: Update event constraints and cache_extra_regsfor CWF perf/x86/intel: Update event constraints and cache_extra_regsfor SRF perf/x86/intel: Update event constraints and cache_extra_regsfor NVL perf/x86/intel: Update event constraints for PTL perf/x86/intel: Update event constraints and cache_extra_regsfor ARL perf/x86/intel: Update event constraints and cache_extra_regsfor LNL perf/x86/intel: Update event constraints and cache_extra_regsfor MTL perf/x86/intel: Update event constraints and cache_extra_regsfor ADL perf/x86/intel: Update event constraints for DMR perf/x86/intel: Update event constraints and cache_extra_regsfor SPR ...
2026-06-15Merge tag 'locking-core-2026-06-14' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "Futex updates: - Optimize futex hash bucket access patterns (Peter Zijlstra) - Large series to address the robust futex unlock race for real, by Thomas Gleixner: "The robust futex unlock mechanism is racy in respect to the clearing of the robust_list_head::list_op_pending pointer because unlock and clearing the pointer are not atomic. The race window is between the unlock and clearing the pending op pointer. If the task is forced to exit in this window, exit will access a potentially invalid pending op pointer when cleaning up the robust list. That happens if another task manages to unmap the object containing the lock before the cleanup, which results in an UAF. In the worst case this UAF can lead to memory corruption when unrelated content has been mapped to the same address by the time the access happens. User space can't solve this problem without help from the kernel. This series provides the kernel side infrastructure to help it along: 1) Combined unlock, pointer clearing, wake-up for the contended case 2) VDSO based unlock and pointer clearing helpers with a fix-up function in the kernel when user space was interrupted within the critical section. ... with help by André Almeida: - Add a note about robust list race condition (André Almeida) - Add self-tests for robust release operations (André Almeida) Context analysis updates: - Implement context analysis for 'struct rt_mutex'. (Bart Van Assche) - Bump required Clang version to 23 (Marco Elver) Guard infrastructure updates: - Series to remove NULL check from unconditional guards (Dmitry Ilvokhin) Lockdep updates: - Restore self-test migrate_disable() and sched_rt_mutex state on PREEMPT_RT (Karl Mehltretter) Membarriers updates: - Use per-CPU mutexes for targeted commands (Aniket Gattani) - Modernize membarrier_global_expedited with cleanup guards (Aniket Gattani) - Add rseq stress test for CFS throttle interactions (Aniket Gattani) percpu-rwsems updates: - Extract __percpu_up_read() to optimize inlining overhead (Dmitry Ilvokhin) Seqlocks updates: - Allow UBSAN_ALIGNMENT to fail optimizing (Heiko Carstens) Lock tracing: - Add contended_release tracepoint to sleepable locks such as mutexes, percpu-rwsems, rtmutexes, rwsems and semaphores (Dmitry Ilvokhin) MAINTAINERS updates: - MAINTAINERS: Add RUST [SYNC] entry (Boqun Feng) Misc updates and fixes by Randy Dunlap, YE WEI-HONG, Fabricio Parra, Dmitry Ilvokhin and Peter Zijlstra" * tag 'locking-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (36 commits) locking: Add contended_release tracepoint to sleepable locks locking/percpu-rwsem: Extract __percpu_up_read() tracing/lock: Remove unnecessary linux/sched.h include futex: Optimize futex hash bucket access patterns rust: sync: completion: Mark inline complete_all and wait_for_completion MAINTAINERS: Add RUST [SYNC] entry cleanup: Specify nonnull argument index selftests: futex: Add tests for robust release operations Documentation: futex: Add a note about robust list race condition x86/vdso: Implement __vdso_futex_robust_try_unlock() x86/vdso: Prepare for robust futex unlock support futex: Provide infrastructure to plug the non contended robust futex unlock race futex: Add robust futex unlock IP range futex: Add support for unlocking robust futexes futex: Cleanup UAPI defines x86: Select ARCH_MEMORY_ORDER_TSO uaccess: Provide unsafe_atomic_store_release_user() futex: Provide UABI defines for robust list entry modifiers futex: Move futex related mm_struct data into a struct futex: Make futex_mm_init() void ...
2026-06-15Merge tag 'timers-vdso-2026-06-13' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull vdso updates from Thomas Gleixner: - Remove the redundant CONFIG_GENERIC_TIME_VSYSCALL after converting the remaining users over. - Rework and sanitize the MIPS VDSO handling, so it does not handle the time related VDSO if there is no VDSO capable clocksource available. Also stop mapping VDSO data pages unconditionally even if there is no usage possible. * tag 'timers-vdso-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: MIPS: VDSO: Fold MIPS_CLOCK_VSYSCALL into MIPS_GENERIC_GETTIMEOFDAY MIPS: VDSO: Gate microMIPS restriction on GCC version MIPS: VDSO: Fold MIPS_DISABLE_VDSO into MIPS_GENERIC_GETTIMEOFDAY clocksource/drivers/mips-gic-timer: Only use VDSO_CLOCKMODE_GIC when it is a available MIPS: csrc-r4k: Only use VDSO_CLOCKMODE_R4K when it is a available MIPS: VDSO: Only map the data pages when the vDSO is used MIPS: Introduce Kconfig MIPS_GENERIC_GETTIMEOFDAY vdso/datastore: Always provide symbol declarations MAINTAINERS: Add include/linux/vdso_datastore.h to vDSO block vdso/gettimeofday: Rename __arch_get_vdso_u_timens_data() vdso/treewide: Drop GENERIC_TIME_VSYSCALL vdso/vsyscall: Gate update_vsyscall() behind CONFIG_GENERIC_GETTIMEOFDAY riscv: vdso: Drop CONFIG_GENERIC_TIME_VSYSCALL guard around syscall fallbacks
2026-06-15Merge tag 'timers-ptp-2026-06-13' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull timekeeping updates from Thomas Gleixner: "Updates for NTP/timekeeping and PTP: - Expand timekeeping snapshot mechanisms The various snapshot functions are mostly used for PTP to collect "atomic" snapshots of various involved clocks. They lack support for the recently introduced AUX clocks and do not provide the underlying counter value (e.g. TSC) to user space. Exposing the counter value snapshot allows for better control and steering. Convert the hard wired ktime_get_snapshot() to take a clock ID, which allows the caller to select the clock ID to be captured along with CLOCK_MONONOTONIC_RAW. Additionally capture the underlying hardware counter value and the clock source ID of the counter. Expand the hardware based snapshot capture where devices provide a mechanism to snapshot the hardware PTP clock and the system counter (usually via PCI/PTM) to support AUX clocks and also provide the captured counter value back to the caller and not only the clock timestamps derived from it. - Add a new optional read_snapshot() callback to clocksources That is required to capture atomic snapshots from clocksources which are derived from TSC with a scaling mechanism (e.g. Hyper-V, KVMclock). The value pair is handed back in the snapshot structure to the callers, so they can do the necessary correlations in a more precise way. This touches usage sites of the affected functions and data structure all over the tree, but stays fully backwards compatible for the existing user space exposed interfaces. New PTP IOCTLs will provide access to the extended functionality in later kernel versions" * tag 'timers-ptp-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (28 commits) ptp: vmclock: Use hw_cycles from snapshot for precise TSC pairing x86/kvmclock: Implement read_snapshot() for kvmclock clocksource clocksource/hyperv: Implement read_snapshot() for TSC page clocksource timekeeping: Add clocksource read_snapshot() method and hw_cycles to snapshot ptp: Switch to ktime_get_snapshot_id() for pre/post timestamps timekeeping: Add support for AUX clock cross timestamping timekeeping: Remove system_device_crosststamp::sys_realtime ALSA: hda/common: Use system_device_crosststamp::sys_systime wifi: iwlwifi: Use system_device_crosststamp::sys_systime ptp: Use system_device_crosststamp::sys_systime timekeeping: Prepare for cross timestamps on arbitrary clock IDs timekeeping: Remove ktime_get_snapshot() virtio_rtc: Use provided clock ID for history snapshot net/mlx5: Use provided clock ID for history snapshot igc: Use provided clock ID for history snapshot ice/ptp: Use provided clock ID for history snapshot wifi: iwlwifi: Adopt PTP cross timestamps to core changes timekeeping: Add CLOCK ID to system_device_crosststamp timekeeping: Add system_counterval_t to struct system_device_crosststamp timekeeping: Add CLOCK_AUX support for ktime_get_snapshot_id() ...
2026-06-15Merge tag 'timers-nohz-2026-06-13' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull NOHZ updates from Thomas Gleixner: - Fix a long standing TOCTOU in get_cpu_sleep_time_us() - Make the CPU offline NOHZ handling more robust by disabling NOHZ on the outgoing CPU early instead of creating unneeded state which needs to be undone. - Unify idle CPU time accounting instead of having two different accounting mechanisms. These two different mechanisms are not really independent, but the different properties can in the worst case cause that gloabl idle time can be observed going backwards. - Consolidate the idle/iowait time retrieval interfaces instead of converting back and forth between them. - Make idle interrupt time accounting more robust. The original code assumes that interrupt time accouting is enabled and therefore stops elapsing idle time while an interrupt is handled in NOHZ dyntick state. That assumption is not correct as interrupt time accounting can be disabled at compile and runtime. - Fix an accounting error between dyntick idle time and dyntick idle steal time. The stolen time is not accounted and therefore idle time becomes inaccurate. The stolen time is now accounted after the fact as there is no way to predict the steal time upfront. * tag 'timers-nohz-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: sched/cputime: Handle dyntick-idle steal time correctly sched/cputime: Handle idle irqtime gracefully sched/cputime: Provide get_cpu_[idle|iowait]_time_us() off-case tick/sched: Consolidate idle time fetching APIs tick/sched: Account tickless idle cputime only when tick is stopped tick/sched: Remove unused fields tick/sched: Move dyntick-idle cputime accounting to cputime code tick/sched: Remove nohz disabled special case in cputime fetch tick/sched: Unify idle cputime accounting s390/time: Prepare to stop elapsing in dynticks-idle powerpc/time: Prepare to stop elapsing in dynticks-idle sched/cputime: Correctly support generic vtime idle time sched/cputime: Remove superfluous and error prone kcpustat_field() parameter sched/idle: Handle offlining first in idle loop tick/sched: Fix TOCTOU in nohz idle time fetch
2026-06-15Merge tag 'timers-core-2026-06-13' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull timer core updates from Thomas Gleixner: "Updates for the time/timer core subsystem: - Harden the user space controllable hrtimer interfaces further to protect against unpriviledged DoS attempts by arming timers in the past. - Add per-capacity hierarchies to the timer migration code to prevent timer migration accross different capacity domains. This code has been disabled last minute as there is a pathological problem with SoCs which advertise a larger number of capacity domains. The problem is under investigation and the code won't be active before v7.3, but that turned out to be less intrusive than a full revert as it preserves the preparatory steps and allows people to work on the final resolution - Export time namespace functionality as a recent user can be built as a module. - Initialize the jiffies clocksource before using it. The recent hardening against time moving backward requires that the related members of struct clocksource have been initialized, otherwise it clamps the readout to 0, which makes time stand sill and causes boot delays. - Fix a more than twenty year old PID reference count leak in an error path of the POSIX CPU timer code. - The usual small fixes, improvements and cleanups all over the place" * tag 'timers-core-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (31 commits) posix-cpu-timers: Fix pid refcount leak in do_cpu_nanosleep() error path time/jiffies: Register jiffies clocksource before usage timers/migration: Temporarily disable per capacity hierarchies timers/migration: Turn tmigr_hierarchy level_list into a flexible array timers/migration: Deactivate per-capacity hierarchies under nohz_full timers/migration: Fix hotplug migrator selection target on asymetric capacity machines ntsync: Honour caller's time namespace for absolute MONOTONIC timeouts time/namespace: Export init_time_ns and do_timens_ktime_to_host() timers/migration: Update stale @online doc to @available timers: Fix flseep() typo in kernel-doc comment hrtimer: Fix the bogus return type of __hrtimer_start_range_ns() hrtimer: Return ktime_t from hrtimer_get_next_event()/hrtimer_next_event_without() clocksource: Clean up clocksource_update_freq() functions alarmtimer: Remove stale return description from alarm_handle_timer() selftests/posix_timers: Use CLOCK_THREAD_CPUTIME_ID for ITIMER_PROF measurements scripts/timers: Add timer_migration_tree.py timers/migration: Handle capacity in connect tracepoints timers/migration: Split per-capacity hierarchies timers/migration: Track CPUs in a hierarchy timers/migration: Abstract out hierarchy to prepare for CPU capacity awareness ...
2026-06-15Merge tag 'timers-clocksource-2026-06-13' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull clocksource updates from Thomas Gleixner: "Updates for clocksource/clockevent drivers: - Add devm helpers for clocksources, which allows to simplify driver teardown and probe failure handling. - More module conversion work - Update the support for the ARM EL2 virtual timer including the required ACPI changes. - Add clockevent and clocksource support for the TI Dual Mode Timer - Fix the support for multiple watchdog instances in the TEGRA186 driver - Add D1 timer support to the SUN5I driver - The usual devicetree updates, cleanups and small fixes all over the place" * tag 'timers-clocksource-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (24 commits) clocksource: move NXP timer selection to drivers/clocksource clocksource/drivers/timer-tegra186: Reserve and service a kernel watchdog clocksource/drivers/timer-tegra186: Register all accessible watchdog timers clocksource/drivers/timer-tegra186: Correct num_wdts for Tegra186 and Tegra234 clocksource/drivers/timer-tegra186: Fix support for multiple watchdog instances clocksource/drivers/timer-ti-dm: Add clockevent support clocksource/drivers/timer-ti-dm: Add clocksource support clocksource/drivers/timer-ti-dm: Fix property name in comment dt-bindings: timer: arm,arch_timer: Fix requirements for interrupt description clocksource/drivers/arm_arch_timer: Default to EL2 virtual timer when running VHE ACPI: GTDT: Parse information related to the EL2 virtual timer ACPI: GTDT: Account for GTDTv3 size when walking the platform timer descriptors clocksource: Add devm_clocksource_register_*() helpers clocksource/drivers/sun5i: Add D1 hstimer support dt-bindings: timer: allwinner,sun5i-a13-hstimer: add H616 and D1 dt-bindings: timer: Add StarFive JHB100 clint dt-bindings: timer: renesas,rz-mtu3: document RZ/{T2H,N2H} dt-bindings: timer: renesas,rz-mtu3: Remove TCIU8 interrupt dt-bindings: timer: Remove sifive,fine-ctr-bits property clocksource/drivers/timer-of: Make the code compatible with modules ...
2026-06-15Merge tag 'irq-core-2026-06-13' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull interrupt core updates from Thomas Gleixner: - Rework of /proc/interrupt handling: /proc/interrupts was subject to micro optimizations for a long time, but most of the low hanging fruit was left on the table. This rework addresses the major time consuming issues: - Printing a long series of zeros one by one via a format string instead of counting subsequent zeros and emitting a string constant. - Simplify and cache the conditions whether interrupts should be printed - Use a proper iteration over the interrupt descriptor xarray instead of walking and testing one by one. - Provide helper functions for the architecture code to emit the architecture specific counters - Convert the counter structure in x86 to an array, which simplifies the output and add mechanisms to suppress unused architecture interrupts, which just occupy space for nothing. Adopt the new core mechanisms. This adjusts the gdb scripts related to interrupt counter statistics to work with the new mechanisms. - Prevent a string overflow in the /proc/irq/$N/ directory name creation code. * tag 'irq-core-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: x86/irq: Add missing 's' back to thermal event printout genirq/proc: Speed up /proc/interrupts iteration genirq/proc: Runtime size the chip name genirq: Expose irq_find_desc_at_or_after() in core code genirq: Add rcuref count to struct irq_desc genirq/proc: Increase default interrupt number precision to four genirq: Calculate precision only when required genirq: Cache the condition for /proc/interrupts exposure genirq/manage: Make NMI cleanup RT safe genirq: Expose nr_irqs in core code scripts/gdb: Update x86 interrupts to the array based storage x86/irq: Move IOAPIC misrouted and PIC/APIC error counts into irq_stats x86/irq: Suppress unlikely interrupt stats by default x86/irq: Make irqstats array based genirq/proc: Utilize irq_desc::tot_count to avoid evaluation genirq/proc: Avoid formatting zero counts in /proc/interrupts x86/irq: Optimize interrupts decimals printing genirq/proc: Size interrupt directory names for 10-digit interrupt numbers
2026-06-15Merge tag 'driver-core-7.2-rc1' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/driver-core/driver-core Pull driver core updates from Danilo Krummrich: "Deferred probe: - Fix race where deferred probe timeout work could be permanently canceled by using mod_delayed_work() - Fix missing jiffies conversion in deferred_probe_extend_timeout() - Guard timeout extension with delayed_work_pending() to prevent premature firing - Use system_percpu_wq instead of the deprecated system_wq - Update deferred_probe_timeout documentation device: - Replace direct struct device bitfield access (can_match, dma_iommu, dma_skip_sync, dma_ops_bypass, state_synced, dma_coherent, of_node_reused, offline, offline_disabled) with flag-based accessors using bit operations - Reject devices with unregistered buses - Delete unused DEVICE_ATTR_PREALLOC() - Add low-level device attribute macros with const show/store callbacks, allowing device attributes to reside in read-only memory - Move core device attributes to read-only memory - Constify group array pointers in driver_add_groups() / driver_remove_groups(), struct bus_type, and struct device_driver device property: - Fix fwnode reference leak in fwnode_graph_get_endpoint_by_id() - Initialize all fields of fwnode_handle in fwnode_init() - Provide swnode_get()/swnode_put() wrappers around kobject_get/put() - Allow passing struct software_node_ref_args pointers directly to PROPERTY_ENTRY_REF() driver_override: - Migrate amba, cdx, vmbus, and rpmsg to the generic driver_override infrastructure, fixing a UAF from unsynchronized access to driver_override in bus match() callbacks - Remove the now-unused driver_set_override() firmware loader: - Fix recursive lock deadlock in device_cache_fw_images() when async work falls back to synchronous execution - Fix device reference leak in firmware_upload_register() platform: - Pass KBUILD_MODNAME through the platform driver registration macro to create module symlinks in sysfs for built-in drivers; move module_kset initialization to a pure_initcall and tegra cbb registration to core_initcall to ensure correct ordering - Pass THIS_MODULE implicitly through a coresight_init_driver() macro sysfs: - Upgrade OOB write detection in sysfs_kf_seq_show() from printk to WARN - Add return value clamping to sysfs_kf_read() Rust: - ACPI: Fix missing match data for PRP0001 by exporting acpi_of_match_device() - Auxiliary: Replace drvdata() with dedicated registration data on auxiliary_device. drvdata() exposed the driver's bus device private data beyond the driver's own scope, creating ordering constraints and forcing the data to outlive all registrations that access it. Registration data is instead scoped structurally to the Registration object, making lifecycle ordering enforced by construction rather than convention. - Rust-native device driver lifetimes (HRT): Allow Rust device drivers to carry a lifetime parameter on their bus device private data, tied to the device binding scope -- the interval during which a bus device is bound to a driver. Device resources like pci::Bar<'a> and IoMem<'a> can be stored directly in the driver's bus device private data with a lifetime bounded by the binding scope, so the compiler enforces at build time that they do not outlive the binding. This removes Devres indirection from every access site and eliminates try_access() failure paths in destructors. Bus driver traits use a Generic Associated Type (GAT) Data<'bound> to introduce the lifetime on the private data, rather than parameterizing the Driver trait itself. Auxiliary registration data, where the lifetime is not introduced by a trait callback but must be threaded through Registration, uses the ForLt trait (a type-level abstraction for types generic over a lifetime). Misc: - Fix DT overlayed devices not probing by reverting the broken treewide overlay fix and re-running fw_devlink consumer pickup when an overlay is applied to a bound device - Use root_device_register() for faux bus root device; add sanity check for failed bus init - Fix dev_has_sync_state() data race with READ_ONCE() and move it to base.h - Avoid spurious device_links warning when removing a device while its supplier is unbinding - Switch ISA bus to dynamic root device - Fix suspicious RCU usage in kernfs_put() - Remove devcoredump exit callback - Constify devfreq_event_class" * tag 'driver-core-7.2-rc1' of gitolite.kernel.org:pub/scm/linux/kernel/git/driver-core/driver-core: (81 commits) software node: allow passing reference args to PROPERTY_ENTRY_REF() driver core: platform: set mod_name in driver registration coresight: pass THIS_MODULE implicitly through a macro kernel: param: initialize module_kset in a pure_initcall soc/tegra: cbb: Move driver registration from pure_initcall to core_initcall firmware_loader: Fix recursive lock in device_cache_fw_images() driver core: Use system_percpu_wq instead of system_wq driver core: remove driver_set_override() rpmsg: use generic driver_override infrastructure Drivers: hv: vmbus: use generic driver_override infrastructure cdx: use generic driver_override infrastructure amba: use generic driver_override infrastructure rust: devres: add 'static bound to Devres<T> samples: rust: rust_driver_auxiliary: showcase lifetime-bound registration data rust: auxiliary: generalize Registration over ForLt rust: types: add `ForLt` trait for higher-ranked lifetime support gpu: nova-core: separate driver type from driver data samples: rust: rust_driver_pci: use HRT lifetime for Bar rust: io: make IoMem and ExclusiveIoMem lifetime-parameterized rust: pci: make Bar lifetime-parameterized ...
2026-06-15Merge tag 'pm-7.2-rc1' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "Over a half of the changes here are cpufreq updates that include core modifications, fixes of the old-style governors, new hardware support in drivers, assorded driver fixes and cleanups, and the removal of one driver (AMD Elan SC4*). Apart from that, the intel_idle driver will now be able to avoid exposing redundant C-states if PC6 is disabled and there are new sysctl knobs for device suspend/resume watchdog timeouts, hibernation gets built-in LZ4 support for image compression and there is the usual collection of assorted fixes and cleanups. Specifics: - Fix a race between cpufreq suspend and CPU hotplug during system shutdown (Tianxiang Chen) - Avoid redundant target() calls for unchanged limits and fix a typo in a comment in the cpufreq core (Viresh Kumar) - Fix concurrency issues related to sysfs attributes access that affect cpufreq governors using the common governor code (Zhongqiu Han) - Simplify frequency limit handling in the conservative cpufreq governor (Lifeng Zheng) - Fix descriptions of the conservative governor freq_step tunable and the ondemand governor sampling_down_factor tunable in the cpufreq documentation (Pengjie Zhang) - Fix use-after-free and double free during _OSC evaluation in the PCC cpufreq driver (Yuho Choi) - Rework the handling of policy min and max frequency values in the cpufreq core to allow drivers to specify special initial values for the scaling_min_freq and scaling_max_freq sysfs attributes (Pierre Gondois) - Add cpufreq scaling support for Qualcomm Shikra SoC (Taniya Das, Imran Shaik). - Improve the warning message on HWP-disabled hybrid processors printed by the intel_pstate driver and sync policy->cur during CPU offline in it (Yohei Kojima, Fushuai Wang) - Drop cpufreq support for AMD Elan SC4* (Sean Young) - Minor fixes for cpufreq drivers (Krzysztof Kozlowski, Akashdeep Kaur, Hans Zhang, Guangshuo Li, Xueqin Luo) - Clean up dead dependencies on X86 in the cpufreq Kconfig (Julian Braha) - Allow the intel_idle driver to avoid exposing C-states that are redundant when PC6 is disabled (Artem Bityutskiy) - Fix memory leak and a potential race in the OPP core (Abdun Nihaal, Di Shen) - Mark Rust OPP methods as inline (Nicolás Antinori) - Fix misc device registration failure path in the PM QoS core (Yuho Choi) - Add sysctl interface for DPM watchdog timeouts (Tzung-Bi Shih) - Use complete() instead of complete_all() in device_pm_sleep_init() to avoid a false-positive warning from lockdep_assert_RT_in_threaded_ctx() when CONFIG_PROVE_RAW_LOCK_NESTING is enabled (Jiakai Xu) - Use a flexible array for CRC uncompressed buffers during hibernation image saving (Rosen Penev) - Make the LZ4 algorithm available for hibernation compression (l1rox3) - Move the preallocate_image() call during hibernation after the "prepare" phase of the "freeze" transition (Matthew Leach) - Fix a memory leak in rapl_add_package_cpuslocked() in the intel_rapl power capping driver and use sysfs_emit() in cpumask_show() in that driver (Sumeet Pawnikar, Yury Norov) - Fix ValueError when parsing incomplete device properties in the pm-graph utility (Gongwei Li)" * tag 'pm-7.2-rc1' of gitolite.kernel.org:pub/scm/linux/kernel/git/rafael/linux-pm: (40 commits) PM: dpm_watchdog: Add sysctl interface for DPM watchdog timeouts PM: QoS: Fix misc device registration unwind cpufreq: Use policy->min/max init as QoS request cpufreq: Remove driver default policy->min/max init cpufreq: Set default policy->min/max values for all drivers cpufreq: Extract cpufreq_policy_init_qos() function cpufreq: Documentation: fix conservative governor freq_step description cpufreq: ti: Add EPROBE_DEFER for K3 SoCs cpufreq: qcom: Add cpufreq scaling support for Qualcomm Shikra SoC dt-bindings: cpufreq: Document Qualcomm Shikra SoC EPSS powercap: intel_rapl: Use sysfs_emit() in cpumask_show() cpufreq: governor: Fix stale prev_cpu_nice spike when enabling ignore_nice_load cpufreq: governor: Fix data races on per-CPU idle/nice baselines PM: hibernate: Use flexible array for CRC uncompressed buffers powercap: intel_rapl: Fix memory leak in rapl_add_package_cpuslocked() PM: hibernate: make LZ4 available for hibernation compression PM: sleep: Use complete() in device_pm_sleep_init() opp: rust: mark OPP methods as inline cpufreq: intel_pstate: Improve warning message on HWP-disabled hybrid CPUs cpufreq: elanfreq: Drop support for AMD Elan SC4* ...
2026-06-15Merge tag 'rcu.release.v7.2' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/rcu/linux Pull RCU updates from Uladzislau Rezki: "Torture test updates: - Improve kvm-series.sh script by adding examples in its header comment - Lazy RCU is more fully tested now by replacing call_rcu_hurry() with call_rcu() and doing rcu_barrier() to motivate lazy callbacks during a stutter pause - Add more synonyms for the "--do-normal" group of torture.sh command-line arguments Misc changes: - Reduce stack usage of nocb_gp_wait() to address frame size warning when built with CONFIG_UBSAN_ALIGNMENT - The synchronize_rcu() call can detect the flood and latches a normal/default path temporary switching to wait_rcu_gp() path - Document using rcu_access_pointer() to fetch the old pointer for lockless cmpxchg() updates - Simplify some RCU code using clamp_val() - Fix a kerneldoc header comment typo in srcu_down_read_fast()" * tag 'rcu.release.v7.2' of gitolite.kernel.org:pub/scm/linux/kernel/git/rcu/linux: rcu/nocb: reduce stack usage in nocb_gp_wait() rcu-tasks: Fix possible boot-time tests failed for the call_rcu_tasks() rcu: Latch normal synchronize_rcu() path on flood rcu: Document rcu_access_pointer() feeding into cmpxchg() rcu: Simplify param_set_next_fqs_jiffies() by applying clamp_val() rcu: Simplify rcu_do_batch() by applying clamp() checkpatch: Undeprecate rcu_read_lock_trace() and rcu_read_unlock_trace() srcu: Fix kerneldoc header comment typo in srcu_down_read_fast() torture: Allow "norm" abbreviation for "normal" torture: Improve kvm-series.sh header comment torture: Add torture_sched_set_normal() for user-specified nice values rcutorture: Fully test lazy RCU
2026-06-14bpf: Add support to specify uprobe_multi target via file descriptorJiri Olsa
Allow uprobe_multi link to identify the target binary by an already opened file descriptor. Adding new BPF_F_UPROBE_MULTI_PATH_FD flag and the path_fd field for the attr.link_create.uprobe_multi struct. When the flag is set, we resolve the target from path_fd, without the flag, we keep the existing string path behavior. I don't see a use case for supporting O_PATH file descriptors, because we need to read the binary first to get probes offsets, so I'm using the CLASS(fd, f), which fails for O_PATH fds. Assisted-by: Codex:GPT-5.4 Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260611114230.950379-4-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-14bpf: Use user_path_at for path resolution in uprobe_multiJiri Olsa
Resolve the uprobe_multi user path with user_path_at() instead of copying the string with strndup_user() and passing it to kern_path(). This removes the temporary allocation and keeps the lookup logic in one helper. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260611114230.950379-3-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-14bpf: Guard __get_user acesss with access_ok for uprobe_multi dataJiri Olsa
As reported by sashiko [1] we need to use access_ok to check the user space data bounds before we use __get-user to get it. [1] https://lore.kernel.org/bpf/20260610145235.CB1441F00893@smtp.kernel.org/ Fixes: 0b779b61f651 ("bpf: Add cookies support for uprobe_multi link") Fixes: 89ae89f53d20 ("bpf: Add multi uprobe link") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260611114230.950379-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-15Merge tag 'vfs-7.2-rc1.procfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull procfs updates from Christian Brauner: - Revamp fs/filesystems.c The file was a mess with a hand-rolled linked list in desperate need of a cleanup. The filesystems list is now RCU-ified, /proc files can be marked permanent from outside fs/proc/, and the string emitted when reading /proc/filesystems is pre-generated and cached instead of pointer-chasing and printfing entry by entry on every read. The file is read frequently because libselinux reads it and is linked into numerous frequently used programs (even ones you would not suspect, like sed!). Scalability also improves since reference maintenance on open/close is bypassed. open+read+close cycle single-threaded (ops/s): before: 442732 after: 1063462 (+140%) open+read+close cycle with 20 processes (ops/s): before: 606177 after: 3300576 (+444%) A follow-up patch adds missing unlocks in some corner cases and tidies things up. - Relax the mount visibility check for subset=pid mounts When procfs is mounted with subset=pid, all static files become unavailable and only the dynamic pid information is accessible. In that case there is no point in imposing the full mount visibility restrictions on the mounter - everything that can be hidden in procfs is already inaccessible. These restrictions prevented procfs from being mounted inside rootless containers since almost all container implementations overmount parts of procfs to hide certain directories. As part of this /proc/self/net is only shown in subset=pid mounts for CAP_NET_ADMIN, reconfiguring subset=pid is rejected, the SB_I_USERNS_VISIBLE superblock flag is replaced with an FS_USERNS_MOUNT_RESTRICTED filesystem flag, fully visible mounts are recorded in a list, and the mount restrictions are finally documented. - Protect ptrace_may_access() with exec_update_lock in procfs Most uses of ptrace_may_access() in procfs should hold exec_update_lock to avoid TOCTOU issues with concurrent privileged execve() (like setuid binary execution). This fixes the easy cases - the owner and visibility checks and the FD link permission checks - with the gnarlier ones to follow later. * tag 'vfs-7.2-rc1.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: fix ups and tidy ups to /proc/filesystems caching proc: protect ptrace_may_access() with exec_update_lock (FD links) proc: protect ptrace_may_access() with exec_update_lock (part 1) docs: proc: add documentation about mount restrictions proc: handle subset=pid separately in userns visibility checks proc: prevent reconfiguring subset=pid proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN fs: cache the string generated by reading /proc/filesystems sysfs: remove trivial sysfs_get_tree() wrapper fs: RCU-ify filesystems list fs: move SB_I_USERNS_VISIBLE to FS_USERNS_MOUNT_RESTRICTED proc: allow to mark /proc files permanent outside of fs/proc/ namespace: record fully visible mounts in list
2026-06-15Merge tag 'vfs-7.2-rc1.xattr' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull simple_xattr updates from Christian Brauner: "This reworks the simple xattr api to make it more efficient and easier to use for all consumers. The simple_xattr hash table moves from the inode into a per-superblock cache, removing the per-inode overhead for the common case of few or no xattrs. The interface now passes struct simple_xattrs ** so lazy allocation is handled internally instead of by every caller, kernfs xattr operations on kernfs nodes shared between multiple superblocks are properly serialized, and tmpfs constructs "security.foo" xattr names with kasprintf() instead of kmalloc() plus two memcpy()s. A follow-up fix links kernfs nodes to their parent before the LSM init hook runs: with the per-sb cache kernfs_xattr_set() computes the cache via kernfs_root(kn), which faulted on a freshly allocated node when selinux_kernfs_init_security() called into it - reproducible as a NULL pointer dereference on the first cgroup mkdir on SELinux-enabled systems. On top of this bpffs gains support for trusted.* and security.* xattrs so that user space and BPF LSM programs can attach metadata - for example a content hash or a security label - to pinned objects and directories and inspect it uniformly like on other filesystems. The store is in-memory and non-persistent, living only for the lifetime of the mount like everything else in bpffs" * tag 'vfs-7.2-rc1.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: bpf: Add simple xattr support to bpffs kernfs: link kn to its parent before the LSM init hook simpe_xattr: use per-sb cache simple_xattr: change interface to pass struct simple_xattrs ** tmpfs: simplify constructing "security.foo" xattr names kernfs: fix xattr race condition with multiple superblocks
2026-06-15Merge tag 'kernel-7.2-rc1.task_exec_state' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull task_exec_state updates from Christian Brauner: "This introduces a new per-task task_exec_state structure and relocates the dumpable mode and the user namespace captured at execve() from mm_struct onto it. It stays attached to the task for its full lifetime. __ptrace_may_access() and several /proc owner and visibility checks need to consult two pieces of state for any observable task, including zombies that have already gone through exit_mm(): the dumpable mode and the user namespace captured at execve(). Both live on mm_struct today, which exit_mm() clears from the task long before the task is reaped. A reader that races with do_exit() observes task->mm == NULL and either fails the check or falls back to init_user_ns - which denies legitimate access to non-dumpable zombies that were running in a nested user namespace. mm_struct loses ->user_ns and the dumpability bits in ->flags. MMF_DUMPABLE_BITS is reserved so the MMF_DUMP_FILTER_* layout exposed via /proc/<pid>/coredump_filter stays stable. task->user_dumpable and its exit_mm() snapshot are removed. task_exec_state is the privilege domain established by an execve(). Within a thread group it is shared via refcount; across thread groups each task has its own: - CLONE_VM siblings (thread-group members, io_uring workers) refcount-share the parent's exec_state. - Non-CLONE_VM clones (fork(), vfork() without CLONE_VM) allocate a fresh exec_state inheriting the parent's dumpable mode and user_ns. - execve() in the child allocates a fresh instance and installs it under task_lock + exec_update_lock via task_exec_state_replace(). - Credential changes (setresuid, capset, ...) and prctl(PR_SET_DUMPABLE) update dumpability on the current task's exec_state, i.e., on the thread group's shared instance. On top of this exec_mmap() no longer tears down the old mm while holding exec_update_lock for writing and cred_guard_mutex. Neither lock is needed for that: exec_update_lock only exists to make the mm swap atomic with the later commit_creds() and all its readers operate on the new mm; none looks at the detached old mm. The cost was real: __mmput() runs exit_mmap() over the entire old address space and can block in exit_aio() waiting for in-flight AIO, so execve() of a large process blocked ptrace_attach() and every exec_update_lock reader for the duration of the teardown. The old mm is now stashed in bprm->old_mm and released from setup_new_exec() after both locks are dropped, with a backstop in free_bprm() for the error paths" * tag 'kernel-7.2-rc1.task_exec_state' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: exec: free the old mm outside the exec locks exec_state: relocate dumpable information ptrace: add ptracer_access_allowed() exec: introduce struct task_exec_state sched/coredump: introduce enum task_dumpable