linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2026-06-16	block: invalidate cached plug timestamp after task switch	Usama Arif
	blk_time_get_ns() caches ktime_get_ns() in current->plug->cur_ktime and marks the task with PF_BLOCK_TS. That cache is only valid while the task keeps running; if the task is switched out, wall-clock time advances and the cached value must not be reused when the task runs again. The existing invalidation covers explicit plug flushes through __blk_flush_plug(), and the schedule() / rtmutex paths through sched_update_worker(). It does not cover in-kernel preemption paths such as preempt_schedule(), preempt_schedule_notrace(), and preempt_schedule_irq(), which enter __schedule(SM_PREEMPT) directly and return without calling sched_update_worker(). As a result, a task preempted while holding a plug with PF_BLOCK_TS set can reuse a stale plug->cur_ktime after it is scheduled back in. blk-iocost then consumes that stale timestamp through ioc_now(), producing stale vnow values for throttle decisions, and through ioc_rqos_done(), inflating on-queue time and feeding false missed-QoS samples into vrate adjustment. Move the schedule-side invalidation to finish_task_switch(), which runs for the scheduled-in task after every actual context switch regardless of which schedule entry point was used. Keep __blk_flush_plug() as the explicit flush/finish-plug invalidation path, and remove only the PF_BLOCK_TS handling from sched_update_worker(). Fixes: 06b23f92af87 ("block: update cached timestamp post schedule/preemption") Cc: stable@vger.kernel.org Signed-off-by: Usama Arif <usama.arif@linux.dev> Link: https://patch.msgid.link/20260616141604.328820-3-usama.arif@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-16	kernel/fork: clear PF_BLOCK_TS in copy_process()	Usama Arif
	PF_BLOCK_TS is only set in blk_time_get_ns() when current->plug is non-NULL, and blk_finish_plug() clears it via __blk_flush_plug() before NULLing the plug pointer. copy_process() breaks the invariant by inheriting PF_BLOCK_TS from the parent while resetting the child's plug to NULL. Clear PF_BLOCK_TS alongside that assignment so callers can rely on "PF_BLOCK_TS set implies current->plug != NULL" and dereference current->plug unguarded. Fixes: 06b23f92af87 ("block: update cached timestamp post schedule/preemption") Cc: stable@vger.kernel.org Signed-off-by: Usama Arif <usama.arif@linux.dev> Link: https://patch.msgid.link/20260616141604.328820-2-usama.arif@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-06-16	Merge branch 'idle-time-acc' into features	Alexander Gordeev
	Heiko Carstens says: =================== This is supposed to improve s390 idle time accounting, and brings it back to the state it was before arch_cpu_idle_time() was removed from s390 [3]. In result all cpu time accounting is done by the s390 architecture backend again, instead of having a mix of architecure specific and common code accounting (common code: idle, s390 architecture: everything else). =================== Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2026-06-16	Merge tag 'trace-latency-v7.2' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing latency updates from Steven Rostedt: - Dump the stack to the buffer on timerlat uret threashold event Record the stack trace in the buffer for THREAD_URET as well as THREAD_CONTEXT when the threshold is hit. Otherwise, if the threshold was not hit at task wakeup, but was at task return, it will not produce a stack trace making it harder to debug. - Have osnoise trace prints print to all buffers The osnoise tracer is allowed to print to the main buffer. Add a osnoise_print() helper function and use trace_array_vprintk() to print osnoise output. * tag 'trace-latency-v7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/osnoise: Array printk init and cleanup tracing/osnoise: Dump stack on timerlat uret threshold event
2026-06-16	Merge tag 'probes-v7.2' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes updates from Masami Hiramatsu: - BTF support for dereferencing pointers Add syntax to the parsing of eprobes to typecast structure pointer trace event fields, enabling BTF-based dereferencing instead of relying on manual offsets. - Improvements and robustness enhancements - Use flexible array for entry fetch code. Store probe entry fetch instructions in the probe_entry_arg allocation via a flexible array member to simplify memory allocation and lifetime management. - Replace BUG_ON with lockdep_assert_held in uprobe_buffer functions Replace BUG_ON() calls with lockdep_assert_held() in uprobe buffer enable/disable paths to prevent kernel crashes and better verify lock ownership. - Ensure the uprobe buffer size is bigger than event size. Add a BUILD_BUG_ON() assertion to guarantee that the per-CPU uprobe working buffer size is always larger than the maximum probe event size. * tag 'probes-v7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/eprobes: Allow use of BTF names to dereference pointers tracing: Replace BUG_ON with lockdep_assert_held in uprobe_buffer functions tracing: Use flexible array for entry fetch code tracing/probes: Ensure the uprobe buffer size is bigger than event size
2026-06-16	Merge tag 'linux_kselftest-kunit-7.2-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull kunit updates from Shuah Khan: "Fixes to tool and kunit core and new features to both to support JUnit XML (primitive) and backtrace suppression API: - Core support for suppressing warning backtraces - Parse and print the reason tests are skipped - Add (primitive) support for outputting JUnit XML - Don't write to stdout when it should be disabled - Add backtrace suppression self-tests - Suppress intentional warning backtraces in scaling unit tests - Add documentation for warning backtrace suppression API - Fix spelling mistakes in comments and messages - gen_compile_commands: Ignore libgcc.a - qemu_configs: Add or1k / openrisc configuration" * tag 'linux_kselftest-kunit-7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: kunit:tool: Don't write to stdout when it should be disabled kunit: tool: Add (primitive) support for outputting JUnit XML kunit: tool: Parse and print the reason tests are skipped kunit: Add documentation for warning backtrace suppression API drm: Suppress intentional warning backtraces in scaling unit tests kunit: Add backtrace suppression self-tests bug/kunit: Core support for suppressing warning backtraces kunit: Fix spelling mistakes in comments and messages kunit: qemu_configs: Add or1k / openrisc configuration gen_compile_commands: Ignore libgcc.a
2026-06-16	Merge tag 'slab-for-7.2' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - Support for "allocation tokens" (currently available in Clang 22+) for smarter partitioning of kmalloc caches based on the allocated object type, which can be enabled instead of the "random" per-caller-address-hash partitioning. It should be able to deterministically separate types containing a pointer from those that do not (Marco Elver) - Improvements and simplification of the kmem_cache_alloc_bulk() and mempool_alloc_bulk() API. This includes adaptation of callers (Christoph Hellwig) - Performance improvements and cleanups related mostly to sheaves refill (Hao Li, Shengming Hu, Vlastimil Babka) - Several fixups for the slabinfo tool (Xuewen Wang) * tag 'slab-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: do not limit zeroing to orig_size when only red zoning is enabled mm/slub: preserve original size in _kmalloc_nolock_noprof retry path mm: simplify the mempool_alloc_bulk API mm/slab: improve kmem_cache_alloc_bulk mm/slub: detach and reattach partial slabs in batch mm/slub: introduce helpers for node partial slab state mm/slub: use empty sheaf helpers for oversized sheaves tools/mm/slabinfo: remove redundant slab->partial assignment tools/mm/slabinfo: remove dead assignment in get_obj_and_str() tools/mm/slabinfo: Fix trace disable logic inversion MAINTAINERS: add slab-related scripts and tools to SLAB ALLOCATOR mm/slub: fix typo in sheaves comment mm, slab: simplify returning slab in __refill_objects_node() mm, slab: add an optimistic __slab_try_return_freelist() slab: fix kernel-docs for mm-api slab: improve KMALLOC_PARTITION_RANDOM randomness slab: support for compiler-assisted type-based slab cache partitioning mm/slub: defer freelist construction until after bulk allocation from a new slab
2026-06-15	s390/idle: Provide arch specific kcpustat_field_idle()/kcpustat_field_iowait()	Heiko Carstens
	The former s390 specific arch_cpu_idle_time() implementation was removed, since its implementation was racy and reported idle time could go backwards [1]. However this removal was not necessary, since independently of the s390 architecture specific races there exists the iowait counter update race, which can also lead to reported idle time going backwards [2]. With Frederic Weisbecker's recent cpu idle time accounting refactoring kernel_cpustat got a sequence counter. Use this to implement s390 specific variants of kcpustat_field_idle() and kcpustat_field_iowait(). This is logically a revert of [1] and moves cpu idle time accounting back into s390 architecture code, which is also more precise than the dyntick idle time accounting by nohz/scheduler. For comparing cross cpu time stamps it is necessary to use the stcke instead of the stckf instruction in irq entry path. Furthermore this open-codes a sequence lock in assembler and C code, which is required to update the irq entry time stamp to the per cpu idle_data structure in a race free manner. [1] commit be76ea614460 ("s390/idle: remove arch_cpu_idle_time() and corresponding code") [2] commit ead70b752373 ("timers/nohz: Add a comment about broken iowait counter update race") Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
2026-06-15	Merge tag 'sched-core-2026-06-14' of ↵	Linus Torvalds
	gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "SMP load-balancing updates: - A large series to introduce infrastructure for cache-aware load balancing, with the goal of co-locating tasks that share data within the same Last Level Cache (LLC) domain. By improving cache locality, the scheduler can reduce cache bouncing and cache misses, ultimately improving data access efficiency. Implemented by Chen Yu and Tim Chen, based on early prototype work by Peter Zijlstra, with fixes by Jianyong Wu, Peter Zijlstra and Shrikanth Hegde. - A series to simplify CONFIG_SCHED_SMT ifdef usage (Shrikanth Hegde) Fair scheduler updates: - A series to improve SD_ASYM_CPUCAPACITY scheduling by introducing SMT awareness (Andrea Righi, K Prateek Nayak) - A series to optimize cfs_rq and sched_entity allocation for better data locality (Zecheng Li) - A preparatory series to change fair/cgroup scheduling to a single runqueue, without the final change (Peter Zijlstra) - Auto-manage ext/fair dl_server bandwidth (Andrea Righi) - Fix cpu_util runnable_avg arithmetic (Hongyan Xia) - Optimize update_tg_load_avg()'s rate-limiting code (Rik van Riel) - Allow account_cfs_rq_runtime() to throttle current hierarchy (K Prateek Nayak) - Update util_est after updating util_avg during dequeue, to fix the util signal update logic, which reduces signal noise (Vincent Guittot) Scheduler topology updates: - Allow multiple domains to claim sched_domain_shared (K Prateek Nayak) - Add parameter to split LLC (Peter Zijlstra) Core scheduler updates: - Use trace_call__<tp>() to save a static branch (Gabriele Monaco) Scheduler statistics updates: - Drop now-stale mul_u64_u64_div_u64() cputime over-approximation guard (Nicolas Pitre) Deadline scheduler updates: - Reject debugfs dl_server writes for offline CPUs (Andrea Righi) - Fix replenishment logic for non-deferred servers (Yuri Andriaccio) RT scheduling updates: - Turn RT_PUSH_IPI default off for non PREEMPT_RT (Steven Rostedt) - Update default bandwidth for real-time tasks to 1.0 (Yuri Andriaccio) Proxy scheduling updates: - A series to implement Optimized Donor Migration for Proxy Execution (John Stultz, Peter Zijlstra) - Various proxy scheduling cleanups and fixes (Peter Zijlstra, K Prateek Nayak) Misc fixes, improvements and cleanups by Aaron Lu, Andrea Righi, Zenghui Yu, Chen Yu, Guanyou.Chen, John Stultz, Shrikanth Hegde, Peter Zijlstra, Liang Luo and Yiyang Chen" * tag 'sched-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (91 commits) sched/fair: Fix newidle vs core-sched sched/deadline: Use task_on_rq_migrating() helper sched/core: Combine separate 'else' and 'if' statements sched/fair: Fix cpu_util runnable_avg arithmetic sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime() sched/fair: Move the throttled tasks to a local list in tg_unthrottle_up() sched/fair: Call update_curr() before unthrottling the hierarchy sched/fair: Use throttled_csd_list for local unthrottle sched/fair: Convert cfs bandwidth throttling to use guards sched/fair: Allocate cfs_tg_state with percpu allocator sched/fair: Remove task_group->se pointer array sched/fair: Co-locate cfs_rq and sched_entity in cfs_tg_state sched: restore timer_slack_ns when resetting RT policy on fork MAINTAINERS: Fix spelling mistake in Peter's name sched: Simplify ttwu_runnable() sched/proxy: Remove superfluous clear_task_blocked_in() sched/proxy: Remove PROXY_WAKING sched/proxy: Switch proxy to use p->is_blocked sched/proxy: Only return migrate when needed sched: Be more strict about p->is_blocked ...
2026-06-15	Merge tag 'perf-core-2026-06-14' of ↵	Linus Torvalds
	gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull performance events updates from Ingo Molnar: "Core perf code updates: - Reveal PMU type in fdinfo (Chun-Tse Shao) Intel CPU PMU driver updates: - Fix various inaccurate hard-coded event configurations (Dapeng Mi) Intel uncore PMU driver updates (Zide Chen): - Fix discovery unit lookup bug for multi-die systems - Guard against invalid box control address - Fix PCI device refcount leak in UPI discovery - Defer ADL global PMON enable to enable_box() to save power - Fix uncore_die_to_cpu() for offline dies - Implement global init callback for GNR uncore AMD CPU PMU driver updates: - Always use the NMI latency mitigation (Sandipan Das) AMD uncore PMU driver updates: - Use Node ID to identify DF and UMC domains (Sandipan Das)" * tag 'perf-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (22 commits) perf/x86/amd/uncore: Use Node ID to identify DF and UMC domains perf: Reveal PMU type in fdinfo perf/x86/intel/uncore: Implement global init callback for GNR uncore perf/x86/intel/uncore: Fix uncore_die_to_cpu() for offline dies perf/x86/intel/uncore: Move die_to_cpu() to uncore.c perf/x86/intel/uncore: Defer ADL global PMON enable to enable_box() perf/x86/intel/uncore: Fix PCI device refcount leak in UPI discovery perf/x86/intel/uncore: Guard against invalid box control address perf/x86/intel/uncore: Fix discovery unit lookup for multi-die systems perf/x86/amd/core: Always use the NMI latency mitigation perf/x86/intel: Update event constraints and cache_extra_regsfor CWF perf/x86/intel: Update event constraints and cache_extra_regsfor SRF perf/x86/intel: Update event constraints and cache_extra_regsfor NVL perf/x86/intel: Update event constraints for PTL perf/x86/intel: Update event constraints and cache_extra_regsfor ARL perf/x86/intel: Update event constraints and cache_extra_regsfor LNL perf/x86/intel: Update event constraints and cache_extra_regsfor MTL perf/x86/intel: Update event constraints and cache_extra_regsfor ADL perf/x86/intel: Update event constraints for DMR perf/x86/intel: Update event constraints and cache_extra_regsfor SPR ...
2026-06-15	Merge tag 'locking-core-2026-06-14' of ↵	Linus Torvalds
	gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "Futex updates: - Optimize futex hash bucket access patterns (Peter Zijlstra) - Large series to address the robust futex unlock race for real, by Thomas Gleixner: "The robust futex unlock mechanism is racy in respect to the clearing of the robust_list_head::list_op_pending pointer because unlock and clearing the pointer are not atomic. The race window is between the unlock and clearing the pending op pointer. If the task is forced to exit in this window, exit will access a potentially invalid pending op pointer when cleaning up the robust list. That happens if another task manages to unmap the object containing the lock before the cleanup, which results in an UAF. In the worst case this UAF can lead to memory corruption when unrelated content has been mapped to the same address by the time the access happens. User space can't solve this problem without help from the kernel. This series provides the kernel side infrastructure to help it along: 1) Combined unlock, pointer clearing, wake-up for the contended case 2) VDSO based unlock and pointer clearing helpers with a fix-up function in the kernel when user space was interrupted within the critical section. ... with help by André Almeida: - Add a note about robust list race condition (André Almeida) - Add self-tests for robust release operations (André Almeida) Context analysis updates: - Implement context analysis for 'struct rt_mutex'. (Bart Van Assche) - Bump required Clang version to 23 (Marco Elver) Guard infrastructure updates: - Series to remove NULL check from unconditional guards (Dmitry Ilvokhin) Lockdep updates: - Restore self-test migrate_disable() and sched_rt_mutex state on PREEMPT_RT (Karl Mehltretter) Membarriers updates: - Use per-CPU mutexes for targeted commands (Aniket Gattani) - Modernize membarrier_global_expedited with cleanup guards (Aniket Gattani) - Add rseq stress test for CFS throttle interactions (Aniket Gattani) percpu-rwsems updates: - Extract __percpu_up_read() to optimize inlining overhead (Dmitry Ilvokhin) Seqlocks updates: - Allow UBSAN_ALIGNMENT to fail optimizing (Heiko Carstens) Lock tracing: - Add contended_release tracepoint to sleepable locks such as mutexes, percpu-rwsems, rtmutexes, rwsems and semaphores (Dmitry Ilvokhin) MAINTAINERS updates: - MAINTAINERS: Add RUST [SYNC] entry (Boqun Feng) Misc updates and fixes by Randy Dunlap, YE WEI-HONG, Fabricio Parra, Dmitry Ilvokhin and Peter Zijlstra" * tag 'locking-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (36 commits) locking: Add contended_release tracepoint to sleepable locks locking/percpu-rwsem: Extract __percpu_up_read() tracing/lock: Remove unnecessary linux/sched.h include futex: Optimize futex hash bucket access patterns rust: sync: completion: Mark inline complete_all and wait_for_completion MAINTAINERS: Add RUST [SYNC] entry cleanup: Specify nonnull argument index selftests: futex: Add tests for robust release operations Documentation: futex: Add a note about robust list race condition x86/vdso: Implement __vdso_futex_robust_try_unlock() x86/vdso: Prepare for robust futex unlock support futex: Provide infrastructure to plug the non contended robust futex unlock race futex: Add robust futex unlock IP range futex: Add support for unlocking robust futexes futex: Cleanup UAPI defines x86: Select ARCH_MEMORY_ORDER_TSO uaccess: Provide unsafe_atomic_store_release_user() futex: Provide UABI defines for robust list entry modifiers futex: Move futex related mm_struct data into a struct futex: Make futex_mm_init() void ...
2026-06-15	Merge tag 'timers-vdso-2026-06-13' of ↵	Linus Torvalds
	gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull vdso updates from Thomas Gleixner: - Remove the redundant CONFIG_GENERIC_TIME_VSYSCALL after converting the remaining users over. - Rework and sanitize the MIPS VDSO handling, so it does not handle the time related VDSO if there is no VDSO capable clocksource available. Also stop mapping VDSO data pages unconditionally even if there is no usage possible. * tag 'timers-vdso-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: MIPS: VDSO: Fold MIPS_CLOCK_VSYSCALL into MIPS_GENERIC_GETTIMEOFDAY MIPS: VDSO: Gate microMIPS restriction on GCC version MIPS: VDSO: Fold MIPS_DISABLE_VDSO into MIPS_GENERIC_GETTIMEOFDAY clocksource/drivers/mips-gic-timer: Only use VDSO_CLOCKMODE_GIC when it is a available MIPS: csrc-r4k: Only use VDSO_CLOCKMODE_R4K when it is a available MIPS: VDSO: Only map the data pages when the vDSO is used MIPS: Introduce Kconfig MIPS_GENERIC_GETTIMEOFDAY vdso/datastore: Always provide symbol declarations MAINTAINERS: Add include/linux/vdso_datastore.h to vDSO block vdso/gettimeofday: Rename __arch_get_vdso_u_timens_data() vdso/treewide: Drop GENERIC_TIME_VSYSCALL vdso/vsyscall: Gate update_vsyscall() behind CONFIG_GENERIC_GETTIMEOFDAY riscv: vdso: Drop CONFIG_GENERIC_TIME_VSYSCALL guard around syscall fallbacks
2026-06-15	Merge tag 'timers-ptp-2026-06-13' of ↵	Linus Torvalds
	gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull timekeeping updates from Thomas Gleixner: "Updates for NTP/timekeeping and PTP: - Expand timekeeping snapshot mechanisms The various snapshot functions are mostly used for PTP to collect "atomic" snapshots of various involved clocks. They lack support for the recently introduced AUX clocks and do not provide the underlying counter value (e.g. TSC) to user space. Exposing the counter value snapshot allows for better control and steering. Convert the hard wired ktime_get_snapshot() to take a clock ID, which allows the caller to select the clock ID to be captured along with CLOCK_MONONOTONIC_RAW. Additionally capture the underlying hardware counter value and the clock source ID of the counter. Expand the hardware based snapshot capture where devices provide a mechanism to snapshot the hardware PTP clock and the system counter (usually via PCI/PTM) to support AUX clocks and also provide the captured counter value back to the caller and not only the clock timestamps derived from it. - Add a new optional read_snapshot() callback to clocksources That is required to capture atomic snapshots from clocksources which are derived from TSC with a scaling mechanism (e.g. Hyper-V, KVMclock). The value pair is handed back in the snapshot structure to the callers, so they can do the necessary correlations in a more precise way. This touches usage sites of the affected functions and data structure all over the tree, but stays fully backwards compatible for the existing user space exposed interfaces. New PTP IOCTLs will provide access to the extended functionality in later kernel versions" * tag 'timers-ptp-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (28 commits) ptp: vmclock: Use hw_cycles from snapshot for precise TSC pairing x86/kvmclock: Implement read_snapshot() for kvmclock clocksource clocksource/hyperv: Implement read_snapshot() for TSC page clocksource timekeeping: Add clocksource read_snapshot() method and hw_cycles to snapshot ptp: Switch to ktime_get_snapshot_id() for pre/post timestamps timekeeping: Add support for AUX clock cross timestamping timekeeping: Remove system_device_crosststamp::sys_realtime ALSA: hda/common: Use system_device_crosststamp::sys_systime wifi: iwlwifi: Use system_device_crosststamp::sys_systime ptp: Use system_device_crosststamp::sys_systime timekeeping: Prepare for cross timestamps on arbitrary clock IDs timekeeping: Remove ktime_get_snapshot() virtio_rtc: Use provided clock ID for history snapshot net/mlx5: Use provided clock ID for history snapshot igc: Use provided clock ID for history snapshot ice/ptp: Use provided clock ID for history snapshot wifi: iwlwifi: Adopt PTP cross timestamps to core changes timekeeping: Add CLOCK ID to system_device_crosststamp timekeeping: Add system_counterval_t to struct system_device_crosststamp timekeeping: Add CLOCK_AUX support for ktime_get_snapshot_id() ...
2026-06-15	Merge tag 'timers-nohz-2026-06-13' of ↵	Linus Torvalds
	gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull NOHZ updates from Thomas Gleixner: - Fix a long standing TOCTOU in get_cpu_sleep_time_us() - Make the CPU offline NOHZ handling more robust by disabling NOHZ on the outgoing CPU early instead of creating unneeded state which needs to be undone. - Unify idle CPU time accounting instead of having two different accounting mechanisms. These two different mechanisms are not really independent, but the different properties can in the worst case cause that gloabl idle time can be observed going backwards. - Consolidate the idle/iowait time retrieval interfaces instead of converting back and forth between them. - Make idle interrupt time accounting more robust. The original code assumes that interrupt time accouting is enabled and therefore stops elapsing idle time while an interrupt is handled in NOHZ dyntick state. That assumption is not correct as interrupt time accounting can be disabled at compile and runtime. - Fix an accounting error between dyntick idle time and dyntick idle steal time. The stolen time is not accounted and therefore idle time becomes inaccurate. The stolen time is now accounted after the fact as there is no way to predict the steal time upfront. * tag 'timers-nohz-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: sched/cputime: Handle dyntick-idle steal time correctly sched/cputime: Handle idle irqtime gracefully sched/cputime: Provide get_cpu_[idle\|iowait]_time_us() off-case tick/sched: Consolidate idle time fetching APIs tick/sched: Account tickless idle cputime only when tick is stopped tick/sched: Remove unused fields tick/sched: Move dyntick-idle cputime accounting to cputime code tick/sched: Remove nohz disabled special case in cputime fetch tick/sched: Unify idle cputime accounting s390/time: Prepare to stop elapsing in dynticks-idle powerpc/time: Prepare to stop elapsing in dynticks-idle sched/cputime: Correctly support generic vtime idle time sched/cputime: Remove superfluous and error prone kcpustat_field() parameter sched/idle: Handle offlining first in idle loop tick/sched: Fix TOCTOU in nohz idle time fetch
2026-06-15	Merge tag 'timers-core-2026-06-13' of ↵	Linus Torvalds
	gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull timer core updates from Thomas Gleixner: "Updates for the time/timer core subsystem: - Harden the user space controllable hrtimer interfaces further to protect against unpriviledged DoS attempts by arming timers in the past. - Add per-capacity hierarchies to the timer migration code to prevent timer migration accross different capacity domains. This code has been disabled last minute as there is a pathological problem with SoCs which advertise a larger number of capacity domains. The problem is under investigation and the code won't be active before v7.3, but that turned out to be less intrusive than a full revert as it preserves the preparatory steps and allows people to work on the final resolution - Export time namespace functionality as a recent user can be built as a module. - Initialize the jiffies clocksource before using it. The recent hardening against time moving backward requires that the related members of struct clocksource have been initialized, otherwise it clamps the readout to 0, which makes time stand sill and causes boot delays. - Fix a more than twenty year old PID reference count leak in an error path of the POSIX CPU timer code. - The usual small fixes, improvements and cleanups all over the place" * tag 'timers-core-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (31 commits) posix-cpu-timers: Fix pid refcount leak in do_cpu_nanosleep() error path time/jiffies: Register jiffies clocksource before usage timers/migration: Temporarily disable per capacity hierarchies timers/migration: Turn tmigr_hierarchy level_list into a flexible array timers/migration: Deactivate per-capacity hierarchies under nohz_full timers/migration: Fix hotplug migrator selection target on asymetric capacity machines ntsync: Honour caller's time namespace for absolute MONOTONIC timeouts time/namespace: Export init_time_ns and do_timens_ktime_to_host() timers/migration: Update stale @online doc to @available timers: Fix flseep() typo in kernel-doc comment hrtimer: Fix the bogus return type of __hrtimer_start_range_ns() hrtimer: Return ktime_t from hrtimer_get_next_event()/hrtimer_next_event_without() clocksource: Clean up clocksource_update_freq() functions alarmtimer: Remove stale return description from alarm_handle_timer() selftests/posix_timers: Use CLOCK_THREAD_CPUTIME_ID for ITIMER_PROF measurements scripts/timers: Add timer_migration_tree.py timers/migration: Handle capacity in connect tracepoints timers/migration: Split per-capacity hierarchies timers/migration: Track CPUs in a hierarchy timers/migration: Abstract out hierarchy to prepare for CPU capacity awareness ...
2026-06-15	Merge tag 'timers-clocksource-2026-06-13' of ↵	Linus Torvalds
	gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull clocksource updates from Thomas Gleixner: "Updates for clocksource/clockevent drivers: - Add devm helpers for clocksources, which allows to simplify driver teardown and probe failure handling. - More module conversion work - Update the support for the ARM EL2 virtual timer including the required ACPI changes. - Add clockevent and clocksource support for the TI Dual Mode Timer - Fix the support for multiple watchdog instances in the TEGRA186 driver - Add D1 timer support to the SUN5I driver - The usual devicetree updates, cleanups and small fixes all over the place" * tag 'timers-clocksource-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (24 commits) clocksource: move NXP timer selection to drivers/clocksource clocksource/drivers/timer-tegra186: Reserve and service a kernel watchdog clocksource/drivers/timer-tegra186: Register all accessible watchdog timers clocksource/drivers/timer-tegra186: Correct num_wdts for Tegra186 and Tegra234 clocksource/drivers/timer-tegra186: Fix support for multiple watchdog instances clocksource/drivers/timer-ti-dm: Add clockevent support clocksource/drivers/timer-ti-dm: Add clocksource support clocksource/drivers/timer-ti-dm: Fix property name in comment dt-bindings: timer: arm,arch_timer: Fix requirements for interrupt description clocksource/drivers/arm_arch_timer: Default to EL2 virtual timer when running VHE ACPI: GTDT: Parse information related to the EL2 virtual timer ACPI: GTDT: Account for GTDTv3 size when walking the platform timer descriptors clocksource: Add devm_clocksource_register_*() helpers clocksource/drivers/sun5i: Add D1 hstimer support dt-bindings: timer: allwinner,sun5i-a13-hstimer: add H616 and D1 dt-bindings: timer: Add StarFive JHB100 clint dt-bindings: timer: renesas,rz-mtu3: document RZ/{T2H,N2H} dt-bindings: timer: renesas,rz-mtu3: Remove TCIU8 interrupt dt-bindings: timer: Remove sifive,fine-ctr-bits property clocksource/drivers/timer-of: Make the code compatible with modules ...
2026-06-15	Merge tag 'irq-core-2026-06-13' of ↵	Linus Torvalds
	gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull interrupt core updates from Thomas Gleixner: - Rework of /proc/interrupt handling: /proc/interrupts was subject to micro optimizations for a long time, but most of the low hanging fruit was left on the table. This rework addresses the major time consuming issues: - Printing a long series of zeros one by one via a format string instead of counting subsequent zeros and emitting a string constant. - Simplify and cache the conditions whether interrupts should be printed - Use a proper iteration over the interrupt descriptor xarray instead of walking and testing one by one. - Provide helper functions for the architecture code to emit the architecture specific counters - Convert the counter structure in x86 to an array, which simplifies the output and add mechanisms to suppress unused architecture interrupts, which just occupy space for nothing. Adopt the new core mechanisms. This adjusts the gdb scripts related to interrupt counter statistics to work with the new mechanisms. - Prevent a string overflow in the /proc/irq/$N/ directory name creation code. * tag 'irq-core-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: x86/irq: Add missing 's' back to thermal event printout genirq/proc: Speed up /proc/interrupts iteration genirq/proc: Runtime size the chip name genirq: Expose irq_find_desc_at_or_after() in core code genirq: Add rcuref count to struct irq_desc genirq/proc: Increase default interrupt number precision to four genirq: Calculate precision only when required genirq: Cache the condition for /proc/interrupts exposure genirq/manage: Make NMI cleanup RT safe genirq: Expose nr_irqs in core code scripts/gdb: Update x86 interrupts to the array based storage x86/irq: Move IOAPIC misrouted and PIC/APIC error counts into irq_stats x86/irq: Suppress unlikely interrupt stats by default x86/irq: Make irqstats array based genirq/proc: Utilize irq_desc::tot_count to avoid evaluation genirq/proc: Avoid formatting zero counts in /proc/interrupts x86/irq: Optimize interrupts decimals printing genirq/proc: Size interrupt directory names for 10-digit interrupt numbers
2026-06-15	Merge tag 'driver-core-7.2-rc1' of ↵	Linus Torvalds
	gitolite.kernel.org:pub/scm/linux/kernel/git/driver-core/driver-core Pull driver core updates from Danilo Krummrich: "Deferred probe: - Fix race where deferred probe timeout work could be permanently canceled by using mod_delayed_work() - Fix missing jiffies conversion in deferred_probe_extend_timeout() - Guard timeout extension with delayed_work_pending() to prevent premature firing - Use system_percpu_wq instead of the deprecated system_wq - Update deferred_probe_timeout documentation device: - Replace direct struct device bitfield access (can_match, dma_iommu, dma_skip_sync, dma_ops_bypass, state_synced, dma_coherent, of_node_reused, offline, offline_disabled) with flag-based accessors using bit operations - Reject devices with unregistered buses - Delete unused DEVICE_ATTR_PREALLOC() - Add low-level device attribute macros with const show/store callbacks, allowing device attributes to reside in read-only memory - Move core device attributes to read-only memory - Constify group array pointers in driver_add_groups() / driver_remove_groups(), struct bus_type, and struct device_driver device property: - Fix fwnode reference leak in fwnode_graph_get_endpoint_by_id() - Initialize all fields of fwnode_handle in fwnode_init() - Provide swnode_get()/swnode_put() wrappers around kobject_get/put() - Allow passing struct software_node_ref_args pointers directly to PROPERTY_ENTRY_REF() driver_override: - Migrate amba, cdx, vmbus, and rpmsg to the generic driver_override infrastructure, fixing a UAF from unsynchronized access to driver_override in bus match() callbacks - Remove the now-unused driver_set_override() firmware loader: - Fix recursive lock deadlock in device_cache_fw_images() when async work falls back to synchronous execution - Fix device reference leak in firmware_upload_register() platform: - Pass KBUILD_MODNAME through the platform driver registration macro to create module symlinks in sysfs for built-in drivers; move module_kset initialization to a pure_initcall and tegra cbb registration to core_initcall to ensure correct ordering - Pass THIS_MODULE implicitly through a coresight_init_driver() macro sysfs: - Upgrade OOB write detection in sysfs_kf_seq_show() from printk to WARN - Add return value clamping to sysfs_kf_read() Rust: - ACPI: Fix missing match data for PRP0001 by exporting acpi_of_match_device() - Auxiliary: Replace drvdata() with dedicated registration data on auxiliary_device. drvdata() exposed the driver's bus device private data beyond the driver's own scope, creating ordering constraints and forcing the data to outlive all registrations that access it. Registration data is instead scoped structurally to the Registration object, making lifecycle ordering enforced by construction rather than convention. - Rust-native device driver lifetimes (HRT): Allow Rust device drivers to carry a lifetime parameter on their bus device private data, tied to the device binding scope -- the interval during which a bus device is bound to a driver. Device resources like pci::Bar<'a> and IoMem<'a> can be stored directly in the driver's bus device private data with a lifetime bounded by the binding scope, so the compiler enforces at build time that they do not outlive the binding. This removes Devres indirection from every access site and eliminates try_access() failure paths in destructors. Bus driver traits use a Generic Associated Type (GAT) Data<'bound> to introduce the lifetime on the private data, rather than parameterizing the Driver trait itself. Auxiliary registration data, where the lifetime is not introduced by a trait callback but must be threaded through Registration, uses the ForLt trait (a type-level abstraction for types generic over a lifetime). Misc: - Fix DT overlayed devices not probing by reverting the broken treewide overlay fix and re-running fw_devlink consumer pickup when an overlay is applied to a bound device - Use root_device_register() for faux bus root device; add sanity check for failed bus init - Fix dev_has_sync_state() data race with READ_ONCE() and move it to base.h - Avoid spurious device_links warning when removing a device while its supplier is unbinding - Switch ISA bus to dynamic root device - Fix suspicious RCU usage in kernfs_put() - Remove devcoredump exit callback - Constify devfreq_event_class" * tag 'driver-core-7.2-rc1' of gitolite.kernel.org:pub/scm/linux/kernel/git/driver-core/driver-core: (81 commits) software node: allow passing reference args to PROPERTY_ENTRY_REF() driver core: platform: set mod_name in driver registration coresight: pass THIS_MODULE implicitly through a macro kernel: param: initialize module_kset in a pure_initcall soc/tegra: cbb: Move driver registration from pure_initcall to core_initcall firmware_loader: Fix recursive lock in device_cache_fw_images() driver core: Use system_percpu_wq instead of system_wq driver core: remove driver_set_override() rpmsg: use generic driver_override infrastructure Drivers: hv: vmbus: use generic driver_override infrastructure cdx: use generic driver_override infrastructure amba: use generic driver_override infrastructure rust: devres: add 'static bound to Devres<T> samples: rust: rust_driver_auxiliary: showcase lifetime-bound registration data rust: auxiliary: generalize Registration over ForLt rust: types: add `ForLt` trait for higher-ranked lifetime support gpu: nova-core: separate driver type from driver data samples: rust: rust_driver_pci: use HRT lifetime for Bar rust: io: make IoMem and ExclusiveIoMem lifetime-parameterized rust: pci: make Bar lifetime-parameterized ...
2026-06-15	Merge tag 'pm-7.2-rc1' of ↵	Linus Torvalds
	gitolite.kernel.org:pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "Over a half of the changes here are cpufreq updates that include core modifications, fixes of the old-style governors, new hardware support in drivers, assorded driver fixes and cleanups, and the removal of one driver (AMD Elan SC4). Apart from that, the intel_idle driver will now be able to avoid exposing redundant C-states if PC6 is disabled and there are new sysctl knobs for device suspend/resume watchdog timeouts, hibernation gets built-in LZ4 support for image compression and there is the usual collection of assorted fixes and cleanups. Specifics: - Fix a race between cpufreq suspend and CPU hotplug during system shutdown (Tianxiang Chen) - Avoid redundant target() calls for unchanged limits and fix a typo in a comment in the cpufreq core (Viresh Kumar) - Fix concurrency issues related to sysfs attributes access that affect cpufreq governors using the common governor code (Zhongqiu Han) - Simplify frequency limit handling in the conservative cpufreq governor (Lifeng Zheng) - Fix descriptions of the conservative governor freq_step tunable and the ondemand governor sampling_down_factor tunable in the cpufreq documentation (Pengjie Zhang) - Fix use-after-free and double free during _OSC evaluation in the PCC cpufreq driver (Yuho Choi) - Rework the handling of policy min and max frequency values in the cpufreq core to allow drivers to specify special initial values for the scaling_min_freq and scaling_max_freq sysfs attributes (Pierre Gondois) - Add cpufreq scaling support for Qualcomm Shikra SoC (Taniya Das, Imran Shaik). - Improve the warning message on HWP-disabled hybrid processors printed by the intel_pstate driver and sync policy->cur during CPU offline in it (Yohei Kojima, Fushuai Wang) - Drop cpufreq support for AMD Elan SC4 (Sean Young) - Minor fixes for cpufreq drivers (Krzysztof Kozlowski, Akashdeep Kaur, Hans Zhang, Guangshuo Li, Xueqin Luo) - Clean up dead dependencies on X86 in the cpufreq Kconfig (Julian Braha) - Allow the intel_idle driver to avoid exposing C-states that are redundant when PC6 is disabled (Artem Bityutskiy) - Fix memory leak and a potential race in the OPP core (Abdun Nihaal, Di Shen) - Mark Rust OPP methods as inline (Nicolás Antinori) - Fix misc device registration failure path in the PM QoS core (Yuho Choi) - Add sysctl interface for DPM watchdog timeouts (Tzung-Bi Shih) - Use complete() instead of complete_all() in device_pm_sleep_init() to avoid a false-positive warning from lockdep_assert_RT_in_threaded_ctx() when CONFIG_PROVE_RAW_LOCK_NESTING is enabled (Jiakai Xu) - Use a flexible array for CRC uncompressed buffers during hibernation image saving (Rosen Penev) - Make the LZ4 algorithm available for hibernation compression (l1rox3) - Move the preallocate_image() call during hibernation after the "prepare" phase of the "freeze" transition (Matthew Leach) - Fix a memory leak in rapl_add_package_cpuslocked() in the intel_rapl power capping driver and use sysfs_emit() in cpumask_show() in that driver (Sumeet Pawnikar, Yury Norov) - Fix ValueError when parsing incomplete device properties in the pm-graph utility (Gongwei Li)" * tag 'pm-7.2-rc1' of gitolite.kernel.org:pub/scm/linux/kernel/git/rafael/linux-pm: (40 commits) PM: dpm_watchdog: Add sysctl interface for DPM watchdog timeouts PM: QoS: Fix misc device registration unwind cpufreq: Use policy->min/max init as QoS request cpufreq: Remove driver default policy->min/max init cpufreq: Set default policy->min/max values for all drivers cpufreq: Extract cpufreq_policy_init_qos() function cpufreq: Documentation: fix conservative governor freq_step description cpufreq: ti: Add EPROBE_DEFER for K3 SoCs cpufreq: qcom: Add cpufreq scaling support for Qualcomm Shikra SoC dt-bindings: cpufreq: Document Qualcomm Shikra SoC EPSS powercap: intel_rapl: Use sysfs_emit() in cpumask_show() cpufreq: governor: Fix stale prev_cpu_nice spike when enabling ignore_nice_load cpufreq: governor: Fix data races on per-CPU idle/nice baselines PM: hibernate: Use flexible array for CRC uncompressed buffers powercap: intel_rapl: Fix memory leak in rapl_add_package_cpuslocked() PM: hibernate: make LZ4 available for hibernation compression PM: sleep: Use complete() in device_pm_sleep_init() opp: rust: mark OPP methods as inline cpufreq: intel_pstate: Improve warning message on HWP-disabled hybrid CPUs cpufreq: elanfreq: Drop support for AMD Elan SC4* ...
2026-06-15	Merge tag 'rcu.release.v7.2' of ↵	Linus Torvalds
	gitolite.kernel.org:pub/scm/linux/kernel/git/rcu/linux Pull RCU updates from Uladzislau Rezki: "Torture test updates: - Improve kvm-series.sh script by adding examples in its header comment - Lazy RCU is more fully tested now by replacing call_rcu_hurry() with call_rcu() and doing rcu_barrier() to motivate lazy callbacks during a stutter pause - Add more synonyms for the "--do-normal" group of torture.sh command-line arguments Misc changes: - Reduce stack usage of nocb_gp_wait() to address frame size warning when built with CONFIG_UBSAN_ALIGNMENT - The synchronize_rcu() call can detect the flood and latches a normal/default path temporary switching to wait_rcu_gp() path - Document using rcu_access_pointer() to fetch the old pointer for lockless cmpxchg() updates - Simplify some RCU code using clamp_val() - Fix a kerneldoc header comment typo in srcu_down_read_fast()" * tag 'rcu.release.v7.2' of gitolite.kernel.org:pub/scm/linux/kernel/git/rcu/linux: rcu/nocb: reduce stack usage in nocb_gp_wait() rcu-tasks: Fix possible boot-time tests failed for the call_rcu_tasks() rcu: Latch normal synchronize_rcu() path on flood rcu: Document rcu_access_pointer() feeding into cmpxchg() rcu: Simplify param_set_next_fqs_jiffies() by applying clamp_val() rcu: Simplify rcu_do_batch() by applying clamp() checkpatch: Undeprecate rcu_read_lock_trace() and rcu_read_unlock_trace() srcu: Fix kerneldoc header comment typo in srcu_down_read_fast() torture: Allow "norm" abbreviation for "normal" torture: Improve kvm-series.sh header comment torture: Add torture_sched_set_normal() for user-specified nice values rcutorture: Fully test lazy RCU
2026-06-14	bpf: Add support to specify uprobe_multi target via file descriptor	Jiri Olsa
	Allow uprobe_multi link to identify the target binary by an already opened file descriptor. Adding new BPF_F_UPROBE_MULTI_PATH_FD flag and the path_fd field for the attr.link_create.uprobe_multi struct. When the flag is set, we resolve the target from path_fd, without the flag, we keep the existing string path behavior. I don't see a use case for supporting O_PATH file descriptors, because we need to read the binary first to get probes offsets, so I'm using the CLASS(fd, f), which fails for O_PATH fds. Assisted-by: Codex:GPT-5.4 Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260611114230.950379-4-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-14	bpf: Use user_path_at for path resolution in uprobe_multi	Jiri Olsa
	Resolve the uprobe_multi user path with user_path_at() instead of copying the string with strndup_user() and passing it to kern_path(). This removes the temporary allocation and keeps the lookup logic in one helper. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260611114230.950379-3-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-14	bpf: Guard __get_user acesss with access_ok for uprobe_multi data	Jiri Olsa
	As reported by sashiko [1] we need to use access_ok to check the user space data bounds before we use __get-user to get it. [1] https://lore.kernel.org/bpf/20260610145235.CB1441F00893@smtp.kernel.org/ Fixes: 0b779b61f651 ("bpf: Add cookies support for uprobe_multi link") Fixes: 89ae89f53d20 ("bpf: Add multi uprobe link") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260611114230.950379-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-15	Merge tag 'vfs-7.2-rc1.procfs' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull procfs updates from Christian Brauner: - Revamp fs/filesystems.c The file was a mess with a hand-rolled linked list in desperate need of a cleanup. The filesystems list is now RCU-ified, /proc files can be marked permanent from outside fs/proc/, and the string emitted when reading /proc/filesystems is pre-generated and cached instead of pointer-chasing and printfing entry by entry on every read. The file is read frequently because libselinux reads it and is linked into numerous frequently used programs (even ones you would not suspect, like sed!). Scalability also improves since reference maintenance on open/close is bypassed. open+read+close cycle single-threaded (ops/s): before: 442732 after: 1063462 (+140%) open+read+close cycle with 20 processes (ops/s): before: 606177 after: 3300576 (+444%) A follow-up patch adds missing unlocks in some corner cases and tidies things up. - Relax the mount visibility check for subset=pid mounts When procfs is mounted with subset=pid, all static files become unavailable and only the dynamic pid information is accessible. In that case there is no point in imposing the full mount visibility restrictions on the mounter - everything that can be hidden in procfs is already inaccessible. These restrictions prevented procfs from being mounted inside rootless containers since almost all container implementations overmount parts of procfs to hide certain directories. As part of this /proc/self/net is only shown in subset=pid mounts for CAP_NET_ADMIN, reconfiguring subset=pid is rejected, the SB_I_USERNS_VISIBLE superblock flag is replaced with an FS_USERNS_MOUNT_RESTRICTED filesystem flag, fully visible mounts are recorded in a list, and the mount restrictions are finally documented. - Protect ptrace_may_access() with exec_update_lock in procfs Most uses of ptrace_may_access() in procfs should hold exec_update_lock to avoid TOCTOU issues with concurrent privileged execve() (like setuid binary execution). This fixes the easy cases - the owner and visibility checks and the FD link permission checks - with the gnarlier ones to follow later. * tag 'vfs-7.2-rc1.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: fix ups and tidy ups to /proc/filesystems caching proc: protect ptrace_may_access() with exec_update_lock (FD links) proc: protect ptrace_may_access() with exec_update_lock (part 1) docs: proc: add documentation about mount restrictions proc: handle subset=pid separately in userns visibility checks proc: prevent reconfiguring subset=pid proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN fs: cache the string generated by reading /proc/filesystems sysfs: remove trivial sysfs_get_tree() wrapper fs: RCU-ify filesystems list fs: move SB_I_USERNS_VISIBLE to FS_USERNS_MOUNT_RESTRICTED proc: allow to mark /proc files permanent outside of fs/proc/ namespace: record fully visible mounts in list
2026-06-15	Merge tag 'vfs-7.2-rc1.xattr' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull simple_xattr updates from Christian Brauner: "This reworks the simple xattr api to make it more efficient and easier to use for all consumers. The simple_xattr hash table moves from the inode into a per-superblock cache, removing the per-inode overhead for the common case of few or no xattrs. The interface now passes struct simple_xattrs ** so lazy allocation is handled internally instead of by every caller, kernfs xattr operations on kernfs nodes shared between multiple superblocks are properly serialized, and tmpfs constructs "security.foo" xattr names with kasprintf() instead of kmalloc() plus two memcpy()s. A follow-up fix links kernfs nodes to their parent before the LSM init hook runs: with the per-sb cache kernfs_xattr_set() computes the cache via kernfs_root(kn), which faulted on a freshly allocated node when selinux_kernfs_init_security() called into it - reproducible as a NULL pointer dereference on the first cgroup mkdir on SELinux-enabled systems. On top of this bpffs gains support for trusted.* and security.* xattrs so that user space and BPF LSM programs can attach metadata - for example a content hash or a security label - to pinned objects and directories and inspect it uniformly like on other filesystems. The store is in-memory and non-persistent, living only for the lifetime of the mount like everything else in bpffs" * tag 'vfs-7.2-rc1.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: bpf: Add simple xattr support to bpffs kernfs: link kn to its parent before the LSM init hook simpe_xattr: use per-sb cache simple_xattr: change interface to pass struct simple_xattrs ** tmpfs: simplify constructing "security.foo" xattr names kernfs: fix xattr race condition with multiple superblocks
2026-06-15	Merge tag 'kernel-7.2-rc1.task_exec_state' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull task_exec_state updates from Christian Brauner: "This introduces a new per-task task_exec_state structure and relocates the dumpable mode and the user namespace captured at execve() from mm_struct onto it. It stays attached to the task for its full lifetime. __ptrace_may_access() and several /proc owner and visibility checks need to consult two pieces of state for any observable task, including zombies that have already gone through exit_mm(): the dumpable mode and the user namespace captured at execve(). Both live on mm_struct today, which exit_mm() clears from the task long before the task is reaped. A reader that races with do_exit() observes task->mm == NULL and either fails the check or falls back to init_user_ns - which denies legitimate access to non-dumpable zombies that were running in a nested user namespace. mm_struct loses ->user_ns and the dumpability bits in ->flags. MMF_DUMPABLE_BITS is reserved so the MMF_DUMP_FILTER_* layout exposed via /proc/<pid>/coredump_filter stays stable. task->user_dumpable and its exit_mm() snapshot are removed. task_exec_state is the privilege domain established by an execve(). Within a thread group it is shared via refcount; across thread groups each task has its own: - CLONE_VM siblings (thread-group members, io_uring workers) refcount-share the parent's exec_state. - Non-CLONE_VM clones (fork(), vfork() without CLONE_VM) allocate a fresh exec_state inheriting the parent's dumpable mode and user_ns. - execve() in the child allocates a fresh instance and installs it under task_lock + exec_update_lock via task_exec_state_replace(). - Credential changes (setresuid, capset, ...) and prctl(PR_SET_DUMPABLE) update dumpability on the current task's exec_state, i.e., on the thread group's shared instance. On top of this exec_mmap() no longer tears down the old mm while holding exec_update_lock for writing and cred_guard_mutex. Neither lock is needed for that: exec_update_lock only exists to make the mm swap atomic with the later commit_creds() and all its readers operate on the new mm; none looks at the detached old mm. The cost was real: __mmput() runs exit_mmap() over the entire old address space and can block in exit_aio() waiting for in-flight AIO, so execve() of a large process blocked ptrace_attach() and every exec_update_lock reader for the duration of the teardown. The old mm is now stashed in bprm->old_mm and released from setup_new_exec() after both locks are dropped, with a backstop in free_bprm() for the error paths" * tag 'kernel-7.2-rc1.task_exec_state' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: exec: free the old mm outside the exec locks exec_state: relocate dumpable information ptrace: add ptracer_access_allowed() exec: introduce struct task_exec_state sched/coredump: introduce enum task_dumpable
2026-06-14	bpf: Raise maximum call chain depth to 16 frames	Alexei Starovoitov
	Bump MAX_CALL_FRAMES from 8 to 16 to allow deeper call chains that Rust-BPF requires and update selftests. Link: https://lore.kernel.org/r/20260613180755.29671-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-13	posix-cpu-timers: Fix pid refcount leak in do_cpu_nanosleep() error path	WenTao Liang
	In do_cpu_nanosleep(), posix_cpu_timer_create() takes a pid reference via get_pid() and stores it in timer.it.cpu.pid. If the subsequent posix_cpu_timer_set() call fails, the function returns immediately without calling posix_cpu_timer_del() to release the pid reference, causing a leak. Fix it by calling posix_cpu_timer_del() before the unlock-and-return on the error path, consistent with the other exit paths in the same function. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: WenTao Liang <vulab@iscas.ac.cn> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260611161738.97043-1-vulab@iscas.ac.cn
2026-06-13	time/jiffies: Register jiffies clocksource before usage	Thomas Gleixner
	Teddy reported that a XEN HVM has a long boot delay, which was bisected to the recent enhancements to the negative motion detection. It turned out that the jiffies clocksource is used in early boot before it is registered, which leaves the max_delta_raw field at zero. That causes the read out to be clamped to the max delta of 0, which means time is not making progress. Cure it by ensuring that it is initialized before its first usage in timekeeping_init(). Fixes: 76031d9536a0 ("clocksource: Make negative motion detection more robust") Reported-by: Teddy Astie <teddy.astie@vates.tech> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Teddy Astie <teddy.astie@vates.tech> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/87y0gn3fve.ffs@fw13 Closes: https://lore.kernel.org/all/1780914594.8631fc262581453bbf619ec5b2062170.19ea6c8227b000701b@vates.tech
2026-06-12	bpf: Fix setting retval to -EPERM for cgroup hooks not returning errno	Xu Kuohai
	When a cgroup BPF program exits with 0, bpf_prog_run_array_cg() sets the hook return value to -EPERM if it is not a valid errno. This is correct for errno-based hooks, which return 0 on success and negative errno on failure, but wrong for boolean and void LSM hooks. Boolean LSM hooks should only return true or false, and void LSM hooks have no return value at all. Fix it by skipping setting -EPERM for hooks not returning errno. Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor") Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/r/20260610201724.733943-2-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-12	bpf: Run generic devmap egress prog on private skb	Sun Jian
	Generic XDP devmap multi redirect uses skb_clone() for intermediate destinations and sends the last destination with the original skb. This can leave multiple destinations sharing the same packet data. This becomes visible after generic devmap egress-program support was added: a devmap egress program may mutate packet data, and another destination sharing the same data can observe that mutation. Native XDP broadcast redirect does not have this issue because xdpf_clone() copies the frame data for each destination. Generic XDP should provide the same per-destination isolation before running a devmap egress program. Fix this by making cloned skbs private before running the generic devmap egress program. Use skb_copy() instead of skb_unshare() so allocation failure does not consume the skb and the existing caller error paths keep their ownership semantics. Fixes: 2ea5eabaf04a ("bpf: devmap: Implement devmap prog execution for generic XDP") Suggested-by: Jiayuan Chen <jiayuan.chen@linux.dev> Suggested-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com> Link: https://lore.kernel.org/r/20260612114032.244616-2-sun.jian.kdev@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-11	Merge tag 'dma-mapping-7.1-2026-06-11' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fix from Marek Szyprowski: "Three more fixes for the DMA-mapping code, related to PCI P2PDMA, DMA debug and DMA link ranges API (Li RongQing and Jason Gunthorpe)" * tag 'dma-mapping-7.1-2026-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: iommu/dma: Do not try to iommu_map a 0 length region in swiotlb dma-debug: fix physical address retrieval in debug_dma_sync_sg_for_device dma-mapping: direct: fix missing mapping for THRU_HOST_BRIDGE segments
2026-06-11	Merge branches 'pm-sleep', 'pm-powercap' and 'pm-tools'	Rafael J. Wysocki
	Merge updates related to system sleep support, two updates of the intel_rapl power capping driver, and a pm-graph utility fix for 7.2-rc1: - Add sysctl interface for DPM watchdog timeouts (Tzung-Bi Shih) - Use complete() instead of complete_all() in device_pm_sleep_init() to avoid a false-positive warning from lockdep_assert_RT_in_threaded_ctx() when CONFIG_PROVE_RAW_LOCK_NESTING is enabled (Jiakai Xu) - Use a flexible array for CRC uncompressed buffers during hibernation image saving (Rosen Penev) - Make the LZ4 algorithm available for hibernation compression (l1rox3) - Move the preallocate_image() call during hibernation after the "prepare" phase of the "freeze" transition (Matthew Leach) - Fix a memory leak in rapl_add_package_cpuslocked() in the intel_rapl power capping driver and use sysfs_emit() in cpumask_show() in that driver (Sumeet Pawnikar, Yury Norov) - Fix ValueError when parsing incomplete device properties in the pm-graph utility (Gongwei Li) * pm-sleep: PM: dpm_watchdog: Add sysctl interface for DPM watchdog timeouts PM: hibernate: Use flexible array for CRC uncompressed buffers PM: hibernate: make LZ4 available for hibernation compression PM: sleep: Use complete() in device_pm_sleep_init() PM: hibernate: call preallocate_image() after freeze prepare * pm-powercap: powercap: intel_rapl: Use sysfs_emit() in cpumask_show() powercap: intel_rapl: Fix memory leak in rapl_add_package_cpuslocked() * pm-tools: PM: tools: pm-graph: fix ValueError when parsing incomplete device properties
2026-06-11	Merge branches 'pm-cpuidle', 'pm-opp' and 'pm-qos'	Rafael J. Wysocki
	Merge cpuidle updates, OPP (operating performance points) updates and a PM QoS update for 7.2-rc1: - Allow the intel_idle driver to avoid exposing C-states that are redundant when PC6 is disabled (Artem Bityutskiy) - Fix memory leak and a potential race in the OPP core (Abdun Nihaal, Di Shen) - Mark Rust OPP methods as inline (Nicolás Antinori) - Fix misc device registration failure path in the PM QoS core (Yuho Choi) * pm-cpuidle: intel_idle: Drop C-states redundant when PC6 is disabled intel_idle: Introduce a helper for checking PC6 intel_idle: Add constants for MSR_PKG_CST_CONFIG_CONTROL * pm-opp: opp: rust: mark OPP methods as inline OPP: of: Fix potential memory leak in opp_parse_supplies() OPP: Fix race between OPP addition and lookup * pm-qos: PM: QoS: Fix misc device registration unwind
2026-06-11	locking: Add contended_release tracepoint to sleepable locks	Dmitry Ilvokhin
	Add the contended_release trace event. This tracepoint fires on the holder side when a contended lock is released, complementing the existing contention_begin/contention_end tracepoints which fire on the waiter side. This enables correlating lock hold time under contention with waiter events by lock address. Add trace_contended_release()/trace_call__contended_release() calls to the slowpath unlock paths of sleepable locks: mutex, rtmutex, semaphore, rwsem, percpu-rwsem, and RT-specific rwbase locks. Where possible, trace_contended_release() fires before the lock is released and before the waiter is woken. For some lock types, the tracepoint fires after the release but before the wake. Making the placement consistent across all lock types is not worth the added complexity. For reader/writer locks, the tracepoint fires for every reader releasing while a writer is waiting, not only for the last reader. Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Usama Arif <usama.arif@linux.dev> Link: https://patch.msgid.link/02f4f6c5ce6761e7f6587cf0ff2289d962ecddd4.1780506267.git.d@ilvokhin.com
2026-06-11	locking/percpu-rwsem: Extract __percpu_up_read()	Dmitry Ilvokhin
	Move the percpu_up_read() slowpath out of the inline function into a new __percpu_up_read() to avoid binary size increase from adding a tracepoint to an inlined function. Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Usama Arif <usama.arif@linux.dev> Link: https://patch.msgid.link/3dd2a1b9ab4f469e1892766cb63f41d6b0f53d29.1780506267.git.d@ilvokhin.com
2026-06-11	futex: Optimize futex hash bucket access patterns	Peter Zijlstra
	Breno reported significant c2c HITM in a futex hash heavy workload. It turns out that the hash bucket to private hash table reverse pointer (futex_hash_bucket::priv) was to blame. Notably when the hash buckets are heavily contended, the: 'fph = bh->priv;' load in futex_hash() will typically miss and consequently become quite expensive. Since this load in particular is quite superfluous, removing it is fairly straight forward. However, removing it does not in fact achieve anything much. The pain moves to the next user, notably: futex_hash_put(). Therefore rework the whole private hash refcounting to avoid needing this back pointer (and removing it). Instead of passing around 'struct futex_hash_bucket hb', pass around a new structure that contains it and the related 'struct futex_private_hash fph' pointer in tandem. Funnily this turns out to remove more code than it adds and significantly improves futex hash performance (as measured by 'perf bench futex hash'): SKL dual socket 112 threads: Baseline Patched shared (16k) 1571857 1641435 + 4.4% autosize (512) 646390 903371 +39.7% -b 256 464395 587014 +26.4% -b 512 715687 995943 +39.2% -b 1024 995085 1396328 +40.3% -b 2048 1293114 1668395 +29.0% -b 4096 2124438 2240228 + 5.5% Zen3 dual socket 256 threads: Baseline Patched shared (16k) 1275840 1381279 + 8.2% autosize (512) 1252745 1482179 +18.3% -b 256 856274 955455 +11.5% -b 512 1267490 1544010 +21.8% -b 1024 1424013 1625424 +14.1% -b 2048 1505181 1669342 +10.9% -b 4096 1465993 1688932 +15.2% AMD EPYC 9D64 (Zen4, single socket) 176 threads: Baseline Patched Delta shared (16k) 1,230,599 1,368,655 +11.2% autosize (1024) 1,285,440 1,556,946 +21.1% -b 256 1,341,471 1,520,303 +13.3% -b 512 1,438,330 1,599,319 +11.2% -b 1024 1,443,772 1,622,493 +12.4% -b 2048 1,472,108 1,643,975 +11.7% -b 4096 1,333,098 1,570,897 +17.8% Reported-by: Breno Leitao <leitao@debian.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Breno Leitao <leitao@debian.org> Tested-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260610135510.GB1430057@noisy.programming.kicks-ass.net
2026-06-11	sched/fair: Fix newidle vs core-sched	Aaron Lu
	While testing Prateek's throttle series, I noticed a panic issue when coresched is enabled and bisected to this patch. I fed the panic log and this patch to an agent and its analysis looks correct to me(cpu56 and cpu57 are siblings in a VM): cpu57 (holds core-wide lock) pick_next_task() [core scheduling] for_each_cpu_wrap(i, smt_mask, 57): i=57: pick_task(rq_57) pick_task_fair(rq_57) -> picks task A rq_57->core_pick = task A // task_rq(A) == rq_57 i=56: pick_task(rq_56) pick_task_fair(rq_56) cfs_rq->nr_queued == 0 goto idle sched_balance_newidle(rq_56) raw_spin_rq_unlock(rq_56) // core-wide lock released newidle_balance() pulls task A: rq_57 -> rq_56 // task_rq(A) == rq_56 now raw_spin_rq_lock(rq_56) // core-wide lock re-acquired return > 0 goto again pick_task_fair(rq_56) -> picks task A rq_56->core_pick = task A // first loop done // rq_57->core_pick is still task A (set before lock release) // but task_rq(A) == rq_56 now next = rq_57->core_pick // = task A put_prev_set_next_task(rq_57, prev, task A) __set_next_task_fair(rq_57, task A) hrtick_start_fair(rq_57, task A) WARN_ON_ONCE(task_rq(task A) != rq_57) // task_rq(A) == rq_56 IOW: by allowing pick_task_fair() to do newidle_balance and not returning RETRY_TASK, it can end up selecting the same task on two CPUs. Restore the previous state by never doing newidle when core scheduling is enabled. Tested-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: "Aaron Lu" <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260603095108.GA1684319@bytedance.com
2026-06-10	bpf: Tighten cgroup storage cookie checks for prog arrays	Daniel Borkmann
	The fix in commit abad3d0bad72 ("bpf: Fix oob access in cgroup local storage") is still incomplete. The prog-array compatibility check treats a program with no cgroup storage as compatible with any stored storage cookie. This allows a storage-less program to bridge a tail call chain between an entry program and a storage-using callee even though cgroup local storage at runtime still follows the caller's context, that is, A -> B(no storage) -> C(storage) path. Requiring exact cookie equality would break the legitimate case of a storage-less leaf program being tail called from a storage-using one. Instead, only accept a zero storage cookie if the program cannot perform tail calls itself. This keeps A -> B(no storage) working while rejecting the A -> B(no storage) -> C(storage) bridge. Fixes: abad3d0bad72 ("bpf: Fix oob access in cgroup local storage") Reported-by: Lin Ma <malin89@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260610105539.705887-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-10	vhost_task_create: kill unnecessary .exit_signal initialization	Oleg Nesterov
	The only reason for this janitorial change is that this initialization adds unnecessary noise to "git grep exit_signal". args.exit_signal has no effect with CLONE_THREAD, not to mention it is zero-initialized by the compiler anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Message-ID: <acAIm732QPFZs15C@redhat.com>
2026-06-09	bpf: Cancel special fields on map value recycle	Justin Suess
	Map update and delete paths currently call bpf_obj_free_fields() when a value is being replaced or recycled. That makes field destruction depend on the context of the update/delete operation. For tracing programs this can include NMI context, where referenced kptr destructors, uptr unpinning, and graph root destruction are not generally safe. Introduce bpf_obj_cancel_fields() for the reusable-value path. It only performs NMI-safe cleanup for timer, workqueue, and task_work fields. Fields that need full destruction are left attached to the recycled value and are destroyed by the final cleanup path instead. Switch array and hashtab update/delete/recycle paths to this cancel helper. Keep bpf_obj_free_fields() for final map destruction and for bpf_mem_alloc destructors. Preallocated hashtabs do not have allocator destructors, so teardown continues to walk the normal and extra elements and fully destroy their fields. This deliberately relaxes the eager-free semantics of map update/delete for special fields. Programs that relied on a recycled map slot becoming empty immediately after update/delete were relying on behavior that cannot be implemented safely from every BPF execution context without offloading arbitrary destructors. There is a chance this change breaks programs making assumptions regarding the eager freeing of fields. If so, we can relax semantics to cancellation only when irqs_disabled() is true in the future. However, theoretically, map values that get reused eagerly already have weaker guarantees as parallel users can recreate freed fields before the new element becomes visible again. Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr") Signed-off-by: Justin Suess <utilityemal77@gmail.com> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260609202548.3571690-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-09	bpf: Reject bpf_obj_drop() from tracing progs	Justin Suess
	bpf_obj_drop() runs bpf_obj_free_fields() synchronously for program-allocated objects. When such an object contains NMI unsafe fields, tracing programs that can run from arbitrary instrumented context can reach that destruction from unsafe contexts, including NMI. NMI is likely one instance of this problem, and other instances would include possible unsafe reentrancy. Deferring bpf_obj_drop() is not appealing either: it would add delayed-free machinery to a release operation that otherwise has straightforward synchronous ownership semantics. Reject bpf_obj_drop() and bpf_percpu_obj_drop() from tracing programs that may run from unsafe contexts unless every field in the object's BTF record is explicitly NMI safe. Do not reject sleepable BPF_PROG_TYPE_TRACING programs, since they are not the arbitrary/NMI contexts that motivate the restriction. Note that while bpf_rb_root and bpf_list_head would be NMI safe on their own to free, the objects recursively held by them may not be; be conservative and just mark them as not NMI safe for now. Use a whitelist for the NMI-safe field set instead of listing only known NMI unsafe fields. Locks, async fields, unreferenced kptrs, and refcounts are known to be NMI safe because their destruction is either a no-op, simple state reset, or async cancellation. Referenced kptrs, percpu referenced kptrs, uptrs, graph roots, graph nodes, and any future field type are rejected until audited for arbitrary tracing and NMI contexts. This is less susceptible to future changes in fields that were previously safe by exclusion, and to new fields being added without updating this check. Convert the existing recursive local-object drop success case to a syscall program in the same commit, since this verifier change makes the old tracing program form invalid. The test still exercises bpf_obj_drop() releasing a referenced task kptr from a safe program type. Fixes: ac9f06050a35 ("bpf: Introduce bpf_obj_drop") Signed-off-by: Justin Suess <utilityemal77@gmail.com> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260609202548.3571690-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-09	Merge tag 'trace-rv-v7.1-rc6-2' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull runtime verifier fixes from Steven Rostedt: - Fix reset ordering on per-task destruction Reset the task before dropping the slot instead of after, which was causing out-of-bound memory accesses. - Fix HA monitor synchronization and cleanup Ensure synchronous cleanup for HA monitors by running timer callbacks in RCU read-side critical sections and using synchronize_rcu() during destruction. - Avoid armed timers after tasks exit Add automatic cleanup for per-task HA monitors to prevent timers from firing after task exit. - Fix memory ordering for DA/HA monitors Fix race conditions during monitor start by using release-acquire semantics for the monitoring flag. - Fix initialization for DA/HA monitors Ensure monitors are not initialized relying on potentially corrupted state like the monitoring flag, that is not reset by all monitors type and may have an unknown state in monitors reusing the storage (per-task). - Fix memory safety in per-task and per-object monitors Prevent use-after-free and out-of-bounds access by synchronizing with in-flight tracepoint probes using tracepoint_synchronize_unregister() before freeing monitor storage or releasing task slots. - Adjust monitors for preemptible tracepoints Fix monitors that relied on tracepoints disabling preemption. Explicitly disable task migration when per-CPU monitors handle events to avoid accessing the wrong state and update the opid monitor logic. - Fix incorrect __user specifier usage Remove __user from a non-pointer variable in the extract_params() helper. - Fix bugs in the rv tool Ensure strings are NUL-terminated, fix substring matching in monitor searches, and improve cleanup and exit status handling. - Fix several bugs in rvgen Fix LTL literal stringification, subparsers' options handling, and suffix stripping in dot2k. * tag 'trace-rv-v7.1-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: verification/rvgen: Fix ltl2k writing True as a literal verification/rvgen: Fix options shared among commands verification/rvgen: Fix suffix strip in dot2k tools/rv: Fix cleanup after failed trace setup tools/rv: Fix substring match when listing container monitors tools/rv: Fix substring match bug in monitor name search tools/rv: Ensure monitor name and desc are NUL-terminated rv: Use 0 to check preemption enabled in opid rv: Prevent task migration while handling per-CPU events rv: Ensure synchronous cleanup for HA monitors rv: Add automatic cleanup handlers for per-task HA monitors rv: Do not rely on clean monitor when initialising HA rv: Fix monitor start ordering and memory ordering for monitoring flag rv: Ensure all pending probes terminate on per-obj monitor destroy rv: Prevent in-flight per-task handlers from using invalid slots rv: Reset per-task DA monitors before releasing the slot rv: Fix __user specifier usage in extract_params()
2026-06-09	bpf: Allow sleepable programs to use LPM trie maps directly	Vlad Poenaru
	The previous change relaxed the rcu_dereference annotations in lpm_trie.c so the trie walks no longer trip lockdep when reached from a sleepable BPF program holding only rcu_read_lock_trace(). By itself that only helps tries reached as the inner map of a map-of-maps, or from the classic-RCU syscall path: a sleepable program that references an LPM trie directly is still rejected at load time by check_map_prog_compatibility(), whose sleepable whitelist omits BPF_MAP_TYPE_LPM_TRIE: Sleepable programs can only use array, hash, ringbuf and local storage maps LPM trie nodes are allocated from a bpf_mem_alloc (trie->ma) and freed with bpf_mem_cache_free_rcu(), which chains a regular RCU grace period into a Tasks Trace grace period before the node -- and the value embedded in it that trie_lookup_elem() returns to the program -- is released. That is the same reclaim discipline BPF_MAP_TYPE_HASH relies on for sleepable access, so a value handed to a sleepable reader cannot be freed while the program is still running under rcu_read_lock_trace(). The writer paths take trie->lock across the walk and never relied on the RCU read-side lock to keep nodes alive. Add BPF_MAP_TYPE_LPM_TRIE to the sleepable map whitelist so these programs can use LPM tries directly. Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260609135558.193287-3-vlad.wing@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-09	bpf: Allow LPM map access from sleepable BPF programs	Vlad Poenaru
	trie_lookup_elem() annotates its rcu_dereference_check() walks with only rcu_read_lock_bh_held(). Because rcu_dereference_check(p, c) resolves to "c \|\| rcu_read_lock_held()", this passes for XDP/NAPI and classic RCU readers but fails for sleepable BPF programs, which enter via __bpf_prog_enter_sleepable() and hold only rcu_read_lock_trace(). trie_update_elem() and trie_delete_elem() have the same problem in a different form: they walk the trie with plain rcu_dereference(), which asserts rcu_read_lock_held() unconditionally. Both are reachable from sleepable BPF programs via the bpf_map_update_elem / bpf_map_delete_elem helpers, and from the syscall path under classic rcu_read_lock(). In the writer paths the trie is actually protected by trie->lock (an rqspinlock taken across the walk); we never relied on the RCU read-side lock to keep nodes alive there. A sleepable LSM hook that ends up touching an LPM trie therefore triggers lockdep on debug kernels: ============================= WARNING: suspicious RCU usage 7.1.0-... Tainted: G E ----------------------------- kernel/bpf/lpm_trie.c:249 suspicious rcu_dereference_check() usage! 1 lock held by net_tests/540: #0: (rcu_tasks_trace_srcu_struct){....}-{0:0}, at: __bpf_prog_enter_sleepable+0x26/0x280 Call Trace: dump_stack_lvl lockdep_rcu_suspicious trie_lookup_elem bpf_prog_..._enforce_security_socket_connect bpf_trampoline_... security_socket_connect __sys_connect do_syscall_64 This is lockdep-only -- no UAF, since Tasks Trace RCU does serialize against the trie's reclaim path -- but it spams the console once per distinct callsite on every debug kernel running a sleepable BPF LSM that touches an LPM trie, which is increasingly common. For the lookup path, switch the rcu_dereference_check() annotation from rcu_read_lock_bh_held() to bpf_rcu_lock_held(), which accepts all three contexts (classic, BH, Tasks Trace). Other map types already follow this convention. For trie_update_elem() and trie_delete_elem(), annotate the walks as rcu_dereference_protected(*p, 1) -- matching trie_free() in the same file -- since trie->lock is held across the walk. rqspinlock has no lockdep_map, so the predicate degenerates to '1' rather than lockdep_is_held(&trie->lock); the protection is real but not machine-verifiable. trie_get_next_key() also uses bare rcu_dereference() but is reachable only from the BPF syscall, which holds classic rcu_read_lock() before dispatching, so it is left untouched. Fixes: 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context") Cc: stable@vger.kernel.org Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260609135558.193287-2-vlad.wing@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-09	PM: QoS: Fix misc device registration unwind	Yuho Choi
	cpu_latency_qos_init() registers cpu_dma_latency first and, when CONFIG_PM_QOS_CPU_SYSTEM_WAKEUP is enabled, registers cpu_wakeup_latency afterwards. The second registration overwrites the first return value. As a result, a failure to register cpu_dma_latency can be masked if the second registration succeeds. Conversely, if cpu_dma_latency succeeds and cpu_wakeup_latency fails, the function returns an error while leaving the first misc device registered. Return immediately on the first registration failure and deregister cpu_dma_latency if the second registration fails. Fixes: a4e6512a79d8 ("PM: QoS: Introduce a CPU system wakeup QoS limit") Signed-off-by: Yuho Choi <dbgh9129@gmail.com> Link: https://patch.msgid.link/20260608170748.82273-1-dbgh9129@gmail.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-06-09	bpf: Enforce write checks for BTF pointer helper access	Nuoqi Gui
	check_mem_reg() verifies both read and write access for global subprogram memory arguments. When the caller register is PTR_TO_BTF_ID, check_helper_mem_access() currently forwards the access to check_ptr_to_btf_access() as BPF_READ regardless of the requested access type. This lets a BTF-backed kernel object field pointer pass the caller-side writable memory check for a global subprogram argument. The callee is then validated with a generic writable PTR_TO_MEM argument and can store through it, even though an equivalent direct BTF field store is rejected with "only read is supported". Forward the requested access type to check_ptr_to_btf_access(). This enforces existing BTF write restrictions for global subprogram memory arguments as well. Fixes: 3e30be4288b3 ("bpf: Allow helpers access trusted PTR_TO_BTF_ID.") Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn> Link: https://lore.kernel.org/bpf/20260609-f01-04-btf-writable-arg-v1-1-f449cd970669@mails.tsinghua.edu.cn Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
2026-06-09	timers/migration: Temporarily disable per capacity hierarchies	Frederic Weisbecker
	Some workloads with different CPU capacities consume more power with timer migration than before. The recently introduced per capacity hierarchies were supposed to alleviate this problem. However it appears to also regress other types of workloads, especially when plenty of capacities live together in the same machine. Disable the feature until a reasonable solution is found. Fixes: 098cbaad8e57 ("timers/migration: Split per-capacity hierarchies") Reported-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260609123356.28449-1-frederic@kernel.org Closes: https://lore.kernel.org/all/3b79338f-6cfc-4722-8062-9103db2c8ad1@arm.com
2026-06-09	bpf: Validate BTF repeated field counts before expansion	Paul Moses
	btf_parse_struct_metas() walks user-supplied BTF during BPF_BTF_LOAD, and btf_repeat_fields() expands repeatable fields from array elements into the fixed BTF_FIELDS_MAX scratch array used by btf_parse_fields(). The remaining-capacity check performs the expanded field count calculation in u32. A malformed BTF can wrap that calculation, causing the check to pass even when the expanded field count exceeds the scratch array capacity. The following memcpy() can then write past the end of the array. Use checked addition and multiplication before copying repeated fields and reject impossible counts. Fixes: 797d73ee232d ("bpf: Check the remaining info_cnt before repeating btf fields") Cc: stable@vger.kernel.org Signed-off-by: Paul Moses <p@1g4.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20260605234301.1109063-1-p@1g4.org Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
2026-06-09	sched/deadline: Use task_on_rq_migrating() helper	Liang Luo
	Replace the open-coded "p->on_rq == TASK_ON_RQ_MIGRATING" comparisons in enqueue_task_dl() and dequeue_task_dl() with the existing task_on_rq_migrating() helper, consistent with the rest of the scheduler code. No functional change. Signed-off-by: Liang Luo <luoliang@kylinos.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260608075500.387271-1-luoliang@kylinos.cn