linux.git - Linux kernel source tree

Age	Commit message (Collapse)	Author
2026-04-10	sched_ext: Track @p's rq lock across set_cpus_allowed_scx -> ops.set_cpumask	Tejun Heo
	The SCX_CALL_OP_TASK call site passes rq=NULL incorrectly, leaving scx_locked_rq() unset. Pass task_rq(p) instead so update_locked_rq() reflects reality. v2: Add Fixes: tag (Andrea Righi). Fixes: 18853ba782be ("sched_ext: Track currently locked rq") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-10	sched_ext: Add select_cpu kfuncs to scx_kfunc_ids_unlocked	Tejun Heo
	select_cpu_from_kfunc() has an extra scx_kf_allowed_if_unlocked() branch that accepts calls from unlocked contexts and takes task_rq_lock() itself - a "callable from unlocked" property encoded in the kfunc body rather than in set membership. That's fine while the runtime check is the authoritative gate, but the upcoming verifier-time filter uses set membership as the source of truth and needs it to reflect every context the kfunc may be called from. Add the three select_cpu kfuncs to scx_kfunc_ids_unlocked so their full set of callable contexts is captured by set membership. This follows the existing dual-set convention used by scx_bpf_dsq_move{,_vtime} and scx_bpf_dsq_move_set_{slice,vtime}, which are members of both scx_kfunc_ids_dispatch and scx_kfunc_ids_unlocked. While at it, add brief comments on each duplicate BTF_ID_FLAGS block (including the pre-existing dsq_move ones) explaining the dual membership. No runtime behavior change: the runtime check in select_cpu_from_kfunc() remains the authoritative gate until it is removed along with the rest of the scx_kf_mask enforcement in a follow-up. v2: Clarify dispatch-set comment to name scx_bpf_dsq_move*() explicitly so it doesn't appear to cover scx_bpf_sub_dispatch() (Andrea Righi). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-10	sched_ext: Drop TRACING access to select_cpu kfuncs	Tejun Heo
	The select_cpu kfuncs - scx_bpf_select_cpu_dfl(), scx_bpf_select_cpu_and() and __scx_bpf_select_cpu_and() - take task_rq_lock() internally. Exposing them via scx_kfunc_set_idle to BPF_PROG_TYPE_TRACING is unsafe: arbitrary tracing contexts (kprobes, tracepoints, fentry, LSM) may run with @p's pi_lock state unknown. Move them out of scx_kfunc_ids_idle into a new scx_kfunc_ids_select_cpu set registered only for STRUCT_OPS and SYSCALL. Extracted from a larger verifier-time kfunc context filter patch originally written by Juntong Deng. Original-patch-by: Juntong Deng <juntong.deng@outlook.com> Cc: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-10	sched/eevdf: Clear buddies for preempt_short	Vincent Guittot
	next buddy should not prevent shorter slice preemption. Don't take buddy into account when checking if shorter slice entity can preempt and clear it if the entity with a shorter slice can preempt current. Test on snapdragon rb5: hackbench -T -p -l 16000000 -g 2 1> /dev/null & hackbench runs in cgroup /test-A cyclictest -t 1 -i 2777 -D 63 --policy=fair --mlock -h 20000 -q cyclictest runs in cgroup /test-B tip/sched/core tip/sched/core +this patch cyclictest slice (ms) (default)2.8 8 8 hackbench slice (ms) (default)2.8 20 20 Total Samples \| 22679 22595 22686 Average (us) \| 84 94(-12%) 59( 37%) Median (P50) (us) \| 56 56( 0%) 56( 0%) 90th Percentile (us) \| 64 65(- 2%) 63( 3%) 99th Percentile (us) \| 1047 1273(-22%) 74( 94%) 99.9th Percentile (us) \| 2431 4751(-95%) 663( 86%) Maximum (us) \| 4694 8655(-84%) 3934( 55%) Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260410132321.2897789-1-vincent.guittot@linaro.org
2026-04-10	Merge branches 'for-next/misc', 'for-next/tlbflush', ↵	Catalin Marinas
	'for-next/ttbr-macros-cleanup', 'for-next/kselftest', 'for-next/feat_lsui', 'for-next/mpam', 'for-next/hotplug-batched-tlbi', 'for-next/bbml2-fixes', 'for-next/sysreg', 'for-next/generic-entry' and 'for-next/acpi', remote-tracking branches 'arm64/for-next/perf' and 'arm64/for-next/read-once' into for-next/core * arm64/for-next/perf: : Perf updates perf/arm-cmn: Fix resource_size_t printk specifier in arm_cmn_init_dtc() perf/arm-cmn: Fix incorrect error check for devm_ioremap() perf: add NVIDIA Tegra410 C2C PMU perf: add NVIDIA Tegra410 CPU Memory Latency PMU perf/arm_cspmu: nvidia: Add Tegra410 PCIE-TGT PMU perf/arm_cspmu: nvidia: Add Tegra410 PCIE PMU perf/arm_cspmu: Add arm_cspmu_acpi_dev_get perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU perf/arm_cspmu: nvidia: Rename doc to Tegra241 perf/arm-cmn: Stop claiming entire iomem region arm64: cpufeature: Use pmuv3_implemented() function arm64: cpufeature: Make PMUVer and PerfMon unsigned KVM: arm64: Read PMUVer as unsigned * arm64/for-next/read-once: : Fixes for __READ_ONCE() with CONFIG_LTO=y arm64, compiler-context-analysis: Permit alias analysis through __READ_ONCE() with CONFIG_LTO=y arm64: Optimize __READ_ONCE() with CONFIG_LTO=y * for-next/misc: : Miscellaneous cleanups/fixes arm64: rsi: use linear-map alias for realm config buffer arm64: Kconfig: fix duplicate word in CMDLINE help text arm64: mte: Skip TFSR_EL1 checks and barriers in synchronous tag check mode arm64/hwcap: Generate the KERNEL_HWCAP_ definitions for the hwcaps arm64: kexec: Remove duplicate allocation for trans_pgd arm64: mm: Use generic enum pgtable_level arm64: scs: Remove redundant save/restore of SCS SP on entry to/from EL0 arm64: remove ARCH_INLINE_* * for-next/tlbflush: : Refactor the arm64 TLB invalidation API and implementation arm64: mm: __ptep_set_access_flags must hint correct TTL arm64: mm: Provide level hint for flush_tlb_page() arm64: mm: Wrap flush_tlb_page() around __do_flush_tlb_range() arm64: mm: More flags for __flush_tlb_range() arm64: mm: Refactor __flush_tlb_range() to take flags arm64: mm: Refactor flush_tlb_page() to use __tlbi_level_asid() arm64: mm: Simplify __flush_tlb_range_limit_excess() arm64: mm: Simplify __TLBI_RANGE_NUM() macro arm64: mm: Re-implement the __flush_tlb_range_op macro in C arm64: mm: Inline __TLBI_VADDR_RANGE() into __tlbi_range() arm64: mm: Push __TLBI_VADDR() into __tlbi_level() arm64: mm: Implicitly invalidate user ASID based on TLBI operation arm64: mm: Introduce a C wrapper for by-range TLB invalidation arm64: mm: Re-implement the __tlbi_level macro as a C function * for-next/ttbr-macros-cleanup: : Cleanups of the TTBR1_* macros arm64/mm: Directly use TTBRx_EL1_CnP arm64/mm: Directly use TTBRx_EL1_ASID_MASK arm64/mm: Describe TTBR1_BADDR_4852_OFFSET * for-next/kselftest: : arm64 kselftest updates selftests/arm64: Implement cmpbr_sigill() to hwcap test * for-next/feat_lsui: : Futex support using FEAT_LSUI instructions to avoid toggling PAN arm64: armv8_deprecated: Disable swp emulation when FEAT_LSUI present arm64: Kconfig: Add support for LSUI KVM: arm64: Use CAST instruction for swapping guest descriptor arm64: futex: Support futex with FEAT_LSUI arm64: futex: Refactor futex atomic operation KVM: arm64: kselftest: set_id_regs: Add test for FEAT_LSUI KVM: arm64: Expose FEAT_LSUI to guests arm64: cpufeature: Add FEAT_LSUI * for-next/mpam: (40 commits) : Expose MPAM to user-space via resctrl: : - Add architecture context-switch and hiding of the feature from KVM. : - Add interface to allow MPAM to be exposed to user-space using resctrl. : - Add errata workaoround for some existing platforms. : - Add documentation for using MPAM and what shape of platforms can use resctrl arm64: mpam: Add initial MPAM documentation arm_mpam: Quirk CMN-650's CSU NRDY behaviour arm_mpam: Add workaround for T241-MPAM-6 arm_mpam: Add workaround for T241-MPAM-4 arm_mpam: Add workaround for T241-MPAM-1 arm_mpam: Add quirk framework arm_mpam: resctrl: Call resctrl_init() on platforms that can support resctrl arm64: mpam: Select ARCH_HAS_CPU_RESCTRL arm_mpam: resctrl: Add empty definitions for assorted resctrl functions arm_mpam: resctrl: Update the rmid reallocation limit arm_mpam: resctrl: Add resctrl_arch_rmid_read() arm_mpam: resctrl: Allow resctrl to allocate monitors arm_mpam: resctrl: Add support for csu counters arm_mpam: resctrl: Add monitor initialisation and domain boilerplate arm_mpam: resctrl: Add kunit test for control format conversions arm_mpam: resctrl: Add support for 'MB' resource arm_mpam: resctrl: Wait for cacheinfo to be ready arm_mpam: resctrl: Add rmid index helpers arm_mpam: resctrl: Convert to/from MPAMs fixed-point formats arm_mpam: resctrl: Hide CDP emulation behind CONFIG_EXPERT ... * for-next/hotplug-batched-tlbi: : arm64/mm: Enable batched TLB flush in unmap_hotplug_range() arm64/mm: Reject memory removal that splits a kernel leaf mapping arm64/mm: Enable batched TLB flush in unmap_hotplug_range() * for-next/bbml2-fixes: : Fixes for realm guest and BBML2_NOABORT arm64: mm: Remove pmd_sect() and pud_sect() arm64: mm: Handle invalid large leaf mappings correctly arm64: mm: Fix rodata=full block mapping support for realm guests * for-next/sysreg: : arm64 sysreg updates arm64/sysreg: Update ID_AA64SMFR0_EL1 description to DDI0601 2025-12 arm64/sysreg: Update ID_AA64ZFR0_EL1 description to DDI0601 2025-12 arm64/sysreg: Update ID_AA64FPFR0_EL1 description to DDI0601 2025-12 arm64/sysreg: Update ID_AA64ISAR2_EL1 description to DDI0601 2025-12 arm64/sysreg: Update ID_AA64ISAR0_EL1 description to DDI0601 2025-12 arm64/sysreg: Update SMIDR_EL1 to DDI0601 2025-06 * for-next/generic-entry: : More arm64 refactoring towards using the generic entry code arm64: Check DAIF (and PMR) at task-switch time arm64: entry: Use split preemption logic arm64: entry: Use irqentry_{enter_from,exit_to}_kernel_mode() arm64: entry: Consistently prefix arm64-specific wrappers arm64: entry: Don't preempt with SError or Debug masked entry: Split preemption from irqentry_exit_to_kernel_mode() entry: Split kernel mode logic from irqentry_{enter,exit}() entry: Move irqentry_enter() prototype later entry: Remove local_irq_{enable,disable}_exit_to_user() entry: Fix stale comment for irqentry_enter() * for-next/acpi: : arm64 ACPI updates ACPI: AGDI: fix missing newline in error message
2026-04-10	Merge branches 'pm-cpuidle', 'pm-opp' and 'pm-sleep'	Rafael J. Wysocki
	Merge cpuidle updates, OPP (operating performance points) library updates, and updates related to system suspend and hibernation for 7.1-rc1: - Refine stopped tick handling in the menu cpuidle governor and rearrange stopped tick handling in the teo cpuidle governor (Rafael Wysocki) - Add Panther Lake C-states table to the intel_idle driver (Artem Bityutskiy) - Clean up dead dependencies on CPU_IDLE in Kconfig (Julian Braha) - Simplify cpuidle_register_device() with guard() (Huisong Li) - Use performance level if available to distinguish between rates in OPP debugfs (Manivannan Sadhasivam) - Fix scoped_guard in dev_pm_opp_xlate_required_opp() (Viresh Kumar) - Return -ENODATA if the snapshot image is not loaded (Alberto Garcia) - Remove inclusion of crypto/hash.h from hibernate_64.c on x86 (Eric Biggers) * pm-cpuidle: cpuidle: Simplify cpuidle_register_device() with guard() cpuidle: clean up dead dependencies on CPU_IDLE in Kconfig intel_idle: Add Panther Lake C-states table cpuidle: governors: teo: Rearrange stopped tick handling cpuidle: governors: menu: Refine stopped tick handling * pm-opp: OPP: Move break out of scoped_guard in dev_pm_opp_xlate_required_opp() OPP: debugfs: Use performance level if available to distinguish between rates * pm-sleep: PM: hibernate: return -ENODATA if the snapshot image is not loaded PM: hibernate: x86: Remove inclusion of crypto/hash.h
2026-04-10	Merge branch 'pm-cpufreq'	Rafael J. Wysocki
	Merge cpufreq updates for 7.1-rc1: - Update qcom-hw DT bindings to include Eliza hardware (Abel Vesa) - Update cpufreq-dt-platdev blocklist (Faruque Ansari) - Minor updates to driver and dt-bindings for Tegra (Thierry Reding, Rosen Penev) - Add MAINTAINERS entry for CPPC driver (Viresh Kumar) - Add support for new features: CPPC performance priority, Dynamic EPP, Raw EPP, and new unit tests for them to amd-pstate (Gautham Shenoy, Mario Limonciello) - Fix sysfs files being present when HW missing and broken/outdated documentation in the amd-pstate driver (Ninad Naik, Gautham Shenoy) - Pass the policy to cpufreq_driver->adjust_perf() to avoid using cpufreq_cpu_get() in the .adjust_perf() callback in amd-pstate which leads to a scheduling-while-atomic bug (K Prateek Nayak) - Clean up dead code in Kconfig for cpufreq (Julian Braha) - Remove max_freq_req update for pre-existing cpufreq policy and add a boost_freq_req QoS request to save the boost constraint instead of overwriting the last scaling_max_freq constraint (Pierre Gondois) - Embed cpufreq QoS freq_req objects in cpufreq policy so they all are allocated in one go along with the policy to simplify lifetime rules and avoid error handling issues (Viresh Kumar) - Use DMI max speed when CPPC is unavailable in the acpi-cpufreq scaling driver (Henry Tseng) - Switch policy_is_shared() in cpufreq to using cpumask_nth() instead of cpumask_weight() because the former is more efficient (Yury Norov) - Use sysfs_emit() in sysfs show functions for cpufreq governor attributes (Thorsten Blum) - Update intel_pstate to stop returning an error when "off" is written to its status sysfs attribute while the driver is already off (Fabio De Francesco) - Include current frequency in the debug message printed by __cpufreq_driver_target() (Pengjie Zhang) * pm-cpufreq: (38 commits) cpufreq/amd-pstate: Add POWER_SUPPLY select for dynamic EPP MAINTAINERS: amd-pstate: Step down as maintainer, add Prateek as reviewer cpufreq: Pass the policy to cpufreq_driver->adjust_perf() cpufreq/amd-pstate: Pass the policy to amd_pstate_update() cpufreq/amd-pstate-ut: Add a unit test for raw EPP cpufreq/amd-pstate: Add support for raw EPP writes cpufreq/amd-pstate: Add support for platform profile class cpufreq/amd-pstate: add kernel command line to override dynamic epp cpufreq/amd-pstate: Add dynamic energy performance preference Documentation: amd-pstate: fix dead links in the reference section cpufreq/amd-pstate: Cache the max frequency in cpudata Documentation/amd-pstate: Add documentation for amd_pstate_floor_{freq,count} Documentation/amd-pstate: List amd_pstate_prefcore_ranking sysfs file Documentation/amd-pstate: List amd_pstate_hw_prefcore sysfs file amd-pstate-ut: Add a testcase to validate the visibility of driver attributes amd-pstate-ut: Add module parameter to select testcases amd-pstate: Introduce a tracepoint trace_amd_pstate_cppc_req2() amd-pstate: Add sysfs support for floor_freq and floor_count amd-pstate: Add support for CPPC_REQ2 and FLOOR_PERF x86/cpufeatures: Add AMD CPPC Performance Priority feature. ...
2026-04-09	cgroup/rdma: fix swapped arguments in pr_warn() format string	cuitao
	The format string says "device %p ... rdma cgroup %p" but the arguments were passed as (cg, device), printing them in the wrong order. Signed-off-by: cuitao <cuitao@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-09	bpf: Fix use-after-free in offloaded map/prog info fill	Jiayuan Chen
	When querying info for an offloaded BPF map or program, bpf_map_offload_info_fill_ns() and bpf_prog_offload_info_fill_ns() obtain the network namespace with get_net(dev_net(offmap->netdev)). However, the associated netdev's netns may be racing with teardown during netns destruction. If the netns refcount has already reached 0, get_net() performs a refcount_t increment on 0, triggering: refcount_t: addition on 0; use-after-free. Although rtnl_lock and bpf_devs_lock ensure the netdev pointer remains valid, they cannot prevent the netns refcount from reaching zero. Fix this by using maybe_get_net() instead of get_net(). maybe_get_net() uses refcount_inc_not_zero() and returns NULL if the refcount is already zero, which causes ns_get_path_cb() to fail and the caller to return -ENOENT -- the correct behavior when the netns is being destroyed. Fixes: 675fc275a3a2d ("bpf: offload: report device information for offloaded programs") Fixes: 52775b33bb507 ("bpf: offload: report device information about offloaded maps") Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn> Closes: https://lore.kernel.org/bpf/f0aa3678-79c9-47ae-9e8c-02a3d1df160a@hust.edu.cn/ Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260409023733.168050-1-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-09	bpf: Drop pkt_end markers on arithmetic to prevent is_pkt_ptr_branch_taken	Daniel Borkmann
	When a pkt pointer acquires AT_PKT_END or BEYOND_PKT_END range from a comparison, and then, known-constant arithmetic is performed, adjust_ptr_min_max_vals() copies the stale range via dst_reg->raw = ptr_reg->raw without clearing the negative reg->range sentinel values. This lets is_pkt_ptr_branch_taken() choose one branch direction and skip going through the other. Fix this by clearing negative pkt range values (that is, AT_PKT_END and BEYOND_PKT_END) after arithmetic on pkt pointers. This ensures is_pkt_ptr_branch_taken() returns unknown and both branches are properly verified. Fixes: 6d94e741a8ff ("bpf: Support for pointers beyond pkt_end.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260409155016.536608-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-09	Merge tag 'dma-mapping-7.0-2026-04-09' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fix from Marek Szyprowski: "A fix for DMA-mapping subsystem, which hides annoying, false-positive warnings from DMA-API debug on coherent platforms like x86_64 (Mikhail Gavrilov)" * tag 'dma-mapping-7.0-2026-04-09' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-debug: suppress cacheline overlap warning when arch has no DMA alignment requirement
2026-04-09	sched/cache: Allow the user space to turn on and off cache aware scheduling	Chen Yu
	Provide a debugfs directory llc_balancing, and a knob named "enabled" under it to allow the user to turn off and on the cache aware scheduling at runtime. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/0aa56f7fc48db2f8f700cd1aa34dedd0ec88351b.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09	sched/cache: Enable cache aware scheduling for multi LLCs NUMA node	Chen Yu
	Introduce sched_cache_present to enable cache aware scheduling for multi LLCs NUMA node Cache-aware load balancing should only be enabled if there are more than 1 LLCs within 1 NUMA node. sched_cache_present is introduced to indicate whether this platform supports this topology. Test results: The first test platform is a 2 socket Intel Sapphire Rapids with 30 cores per socket. The DRAM interleaving is enabled in the BIOS so it essential has one NUMA node with two last level caches. There are 60 CPUs associated with each last level cache. The second test platform is a AMD Genoa. There are 4 Nodes and 32 CPUs per node. Each node has 2 CCXs and each CCX has 16 CPUs. hackbench/schbench/netperf/stream/stress-ng/chacha20 were launched on these two platforms. [TL;DR] Sappire Rapids: hackbench shows significant improvement when the number of different active threads is below the capacity of a LLC. schbench shows limitted wakeup latency improvement. ChaCha20-xiangshan(risc-v simulator) shows good throughput improvement. No obvious difference was observed in netperf/stream/stress-ng in Hmean. Genoa: Significant improvement is observed in hackbench when the active number of threads is lower than the number of CPUs within 1 LLC. On v2, Aaron reported improvement of hackbench/redis when system is underloaded. ChaCha20-xiangshan shows huge throughput improvement. Phoronix has tested v1 and shows good improvements in 30+ cases[3]. No obvious difference was observed in netperf/stream/stress-ng in Hmean. Detail: Due to length constraints, data without much difference with baseline is not presented. Sapphire Rapids: [hackbench pipe] ================ case load baseline(std%) compare%( std%) threads-pipe-10 1-groups 1.00 ( 1.22) +26.09 ( 1.10) threads-pipe-10 2-groups 1.00 ( 4.90) +22.88 ( 0.18) threads-pipe-10 4-groups 1.00 ( 2.07) +9.00 ( 3.49) threads-pipe-10 8-groups 1.00 ( 8.13) +3.45 ( 3.62) threads-pipe-16 1-groups 1.00 ( 2.11) +26.30 ( 0.08) threads-pipe-16 2-groups 1.00 ( 15.13) -1.77 ( 11.89) threads-pipe-16 4-groups 1.00 ( 4.37) +0.58 ( 7.99) threads-pipe-16 8-groups 1.00 ( 2.88) +2.71 ( 3.50) threads-pipe-2 1-groups 1.00 ( 9.40) +22.07 ( 0.71) threads-pipe-2 2-groups 1.00 ( 9.99) +18.01 ( 0.95) threads-pipe-2 4-groups 1.00 ( 3.98) +24.66 ( 0.96) threads-pipe-2 8-groups 1.00 ( 7.00) +21.83 ( 0.23) threads-pipe-20 1-groups 1.00 ( 1.03) +28.84 ( 0.21) threads-pipe-20 2-groups 1.00 ( 4.42) +31.90 ( 3.15) threads-pipe-20 4-groups 1.00 ( 9.97) +4.56 ( 1.69) threads-pipe-20 8-groups 1.00 ( 1.87) +1.25 ( 0.74) threads-pipe-4 1-groups 1.00 ( 4.48) +25.67 ( 0.78) threads-pipe-4 2-groups 1.00 ( 9.14) +4.91 ( 2.08) threads-pipe-4 4-groups 1.00 ( 7.68) +19.36 ( 1.53) threads-pipe-4 8-groups 1.00 ( 10.79) +7.20 ( 12.20) threads-pipe-8 1-groups 1.00 ( 4.69) +21.93 ( 0.03) threads-pipe-8 2-groups 1.00 ( 1.16) +25.29 ( 0.65) threads-pipe-8 4-groups 1.00 ( 2.23) -1.27 ( 3.62) threads-pipe-8 8-groups 1.00 ( 4.65) -3.08 ( 2.75) Note: The default number of fd in hackbench is changed from 20 to various values to ensure that threads fit within a single LLC, especially on AMD systems. Take "threads-pipe-8, 2-groups" for example, the number of fd is 8, and 2 groups are created. [schbench] The 99th percentile wakeup latency shows some improvements when the system is underload, while it does not bring much difference with the increasing of system utilization. 99th Wakeup Latencies Base (mean std) Compare (mean std) Change ========================================================================= thread=2 9.00(0.00) 9.00(1.73) 0.00% thread=4 7.33(0.58) 6.33(0.58) +13.64% thread=8 9.00(0.00) 7.67(1.15) +14.78% thread=16 8.67(0.58) 8.67(1.53) 0.00% thread=32 9.00(0.00) 7.00(0.00) +22.22% thread=64 9.33(0.58) 9.67(0.58) -3.64% thread=128 12.00(0.00) 12.00(0.00) 0.00% [chacha20 on simulated risc-v] baseline: Host time spent: 67861ms cache aware scheduling enabled: Host time spent: 54441ms Time reduced by 24% Genoa: [hackbench pipe] The default number of fd is 20, which exceed the number of CPUs in a LLC. So the fd is adjusted to 2, 4, 6, 8, 20 respectively. Exclude the result with large run-to-run variance, 10% ~ 50% improvement is observed when the system is underloaded: [hackbench pipe] ================ case load baseline(std%) compare%( std%) threads-pipe-2 1-groups 1.00 ( 2.89) +47.33 ( 1.20) threads-pipe-2 2-groups 1.00 ( 3.88) +39.82 ( 0.61) threads-pipe-2 4-groups 1.00 ( 8.76) +5.57 ( 13.10) threads-pipe-20 1-groups 1.00 ( 4.61) +11.72 ( 1.06) threads-pipe-20 2-groups 1.00 ( 6.18) +14.55 ( 1.47) threads-pipe-20 4-groups 1.00 ( 2.99) +10.16 ( 4.49) threads-pipe-4 1-groups 1.00 ( 4.23) +43.70 ( 2.14) threads-pipe-4 2-groups 1.00 ( 3.68) +8.45 ( 4.04) threads-pipe-4 4-groups 1.00 ( 17.72) +2.42 ( 1.14) threads-pipe-6 1-groups 1.00 ( 3.10) +7.74 ( 3.83) threads-pipe-6 2-groups 1.00 ( 3.42) +14.26 ( 4.53) threads-pipe-6 4-groups 1.00 ( 10.34) +10.94 ( 7.12) threads-pipe-8 1-groups 1.00 ( 4.21) +9.06 ( 4.43) threads-pipe-8 2-groups 1.00 ( 1.88) +3.74 ( 0.58) threads-pipe-8 4-groups 1.00 ( 2.78) +23.96 ( 1.18) [chacha20 on simulated risc-v] Host time spent: 54762ms Host time spent: 28295ms Time reduced by 48% Suggested-by: Libo Chen <libchen@purestorage.com> Suggested-by: Adam Li <adamli@os.amperecomputing.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/71972e12ab4f08aff422b31e34df09bdbd94de84.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09	sched/cache: Respect LLC preference in task migration and detach	Tim Chen
	During load balancing, make can_migrate_task() consider a task's LLC preference. Prevent a task from being moved out of its preferred LLC. During the regular load balancing, if the task cannot be migrated due to LLC locality, the nr_balance_failed also should not be increased. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Co-developed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/53da65f3d59de31e1a1dc59a4093d8dd9d4dc206.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09	sched/cache: Handle moving single tasks to/from their preferred LLC	Tim Chen
	Cache aware scheduling mainly does two things: 1. Prevent task from migrating out of its preferred LLC if not nessasary. 2. Migrating task to their preferred LLC if nessasary. For 1: In the generic load balance, if the busiest runqueue has only one task, active balancing may be invoked to move it away. However, this migration might break LLC locality. Prevent regular load balance from migrating a task that prefers the current LLC. The load level and imbalance do not warrant breaking LLC preference per the can_migrate_llc() policy. Here, the benefit of LLC locality outweighs the power efficiency gained from migrating the only runnable task away. Before migration, check whether the task is running on its preferred LLC: Do not move a lone task to another LLC if it would move the task away from its preferred LLC or cause excessive imbalance between LLCs. For 2: On the other hand, if the migration type is migrate_llc_task, it means that there are tasks on the env->src_cpu that want to be migrated to their preferred LLC, launch the active load balance anyway. Co-developed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/9b816d8c27fabf2a9c0e1f61a6b90afe8ec4ad52.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09	sched/cache: Add migrate_llc_task migration type for cache-aware balancing	Tim Chen
	Introduce a new migration type, migrate_llc_task, to support cache-aware load balancing. After identifying the busiest sched_group (having the most tasks preferring the destination LLC), mark migrations with this type. During load balancing, each runqueue in the busiest sched_group is examined, and the runqueue with the highest number of tasks preferring the destination CPU is selected as the busiest runqueue. Co-developed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/b9df27c19cc5121ddb2a7d1be7f9d52fec1563dc.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09	sched/cache: Prioritize tasks preferring destination LLC during balancing	Tim Chen
	During LLC load balancing, first check for tasks that prefer the destination LLC and balance them to it before others. Mark source sched groups containing tasks preferring non local LLCs with the group_llc_balance flag. This ensures the load balancer later pulls or pushes these tasks toward their preferred LLCs. The priority of group_llc_balance is lower than that of group_overloaded and higher than that of all other group types. This is because group_llc_balance may exacerbate load imbalance, and if the LLC balancing attempt fails, the nr_balance_failed mechanism will trigger other group types to rebalance the load. The load balancer selects the busiest sched_group and migrates tasks to less busy groups to distribute load across CPUs. With cache-aware scheduling enabled, the busiest sched_group is the one with most tasks preferring the destination LLC. If the group has the llc_balance flag set, cache aware load balancing is triggered. Introduce the helper function update_llc_busiest() to identify the sched_group with the most tasks preferring the destination LLC. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Co-developed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/baa458f45eab3f602af090c6d6af63dc864f5ec6.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09	sched/cache: Check local_group only once in update_sg_lb_stats()	Tim Chen
	There is no need to check the local group twice for both group_asym_packing and group_smt_balance. Adjust the code to facilitate future checks for group types (cache-aware load balancing) as well. No functional changes are expected. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Co-developed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/99a57865c8ae1847087a5c00e92d24351cf3e5a8.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09	sched/cache: Count tasks prefering destination LLC in a sched group	Tim Chen
	During LLC load balancing, tabulate the number of tasks on each runqueue that prefer the LLC contains the env->dst_cpu in a sched group. For example, consider a system with 4 LLC sched groups (LLC0 to LLC3) balancing towards LLC3. LLC0 has 3 tasks preferring LLC3, LLC1 has 2, and LLC2 has 1. LLC0, having the most tasks preferring LLC3, is selected as the busiest source to pick tasks from. Within a source LLC, the total number of tasks preferring a destination LLC is computed by summing counts across all CPUs in that LLC. For instance, if LLC0 has CPU0 with 2 tasks and CPU1 with 1 task preferring LLC3, the total for LLC0 is 3. These statistics allow the load balancer to choose tasks from source sched groups that best match their preferred LLCs. Co-developed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/3d8502a33a753c4384b368f97f64ee70b1cea0db.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09	sched/cache: Calculate the percpu sd task LLC preference	Tim Chen
	Calculate the number of tasks' LLC preferences for each runqueue. This statistic is computed during task enqueue and dequeue operations, and is used by the cache-aware load balancing. Co-developed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/d15a64436d3acd19c5c53344c5e9d3d0b79b3233.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09	sched/cache: Introduce per CPU's tasks LLC preference counter	Tim Chen
	The lowest level of sched domain for each CPU is assigned an array where each element tracks the number of tasks preferring a given LLC, indexed from 0 to max_lid. Since each CPU has its dedicated sd, this implies that each CPU will have a dedicated task LLC preference counter. For example, sd->llc_counts[3] = 2 signifies that there are 2 tasks on this runqueue which prefer to run within LLC3. The load balancer can use this information to identify busy runqueues and migrate tasks to their preferred LLC domains. This array will be reallocated at runtime during sched domain rebuild. Introduce the buffer allocation mechanism, and the statistics will be calculated in the subsequent patch. Note: the LLC preference statistics of each CPU are reset on sched domain rebuild and may under count temporarily, until the CPU becomes idle and the count is cleared. This is a trade off to avoid complex data synchronization across sched domain builds. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Co-developed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/42e79eceb8cd6be8a032401d481d101913bc5703.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09	sched/cache: Track LLC-preferred tasks per runqueue	Tim Chen
	For each runqueue, track the number of tasks with an LLC preference and how many of them are running on their preferred LLC. This mirrors nr_numa_running and nr_preferred_running for NUMA balancing, and will be used by cache-aware load balancing in later patches. Co-developed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/459a37102f3d74a4e09ea58401d2094ac731d044.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09	sched/cache: Assign preferred LLC ID to processes	Tim Chen
	With cache-aware scheduling enabled, each task is assigned a preferred LLC ID. This allows quick identification of the LLC domain where the task prefers to run, similar to numa_preferred_nid in NUMA balancing. Co-developed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/f2ceecba5858680349ad4ce9303a2121f0bb7272.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09	sched/cache: Make LLC id continuous	Tim Chen
	Introduce an index mapping between CPUs and their LLCs. This provides a roughly continuous per LLC index needed for cache-aware load balancing in later patches. The existing per_cpu llc_id usually points to the first CPU of the LLC domain, which is sparse and unsuitable as an array index. Using llc_id directly would waste memory. With the new mapping, CPUs in the same LLC share an approximate continuous id: per_cpu(llc_id, CPU=0...15) = 0 per_cpu(llc_id, CPU=16...31) = 1 per_cpu(llc_id, CPU=32...47) = 2 ... Note that the LLC IDs are allocated via bitmask, so the IDs may be reused during CPU offline->online transitions. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Originally-by: K Prateek Nayak <kprateek.nayak@amd.com> Co-developed-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/047ef46339e4db497b54a89940a7ebedf27fcf28.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09	sched/cache: Introduce helper functions to enforce LLC migration policy	Chen Yu
	Cache-aware scheduling aggregates threads onto their preferred LLC, mainly through load balancing. When the preferred LLC becomes saturated, more threads are still placed there, increasing latency. A mechanism is needed to limit aggregation so that the preferred LLC does not become overloaded. Introduce helper functions can_migrate_llc() and can_migrate_llc_task() to enforce the LLC migration policy: 1. Aggregate a task to its preferred LLC if both source and destination LLCs are not too busy, or if doing so will not leave the preferred LLC much more imbalanced than the non-preferred one (>20% utilization difference, a little higher than the default imbalance_pct(17%) of the LLC domain as hysteresis). Later this threshold will be turned into tunable debugfs. 2. Allow moving a task from overloaded preferred LLC to a non preferred LLC if this will not cause the non preferred LLC to become too imbalanced to cause a later migration back. 3. If both LLCs are too busy, let the generic load balance to spread the tasks. Further (hysteresis)action could be taken in the future to prevent tasks from being migrated into and out of the preferred LLC frequently (back and forth): the threshold for migrating a task out of its preferred LLC should be higher than that for migrating it into the LLC. Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/d19b52589cdceaee5e625980959f4d1982d6d7c9.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09	sched/cache: Record per LLC utilization to guide cache aware scheduling ↵	Chen Yu
	decisions When a system becomes busy and a process's preferred LLC is saturated with too many threads, tasks within that LLC migrate frequently. These in LLC migrations introduce latency and degrade performance. To avoid this, task aggregation should be suppressed when the preferred LLC is overloaded, which requires a metric to indicate LLC utilization. Record per LLC utilization/cpu capacity during periodic load balancing. These statistics will be used in later patches to decide whether tasks should be aggregated into their preferred LLC. Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/a48151b3d57f2a42a5971aaead1b7f81e69229f4.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09	sched/cache: Limit the scan number of CPUs when calculating task occupancy	Chen Yu
	When NUMA balancing is enabled, the kernel currently iterates over all online CPUs to aggregate process-wide occupancy data. On large systems, this global scan introduces significant overhead. To reduce scan latency, limit the search to a subset of relevant CPUs: 1. The task's preferred NUMA node. 2. The node where the task is currently running. 3. The node that contains the task's current preferred LLC.. While focusing solely on the preferred NUMA node is ideal, a process-wide scan must remain flexible because the "preferred node" is a per-task attribute. Different threads within the same process may have different preferred nodes, causing the process-wide preference to migrate. Maintaining a mask that covers both the preferred and active running nodes ensures accuracy while significantly reducing the number of CPUs inspected. Future work may integrate numa_group to further refine task aggregation. Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/57ed5fcec9b242803fe4ea2ce6e7f3de6a6efc6b.1775065312.git.tim.c.chen@linux.intel.com
2026-04-09	sched/cache: Introduce infrastructure for cache-aware load balancing	Peter Zijlstra (Intel)
	Adds infrastructure to enable cache-aware load balancing, which improves cache locality by grouping tasks that share resources within the same cache domain. This reduces cache misses and improves overall data access efficiency. In this initial implementation, threads belonging to the same process are treated as entities that likely share working sets. The mechanism tracks per-process CPU occupancy across cache domains and attempts to migrate threads toward cache-hot domains where their process already has active threads, thereby enhancing locality. This provides a basic model for cache affinity. While the current code targets the last-level cache (LLC), the approach could be extended to other domain types such as clusters (L2) or node-internal groupings. At present, the mechanism selects the CPU within an LLC that has the highest recent runtime. Subsequent patches in this series will use this information in the load-balancing path to guide task placement toward preferred LLCs. In the future, more advanced policies could be integrated through NUMA balancing-for example, migrating a task to its preferred LLC when spare capacity exists, or swapping tasks across LLCs to improve cache affinity. Grouping of tasks could also be generalized from that of a process to be that of a NUMA group, or be user configurable. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/6269a53221b9439b9ca00d18a9d1946fb64d8cff.1775065312.git.tim.c.chen@linux.intel.com
2026-04-08	bpf: Remove static qualifier from local subprog pointer	Daniel Borkmann
	The local subprog pointer in create_jt() and visit_abnormal_return_insn() was declared static. It is unconditionally assigned via bpf_find_containing_subprog() before every use. Thus, the static qualifier serves no purpose and rather creates confusion. Just remove it. Fixes: e40f5a6bf88a ("bpf: correct stack liveness for tail calls") Fixes: 493d9e0d6083 ("bpf, x86: add support for indirect jumps") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20260408191242.526279-3-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-08	bpf: Fix ld_{abs,ind} failure path analysis in subprogs	Daniel Borkmann
	Usage of ld_{abs,ind} instructions got extended into subprogs some time ago via commit 09b28d76eac4 ("bpf: Add abnormal return checks."). These are only allowed in subprograms when the latter are BTF annotated and have scalar return types. The code generator in bpf_gen_ld_abs() has an abnormal exit path (r0=0 + exit) from legacy cBPF times. While the enforcement is on scalar return types, the verifier must also simulate the path of abnormal exit if the packet data load via ld_{abs,ind} failed. This is currently not the case. Fix it by having the verifier simulate both success and failure paths, and extend it in similar ways as we do for tail calls. The success path (r0=unknown, continue to next insn) is pushed onto stack for later validation and the r0=0 and return to the caller is done on the fall-through side. Fixes: 09b28d76eac4 ("bpf: Add abnormal return checks.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260408191242.526279-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-08	bpf: Propagate error from visit_tailcall_insn	Daniel Borkmann
	Commit e40f5a6bf88a ("bpf: correct stack liveness for tail calls") added visit_tailcall_insn() but did not check its return value. Fixes: e40f5a6bf88a ("bpf: correct stack liveness for tail calls") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260408191242.526279-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-08	bpf: Make find_linfo widely available	Kumar Kartikeya Dwivedi
	Move find_linfo() as bpf_find_linfo() into core.c to allow for its use in the verifier in subsequent patches. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260408021359.3786905-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-08	bpf: Extract bpf_get_linfo_file_line	Kumar Kartikeya Dwivedi
	Extract bpf_get_linfo_file_line as its own function so that the logic to obtain the file, line, and line number for a given program can be shared in subsequent patches. Reviewed-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260408021359.3786905-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-08	Merge branch kvm-arm64/hyp-tracing into kvmarm-master/next	Marc Zyngier
	* kvm-arm64/hyp-tracing: (40 commits) : . : EL2 tracing support, adding both 'remote' ring-buffer : infrastructure and the tracing itself, courtesy of : Vincent Donnefort. From the cover letter: : : "The growing set of features supported by the hypervisor in protected : mode necessitates debugging and profiling tools. Tracefs is the : ideal candidate for this task: : : * It is simple to use and to script. : : * It is supported by various tools, from the trace-cmd CLI to the : Android web-based perfetto. : : * The ring-buffer, where are stored trace events consists of linked : pages, making it an ideal structure for sharing between kernel and : hypervisor. : : This series first introduces a new generic way of creating remote events and : remote buffers. Then it adds support to the pKVM hypervisor." : . tracing: selftests: Extend hotplug testing for trace remotes tracing: Non-consuming read for trace remotes with an offline CPU tracing: Adjust cmd_check_undefined to show unexpected undefined symbols tracing: Restore accidentally removed SPDX tag KVM: arm64: avoid unused-variable warning tracing: Generate undef symbols allowlist for simple_ring_buffer KVM: arm64: tracing: add ftrace dependency tracing: add more symbols to whitelist tracing: Update undefined symbols allow list for simple_ring_buffer KVM: arm64: Fix out-of-tree build for nVHE/pKVM tracing tracing: selftests: Add hypervisor trace remote tests KVM: arm64: Add selftest event support to nVHE/pKVM hyp KVM: arm64: Add hyp_enter/hyp_exit events to nVHE/pKVM hyp KVM: arm64: Add event support to the nVHE/pKVM hyp and trace remote KVM: arm64: Add trace reset to the nVHE/pKVM hyp KVM: arm64: Sync boot clock with the nVHE/pKVM hyp KVM: arm64: Add trace remote for the nVHE/pKVM hyp KVM: arm64: Add tracing capability for the nVHE/pKVM hyp KVM: arm64: Support unaligned fixmap in the pKVM hyp KVM: arm64: Initialise hyp_nr_cpus for nVHE hyp ... Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-04-08	perf/events: Replace READ_ONCE() with standard pgtable accessors	Anshuman Khandual
	Replace raw READ_ONCE() dereferences of pgtable entries with corresponding standard page table accessors pxdp_get() in perf_get_pgtable_size(). These accessors default to READ_ONCE() on platforms that don't override them. So there is no functional change on such platforms. However arm64 platform is being extended to support 128 bit page tables via a new architecture feature i.e FEAT_D128 in which case READ_ONCE() will not provide required single copy atomic access for 128 bit page table entries. Although pxdp_get() accessors can later be overridden on arm64 platform to extend required single copy atomicity support on 128 bit entries. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260227062744.2215491-1-anshuman.khandual@arm.com
2026-04-08	sched/rt: Cleanup global RT bandwidth functions	Michal Koutný
	The commit 5f6bd380c7bdb ("sched/rt: Remove default bandwidth control") and followup changes made a few of the functions unnecessary, drop them for simplicity. Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260323-sched-rert_groups-v3-3-1e7d5ed6b249@suse.com
2026-04-08	sched/rt: Move group schedulability check to sched_rt_global_validate()	Michal Koutný
	The sched_rt_global_constraints() function is a remnant that used to set up global RT throttling but that is no more since commit 5f6bd380c7bdb ("sched/rt: Remove default bandwidth control") and the function ended up only doing schedulability check. Move the check into the validation function where it fits better. (The order of validations sched_dl_global_validate() and sched_rt_global_validate() shouldn't matter.) Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260323-sched-rert_groups-v3-2-1e7d5ed6b249@suse.com
2026-04-08	sched/rt: Skip group schedulable check with rt_group_sched=0	Michal Koutný
	The warning from the commit 87f1fb77d87a6 ("sched: Add RT_GROUP WARN checks for non-root task_groups") is wrong -- it assumes that only task_groups with rt_rq are traversed, however, the schedulability check would iterate all task_groups even when rt_group_sched=0 is disabled at boot time but some non-root task_groups exist. The schedulability check is supposed to validate: a) that children don't overcommit its parent, b) no RT task group overcommits global RT limit. but with rt_group_sched=0 there is no (non-trivial) hierarchy of RT groups, therefore skip the validation altogether. Otherwise, writes to the global sched_rt_runtime_us knob will be rejected with incorrect validation error. This fix is immaterial with CONFIG_RT_GROUP_SCHED=n. Fixes: 87f1fb77d87a6 ("sched: Add RT_GROUP WARN checks for non-root task_groups") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260323-sched-rert_groups-v3-1-1e7d5ed6b249@suse.com
2026-04-08	sched/deadline: Use revised wakeup rule for dl_server	Peter Zijlstra
	John noted that commit 115135422562 ("sched/deadline: Fix 'stuck' dl_server") unfixed the issue from commit a3a70caf7906 ("sched/deadline: Fix dl_server behaviour"). The issue in commit 115135422562 was for wakeups of the server after the deadline; in which case you have to start a new period. The case for a3a70caf7906 is wakeups before the deadline. Now, because the server is effectively running a least-laxity policy, it means that any wakeup during the runnable phase means dl_entity_overflow() will be true. This means we need to adjust the runtime to allow it to still run until the existing deadline expires. Use the revised wakeup rule for dl_defer entities. Fixes: 115135422562 ("sched/deadline: Fix 'stuck' dl_server") Reported-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260404102244.GB22575@noisy.programming.kicks-ass.net
2026-04-08	entry: Split kernel mode logic from irqentry_{enter,exit}()	Mark Rutland
	The generic irqentry code has entry/exit functions specifically for exceptions taken from user mode, but doesn't have entry/exit functions specifically for exceptions taken from kernel mode. It would be helpful to have separate entry/exit functions specifically for exceptions taken from kernel mode. This would make the structure of the entry code more consistent, and would make it easier for architectures to manage logic specific to exceptions taken from kernel mode. Move the logic specific to kernel mode out of irqentry_enter() and irqentry_exit() into new irqentry_enter_from_kernel_mode() and irqentry_exit_to_kernel_mode() functions. These are marked __always_inline and placed in irq-entry-common.h, as with irqentry_enter_from_user_mode() and irqentry_exit_to_user_mode(), so that they can be inlined into architecture-specific wrappers. The existing out-of-line irqentry_enter() and irqentry_exit() functions retained as callers of the new functions. The lockdep assertion from irqentry_exit() is moved into irqentry_exit_to_user_mode() and irqentry_exit_to_kernel_mode(). This was previously missing from irqentry_exit_to_user_mode() when called directly, and any new lockdep assertion failure relating from this change is a latent bug. Aside from the lockdep change noted above, there should be no functional change as a result of this change. [ tglx: Updated kernel doc ] Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260407131650.3813777-5-mark.rutland@arm.com
2026-04-08	entry: Remove local_irq_{enable,disable}_exit_to_user()	Mark Rutland
	local_irq_enable_exit_to_user() and local_irq_disable_exit_to_user() are never overridden by architecture code, and are always equivalent to local_irq_enable() and local_irq_disable(). These functions were added on the assumption that arm64 would override them to manage 'DAIF' exception masking, as described by Thomas Gleixner in these threads: https://lore.kernel.org/all/20190919150809.340471236@linutronix.de/ https://lore.kernel.org/all/alpine.DEB.2.21.1910240119090.1852@nanos.tec.linutronix.de/ In practice arm64 did not need to override either. Prior to moving to the generic irqentry code, arm64's management of DAIF was reworked in commit: 97d935faacde ("arm64: Unmask Debug + SError in do_notify_resume()") Since that commit, arm64 only masks interrupts during the 'prepare' step when returning to user mode, and masks other DAIF exceptions later. Within arm64_exit_to_user_mode(), the arm64 entry code is as follows: local_irq_disable(); exit_to_user_mode_prepare_legacy(regs); local_daif_mask(); mte_check_tfsr_exit(); exit_to_user_mode(); Remove the unnecessary local_irq_enable_exit_to_user() and local_irq_disable_exit_to_user() functions. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260407131650.3813777-3-mark.rutland@arm.com
2026-04-07	bpf: Allow overwriting referenced dynptr when refcnt > 1	Amery Hung
	The verifier currently does not allow overwriting a referenced dynptr's stack slot to prevent resource leak. This is because referenced dynptr holds additional resources that requires calling specific helpers to release. This limitation can be relaxed when there are multiple copies of the same dynptr. Whether it is the orignial dynptr or one of its clones, as long as there exists at least one other dynptr with the same ref_obj_id (to be used to release the reference), its stack slot should be allowed to be overwritten. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260406150548.1354271-2-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-07	bpf: Clear delta when clearing reg id for non-{add,sub} ops	Daniel Borkmann
	When a non-{add,sub} alu op such as xor is performed on a scalar register that previously had a BPF_ADD_CONST delta, the else path in adjust_reg_min_max_vals() only clears dst_reg->id but leaves dst_reg->delta unchanged. This stale delta can propagate via assign_scalar_id_before_mov() when the register is later used in a mov. It gets a fresh id but keeps the stale delta from the old (now-cleared) BPF_ADD_CONST. This stale delta can later propagate leading to a verifier-vs- runtime value mismatch. The clear_id label already correctly clears both delta and id. Make the else path consistent by also zeroing the delta when id is cleared. More generally, this introduces a helper clear_scalar_id() which internally takes care of zeroing. There are various other locations in the verifier where only the id is cleared. By using the helper we catch all current and future locations. Fixes: 98d7ca374ba4 ("bpf: Track delta between "linked" registers.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260407192421.508817-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-07	bpf: Fix linked reg delta tracking when src_reg == dst_reg	Daniel Borkmann
	Consider the case of rX += rX where src_reg and dst_reg are pointers to the same bpf_reg_state in adjust_reg_min_max_vals(). The latter first modifies the dst_reg in-place, and later in the delta tracking, the subsequent is_reg_const(src_reg)/reg_const_value(src_reg) reads the post-{add,sub} value instead of the original source. This is problematic since it sets an incorrect delta, which sync_linked_regs() then propagates to linked registers, thus creating a verifier-vs-runtime mismatch. Fix it by just skipping this corner case. Fixes: 98d7ca374ba4 ("bpf: Track delta between "linked" registers.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260407192421.508817-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-07	bpf: Prefer vmlinux symbols over module symbols for unqualified kprobes	Andrey Grodzovsky
	When an unqualified kprobe target exists in both vmlinux and a loaded module, number_of_same_symbols() returns a count greater than 1, causing kprobe attachment to fail with -EADDRNOTAVAIL even though the vmlinux symbol is unambiguous. When no module qualifier is given and the symbol is found in vmlinux, return the vmlinux-only count without scanning loaded modules. This preserves the existing behavior for all other cases: - Symbol only in a module: vmlinux count is 0, falls through to module scan as before. - Symbol qualified with MOD:SYM: mod != NULL, unchanged path. - Symbol ambiguous within vmlinux itself: count > 1 is returned as-is. Fixes: 926fe783c8a6 ("tracing/kprobes: Fix symbol counting logic by looking at modules as well") Fixes: 9d8616034f16 ("tracing/kprobes: Add symbol counting check when module loads") Suggested-by: Ihor Solodrai <ihor.solodrai@linux.dev> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com> Link: https://lore.kernel.org/r/20260407203912.1787502-2-andrey.grodzovsky@crowdstrike.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-07	bpf: Retire rcu_trace_implies_rcu_gp()	Kumar Kartikeya Dwivedi
	RCU Tasks Trace grace period implies RCU grace period, and this guarantee is expected to remain in the future. Only BPF is the user of this predicate, hence retire the API and clean up all in-tree users. RCU Tasks Trace is now implemented on SRCU-fast and its grace period mechanism always has at least one call to synchronize_rcu() as it is required for SRCU-fast's correctness (it replaces the smp_mb() that SRCU-fast readers skip). So, RCU-tt GP will always imply RCU GP. Reviewed-by: Puranjay Mohan <puranjay@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260407162234.785270-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-07	workqueue: use NR_STD_WORKER_POOLS instead of hardcoded value	Maninder Singh
	use NR_STD_WORKER_POOLS for irq_work_fns[] array definition. NR_STD_WORKER_POOLS is also 2, but better to use MACRO. Initialization loop for_each_bh_worker_pool() also uses same MACRO. Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-07	Merge tag 'mm-hotfixes-stable-2026-04-06-15-27' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Eight hotfixes. All are cc:stable and seven are for MM. All are singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-04-06-15-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: ocfs2: fix out-of-bounds write in ocfs2_write_end_inline mm/damon/stat: deallocate damon_call() failure leaking damon_ctx mm/vma: fix memory leak in __mmap_region() mm/memory_hotplug: maintain N_NORMAL_MEMORY during hotplug mm/damon/sysfs: dealloc repeat_call_control if damon_call() fails mm: reinstate unconditional writeback start in balance_dirty_pages() liveupdate: propagate file deserialization failures mm: filemap: fix nr_pages calculation overflow in filemap_map_pages()
2026-04-07	alarmtimer: Access timerqueue node under lock in suspend	Zhan Xusheng
	In alarmtimer_suspend(), timerqueue_getnext() is called under base->lock, but next->expires is read after the lock is released. This is safe because suspend freezes all relevant task contexts, but reading the node while holding the lock makes the code easier to reason about and not worry about a theoretical UAF. Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260407143627.19405-1-zhanxusheng@xiaomi.com
2026-04-07	bpf: Drop task_to_inode and inet_conn_established from lsm sleepable hooks	Jiayuan Chen
	bpf_lsm_task_to_inode() is called under rcu_read_lock() and bpf_lsm_inet_conn_established() is called from softirq context, so neither hook can be used by sleepable LSM programs. Fixes: 423f16108c9d8 ("bpf: Augment the set of sleepable LSM hooks") Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn> Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reported-by: Dongliang Mu <dzm91@hust.edu.cn> Closes: https://lore.kernel.org/bpf/3ab69731-24d1-431a-a351-452aafaaf2a5@std.uestc.edu.cn/T/#u Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260407122334.344072-1-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>