summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2026-04-05kho: move alloc tag init to kho_init_{folio,pages}()Pratyush Yadav (Google)
Commit 8f1081892d62 ("kho: simplify page initialization in kho_restore_page()") cleaned up the page initialization logic by moving the folio and 0-order-page paths into separate functions. It missed moving the alloc tag initialization. Do it now to keep the two paths cleanly separated. While at it, touch up the comments to be a tiny bit shorter (mainly so it doesn't end up splitting into a multiline comment). This is purely a cosmetic change and there should be no change in behaviour. Link: https://lkml.kernel.org/r/20260213085914.2778107-1-pratyush@kernel.org Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-05Merge tag 'sched-urgent-2026-04-05' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix zero_vruntime tracking again (Peter Zijlstra) - Fix avg_vruntime() usage in sched_debug (Peter Zijlstra) * tag 'sched-urgent-2026-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/debug: Fix avg_vruntime() usage sched/fair: Fix zero_vruntime tracking fix
2026-04-04prctl: cfi: change the branch landing pad prctl()s to be more descriptivePaul Walmsley
Per Linus' comments requesting the replacement of "INDIR_BR_LP" in the indirect branch tracking prctl()s with something more readable, and suggesting the use of the speculation control prctl()s as an exemplar, reimplement the prctl()s and related constants that control per-task forward-edge control flow integrity. This primarily involves two changes. First, the prctls are restructured to resemble the style of the speculative execution workaround control prctls PR_{GET,SET}_SPECULATION_CTRL, to make them easier to extend in the future. Second, the "indir_br_lp" abbrevation is expanded to "branch_landing_pads" to be less telegraphic. The kselftest and documentation is adjusted accordingly. Link: https://lore.kernel.org/linux-riscv/CAHk-=whhSLGZAx3N5jJpb4GLFDqH_QvS07D+6BnkPWmCEzTAgw@mail.gmail.com/ Cc: Deepak Gupta <debug@rivosinc.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Paul Walmsley <pjw@kernel.org>
2026-04-04prctl: rename branch landing pad implementation functions to be more explicitPaul Walmsley
Per Linus' comments about the unreadability of abbreviations such as "indir_br_lp", rename the three prctl() implementation functions to be more explicit. This involves renaming "indir_br_lp_status" in the function names to "branch_landing_pad_state". While here, add _prctl_ into the function names, following the speculation control prctl implementation functions. Link: https://lore.kernel.org/linux-riscv/CAHk-=whhSLGZAx3N5jJpb4GLFDqH_QvS07D+6BnkPWmCEzTAgw@mail.gmail.com/ Cc: Deepak Gupta <debug@rivosinc.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Paul Walmsley <pjw@kernel.org>
2026-04-04Merge tag 'v7.0-rc6' into irq/coreThomas Gleixner
to be able to merge the hyper-v patch related to randomness.
2026-04-04Merge tag 'amd-pstate-v7.1-2026-04-02' of ↵Rafael J. Wysocki
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux Pull amd-pstate new content for 7.1 (2026-04-02) from Mario Limonciello: "Add support for new features: * CPPC performance priority * Dynamic EPP * Raw EPP * New unit tests for new features Fixes for: * PREEMPT_RT * sysfs files being present when HW missing * Broken/outdated documentation" * tag 'amd-pstate-v7.1-2026-04-02' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux: (22 commits) MAINTAINERS: amd-pstate: Step down as maintainer, add Prateek as reviewer cpufreq: Pass the policy to cpufreq_driver->adjust_perf() cpufreq/amd-pstate: Pass the policy to amd_pstate_update() cpufreq/amd-pstate-ut: Add a unit test for raw EPP cpufreq/amd-pstate: Add support for raw EPP writes cpufreq/amd-pstate: Add support for platform profile class cpufreq/amd-pstate: add kernel command line to override dynamic epp cpufreq/amd-pstate: Add dynamic energy performance preference Documentation: amd-pstate: fix dead links in the reference section cpufreq/amd-pstate: Cache the max frequency in cpudata Documentation/amd-pstate: Add documentation for amd_pstate_floor_{freq,count} Documentation/amd-pstate: List amd_pstate_prefcore_ranking sysfs file Documentation/amd-pstate: List amd_pstate_hw_prefcore sysfs file amd-pstate-ut: Add a testcase to validate the visibility of driver attributes amd-pstate-ut: Add module parameter to select testcases amd-pstate: Introduce a tracepoint trace_amd_pstate_cppc_req2() amd-pstate: Add sysfs support for floor_freq and floor_count amd-pstate: Add support for CPPC_REQ2 and FLOOR_PERF x86/cpufeatures: Add AMD CPPC Performance Priority feature. amd-pstate: Make certain freq_attrs conditionally visible ...
2026-04-04module: Simplify warning on positive returns from module_init()Lucas De Marchi
It should now be rare to trigger this warning - it doesn't need to be so verbose. Make it follow the usual style in the module loading code. For the same reason, drop the dump_stack(). Suggested-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Lucas De Marchi <demarchi@kernel.org> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2026-04-04module: Override -EEXIST module returnLucas De Marchi
The -EEXIST errno is reserved by the module loading functionality. When userspace calls [f]init_module(), it expects a -EEXIST to mean that the module is already loaded in the kernel. If module_init() returns it, that is not true anymore. Override the error when returning to userspace: it doesn't make sense to change potentially long error propagation call chains just because it's will end up as the return of module_init(). Closes: https://lore.kernel.org/all/aKLzsAX14ybEjHfJ@orbyte.nwl.cc/ Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: Phil Sutter <phil@nwl.cc> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Lucas De Marchi <demarchi@kernel.org> [Sami: Fixed a typo.] Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2026-04-03Merge tag 'sched_ext-for-7.0-rc6-fixes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "These are late but both fix subtle yet critical problems and the blast radius is limited strictly to sched_ext. - Fix stale direct dispatch state in ddsp_dsq_id which can cause spurious warnings in mark_direct_dispatch() on task wakeup - Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCU configs which can lead to incorrectly dispatching migration- disabled tasks to remote CPUs" * tag 'sched_ext-for-7.0-rc6-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix stale direct dispatch state in ddsp_dsq_id sched_ext: Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCU
2026-04-03Merge branch 'for-7.0-fixes' into for-7.1Tejun Heo
Conflict in kernel/sched/ext.c between: 7e0ffb72de8a ("sched_ext: Fix stale direct dispatch state in ddsp_dsq_id") which clears ddsp state at individual call sites instead of dispatch_enqueue(), and sub-sched related code reorg and API updates on for-7.1. Resolved by applying the ddsp fix with for-7.1's signatures. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-03kernel: ksysfs: initialize kernel_kobj earlierBartosz Golaszewski
Software nodes depend on kernel_kobj which is initialized pretty late into the boot process - as a core_initcall(). Ahead of moving the software node initialization to driver_init() we must first make kernel_kobj available before it. Make ksysfs_init() visible in a new header - ksysfs.h - and call it in do_basic_setup() right before driver_init(). Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com> Link: https://patch.msgid.link/20260402-nokia770-gpio-swnodes-v5-1-d730db3dd299@oss.qualcomm.com Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2026-04-03sched_ext: Fix stale direct dispatch state in ddsp_dsq_idAndrea Righi
@p->scx.ddsp_dsq_id can be left set (non-SCX_DSQ_INVALID) triggering a spurious warning in mark_direct_dispatch() when the next wakeup's ops.select_cpu() calls scx_bpf_dsq_insert(), such as: WARNING: kernel/sched/ext.c:1273 at scx_dsq_insert_commit+0xcd/0x140 The root cause is that ddsp_dsq_id was only cleared in dispatch_enqueue(), which is not reached in all paths that consume or cancel a direct dispatch verdict. Fix it by clearing it at the right places: - direct_dispatch(): cache the direct dispatch state in local variables and clear it before dispatch_enqueue() on the synchronous path. For the deferred path, the direct dispatch state must remain set until process_ddsp_deferred_locals() consumes them. - process_ddsp_deferred_locals(): cache the dispatch state in local variables and clear it before calling dispatch_to_local_dsq(), which may migrate the task to another rq. - do_enqueue_task(): clear the dispatch state on the enqueue path (local/global/bypass fallbacks), where the direct dispatch verdict is ignored. - dequeue_task_scx(): clear the dispatch state after dispatch_dequeue() to handle both the deferred dispatch cancellation and the holding_cpu race, covering all cases where a pending direct dispatch is cancelled. - scx_disable_task(): clear the direct dispatch state when transitioning a task out of the current scheduler. Waking tasks may have had the direct dispatch state set by the outgoing scheduler's ops.select_cpu() and then been queued on a wake_list via ttwu_queue_wakelist(), when SCX_OPS_ALLOW_QUEUED_WAKEUP is set. Such tasks are not on the runqueue and are not iterated by scx_bypass(), so their direct dispatch state won't be cleared. Without this clear, any subsequent SCX scheduler that tries to direct dispatch the task will trigger the WARN_ON_ONCE() in mark_direct_dispatch(). Fixes: 5b26f7b920f7 ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches") Cc: stable@vger.kernel.org # v6.12+ Cc: Daniel Hodges <hodgesd@meta.com> Cc: Patrick Somaru <patsomaru@meta.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-03Merge tag 'pm-7.0-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix a potential NULL pointer dereference in the energy model netlink interface and a potential double free in an error path in the common cpufreq governor management code: - Fix a NULL pointer dereference in the energy model netlink interface that may occur if a given perf domain ID is not recognized (Changwoo Min) - Avoid double free in the cpufreq_dbs_governor_init() error path when kobject_init_and_add() fails (Guangshuo Li)" * tag 'pm-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: governor: fix double free in cpufreq_dbs_governor_init() error path PM: EM: Fix NULL pointer dereference when perf domain ID is not found
2026-04-03bpf: Add helper and kfunc stack access size resolutionAlexei Starovoitov
The static stack liveness analysis needs to know how many bytes a helper or kfunc accesses through a stack pointer argument, so it can precisely mark the affected stack slots as stack 'def' or 'use'. Add bpf_helper_stack_access_bytes() and bpf_kfunc_stack_access_bytes() which resolve the access size for a given call argument. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-7-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-03bpf: Move verifier helpers to headerAlexei Starovoitov
Move several helpers to header as preparation for the subsequent stack liveness patches. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-6-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-03bpf: Add bpf_compute_const_regs() and bpf_prune_dead_branches() passesAlexei Starovoitov
Add two passes before the main verifier pass: bpf_compute_const_regs() is a forward dataflow analysis that tracks register values in R0-R9 across the program using fixed-point iteration in reverse postorder. Each register is tracked with a six-state lattice: UNVISITED -> CONST(val) / MAP_PTR(map_index) / MAP_VALUE(map_index, offset) / SUBPROG(num) -> UNKNOWN At merge points, if two paths produce the same state and value for a register, it stays; otherwise it becomes UNKNOWN. The analysis handles: - MOV, ADD, SUB, AND with immediate or register operands - LD_IMM64 for plain constants, map FDs, map values, and subprogs - LDX from read-only maps: constant-folds the load by reading the map value directly via bpf_map_direct_read() Results that fit in 32 bits are stored per-instruction in insn_aux_data and bitmasks. bpf_prune_dead_branches() uses the computed constants to evaluate conditional branches. When both operands of a conditional jump are known constants, the branch outcome is determined statically and the instruction is rewritten to an unconditional jump. The CFG postorder is then recomputed to reflect new control flow. This eliminates dead edges so that subsequent liveness analysis doesn't propagate through dead code. Also add runtime sanity check to validate that precomputed constants match the verifier's tracked state. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-5-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-03bpf: Sort subprogs in topological order after check_cfg()Alexei Starovoitov
Add a pass that sorts subprogs in topological order so that iterating subprog_topo_order[] walks leaf subprogs first, then their callers. This is computed as a DFS post-order traversal of the CFG. The pass runs after check_cfg() to ensure the CFG has been validated before traversing and after postorder has been computed to avoid walking dead code. Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-3-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-03bpf: Do register range validation earlyAlexei Starovoitov
Instead of checking src/dst range multiple times during the main verifier pass do them once. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260403024422.87231-2-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf 7.0-rc6+Alexei Starovoitov
Cross-merge BPF and other fixes after downstream PR. Minor conflict in kernel/bpf/verifier.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-03sched/fair: Prevent negative lag increase during delayed dequeueVincent Guittot
Delayed dequeue feature aims to reduce the negative lag of a dequeued task while sleeping but it can happens that newly enqueued tasks will move backward the avg vruntime and increase its negative lag. When the delayed dequeued task wakes up, it has more neg lag compared to being dequeued immediately or to other tasks that have been dequeued just before theses new enqueues. Ensure that the negative lag of a delayed dequeued task doesn't increase during its delayed dequeued phase while waiting for its neg lag to diseappear. Similarly, we remove any positive lag that the delayed dequeued task could have gain during thsi period. Short slice tasks are particularly impacted in overloaded system. Test on snapdragon rb5: hackbench -T -p -l 16000000 -g 2 1> /dev/null & cyclictest -t 1 -i 2777 -D 333 --policy=fair --mlock -h 20000 -q The scheduling latency of cyclictest is: tip/sched/core tip/sched/core +this patch cyclictest slice (ms) (default)2.8 8 8 hackbench slice (ms) (default)2.8 20 20 Total Samples | 115632 119733 119806 Average (us) | 364 64(-82%) 61(- 5%) Median (P50) (us) | 60 56(- 7%) 56( 0%) 90th Percentile (us) | 1166 62(-95%) 62( 0%) 99th Percentile (us) | 4192 73(-98%) 72(- 1%) 99.9th Percentile (us) | 8528 2707(-68%) 1300(-52%) Maximum (us) | 17735 14273(-20%) 13525(- 5%) Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260331162352.551501-1-vincent.guittot@linaro.org
2026-04-03sched/fair: Use sched_energy_enabled()Vincent Guittot
Use helper sched_energy_enabled() everywhere we want to test if EAS is enabled instead of mixing sched_energy_enabled() and direct call to static_branch_unlikely(). No functional change Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260327132013.2800517-1-vincent.guittot@linaro.org
2026-04-03sched: Handle blocked-waiter migration (and return migration)John Stultz
Add logic to handle migrating a blocked waiter to a remote cpu where the lock owner is runnable. Additionally, as the blocked task may not be able to run on the remote cpu, add logic to handle return migration once the waiting task is given the mutex. Because tasks may get migrated to where they cannot run, also modify the scheduling classes to avoid sched class migrations on mutex blocked tasks, leaving find_proxy_task() and related logic to do the migrations and return migrations. This was split out from the larger proxy patch, and significantly reworked. Credits for the original patch go to: Peter Zijlstra (Intel) <peterz@infradead.org> Juri Lelli <juri.lelli@redhat.com> Valentin Schneider <valentin.schneider@arm.com> Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260324191337.1841376-11-jstultz@google.com
2026-04-03sched: Move attach_one_task and attach_task helpers to sched.hJohn Stultz
The fair scheduler locally introduced attach_one_task() and attach_task() helpers, but these could be generically useful so move this code to sched.h so we can use them elsewhere. One minor tweak made to utilize guard(rq_lock)(rq) to simplifiy the function. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-10-jstultz@google.com
2026-04-03sched: Add logic to zap balance callbacks if we pick againJohn Stultz
With proxy-exec, a task is selected to run via pick_next_task(), and then if it is a mutex blocked task, we call find_proxy_task() to find a runnable owner. If the runnable owner is on another cpu, we will need to migrate the selected donor task away, after which we will pick_again can call pick_next_task() to choose something else. However, in the first call to pick_next_task(), we may have had a balance_callback setup by the class scheduler. After we pick again, its possible pick_next_task_fair() will be called which calls sched_balance_newidle() and sched_balance_rq(). This will throw a warning: [ 8.796467] rq->balance_callback && rq->balance_callback != &balance_push_callback [ 8.796467] WARNING: CPU: 32 PID: 458 at kernel/sched/sched.h:1750 sched_balance_rq+0xe92/0x1250 ... [ 8.796467] Call Trace: [ 8.796467] <TASK> [ 8.796467] ? __warn.cold+0xb2/0x14e [ 8.796467] ? sched_balance_rq+0xe92/0x1250 [ 8.796467] ? report_bug+0x107/0x1a0 [ 8.796467] ? handle_bug+0x54/0x90 [ 8.796467] ? exc_invalid_op+0x17/0x70 [ 8.796467] ? asm_exc_invalid_op+0x1a/0x20 [ 8.796467] ? sched_balance_rq+0xe92/0x1250 [ 8.796467] sched_balance_newidle+0x295/0x820 [ 8.796467] pick_next_task_fair+0x51/0x3f0 [ 8.796467] __schedule+0x23a/0x14b0 [ 8.796467] ? lock_release+0x16d/0x2e0 [ 8.796467] schedule+0x3d/0x150 [ 8.796467] worker_thread+0xb5/0x350 [ 8.796467] ? __pfx_worker_thread+0x10/0x10 [ 8.796467] kthread+0xee/0x120 [ 8.796467] ? __pfx_kthread+0x10/0x10 [ 8.796467] ret_from_fork+0x31/0x50 [ 8.796467] ? __pfx_kthread+0x10/0x10 [ 8.796467] ret_from_fork_asm+0x1a/0x30 [ 8.796467] </TASK> This is because if a RT task was originally picked, it will setup the rq->balance_callback with push_rt_tasks() via set_next_task_rt(). Once the task is migrated away and we pick again, we haven't processed any balance callbacks, so rq->balance_callback is not in the same state as it was the first time pick_next_task was called. To handle this, add a zap_balance_callbacks() helper function which cleans up the balance callbacks without running them. This should be ok, as we are effectively undoing the state set in the first call to pick_next_task(), and when we pick again, the new callback can be configured for the donor task actually selected. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-9-jstultz@google.com
2026-04-03sched: Add assert_balance_callbacks_empty helperJohn Stultz
With proxy-exec utilizing pick-again logic, we can end up having balance callbacks set by the preivous pick_next_task() call left on the list. So pull the warning out into a helper function, and make sure we check it when we pick again. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-8-jstultz@google.com
2026-04-03sched/locking: Add special p->blocked_on==PROXY_WAKING value for proxy ↵John Stultz
return-migration As we add functionality to proxy execution, we may migrate a donor task to a runqueue where it can't run due to cpu affinity. Thus, we must be careful to ensure we return-migrate the task back to a cpu in its cpumask when it becomes unblocked. Peter helpfully provided the following example with pictures: "Suppose we have a ww_mutex cycle: ,-+-* Mutex-1 <-. Task-A ---' | | ,-- Task-B `-> Mutex-2 *-+-' Where Task-A holds Mutex-1 and tries to acquire Mutex-2, and where Task-B holds Mutex-2 and tries to acquire Mutex-1. Then the blocked_on->owner chain will go in circles. Task-A -> Mutex-2 ^ | | v Mutex-1 <- Task-B We need two things: - find_proxy_task() to stop iterating the circle; - the woken task to 'unblock' and run, such that it can back-off and re-try the transaction. Now, the current code [without this patch] does: __clear_task_blocked_on(); wake_q_add(); And surely clearing ->blocked_on is sufficient to break the cycle. Suppose it is Task-B that is made to back-off, then we have: Task-A -> Mutex-2 -> Task-B (no further blocked_on) and it would attempt to run Task-B. Or worse, it could directly pick Task-B and run it, without ever getting into find_proxy_task(). Now, here is a problem because Task-B might not be runnable on the CPU it is currently on; and because !task_is_blocked() we don't get into the proxy paths, so nobody is going to fix this up. Ideally we would have dequeued Task-B alongside of clearing ->blocked_on, but alas, [the lock ordering prevents us from getting the task_rq_lock() and] spoils things." Thus we need more than just a binary concept of the task being blocked on a mutex or not. So allow setting blocked_on to PROXY_WAKING as a special value which specifies the task is no longer blocked, but needs to be evaluated for return migration *before* it can be run. This will then be used in a later patch to handle proxy return-migration. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-7-jstultz@google.com
2026-04-03sched: Fix modifying donor->blocked on without proper lockingJohn Stultz
Introduce an action enum in find_proxy_task() which allows us to handle work needed to be done outside the mutex.wait_lock and task.blocked_lock guard scopes. This ensures proper locking when we clear the donor's blocked_on pointer in proxy_deactivate(), and the switch statement will be useful as we add more cases to handle later in this series. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-6-jstultz@google.com
2026-04-03locking: Add task::blocked_lock to serialize blocked_on stateJohn Stultz
So far, we have been able to utilize the mutex::wait_lock for serializing the blocked_on state, but when we move to proxying across runqueues, we will need to add more state and a way to serialize changes to this state in contexts where we don't hold the mutex::wait_lock. So introduce the task::blocked_lock, which nests under the mutex::wait_lock in the locking order, and rework the locking to use it. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-5-jstultz@google.com
2026-04-03sched: Fix potentially missing balancing with Proxy ExecJohn Stultz
K Prateek pointed out that with Proxy Exec, we may have cases where we context switch in __schedule(), while the donor remains the same. This could cause balancing issues, since the put_prev_set_next() logic short-cuts if (prev == next). With proxy-exec prev is the previous donor, and next is the next donor. Should the donor remain the same, but different tasks are picked to actually run, the shortcut will have avoided enqueuing the sched class balance callback. So, if we are context switching, add logic to catch the same-donor case, and trigger the put_prev/set_next calls to ensure the balance callbacks get enqueued. Closes: https://lore.kernel.org/lkml/20ea3670-c30a-433b-a07f-c4ff98ae2379@amd.com/ Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260324191337.1841376-4-jstultz@google.com
2026-04-03sched: Minimise repeated sched_proxy_exec() checkingJohn Stultz
Peter noted: Compilers are really bad (as in they utterly refuse) optimizing (even when marked with __pure) the static branch things, and will happily emit multiple identical in a row. So pull out the one obvious sched_proxy_exec() branch in __schedule() and remove some of the 'implicit' ones in that path. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260324191337.1841376-3-jstultz@google.com
2026-04-03sched: Make class_schedulers avoid pushing current, and get rid of ↵John Stultz
proxy_tag_curr() With proxy-execution, the scheduler selects the donor, but for blocked donors, we end up running the lock owner. This caused some complexity, because the class schedulers make sure to remove the task they pick from their pushable task lists, which prevents the donor from being migrated, but there wasn't then anything to prevent rq->curr from being migrated if rq->curr != rq->donor. This was sort of hacked around by calling proxy_tag_curr() on the rq->curr task if we were running something other then the donor. proxy_tag_curr() did a dequeue/enqueue pair on the rq->curr task, allowing the class schedulers to remove it from their pushable list. The dequeue/enqueue pair was wasteful, and additonally K Prateek highlighted that we didn't properly undo things when we stopped proxying, leaving the lock owner off the pushable list. After some alternative approaches were considered, Peter suggested just having the RT/DL classes just avoid migrating when task_on_cpu(). So rework pick_next_pushable_dl_task() and the rt pick_next_pushable_task() functions so that they skip over the first pushable task if it is on_cpu. Then just drop all of the proxy_tag_curr() logic. Fixes: be39617e38e0 ("sched: Fix proxy/current (push,pull)ability") Closes: https://lore.kernel.org/lkml/e735cae0-2cc9-4bae-b761-fcb082ed3e94@amd.com/ Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260324191337.1841376-2-jstultz@google.com
2026-04-02crash_dump/dm-crypt: don't print in arch-specific codeCoiby Xu
Patch series "kdump: Enable LUKS-encrypted dump target support in ARM64 and PowerPC", v5. CONFIG_CRASH_DM_CRYPT has been introduced to support LUKS-encrypted device dump target by addressing two challenges [1], - Kdump kernel may not be able to decrypt the LUKS partition. For some machines, a system administrator may not have a chance to enter the password to decrypt the device in kdump initramfs after the 1st kernel crashes - LUKS2 by default use the memory-hard Argon2 key derivation function which is quite memory-consuming compared to the limited memory reserved for kdump. To also enable this feature for ARM64 and PowerPC, we need to add a device tree property dmcryptkeys [2] as similar to elfcorehdr to pass the memory address of the stored info of dm-crypt keys to the kdump kernel. This patch (of 3): When the vmcore dumping target is not a LUKS-encrypted target, it's expected that there is no dm-crypt key thus no need to return -ENOENT. Also print more logs in crash_load_dm_crypt_keys. The benefit is arch-specific code can be more succinct. Link: https://lkml.kernel.org/r/20260225060347.718905-1-coxu@redhat.com Link: https://lkml.kernel.org/r/20260225060347.718905-2-coxu@redhat.com Link: https://lore.kernel.org/all/20250502011246.99238-1-coxu@redhat.com/ [1] Link: https://github.com/devicetree-org/dt-schema/pull/181 [2] Signed-off-by: Coiby Xu <coxu@redhat.com> Suggested-by: Will Deacon <will@kernel.org> Acked-by: Baoquan He <bhe@redhat.com> Cc: Arnaud Lefebvre <arnaud.lefebvre@clever-cloud.com> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Dave Young <dyoung@redhat.com> Cc: Kairui Song <ryncsn@gmail.com> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Thomas Staudt <tstaudt@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-02Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Alexei Starovoitov: - Fix register equivalence for pointers to packet (Alexei Starovoitov) - Fix incorrect pruning due to atomic fetch precision tracking (Daniel Borkmann) - Fix grace period wait for bpf_link-ed tracepoints (Kumar Kartikeya Dwivedi) - Fix use-after-free of sockmap's sk->sk_socket (Kuniyuki Iwashima) - Reject direct access to nullable PTR_TO_BUF pointers (Qi Tang) - Reject sleepable kprobe_multi programs at attach time (Varun R Mallya) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Add more precision tracking tests for atomics bpf: Fix incorrect pruning due to atomic fetch precision tracking bpf: Reject sleepable kprobe_multi programs at attach time bpf: reject direct access to nullable PTR_TO_BUF pointers bpf: sockmap: Fix use-after-free of sk->sk_socket in sk_psock_verdict_data_ready(). bpf: Fix grace period wait for tracepoint bpf_link bpf: Fix regsafe() for pointers to packet
2026-04-02bpf: Simulate branches to prune based on range violationsHarishankar Vishwanathan
This patch fixes the invariant violations that can happen after we refine ranges & tnum based on an incorrectly-detected branch condition. For example, the branch is always true, but we miss it in is_branch_taken; we then refine based on the branch being false and end up with incoherent ranges (e.g. umax < umin). To avoid this, we can simulate the refinement on both branches. More specifically, this patch simulates both branches taken using regs_refine_cond_op and reg_bounds_sync. If the resulting register states are ill-formed on one of the branches, is_branch_taken can mark that branch as "never taken". On a more formal note, we can deduce a branch is not taken when regs_refine_cond_op or reg_bounds_sync returns an ill-formed state because the branch operators are sound (verified with Agni [1]). Soundness means that the verifier is guaranteed to produce sound outputs on the taken branches. On the non-taken branch (explored because of imprecision in the bounds), the verifier is free to produce any output. We use ill-formedness as a signal that the branch is dead and prune that branch. This patch moves the refinement logic for both branches from reg_set_min_max to their own function, simulate_both_branches_taken, which is called from is_scalar_branch_taken. As a result, reg_set_min_max now only runs sanity checks and has been renamed to reg_bounds_sanity_check_branches to reflect that. We have had five patches fixing specific cases of invariant violations in the past, all added with selftests: - commit fbc7aef517d8 ("bpf: Fix u32/s32 bounds when ranges cross min/max boundary") - commit efc11a667878 ("bpf: Improve bounds when tnum has a single possible value") - commit f41345f47fb2 ("bpf: Use tnums for JEQ/JNE is_branch_taken logic") - commit 00bf8d0c6c9b ("bpf: Improve bounds when s64 crosses sign boundary") - commit 6279846b9b25 ("bpf: Forget ranges when refining tnum after JSET") To confirm that this patch addresses all invariant violations, we have also reverted those five commits and verified that their related selftests don't cause any invariant violation warnings anymore. Those selftests still fail but only because of misdetected branches or less-precise bounds than expected. This demonstrates that the current patch is enough to avoid the invariant violation warning AND that the previous five patches are still useful to improve branch detection. In addition to the selftests, this change was also tested with the Cilium complexity test suite: all programs were successfully loaded and it didn't change the number of processed instructions. Link: https://github.com/bpfverif/agni [1] Reported-by: syzbot+c950cc277150935cc0b5@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c950cc277150935cc0b5 Co-developed-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/a166b54a3cbbbdbcdf8a87f53045f1097176218b.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-02bpf: Exit early if reg_bounds_sync gets invalid inputsHarishankar Vishwanathan
In the subsequent commit, to prune dead branches we will rely on detecting ill-formed ranges using range_bounds_violations() (e.g., umin > umax) after refining register bounds using regs_refine_cond_op(). However, reg_bounds_sync() can sometimes "repair" ill-formed bounds, potentially masking a violation that was produced by regs_refine_cond_op(). This commit modifies reg_bounds_sync() to exit early if an invariant violation is already present in the input. This ensures ill-formed reg_states remain ill-formed after reg_bounds_sync(), allowing simulate_both_branches_taken() to correctly identify dead branches with a single check to range_bounds_violation(). Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/73127d628841c59cb7423d6bdcd204bf90bcdc80.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-02bpf: Use bpf_verifier_env buffers for reg_set_min_maxPaul Chaignon
In a subsequent patch, the regs_refine_cond_op and reg_bounds_sync functions will be called in is_branch_taken instead of reg_set_min_max, to simulate each branch's outcome. Since they will run before we branch out, these two functions will need to work on temporary registers for the two branches. This refactoring patch prepares for that change, by introducing the temporary registers on bpf_verifier_env and using them in reg_set_min_max. This change also allows us to save one fake_reg slot as we don't need to allocate an additional temporary buffer in case of a BPF_K condition. Finally, you may notice that this patch removes the check for "false_reg1 == false_reg2" in reg_set_min_max. That check was introduced in commit d43ad9da8052 ("bpf: Skip bounds adjustment for conditional jumps on same scalar register") to avoid an invariant violation. Given that "env->false_reg1 == env->false_reg2" doesn't make sense and invariant violations are addressed in a subsequent commit, this patch just removes the check. Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Co-developed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/260b0270052944a420e1c56e6a92df4d43cadf03.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-02bpf: Refactor reg_bounds_sanity_checkHarishankar Vishwanathan
This commit refactors reg_bounds_sanity_check to factor out the logic that performs the sanity check from the logic that does the reporting. Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/198ec3e69343e2c46dc9cbe2b1bc9be9ae2df5bd.1775142354.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-02genirq/chip: Invoke add_interrupt_randomness() in handle_percpu_devid_irq()Michael Kelley
handle_percpu_devid_irq() is a version of handle_percpu_irq() but with the addition of a pointer to a per-CPU devid. However, handle_percpu_irq() invokes add_interrupt_randomness(), while handle_percpu_devid_irq() currently does not. Add the missing add_interrupt_randomness(), as it is needed when per-CPU interrupts with devid's are used in VMs for interrupts from the hypervisor. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260402202400.1707-2-mhklkml@zohomail.com
2026-04-02sched_ext: Fix is_bpf_migration_disabled() false negative on non-PREEMPT_RCUChangwoo Min
Since commit 8e4f0b1ebcf2 ("bpf: use rcu_read_lock_dont_migrate() for trampoline.c"), the BPF prolog (__bpf_prog_enter) calls migrate_disable() only when CONFIG_PREEMPT_RCU is enabled, via rcu_read_lock_dont_migrate(). Without CONFIG_PREEMPT_RCU, the prolog never touches migration_disabled, so migration_disabled == 1 always means the task is truly migration-disabled regardless of whether it is the current task. The old unconditional p == current check was a false negative in this case, potentially allowing a migration-disabled task to be dispatched to a remote CPU and triggering scx_error in task_can_run_on_remote_rq(). Only apply the p == current disambiguation when CONFIG_PREEMPT_RCU is enabled, where the ambiguity with the BPF prolog still exists. Fixes: 8e4f0b1ebcf2 ("bpf: use rcu_read_lock_dont_migrate() for trampoline.c") Cc: stable@vger.kernel.org # v6.18+ Link: https://lore.kernel.org/lkml/20250821090609.42508-8-dongml2@chinatelecom.cn/ Signed-off-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-02sched_ext: Fix missing warning in scx_set_task_state() default caseSamuele Mariotti
In scx_set_task_state(), the default case was setting the warn flag, but then returning immediately. This is problematic because the only purpose of the warn flag is to trigger WARN_ONCE, but the early return prevented it from ever firing, leaving invalid task states undetected and untraced. To fix this, a WARN_ONCE call is now added directly in the default case. The fix addresses two aspects: - Guarantees the invalid task states are properly logged and traced. - Provides a distinct warning message ("sched_ext: Invalid task state") specifically for states outside the defined scx_task_state enum values, making it easier to distinguish from other transition warnings. This ensures proper detection and reporting of invalid states. Signed-off-by: Samuele Mariotti <smariotti@disroot.org> Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-02tracing: Allow backup to save persistent ring buffer before it startsSteven Rostedt
When the persistent ring buffer was first introduced, it did not make sense to start tracing for it on the kernel command line. That's because if there was a crash, the start of events would invalidate the events from the previous boot that had the crash. But now that there's a "backup" instance that can take a snapshot of the persistent ring buffer when boot starts, it is possible to have the persistent ring buffer start events at boot up and not lose the old events. Update the code where the boot events start after all boot time instances are created. This will allow the backup instance to copy the persistent ring buffer from the previous boot, and allow the persistent ring buffer to start tracing new events for the current boot. reserve_mem=100M:12M:trace trace_instance=boot_mapped^@trace,sched trace_instance=backup=boot_mapped The above will create a boot_mapped persistent ring buffer and enabled the scheduler events. If there's a crash, a "backup" instance will be created holding the events of the persistent ring buffer from the previous boot, while the persistent ring buffer will once again start tracing scheduler events of the current boot. Now the user doesn't have to remember to start the persistent ring buffer. It will always have the events started at each boot. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260331163924.6ccb3896@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-04-02tracing: Remove the backup instance automatically after readMasami Hiramatsu (Google)
Since the backup instance is readonly, after reading all data via pipe, no data is left on the instance. Thus it can be removed safely after closing all files. This also removes it if user resets the ring buffer manually via 'trace' file. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/177502547711.1311542.12572973358010839400.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-04-02tracing: Make the backup instance non-reusableMasami Hiramatsu (Google)
Since there is no reason to reuse the backup instance, make it readonly (but erasable). Note that only backup instances are readonly, because other trace instances will be empty unless it is writable. Only backup instances have copy entries from the original. With this change, most of the trace control files are removed from the backup instance, including eventfs enable/filter etc. # find /sys/kernel/tracing/instances/backup/events/ | wc -l 4093 # find /sys/kernel/tracing/instances/boot_map/events/ | wc -l 9573 Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/177502546939.1311542.1826814401724828930.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-04-02ring-buffer: Enforce read ordering of trace_buffer cpumask and buffersVincent Donnefort
On CPU hotplug, if it is the first time a trace_buffer sees a CPU, a ring_buffer_per_cpu will be allocated and its corresponding bit toggled in the cpumask. Many readers check this cpumask to know if they can safely read the ring_buffer_per_cpu but they are doing so without memory ordering and may observe the cpumask bit set while having NULL buffer pointer. Enforce the memory read ordering by sending an IPI to all online CPUs. The hotplug path is a slow-path anyway and it saves us from adding read barriers in numerous call sites. Link: https://patch.msgid.link/20260401053659.3458961-1-vdonnefort@google.com Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-04-02bpf: Fix incorrect pruning due to atomic fetch precision trackingDaniel Borkmann
When backtrack_insn encounters a BPF_STX instruction with BPF_ATOMIC and BPF_FETCH, the src register (or r0 for BPF_CMPXCHG) also acts as a destination, thus receiving the old value from the memory location. The current backtracking logic does not account for this. It treats atomic fetch operations the same as regular stores where the src register is only an input. This leads the backtrack_insn to fail to propagate precision to the stack location, which is then not marked as precise! Later, the verifier's path pruning can incorrectly consider two states equivalent when they differ in terms of stack state. Meaning, two branches can be treated as equivalent and thus get pruned when they should not be seen as such. Fix it as follows: Extend the BPF_LDX handling in backtrack_insn to also cover atomic fetch operations via is_atomic_fetch_insn() helper. When the fetch dst register is being tracked for precision, clear it, and propagate precision over to the stack slot. For non-stack memory, the precision walk stops at the atomic instruction, same as regular BPF_LDX. This covers all fetch variants. Before: 0: (b7) r1 = 8 ; R1=8 1: (7b) *(u64 *)(r10 -8) = r1 ; R1=8 R10=fp0 fp-8=8 2: (b7) r2 = 0 ; R2=0 3: (db) r2 = atomic64_fetch_add((u64 *)(r10 -8), r2) ; R2=8 R10=fp0 fp-8=mmmmmmmm 4: (bf) r3 = r10 ; R3=fp0 R10=fp0 5: (0f) r3 += r2 mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r2 stack= before 4: (bf) r3 = r10 mark_precise: frame0: regs=r2 stack= before 3: (db) r2 = atomic64_fetch_add((u64 *)(r10 -8), r2) mark_precise: frame0: regs=r2 stack= before 2: (b7) r2 = 0 6: R2=8 R3=fp8 6: (b7) r0 = 0 ; R0=0 7: (95) exit After: 0: (b7) r1 = 8 ; R1=8 1: (7b) *(u64 *)(r10 -8) = r1 ; R1=8 R10=fp0 fp-8=8 2: (b7) r2 = 0 ; R2=0 3: (db) r2 = atomic64_fetch_add((u64 *)(r10 -8), r2) ; R2=8 R10=fp0 fp-8=mmmmmmmm 4: (bf) r3 = r10 ; R3=fp0 R10=fp0 5: (0f) r3 += r2 mark_precise: frame0: last_idx 5 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r2 stack= before 4: (bf) r3 = r10 mark_precise: frame0: regs=r2 stack= before 3: (db) r2 = atomic64_fetch_add((u64 *)(r10 -8), r2) mark_precise: frame0: regs= stack=-8 before 2: (b7) r2 = 0 mark_precise: frame0: regs= stack=-8 before 1: (7b) *(u64 *)(r10 -8) = r1 mark_precise: frame0: regs=r1 stack= before 0: (b7) r1 = 8 6: R2=8 R3=fp8 6: (b7) r0 = 0 ; R0=0 7: (95) exit Fixes: 5ffa25502b5a ("bpf: Add instructions for atomic_[cmp]xchg") Fixes: 5ca419f2864a ("bpf: Add BPF_FETCH field / create atomic_fetch_add instruction") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260331222020.401848-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-02bpf: Reject sleepable kprobe_multi programs at attach timeVarun R Mallya
kprobe.multi programs run in atomic/RCU context and cannot sleep. However, bpf_kprobe_multi_link_attach() did not validate whether the program being attached had the sleepable flag set, allowing sleepable helpers such as bpf_copy_from_user() to be invoked from a non-sleepable context. This causes a "sleeping function called from invalid context" splat: BUG: sleeping function called from invalid context at ./include/linux/uaccess.h:169 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1787, name: sudo preempt_count: 1, expected: 0 RCU nest depth: 2, expected: 0 Fix this by rejecting sleepable programs early in bpf_kprobe_multi_link_attach(), before any further processing. Fixes: 0dcac2725406 ("bpf: Add multi kprobe link") Signed-off-by: Varun R Mallya <varunrmallya@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Leon Hwang <leon.hwang@linux.dev> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260401191126.440683-1-varunrmallya@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-02bpf: reject direct access to nullable PTR_TO_BUF pointersQi Tang
check_mem_access() matches PTR_TO_BUF via base_type() which strips PTR_MAYBE_NULL, allowing direct dereference without a null check. Map iterator ctx->key and ctx->value are PTR_TO_BUF | PTR_MAYBE_NULL. On stop callbacks these are NULL, causing a kernel NULL dereference. Add a type_may_be_null() guard to the PTR_TO_BUF branch, matching the existing PTR_TO_BTF_ID pattern. Fixes: 20b2aff4bc15 ("bpf: Introduce MEM_RDONLY flag") Signed-off-by: Qi Tang <tpluszz77@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260402092923.38357-2-tpluszz77@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-02bpf: Migrate dynptr file to kmalloc_nolockMykyta Yatsenko
Replace bpf_mem_alloc/bpf_mem_free with kmalloc_nolock/kfree_nolock for bpf_dynptr_file_impl, continuing the migration away from bpf_mem_alloc now that kmalloc can be used from NMI context. freader_cleanup() runs before kfree_nolock() while the dynptr still holds exclusive access, so plain kfree_nolock() is safe — no concurrent readers can access the object. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260330-kmalloc_special-v2-2-c90403f92ff0@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-02bpf: Migrate bpf_task_work to kmalloc_nolockMykyta Yatsenko
Replace bpf_mem_alloc/bpf_mem_free with kmalloc_nolock/kfree_rcu for bpf_task_work_ctx. Replace guard(rcu_tasks_trace)() with guard(rcu)() in bpf_task_work_irq(). The function only accesses ctx struct members (not map values), so tasks trace protection is not needed - regular RCU is sufficient since ctx is freed via kfree_rcu. The guard in bpf_task_work_callback() remains as tasks trace since it accesses map values from process context. Sleepable BPF programs hold rcu_read_lock_trace but not regular rcu_read_lock. Since kfree_rcu waits for a regular RCU grace period, the ctx memory can be freed while a sleepable program is still running. Add scoped_guard(rcu) around the pointer read and refcount tryget in bpf_task_work_acquire_ctx to close this race window. Since kfree_rcu uses call_rcu internally which is not safe from NMI context, defer destruction via irq_work when irqs are disabled. For the lost-cmpxchg path the ctx was never published, so kfree_nolock is safe. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260330-kmalloc_special-v2-1-c90403f92ff0@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-02cpufreq: Pass the policy to cpufreq_driver->adjust_perf()K Prateek Nayak
cpufreq_cpu_get() can sleep on PREEMPT_RT in presence of concurrent writer(s), however amd-pstate depends on fetching the cpudata via the policy's driver data which necessitates grabbing the reference. Since schedutil governor can call "cpufreq_driver->update_perf()" during sched_tick/enqueue/dequeue with rq_lock held and IRQs disabled, fetching the policy object using the cpufreq_cpu_get() helper in the scheduler fast-path leads to "BUG: scheduling while atomic" on PREEMPT_RT [1]. Pass the cached cpufreq policy object in sg_policy to the update_perf() instead of just the CPU. The CPU can be inferred using "policy->cpu". The lifetime of cpufreq_policy object outlasts that of the governor and the cpufreq driver (allocated when the CPU is onlined and only reclaimed when the CPU is offlined / the CPU device is removed) which makes it safe to be referenced throughout the governor's lifetime. Closes:https://lore.kernel.org/all/20250731092316.3191-1-spasswolf@web.de/ [1] Fixes: 1d215f0319c2 ("cpufreq: amd-pstate: Add fast switch function for AMD P-State") Reported-by: Bert Karwatzki <spasswolf@web.de> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Gary Guo <gary@garyguo.net> # Rust Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Reviewed-by: Zhongqiu Han <zhongqiu.han@oss.qualcomm.com> Link: https://lore.kernel.org/r/20260316081849.19368-3-kprateek.nayak@amd.com Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>