summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2026-03-09tracing: Introduce trace remotesVincent Donnefort
A trace remote relies on ring-buffer remotes to read and control compatible tracing buffers, written by entity such as firmware or hypervisor. Add a Tracefs directory remotes/ that contains all instances of trace remotes. Each instance follows the same hierarchy as any other to ease the support by existing user-space tools. This currently does not provide any event support, which will come later. Link: https://patch.msgid.link/20260309162516.2623589-6-vdonnefort@google.com Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-09ring-buffer: Add non-consuming read for ring-buffer remotesVincent Donnefort
Hopefully, the remote will only swap pages on the kernel instruction (via the swap_reader_page() callback). This means we know at what point the ring-buffer geometry has changed. It is therefore possible to rearrange the kernel view of that ring-buffer to allow non-consuming read. Link: https://patch.msgid.link/20260309162516.2623589-5-vdonnefort@google.com Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-09ring-buffer: Introduce ring-buffer remotesVincent Donnefort
Add ring-buffer remotes to support entities outside of the kernel (such as firmware or a hypervisor) that writes events into a ring-buffer using the tracefs format Require a description of the ring-buffer pages (struct trace_buffer_desc) and callbacks (swap_reader_page and reset) to set up the ring-buffer on the kernel side. Expect the remote entity to maintain and update the meta-page. Link: https://patch.msgid.link/20260309162516.2623589-4-vdonnefort@google.com Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-09ring-buffer: Store bpage pointers into subbuf_idsVincent Donnefort
The subbuf_ids field allows to point to a specific page from the ring-buffer based on its ID. As a preparation or the upcoming ring-buffer remote support, point this array to the buffer_page instead of the buffer_data_page. Link: https://patch.msgid.link/20260309162516.2623589-3-vdonnefort@google.com Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-09ring-buffer: Add page statistics to the meta-pageVincent Donnefort
Add two fields pages_touched and pages_lost to the ring-buffer meta-page. Those fields are useful to get the number of used pages in the ring-buffer. Link: https://patch.msgid.link/20260309162516.2623589-2-vdonnefort@google.com Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-09bpf: Always allow fmod_ret programs on syscallsViktor Malik
fmod_ret BPF programs can only be attached to selected functions. For convenience, the error injection list was originally used (along with functions prefixed with "security_"), which contains syscalls and several other functions. When error injection is disabled (CONFIG_FUNCTION_ERROR_INJECTION=n), that list is empty and fmod_ret programs are effectively unavailable for most of the functions. In such a case, at least enable fmod_ret programs on syscalls. Signed-off-by: Viktor Malik <vmalik@redhat.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/472310f9a5f4944ad03214e4d943a4830fd8eb76.1773055375.git.vmalik@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-09bpf: Always allow sleepable programs on syscallsViktor Malik
Sleepable BPF programs can only be attached to selected functions. For convenience, the error injection list was originally used, which contains syscalls and several other functions. When error injection is disabled (CONFIG_FUNCTION_ERROR_INJECTION=n), that list is empty and sleepable tracing programs are effectively unavailable. In such a case, at least enable sleepable programs on syscalls. For discussion why syscalls were chosen, see [1]. To detect that a function is a syscall handler, we check for arch-specific prefixes for the most common architectures. Unfortunately, the prefixes are hard-coded in arch syscall code so we need to hard-code them, too. [1] https://lore.kernel.org/bpf/CAADnVQK6qP8izg+k9yV0vdcT-+=axtFQ2fKw7D-2Ei-V6WS5Dw@mail.gmail.com/ Signed-off-by: Viktor Malik <vmalik@redhat.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/2704a8512746655037e3c02b471b31bd0d76c8db.1773055375.git.vmalik@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-09Merge branch 'for-7.0-fixes' into for-7.1Tejun Heo
2026-03-09sched_ext: Use WRITE_ONCE() for the write side of scx_enable helper pointerzhidao su
scx_enable() uses double-checked locking to lazily initialize a static kthread_worker pointer. The fast path reads helper locklessly: if (!READ_ONCE(helper)) { // lockless read -- no helper_mutex The write side initializes helper under helper_mutex, but previously used a plain assignment: helper = kthread_run_worker(0, "scx_enable_helper"); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ plain write -- KCSAN data race with READ_ONCE() above Since READ_ONCE() on the fast path and the plain write on the initialization path access the same variable without a common lock, they constitute a data race. KCSAN requires that all sides of a lock-free access use READ_ONCE()/WRITE_ONCE() consistently. Use a temporary variable to stage the result of kthread_run_worker(), and only WRITE_ONCE() into helper after confirming the pointer is valid. This avoids a window where a concurrent caller on the fast path could observe an ERR pointer via READ_ONCE(helper) before the error check completes. Fixes: b06ccbabe250 ("sched_ext: Fix starvation of scx_enable() under fair-class saturation") Signed-off-by: zhidao su <suzhidao@xiaomi.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf 7.0-rc3Alexei Starovoitov
Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-08locking/rwsem: Add context analysisPeter Zijlstra
Add compiler context analysis annotations. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260306101417.GT1282955@noisy.programming.kicks-ass.net
2026-03-08locking/rtmutex: Add context analysisPeter Zijlstra
Add compiler context analysis annotations. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260121111213.851599178@infradead.org
2026-03-08locking/mutex: Add context analysisPeter Zijlstra
Add compiler context analysis annotations. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260121111213.745353747@infradead.org
2026-03-08locking/mutex: Remove the list_head from struct mutexMatthew Wilcox (Oracle)
Instead of embedding a list_head in struct mutex, store a pointer to the first waiter. The list of waiters remains a doubly linked list so we can efficiently add to the tail of the list, remove from the front (or middle) of the list. Some of the list manipulation becomes more complicated, but it's a reasonable tradeoff on the slow paths to shrink data structures which embed a mutex like struct file. Some of the debug checks have to be deleted because there's no equivalent to checking them in the new scheme (eg an empty waiter->list now means that it is the only waiter, not that the waiter is no longer on the list). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260305195545.3707590-4-willy@infradead.org
2026-03-08locking/semaphore: Remove the list_head from struct semaphoreMatthew Wilcox (Oracle)
Instead of embedding a list_head in struct semaphore, store a pointer to the first waiter. The list of waiters remains a doubly linked list so we can efficiently add to the tail of the list and remove from the front (or middle) of the list. Some of the list manipulation becomes more complicated, but it's a reasonable tradeoff on the slow paths to shrink data structures which embed a semaphore. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260305195545.3707590-3-willy@infradead.org
2026-03-08locking/rwsem: Remove the list_head from struct rw_semaphoreMatthew Wilcox (Oracle)
Instead of embedding a list_head in struct rw_semaphore, store a pointer to the first waiter. The list of waiters remains a doubly linked list so we can efficiently add to the tail of the list, remove from the front (or middle) of the list. Some of the list manipulation becomes more complicated, but it's a reasonable tradeoff on the slow paths to shrink some core data structures like struct inode. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260305195545.3707590-2-willy@infradead.org
2026-03-07Revert "sched_ext: Use READ_ONCE() for the read side of dsq->nr update"Tejun Heo
This reverts commit 9adfcef334bf9c6ef68eaecfca5f45d18614efe0. dsq->nr is protected by dsq->lock and reading while holding the lock doesn't constitute a racy read. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: zhidao su <suzhidao@xiaomi.com>
2026-03-07Merge tag 'timers-urgent-2026-03-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Make clock_adjtime() syscall timex validation slightly more permissive for auxiliary clocks, to not reject syscalls based on the status field that do not try to modify the status field. This makes the ABI behavior in clock_adjtime() consistent with CLOCK_REALTIME" * tag 'timers-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Fix timex status validation for auxiliary clocks
2026-03-07Merge tag 'sched-urgent-2026-03-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a DL scheduler bug that may corrupt internal metrics during PI and setscheduler() syscalls, resulting in kernel warnings and misbehavior. Found during stress-testing" * tag 'sched-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix missing ENQUEUE_REPLENISH during PI de-boosting
2026-03-07Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Alexei Starovoitov: - Fix u32/s32 bounds when ranges cross min/max boundary (Eduard Zingerman) - Fix precision backtracking with linked registers (Eduard Zingerman) - Fix linker flags detection for resolve_btfids (Ihor Solodrai) - Fix race in update_ftrace_direct_add/del (Jiri Olsa) - Fix UAF in bpf_trampoline_link_cgroup_shim (Lang Xu) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: resolve_btfids: Fix linker flags detection selftests/bpf: add reproducer for spurious precision propagation through calls bpf: collect only live registers in linked regs Revert "selftests/bpf: Update reg_bound range refinement logic" selftests/bpf: test refining u32/s32 bounds when ranges cross min/max boundary bpf: Fix u32/s32 bounds when ranges cross min/max boundary bpf: Fix a UAF issue in bpf_trampoline_link_cgroup_shim ftrace: Add missing ftrace_lock to update_ftrace_direct_add/del
2026-03-07sched_ext: Fix scx_bpf_reenqueue_local() silently reenqueuing nothingCheng-Yang Chou
ffa7ae0724e4 ("sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq()") introduced task_should_reenq() as a filter inside reenq_local(), requiring SCX_REENQ_ANY to be set in order to match any task. scx_bpf_dsq_reenq() handles this correctly by converting a bare reenq_flags=0 to SCX_REENQ_ANY, but scx_bpf_reenqueue_local() was not updated and continued to call reenq_local() with 0, causing it to silently reenqueue zero tasks. Fix by passing SCX_REENQ_ANY directly. Fixes: ffa7ae0724e4 ("sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq()") Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-07Merge tag 'trace-v7.0-rc2-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix possible NULL pointer dereference in trace_data_alloc() On the trace_data_alloc() error path, it can call trigger_data_free() with a NULL pointer. This used to be a kfree() but was changed to trigger_data_free() to clean up any partial initialization. The issue is that trigger_data_free() does not expect a NULL pointer. Have trigger_data_free() return safely on NULL pointer. - Fix multiple events on the command line and bootconfig If multiple events are enabled on the command line separately and not grouped, only the last event gets enabled. That is: trace_event=sched_switch trace_event=sched_waking will only enable sched_waking whereas: trace_event=sched_switch,sched_waking will enable both. The bootconfig makes it even worse as the second way is the more common method. The issue is that a temporary buffer is used to store the events to enable later in boot. Each time the cmdline callback is called, it overwrites what was previously there. Have the callback append the next value (delimited by a comma) if the temporary buffer already has content. - Fix command line trace_buffer_size if >= 2G The logic to allocate the trace buffer uses "int" for the size parameter in the command line code causing overflow issues if more that 2G is specified. * tag 'trace-v7.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix trace_buf_size= cmdline parameter with sizes >= 2G tracing: Fix enabling multiple events on the kernel command line and bootconfig tracing: Add NULL pointer check to trigger_data_free()
2026-03-07sched_ext: Add SCX_TASK_REENQ_REASON flagsTejun Heo
SCX_ENQ_REENQ indicates that a task is being re-enqueued but doesn't tell the BPF scheduler why. Add SCX_TASK_REENQ_REASON flags using bits 12-13 of p->scx.flags to communicate the reason during ops.enqueue(): - NONE: Not being reenqueued - KFUNC: Reenqueued by scx_bpf_dsq_reenq() and friends More reasons will be added. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Simplify task state handlingTejun Heo
Task states (NONE, INIT, READY, ENABLED) were defined in a separate enum with unshifted values and then shifted when stored in scx_entity.flags. Simplify by defining them as pre-shifted values directly in scx_ent_flags and removing the separate scx_task_state enum. This removes the need for shifting when reading/writing state values. scx_get_task_state() now returns the masked flags value directly. scx_set_task_state() accepts the pre-shifted state value. scx_dump_task() shifts down for display to maintain readable output. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Optimize schedule_dsq_reenq() with lockless fast pathTejun Heo
schedule_dsq_reenq() always acquires deferred_reenq_lock to queue a reenqueue request. Add a lockless fast-path to skip lock acquisition when the request is already pending with the required flags set. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Implement scx_bpf_dsq_reenq() for user DSQsTejun Heo
scx_bpf_dsq_reenq() currently only supports local DSQs. Extend it to support user-defined DSQs by adding a deferred re-enqueue mechanism similar to the local DSQ handling. Add per-cpu deferred_reenq_user_node/flags to scx_dsq_pcpu and deferred_reenq_users list to scx_rq. When scx_bpf_dsq_reenq() is called on a user DSQ, the DSQ's per-cpu node is added to the current rq's deferred list. process_deferred_reenq_users() then iterates the DSQ using the cursor helpers and re-enqueues each task. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Factor out nldsq_cursor_next_task() and nldsq_cursor_lost_task()Tejun Heo
Factor out cursor-based DSQ iteration from bpf_iter_scx_dsq_next() into nldsq_cursor_next_task() and the task-lost check from scx_dsq_move() into nldsq_cursor_lost_task() to prepare for reuse. As ->priv is only used to record dsq->seq for cursors, update INIT_DSQ_LIST_CURSOR() to take the DSQ pointer and set ->priv from dsq->seq so that users don't have to read it manually. Move scx_dsq_iter_flags enum earlier so nldsq_cursor_next_task() can use SCX_DSQ_ITER_REV. bypass_lb_cpu() now sets cursor.priv to dsq->seq but doesn't use it. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Add per-CPU data to DSQsTejun Heo
Add per-CPU data structure to dispatch queues. Each DSQ now has a percpu scx_dsq_pcpu which contains a back-pointer to the DSQ. This will be used by future changes to implement per-CPU reenqueue tracking for user DSQs. init_dsq() now allocates the percpu data and can fail, so it returns an error code. All callers are updated to handle failures. exit_dsq() is added to free the percpu data and is called from all DSQ cleanup paths. In scx_bpf_create_dsq(), init_dsq() is called before rcu_read_lock() since alloc_percpu() requires GFP_KERNEL context, and dsq->sched is set afterwards. v2: Fix err_free_pcpu to only exit_dsq() initialized bypass DSQs (Andrea Righi). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq()Tejun Heo
Add infrastructure to pass flags through the deferred reenqueue path. reenq_local() now takes a reenq_flags parameter, and scx_sched_pcpu gains a deferred_reenq_local_flags field to accumulate flags from multiple scx_bpf_dsq_reenq() calls before processing. No flags are defined yet. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Introduce scx_bpf_dsq_reenq() for remote local DSQ reenqueueTejun Heo
scx_bpf_reenqueue_local() can only trigger re-enqueue of the current CPU's local DSQ. Introduce scx_bpf_dsq_reenq() which takes a DSQ ID and can target any local DSQ including remote CPUs via SCX_DSQ_LOCAL_ON | cpu. This will be expanded to support user DSQs by future changes. scx_bpf_reenqueue_local() is reimplemented as a simple wrapper around scx_bpf_dsq_reenq(SCX_DSQ_LOCAL, 0) and may be deprecated in the future. Update compat.bpf.h with a compatibility shim and scx_qmap to test the new functionality. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Wrap deferred_reenq_local_node into a structTejun Heo
Wrap the deferred_reenq_local_node list_head into struct scx_deferred_reenq_local. More fields will be added and this allows using a shorthand pointer to access them. No functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Convert deferred_reenq_locals from llist to regular listTejun Heo
The deferred reenqueue local mechanism uses an llist (lockless list) for collecting schedulers that need their local DSQs re-enqueued. Convert to a regular list protected by a raw_spinlock. The llist was used for its lockless properties, but the upcoming changes to support remote reenqueue require more complex list operations that are difficult to implement correctly with lockless data structures. A spinlock- protected regular list provides the necessary flexibility. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Relocate run_deferred() and its calleesTejun Heo
Previously, both process_ddsp_deferred_locals() and reenq_local() required forward declarations. Reorganize so that only run_deferred() needs to be declared. Both callees are grouped right before run_deferred() for better locality. This reduces forward declaration clutter and will ease adding more to the run_deferred() path. No functional changes. v2: Also relocate process_ddsp_deferred_locals() next to run_deferred() (Daniel Jordan). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Change find_global_dsq() to take CPU number instead of taskTejun Heo
Change find_global_dsq() to take a CPU number directly instead of a task pointer. This prepares for callers where the CPU is available but the task is not. No functional changes. v2: Rename tcpu to cpu in find_global_dsq() (Emil Tsalapatis). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Factor out pnode allocation and deallocation into helpersTejun Heo
Extract pnode allocation and deallocation logic into alloc_pnode() and free_pnode() helpers. This simplifies scx_alloc_and_add_sched() and prepares for adding more per-node initialization and cleanup in subsequent patches. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Wrap global DSQs in per-node structureTejun Heo
Global DSQs are currently stored as an array of scx_dispatch_q pointers, one per NUMA node. To allow adding more per-node data structures, wrap the global DSQ in scx_sched_pnode and replace global_dsqs with pnode array. NUMA-aware allocation is maintained. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Relocate scx_bpf_task_cgroup() and its BTF_ID to the end of kfunc ↵Tejun Heo
section Move scx_bpf_task_cgroup() kfunc definition and its BTF_ID entry to the end of the kfunc section before __bpf_kfunc_end_defs() for cleaner code organization. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Pass full dequeue flags to ops.quiescent()Andrea Righi
ops.quiescent() is invoked with the same deq_flags as ops.dequeue(), so the BPF scheduler is able to distinguish sleep vs property changes in both callbacks. However, dequeue_task_scx() receives deq_flags as an int from the sched_class interface, so SCX flags above bit 32 (%SCX_DEQ_SCHED_CHANGE) are truncated. ops_dequeue() reconstructs the full u64 for ops.dequeue(), but ops.quiescent() is still called with the original int and can never see %SCX_DEQ_SCHED_CHANGE. Fix this by constructing the full u64 deq_flags in dequeue_task_scx() (renaming the int parameter to core_deq_flags) and passing the complete flags to both ops_dequeue() and ops.quiescent(). Fixes: ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-07Merge branch 'for-7.0-fixes' into for-7.1Tejun Heo
Pull in 57ccf5ccdc56 ("sched_ext: Fix enqueue_task_scx() truncation of upper enqueue flags") which conflicts with ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics"). Signed-off-by: Tejun Heo <tj@kernel.org> # Conflicts: # kernel/sched/ext.c
2026-03-07sched_ext: Fix enqueue_task_scx() truncation of upper enqueue flagsTejun Heo
enqueue_task_scx() takes int enq_flags from the sched_class interface. SCX enqueue flags starting at bit 32 (SCX_ENQ_PREEMPT and above) are silently truncated when passed through activate_task(). extra_enq_flags was added as a workaround - storing high bits in rq->scx.extra_enq_flags and OR-ing them back in enqueue_task_scx(). However, the OR target is still the int parameter, so the high bits are lost anyway. The current impact is limited as the only affected flag is SCX_ENQ_PREEMPT which is informational to the BPF scheduler - its loss means the scheduler doesn't know about preemption but doesn't cause incorrect behavior. Fix by renaming the int parameter to core_enq_flags and introducing a u64 enq_flags local that merges both sources. All downstream functions already take u64 enq_flags. Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-06bpf: collect only live registers in linked regsEduard Zingerman
Fix an inconsistency between func_states_equal() and collect_linked_regs(): - regsafe() uses check_ids() to verify that cached and current states have identical register id mapping. - func_states_equal() calls regsafe() only for registers computed as live by compute_live_registers(). - clean_live_states() is supposed to remove dead registers from cached states, but it can skip states belonging to an iterator-based loop. - collect_linked_regs() collects all registers sharing the same id, ignoring the marks computed by compute_live_registers(). Linked registers are stored in the state's jump history. - backtrack_insn() marks all linked registers for an instruction as precise whenever one of the linked registers is precise. The above might lead to a scenario: - There is an instruction I with register rY known to be dead at I. - Instruction I is reached via two paths: first A, then B. - On path A: - There is an id link between registers rX and rY. - Checkpoint C is created at I. - Linked register set {rX, rY} is saved to the jump history. - rX is marked as precise at I, causing both rX and rY to be marked precise at C. - On path B: - There is no id link between registers rX and rY, otherwise register states are sub-states of those in C. - Because rY is dead at I, check_ids() returns true. - Current state is considered equal to checkpoint C, propagate_precision() propagates spurious precision mark for register rY along the path B. - Depending on a program, this might hit verifier_bug() in the backtrack_insn(), e.g. if rY ∈ [r1..r5] and backtrack_insn() spots a function call. The reproducer program is in the next patch. This was hit by sched_ext scx_lavd scheduler code. Changes in tests: - verifier_scalar_ids.c selftests need modification to preserve some registers as live for __msg() checks. - exceptions_assert.c adjusted to match changes in the verifier log, R0 is dead after conditional instruction and thus does not get range. - precise.c adjusted to match changes in the verifier log, register r9 is dead after comparison and it's range is not important for test. Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Fixes: 0fb3cf6110a5 ("bpf: use register liveness information for func_states_equal") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260306-linked-regs-and-propagate-precision-v1-1-18e859be570d@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-07padata: Remove cpu online check from cpu add and removalChuyi Zhou
During the CPU offline process, the dying CPU is cleared from the cpu_online_mask in takedown_cpu(). After this step, various CPUHP_*_DEAD callbacks are executed to perform cleanup jobs for the dead CPU, so this cpu online check in padata_cpu_dead() is unnecessary. Similarly, when executing padata_cpu_online() during the CPUHP_AP_ONLINE_DYN phase, the CPU has already been set in the cpu_online_mask, the action even occurs earlier than the CPUHP_AP_ONLINE_IDLE stage. Remove this unnecessary cpu online check in __padata_add_cpu() and __padata_remove_cpu(). Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2026-03-06tracing: Fix trace_buf_size= cmdline parameter with sizes >= 2GCalvin Owens
Some of the sizing logic through tracer_alloc_buffers() uses int internally, causing unexpected behavior if the user passes a value that does not fit in an int (on my x86 machine, the result is uselessly tiny buffers). Fix by plumbing the parameter's real type (unsigned long) through to the ring buffer allocation functions, which already use unsigned long. It has always been possible to create larger ring buffers via the sysfs interface: this only affects the cmdline parameter. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/bff42a4288aada08bdf74da3f5b67a2c28b761f8.1772852067.git.calvin@wbinvd.org Fixes: 73c5162aa362 ("tracing: keep ring buffer to minimum size till used") Signed-off-by: Calvin Owens <calvin@wbinvd.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-06bpf: Fix u32/s32 bounds when ranges cross min/max boundaryEduard Zingerman
Same as in __reg64_deduce_bounds(), refine s32/u32 ranges in __reg32_deduce_bounds() in the following situations: - s32 range crosses U32_MAX/0 boundary, positive part of the s32 range overlaps with u32 range: 0 U32_MAX | [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] | |----------------------------|----------------------------| |xxxxx s32 range xxxxxxxxx] [xxxxxxx| 0 S32_MAX S32_MIN -1 - s32 range crosses U32_MAX/0 boundary, negative part of the s32 range overlaps with u32 range: 0 U32_MAX | [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] | |----------------------------|----------------------------| |xxxxxxxxx] [xxxxxxxxxxxx s32 range | 0 S32_MAX S32_MIN -1 - No refinement if ranges overlap in two intervals. This helps for e.g. consider the following program: call %[bpf_get_prandom_u32]; w0 &= 0xffffffff; if w0 < 0x3 goto 1f; // on fall-through u32 range [3..U32_MAX] if w0 s> 0x1 goto 1f; // on fall-through s32 range [S32_MIN..1] if w0 s< 0x0 goto 1f; // range can be narrowed to [S32_MIN..-1] r10 = 0; 1: ...; The reg_bounds.c selftest is updated to incorporate identical logic, refinement based on non-overflowing range halves: ((x ∩ [0, smax]) ∩ (y ∩ [0, smax])) ∪ ((x ∩ [smin,-1]) ∩ (y ∩ [smin,-1])) Reported-by: Andrea Righi <arighi@nvidia.com> Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Closes: https://lore.kernel.org/bpf/aakqucg4vcujVwif@gpd4/T/ Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260306-bpf-32-bit-range-overflow-v3-1-f7f67e060a6b@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-06cgroup: Don't expose dead tasks in cgroupSebastian Andrzej Siewior
Once a task exits it has its state set to TASK_DEAD and then it is removed from the cgroup it belonged to. The last step happens on the task gets out of its last schedule() invocation and is delayed on PREEMPT_RT due to locking constraints. As a result it is possible to receive a pid via waitpid() of a task which is still listed in cgroup.procs for the cgroup it belonged to. This is something that systemd does not expect and as a result it waits for its exit until a time out occurs. This can also be reproduced on !PREEMPT_RT kernel with a significant delay in do_exit() after exit_notify(). Hide the task from the output which have PF_EXITING set which is done before the parent is notified. Keeping zombies with live threads shouldn't break anything (suggested by Tejun). Reported-by: Bert Karwatzki <spasswolf@web.de> Closes: https://lore.kernel.org/all/20260219164648.3014-1-spasswolf@web.de/ Tested-by: Bert Karwatzki <spasswolf@web.de> Fixes: 9311e6c29b34 ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT") Cc: stable@vger.kernel.org # v6.19+ Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-06tracing: Fix enabling multiple events on the kernel command line and bootconfigAndrei-Alexandru Tachici
Multiple events can be enabled on the kernel command line via a comma separator. But if the are specified one at a time, then only the last event is enabled. This is because the event names are saved in a temporary buffer, and each call by the init cmdline code will reset that buffer. This also affects names in the boot config file, as it may call the callback multiple times with an example of: kernel.trace_event = ":mod:rproc_qcom_common", ":mod:qrtr", ":mod:qcom_aoss" Change the cmdline callback function to append a comma and the next value if the temporary buffer already has content. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260302-trace-events-allow-multiple-modules-v1-1-ce4436e37fb8@oss.qualcomm.com Signed-off-by: Andrei-Alexandru Tachici <andrei-alexandru.tachici@oss.qualcomm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-06tracing: Add NULL pointer check to trigger_data_free()Guenter Roeck
If trigger_data_alloc() fails and returns NULL, event_hist_trigger_parse() jumps to the out_free error path. While kfree() safely handles a NULL pointer, trigger_data_free() does not. This causes a NULL pointer dereference in trigger_data_free() when evaluating data->cmd_ops->set_filter. Fix the problem by adding a NULL pointer check to trigger_data_free(). The problem was found by an experimental code review agent based on gemini-3.1-pro while reviewing backports into v6.18.y. Cc: Miaoqian Lin <linmq006@gmail.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://patch.msgid.link/20260305193339.2810953-1-linux@roeck-us.net Fixes: 0550069cc25f ("tracing: Properly process error handling in event_hist_trigger_parse()") Assisted-by: Gemini:gemini-3.1-pro Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-06sched_ext: Add basic building blocks for nested sub-scheduler dispatchingTejun Heo
This is an early-stage partial implementation that demonstrates the core building blocks for nested sub-scheduler dispatching. While significant work remains in the enqueue path and other areas, this patch establishes the fundamental mechanisms needed for hierarchical scheduler operation. The key building blocks introduced include: - Private stack support for ops.dispatch() to prevent stack overflow when walking down nested schedulers during dispatch operations - scx_bpf_sub_dispatch() kfunc that allows parent schedulers to trigger dispatch operations on their direct child schedulers - Proper parent-child relationship validation to ensure dispatch requests are only made to legitimate child schedulers - Updated scx_dispatch_sched() to handle both nested and non-nested invocations with appropriate kf_mask handling The qmap scheduler is updated to demonstrate the functionality by calling scx_bpf_sub_dispatch() on registered child schedulers when it has no tasks in its own queues. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Add rhashtable lookup for sub-schedulersTejun Heo
Add rhashtable-based lookup for sub-schedulers indexed by cgroup_id to enable efficient scheduler discovery in preparation for multiple scheduler support. The hash table allows quick lookup of the appropriate scheduler instance when processing tasks from different cgroups. This extends scx_link_sched() to register sub-schedulers in the hash table and scx_unlink_sched() to remove them. A new scx_find_sub_sched() function provides the lookup interface. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Factor out scx_link_sched() and scx_unlink_sched()Tejun Heo
Factor out scx_link_sched() and scx_unlink_sched() functions to reduce code duplication in the scheduler enable/disable paths. No functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>