summaryrefslogtreecommitdiff
path: root/kernel/sched/ext.c
AgeCommit message (Collapse)Author
4 dayssched: Audit MOVE vs balance_callbacksPeter Zijlstra
The {DE,EN}QUEUE_MOVE flag indicates a task is allowed to change priority, which means there could be balance callbacks queued. Therefore audit all MOVE users and make sure they do run balance callbacks before dropping rq-lock. Fixes: 6455ad5346c9 ("sched: Move sched_class::prio_changed() into the change pattern") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260114130528.GB831285@noisy.programming.kicks-ass.net
2025-12-22sched_ext: Use the resched_cpu() to replace resched_curr() in the ↵Zqiang
bypass_lb_node() For the PREEMPT_RT kernels, the scx_bypass_lb_timerfn() running in the preemptible per-CPU ktimer kthread context, this means that the following scenarios will occur(for x86 platform): cpu1 cpu2 ktimer kthread: ->scx_bypass_lb_timerfn ->bypass_lb_node ->for_each_cpu(cpu, resched_mask) migration/1: by preempt by migration/2: multi_cpu_stop() multi_cpu_stop() ->take_cpu_down() ->__cpu_disable() ->set cpu1 offline ->rq1 = cpu_rq(cpu1) ->resched_curr(rq1) ->smp_send_reschedule(cpu1) ->native_smp_send_reschedule(cpu1) ->if(unlikely(cpu_is_offline(cpu))) { WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu); return; } This commit therefore use the resched_cpu() to replace resched_curr() in the bypass_lb_node() to avoid send-ipi to offline CPUs. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-19sched_ext: Fix some comments in ext.cZqiang
This commit update balance_scx() in the comments to balance_one(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-16sched_ext: fix uninitialized ret on alloc_percpu() failureLiang Jie
Smatch reported: kernel/sched/ext.c:5332 scx_alloc_and_add_sched() warn: passing zero to 'ERR_PTR' In scx_alloc_and_add_sched(), the alloc_percpu() failure path jumps to err_free_gdsqs without initializing @ret. That can lead to returning ERR_PTR(0), which violates the ERR_PTR() convention and confuses callers. Set @ret to -ENOMEM before jumping to the error path when alloc_percpu() fails. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202512141601.yAXDAeA9-lkp@intel.com/ Reported-by: Dan Carpenter <error27@gmail.com> Fixes: c201ea1578d3 ("sched_ext: Move event_stats_cpu into scx_sched") Signed-off-by: Liang Jie <liangjie@lixiang.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-15sched_ext: Remove unused code in the do_pick_task_scx()Zqiang
The kick_idle variable is no longer used, this commit therefore remove it and also remove associated code in the do_pick_task_scx(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-12sched_ext: Fix missing post-enqueue handling in move_local_task_to_local_dsq()Tejun Heo
move_local_task_to_local_dsq() is used when moving a task from a non-local DSQ to a local DSQ on the same CPU. It directly manipulates the local DSQ without going through dispatch_enqueue() and was missing the post-enqueue handling that triggers preemption when SCX_ENQ_PREEMPT is set or the idle task is running. The function is used by move_task_between_dsqs() which backs scx_bpf_dsq_move() and may be called while the CPU is busy. Add local_dsq_post_enq() call to move_local_task_to_local_dsq(). As the dispatch path doesn't need post-enqueue handling, add SCX_RQ_IN_BALANCE early exit to keep consume_dispatch_q() behavior unchanged and avoid triggering unnecessary resched when scx_bpf_dsq_move() is used from the dispatch path. Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Cc: stable@vger.kernel.org # v6.12+ Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-12sched_ext: Factor out local_dsq_post_enq() from dispatch_enqueue()Tejun Heo
Factor out local_dsq_post_enq() which performs post-enqueue handling for local DSQs - triggering resched_curr() if SCX_ENQ_PREEMPT is specified or if the current CPU is idle. No functional change. This will be used by the next patch to fix move_local_task_to_local_dsq(). Cc: stable@vger.kernel.org # v6.12+ Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-11sched_ext: Fix bypass depth leak on scx_enable() failureTejun Heo
scx_enable() calls scx_bypass(true) to initialize in bypass mode and then scx_bypass(false) on success to exit. If scx_enable() fails during task initialization - e.g. scx_cgroup_init() or scx_init_task() returns an error - it jumps to err_disable while bypass is still active. scx_disable_workfn() then calls scx_bypass(true/false) for its own bypass, leaving the bypass depth at 1 instead of 0. This causes the system to remain permanently in bypass mode after a failed scx_enable(). Failures after task initialization is complete - e.g. scx_tryset_enable_state() at the end - already call scx_bypass(false) before reaching the error path and are not affected. This only affects a subset of failure modes. Fix it by tracking whether scx_enable() called scx_bypass(true) in a bool and having scx_disable_workfn() call an extra scx_bypass(false) to clear it. This is a temporary measure as the bypass depth will be moved into the sched instance, which will make this tracking unnecessary. Fixes: 8c2090c504e9 ("sched_ext: Initialize in bypass mode") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Chris Mason <clm@meta.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/stable/286e6f7787a81239e1ce2989b52391ce%40kernel.org Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-08sched/ext: Avoid null ptr traversal when ->put_prev_task() is called with ↵John Stultz
NULL next Early when trying to get sched_ext and proxy-exe working together, I kept tripping over NULL ptr in put_prev_task_scx() on the line: if (sched_class_above(&ext_sched_class, next->sched_class)) { Which was due to put_prev_task() passes a NULL next, calling: prev->sched_class->put_prev_task(rq, prev, NULL); put_prev_task_scx() already guards for a NULL next in the switch_class case, but doesn't seem to have a guard for sched_class_above() check. I can't say I understand why this doesn't trip usually without proxy-exec. And in newer kernels there are way fewer put_prev_task(), and I can't easily reproduce the issue now even with proxy-exec. But we still have one put_prev_task() call left in core.c that seems like it could trip this, so I wanted to send this out for consideration. tj: put_prev_task() can be called with NULL @next; however, when @p is queued, that doesn't happen, so this condition shouldn't currently be triggerable. The connection isn't straightforward or necessarily reliable, so add the NULL check even if it can't currently be triggered. Link: http://lkml.kernel.org/r/20251206022218.1541878-1-jstultz@google.com Signed-off-by: John Stultz <jstultz@google.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-08sched_ext: Fix the memleak for sch->helper objectsZqiang
This commit use kthread_destroy_worker() to release sch->helper objects to fix the following kmemleak: unreferenced object 0xffff888121ec7b00 (size 128): comm "scx_simple", pid 1197, jiffies 4295884415 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 ad 4e ad de .............N.. ff ff ff ff 00 00 00 00 ff ff ff ff ff ff ff ff ................ backtrace (crc 587b3352): kmemleak_alloc+0x62/0xa0 __kmalloc_cache_noprof+0x28d/0x3e0 kthread_create_worker_on_node+0xd5/0x1f0 scx_enable.isra.210+0x6c2/0x25b0 bpf_scx_reg+0x12/0x20 bpf_struct_ops_link_create+0x2c3/0x3b0 __sys_bpf+0x3102/0x4b00 __x64_sys_bpf+0x79/0xc0 x64_sys_call+0x15d9/0x1dd0 do_syscall_64+0xf0/0x470 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: bff3b5aec1b7 ("sched_ext: Move disable machinery into scx_sched") Cc: stable@vger.kernel.org # v6.16+ Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-03Merge tag 'sched_ext-for-6.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Improve recovery from misbehaving BPF schedulers. When a scheduler puts many tasks with varying affinity restrictions on a shared DSQ, CPUs scanning through tasks they cannot run can overwhelm the system, causing lockups. Bypass mode now uses per-CPU DSQs with a load balancer to avoid this, and hooks into the hardlockup detector to attempt recovery. Add scx_cpu0 example scheduler to demonstrate this scenario. - Add lockless peek operation for DSQs to reduce lock contention for schedulers that need to query queue state during load balancing. - Allow scx_bpf_reenqueue_local() to be called from anywhere in preparation for deprecating cpu_acquire/release() callbacks in favor of generic BPF hooks. - Prepare for hierarchical scheduler support: add scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() kfuncs, make scx_bpf_dsq_insert*() return bool, and wrap kfunc args in structs for future aux__prog parameter. - Implement cgroup_set_idle() callback to notify BPF schedulers when a cgroup's idle state changes. - Fix migration tasks being incorrectly downgraded from stop_sched_class to rt_sched_class across sched_ext enable/disable. Applied late as the fix is low risk and the bug subtle but needs stable backporting. - Various fixes and cleanups including cgroup exit ordering, SCX_KICK_WAIT reliability, and backward compatibility improvements. * tag 'sched_ext-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (44 commits) sched_ext: Fix incorrect sched_class settings for per-cpu migration tasks sched_ext: tools: Removing duplicate targets during non-cross compilation sched_ext: Use kvfree_rcu() to release per-cpu ksyncs object sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docs sched_ext: Update comments replacing breather with aborting mechanism sched_ext: Implement load balancer for bypass mode sched_ext: Factor out abbreviated dispatch dequeue into dispatch_dequeue_locked() sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR sched_ext: Add scx_cpu0 example scheduler sched_ext: Hook up hardlockup detector sched_ext: Make handle_lockup() propagate scx_verror() result sched_ext: Refactor lockup handlers into handle_lockup() sched_ext: Make scx_exit() and scx_vexit() return bool sched_ext: Exit dispatch and move operations immediately when aborting sched_ext: Simplify breather mechanism with scx_aborting flag sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode sched_ext: Refactor do_enqueue_task() local and global DSQ paths sched_ext: Use shorter slice in bypass mode sched_ext: Mark racy bitfields to prevent adding fields that can't tolerate races sched_ext: Minor cleanups to scx_task_iter ...
2025-12-01Merge tag 'sched-core-2025-12-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Scalability and load-balancing improvements: - Enable scheduler feature NEXT_BUDDY (Mel Gorman) - Reimplement NEXT_BUDDY to align with EEVDF goals (Mel Gorman) - Skip sched_balance_running cmpxchg when balance is not due (Tim Chen) - Implement generic code for architecture specific sched domain NUMA distances (Tim Chen) - Optimize the NUMA distances of the sched-domains builds of Intel Granite Rapids (GNR) and Clearwater Forest (CWF) platforms (Tim Chen) - Implement proportional newidle balance: a randomized algorithm that runs newidle balancing proportional to its success rate. (Peter Zijlstra) Scheduler infrastructure changes: - Implement the 'sched_change' scoped_guard() pattern for the entire scheduler (Peter Zijlstra) - More broadly utilize the sched_change guard (Peter Zijlstra) - Add support to pick functions to take runqueue-flags (Joel Fernandes) - Provide and use set_need_resched_current() (Peter Zijlstra) Fair scheduling enhancements: - Forfeit vruntime on yield (Fernand Sieber) - Only update stats for allowed CPUs when looking for dst group (Adam Li) CPU-core scheduling enhancements: - Optimize core cookie matching check (Fernand Sieber) Deadline scheduler fixes: - Only set free_cpus for online runqueues (Doug Berger) - Fix dl_server time accounting (Peter Zijlstra) - Fix dl_server stop condition (Peter Zijlstra) Proxy scheduling fixes: - Yield the donor task (Fernand Sieber) Fixes and cleanups: - Fix do_set_cpus_allowed() locking (Peter Zijlstra) - Fix migrate_disable_switch() locking (Peter Zijlstra) - Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked() (Hao Jia) - Increase sched_tick_remote timeout (Phil Auld) - sched/deadline: Use cpumask_weight_and() in dl_bw_cpus() (Shrikanth Hegde) - sched/deadline: Clean up select_task_rq_dl() (Shrikanth Hegde)" * tag 'sched-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits) sched: Provide and use set_need_resched_current() sched/fair: Proportional newidle balance sched/fair: Small cleanup to update_newidle_cost() sched/fair: Small cleanup to sched_balance_newidle() sched/fair: Revert max_newidle_lb_cost bump sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals sched/fair: Enable scheduler feature NEXT_BUDDY sched: Increase sched_tick_remote timeout sched/fair: Have SD_SERIALIZE affect newidle balancing sched/fair: Skip sched_balance_running cmpxchg when balance is not due sched/deadline: Minor cleanup in select_task_rq_dl() sched/deadline: Use cpumask_weight_and() in dl_bw_cpus sched/deadline: Document dl_server sched/deadline: Fix dl_server stop condition sched/deadline: Fix dl_server time accounting sched/core: Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked() sched/eevdf: Fix min_vruntime vs avg_vruntime sched/core: Add comment explaining force-idle vruntime snapshots sched/core: Optimize core cookie matching check sched/proxy: Yield the donor task ...
2025-12-01sched_ext: Fix incorrect sched_class settings for per-cpu migration tasksZqiang
When loading the ebpf scheduler, the tasks in the scx_tasks list will be traversed and invoke __setscheduler_class() to get new sched_class. however, this would also incorrectly set the per-cpu migration task's->sched_class to rt_sched_class, even after unload, the per-cpu migration task's->sched_class remains sched_rt_class. The log for this issue is as follows: ./scx_rustland --stats 1 [ 199.245639][ T630] sched_ext: "rustland" does not implement cgroup cpu.weight [ 199.269213][ T630] sched_ext: BPF scheduler "rustland" enabled 04:25:09 [INFO] RustLand scheduler attached bpftrace -e 'iter:task /strcontains(ctx->task->comm, "migration")/ { printf("%s:%d->%pS\n", ctx->task->comm, ctx->task->pid, ctx->task->sched_class); }' Attaching 1 probe... migration/0:24->rt_sched_class+0x0/0xe0 migration/1:27->rt_sched_class+0x0/0xe0 migration/2:33->rt_sched_class+0x0/0xe0 migration/3:39->rt_sched_class+0x0/0xe0 migration/4:45->rt_sched_class+0x0/0xe0 migration/5:52->rt_sched_class+0x0/0xe0 migration/6:58->rt_sched_class+0x0/0xe0 migration/7:64->rt_sched_class+0x0/0xe0 sched_ext: BPF scheduler "rustland" disabled (unregistered from user space) EXIT: unregistered from user space 04:25:21 [INFO] Unregister RustLand scheduler bpftrace -e 'iter:task /strcontains(ctx->task->comm, "migration")/ { printf("%s:%d->%pS\n", ctx->task->comm, ctx->task->pid, ctx->task->sched_class); }' Attaching 1 probe... migration/0:24->rt_sched_class+0x0/0xe0 migration/1:27->rt_sched_class+0x0/0xe0 migration/2:33->rt_sched_class+0x0/0xe0 migration/3:39->rt_sched_class+0x0/0xe0 migration/4:45->rt_sched_class+0x0/0xe0 migration/5:52->rt_sched_class+0x0/0xe0 migration/6:58->rt_sched_class+0x0/0xe0 migration/7:64->rt_sched_class+0x0/0xe0 This commit therefore generate a new scx_setscheduler_class() and add check for stop_sched_class to replace __setscheduler_class(). Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-20sched_ext: Fix scx_enable() crash on helper kthread creation failureSaket Kumar Bhaskar
A crash was observed when the sched_ext selftests runner was terminated with Ctrl+\ while test 15 was running: NIP [c00000000028fa58] scx_enable.constprop.0+0x358/0x12b0 LR [c00000000028fa2c] scx_enable.constprop.0+0x32c/0x12b0 Call Trace: scx_enable.constprop.0+0x32c/0x12b0 (unreliable) bpf_struct_ops_link_create+0x18c/0x22c __sys_bpf+0x23f8/0x3044 sys_bpf+0x2c/0x6c system_call_exception+0x124/0x320 system_call_vectored_common+0x15c/0x2ec kthread_run_worker() returns an ERR_PTR() on failure rather than NULL, but the current code in scx_alloc_and_add_sched() only checks for a NULL helper. Incase of failure on SIGQUIT, the error is not handled in scx_alloc_and_add_sched() and scx_enable() ends up dereferencing an error pointer. Error handling is fixed in scx_alloc_and_add_sched() to propagate PTR_ERR() into ret, so that scx_enable() jumps to the existing error path, avoiding random dereference on failure. Fixes: bff3b5aec1b7 ("sched_ext: Move disable machinery into scx_sched") Cc: stable@vger.kernel.org # v6.16+ Reported-and-tested-by: Samir Mulani <samir@linux.ibm.com> Signed-off-by: Saket Kumar Bhaskar <skb99@linux.ibm.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-17sched_ext: Use kvfree_rcu() to release per-cpu ksyncs objectZqiang
The free_kick_syncs_rcu() rcu-callback only invoke kvfree() to release per-cpu ksyncs object, this can use kvfree_rcu() replace call_rcu() to release per-cpu ksyncs object in the free_kick_syncs(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-17sched_ext: Use IRQ_WORK_INIT_HARD() to initialize rq->scx.kick_cpus_irq_workZqiang
For PREEMPT_RT kernels, the kick_cpus_irq_workfn() be invoked in the per-cpu irq_work/* task context and there is no rcu-read critical section to protect. this commit therefore use IRQ_WORK_INIT_HARD() to initialize the per-cpu rq->scx.kick_cpus_irq_work in the init_sched_ext_class(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-14sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docsTejun Heo
With the buddy lockup detector, smp_processor_id() returns the detecting CPU, not the locked CPU, making scx_hardlockup()'s printouts confusing. Pass the locked CPU number from watchdog_hardlockup_check() as a parameter instead. Also add kerneldoc comments to handle_lockup(), scx_hardlockup(), and scx_rcu_cpu_stall() documenting their return value semantics. Suggested-by: Doug Anderson <dianders@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Acked-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-13sched_ext: Fix possible deadlock in the deferred_irq_workfn()Zqiang
For PREEMPT_RT=y kernels, the deferred_irq_workfn() is executed in the per-cpu irq_work/* task context and not disable-irq, if the rq returned by container_of() is current CPU's rq, the following scenarios may occur: lock(&rq->__lock); <Interrupt> lock(&rq->__lock); This commit use IRQ_WORK_INIT_HARD() to replace init_irq_work() to initialize rq->scx.deferred_irq_work, make the deferred_irq_workfn() is always invoked in hard-irq context. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Update comments replacing breather with aborting mechanismAndrea Righi
Commit 5ebec443fb96a ("sched_ext: Exit dispatch and move operations immediately when aborting") replaced the breather mechanism with the scx_aborting flag. Update comments removing references to the breather mechanism to avoid confusion. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched/ext: convert scx_tasks_lock to raw spinlockEmil Tsalapatis
Update scx_task_locks so that it's safe to lock/unlock in a non-sleepable context in PREEMPT_RT kernels. scx_task_locks is (non-raw) spinlock used to protect the list of tasks under SCX. This list is updated during from finish_task_switch(), which cannot sleep. Regular spinlocks can be locked in such a context in non-RT kernels, but are sleepable under when CONFIG_PREEMPT_RT=y. Convert scx_task_locks into a raw spinlock, which is not sleepable even on RT kernels. Sample backtrace: <TASK> dump_stack_lvl+0x83/0xa0 __might_resched+0x14a/0x200 rt_spin_lock+0x61/0x1c0 ? sched_ext_dead+0x2d/0xf0 ? lock_release+0xc6/0x280 sched_ext_dead+0x2d/0xf0 ? srso_alias_return_thunk+0x5/0xfbef5 finish_task_switch.isra.0+0x254/0x360 __schedule+0x584/0x11d0 ? srso_alias_return_thunk+0x5/0xfbef5 ? srso_alias_return_thunk+0x5/0xfbef5 ? tick_nohz_idle_exit+0x7e/0x120 schedule_idle+0x23/0x40 cpu_startup_entry+0x29/0x30 start_secondary+0xf8/0x100 common_startup_64+0x13e/0x148 </TASK> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Implement load balancer for bypass modeTejun Heo
In bypass mode, tasks are queued on per-CPU bypass DSQs. While this works well in most cases, there is a failure mode where a BPF scheduler can skew task placement severely before triggering bypass in highly over-saturated systems. If most tasks end up concentrated on a few CPUs, those CPUs can accumulate queues that are too long to drain in a reasonable time, leading to RCU stalls and hung tasks. Implement a simple timer-based load balancer that redistributes tasks across CPUs within each NUMA node. The balancer runs periodically (default 500ms, tunable via bypass_lb_intv_us module parameter) and moves tasks from overloaded CPUs to underloaded ones. When moving tasks between bypass DSQs, the load balancer holds nested DSQ locks to avoid dropping and reacquiring the donor DSQ lock on each iteration, as donor DSQs can be very long and highly contended. Add the SCX_ENQ_NESTED flag and use raw_spin_lock_nested() in dispatch_enqueue() to support this. The load balancer timer function reads scx_bypass_depth locklessly to check whether bypass mode is active. Use WRITE_ONCE() when updating scx_bypass_depth to pair with the READ_ONCE() in the timer function. This has been tested on a 192 CPU dual socket AMD EPYC machine with ~20k runnable tasks running scx_cpu0. As scx_cpu0 queues all tasks to CPU0, almost all tasks end up on CPU0 creating severe imbalance. Without the load balancer, disabling the scheduler can lead to RCU stalls and hung tasks, taking a very long time to complete. With the load balancer, disable completes in about a second. The load balancing operation can be monitored using the sched_ext_bypass_lb tracepoint and disabled by setting bypass_lb_intv_us to 0. v2: Lock both rq and DSQ in bypass_lb_cpu() and use dispatch_dequeue_locked() to prevent races with dispatch_dequeue() (Andrea Righi). Cc: Andrea Righi <arighi@nvidia.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Emil Tsalapatis <etsal@meta.com> Reviewed_by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Factor out abbreviated dispatch dequeue into ↵Tejun Heo
dispatch_dequeue_locked() move_task_between_dsqs() contains open-coded abbreviated dequeue logic when moving tasks between non-local DSQs. Factor this out into dispatch_dequeue_locked() which can be used when both the task's rq and dsq locks are already held. Add lockdep assertions to both dispatch_dequeue() and the new helper to verify locking requirements. This prepares for the load balancer which will need the same abbreviated dequeue pattern. Cc: Andrea Righi <arighi@nvidia.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Emil Tsalapatis <etsal@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Factor out scx_dsq_list_node cursor initialization into ↵Tejun Heo
INIT_DSQ_LIST_CURSOR Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR macro in preparation for additional users. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Hook up hardlockup detectorTejun Heo
A poorly behaving BPF scheduler can trigger hard lockup. For example, on a large system with many tasks pinned to different subsets of CPUs, if the BPF scheduler puts all tasks in a single DSQ and lets all CPUs at it, the DSQ lock can be contended to the point where hardlockup triggers. Unfortunately, hardlockup can be the first signal out of such situations, thus requiring hardlockup handling. Hook scx_hardlockup() into the hardlockup detector to try kicking out the current scheduler in an attempt to recover the system to a good state. The handling strategy can delay watchdog taking its own action by one polling period; however, given that the only remediation for hardlockup is crash, this is likely an acceptable trade-off. v2: Add missing dummy scx_hardlockup() definition for !CONFIG_SCHED_CLASS_EXT (kernel test bot). Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Emil Tsalapatis <etsal@meta.com> Cc: Douglas Anderson <dianders@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Make handle_lockup() propagate scx_verror() resultTejun Heo
handle_lockup() currently calls scx_verror() but ignores its return value, always returning true when the scheduler is enabled. Make it capture and return the result from scx_verror(). This prepares for hardlockup handling. Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Emil Tsalapatis <etsal@meta.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Refactor lockup handlers into handle_lockup()Tejun Heo
scx_rcu_cpu_stall() and scx_softlockup() share the same pattern: check if the scheduler is enabled under RCU read lock and trigger an error if so. Extract the common pattern into handle_lockup() helper. Add scx_verror() macro and use guard(rcu)(). This simplifies both handlers, reduces code duplication, and prepares for hardlockup handling. Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Cc: Emil Tsalapatis <etsal@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Make scx_exit() and scx_vexit() return boolTejun Heo
Make scx_exit() and scx_vexit() return bool indicating whether the calling thread successfully claimed the exit. This will be used by the abort mechanism added in a later patch. Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Cc: Emil Tsalapatis <etsal@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Exit dispatch and move operations immediately when abortingTejun Heo
62dcbab8b0ef ("sched_ext: Avoid live-locking bypass mode switching") introduced the breather mechanism to inject delays during bypass mode switching. It maintains operation semantics unchanged while reducing lock contention to avoid live-locks on large NUMA systems. However, the breather only activates when exiting the scheduler, so there's no need to maintain operation semantics. Simplify by exiting dispatch and move operations immediately when scx_aborting is set. In consume_dispatch_q(), break out of the task iteration loop. In scx_dsq_move(), return early before acquiring locks. This also fixes cases the breather mechanism cannot handle. When a large system has many runnable threads affinitized to different CPU subsets and the BPF scheduler places them all into a single DSQ, many CPUs can scan the DSQ concurrently for tasks they can run. This can cause DSQ and RQ locks to be held for extended periods, leading to various failure modes. The breather cannot solve this because once in the consume loop, there's no exit. The new mechanism fixes this by exiting the loop immediately. The bypass DSQ is exempted to ensure the bypass mechanism itself can make progress. v2: Use READ_ONCE() when reading scx_aborting (Andrea Righi). Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Andrea Righi <arighi@nvidia.com> Cc: Emil Tsalapatis <etsal@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Simplify breather mechanism with scx_aborting flagTejun Heo
The breather mechanism was introduced in 62dcbab8b0ef ("sched_ext: Avoid live-locking bypass mode switching") and e32c260195e6 ("sched_ext: Enable the ops breather and eject BPF scheduler on softlockup") to prevent live-locks by injecting delays when CPUs are trapped in dispatch paths. Currently, it uses scx_breather_depth (atomic_t) and scx_in_softlockup (unsigned long) with separate increment/decrement and cleanup operations. The breather is only activated when aborting, so tie it directly to the exit mechanism. Replace both variables with scx_aborting flag set when exit is claimed and cleared after bypass is enabled. Introduce scx_claim_exit() to consolidate exit_kind claiming and breather enablement. This eliminates scx_clear_softlockup() and simplifies scx_softlockup() and scx_bypass(). The breather mechanism will be replaced by a different abort mechanism in a future patch. This simplification prepares for that change. Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass modeTejun Heo
Bypass mode routes tasks through fallback dispatch queues. Originally a single global DSQ, b7b3b2dbae73 ("sched_ext: Split the global DSQ per NUMA node") changed this to per-node DSQs to resolve NUMA-related livelocks. Dan Schatzberg found per-node DSQs can still livelock when many threads are pinned to different small CPU subsets: each CPU must scan many incompatible tasks to find runnable ones, causing severe contention with high CPU counts. Switch to per-CPU bypass DSQs. Each task queues on its current CPU. Default idle CPU selection and direct dispatch handle most cases well. This introduces a failure mode when tasks concentrate on one CPU in over-saturated systems. If the BPF scheduler severely skews placement before triggering bypass, that CPU's queue may be too long to drain, causing RCU stalls. A load balancer in a future patch will address this. The bypass DSQ is separate from local DSQ to enable load balancing: local DSQs use rq locks, preventing efficient scanning and transfer across CPUs, especially problematic when systems are already contended. v2: Clarified why bypass DSQ is separate from local DSQ (Andrea Righi). Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Refactor do_enqueue_task() local and global DSQ pathsTejun Heo
The local and global DSQ enqueue paths in do_enqueue_task() share the same slice refill logic. Factor out the common code into a shared enqueue label. This makes adding new enqueue cases easier. No functional changes. Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Use shorter slice in bypass modeTejun Heo
There have been reported cases of bypass mode not making forward progress fast enough. The 20ms default slice is unnecessarily long for bypass mode where the primary goal is ensuring all tasks can make forward progress. Introduce SCX_SLICE_BYPASS set to 5ms and make the scheduler automatically switch to it when entering bypass mode. Also make the bypass slice value tunable through the slice_bypass_us module parameter (adjustable between 100us and 100ms) to make it easier to test whether slice durations are a factor in problem cases. v3: Use READ_ONCE/WRITE_ONCE for scx_slice_dfl access (Dan). v2: Removed slice_dfl_us module parameter. Fixed typos (Andrea). Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Fix unsafe locking in the scx_dump_state()Zqiang
For built with CONFIG_PREEMPT_RT=y kernels, the dump_lock will be converted sleepable spinlock and not disable-irq, so the following scenarios occur: inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. irq_work/0/27 [HC0[0]:SC0[0]:HE1:SE1] takes: (&rq->__lock){?...}-{2:2}, at: raw_spin_rq_lock_nested+0x2b/0x40 {IN-HARDIRQ-W} state was registered at: lock_acquire+0x1e1/0x510 _raw_spin_lock_nested+0x42/0x80 raw_spin_rq_lock_nested+0x2b/0x40 sched_tick+0xae/0x7b0 update_process_times+0x14c/0x1b0 tick_periodic+0x62/0x1f0 tick_handle_periodic+0x48/0xf0 timer_interrupt+0x55/0x80 __handle_irq_event_percpu+0x20a/0x5c0 handle_irq_event_percpu+0x18/0xc0 handle_irq_event+0xb5/0x150 handle_level_irq+0x220/0x460 __common_interrupt+0xa2/0x1e0 common_interrupt+0xb0/0xd0 asm_common_interrupt+0x2b/0x40 _raw_spin_unlock_irqrestore+0x45/0x80 __setup_irq+0xc34/0x1a30 request_threaded_irq+0x214/0x2f0 hpet_time_init+0x3e/0x60 x86_late_time_init+0x5b/0xb0 start_kernel+0x308/0x410 x86_64_start_reservations+0x1c/0x30 x86_64_start_kernel+0x96/0xa0 common_startup_64+0x13e/0x148 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&rq->__lock); <Interrupt> lock(&rq->__lock); *** DEADLOCK *** stack backtrace: CPU: 0 UID: 0 PID: 27 Comm: irq_work/0 Call Trace: <TASK> dump_stack_lvl+0x8c/0xd0 dump_stack+0x14/0x20 print_usage_bug+0x42e/0x690 mark_lock.part.44+0x867/0xa70 ? __pfx_mark_lock.part.44+0x10/0x10 ? string_nocheck+0x19c/0x310 ? number+0x739/0x9f0 ? __pfx_string_nocheck+0x10/0x10 ? __pfx_check_pointer+0x10/0x10 ? kvm_sched_clock_read+0x15/0x30 ? sched_clock_noinstr+0xd/0x20 ? local_clock_noinstr+0x1c/0xe0 __lock_acquire+0xc4b/0x62b0 ? __pfx_format_decode+0x10/0x10 ? __pfx_string+0x10/0x10 ? __pfx___lock_acquire+0x10/0x10 ? __pfx_vsnprintf+0x10/0x10 lock_acquire+0x1e1/0x510 ? raw_spin_rq_lock_nested+0x2b/0x40 ? __pfx_lock_acquire+0x10/0x10 ? dump_line+0x12e/0x270 ? raw_spin_rq_lock_nested+0x20/0x40 _raw_spin_lock_nested+0x42/0x80 ? raw_spin_rq_lock_nested+0x2b/0x40 raw_spin_rq_lock_nested+0x2b/0x40 scx_dump_state+0x3b3/0x1270 ? finish_task_switch+0x27e/0x840 scx_ops_error_irq_workfn+0x67/0x80 irq_work_single+0x113/0x260 irq_work_run_list.part.3+0x44/0x70 run_irq_workd+0x6b/0x90 ? __pfx_run_irq_workd+0x10/0x10 smpboot_thread_fn+0x529/0x870 ? __pfx_smpboot_thread_fn+0x10/0x10 kthread+0x305/0x3f0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x40/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> This commit therefore use rq_lock_irqsave/irqrestore() to replace rq_lock/unlock() in the scx_dump_state(). Fixes: 07814a9439a3 ("sched_ext: Print debug dump after an error exit") Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-11sched/proxy: Yield the donor taskFernand Sieber
When executing a task in proxy context, handle yields as if they were requested by the donor task. This matches the traditional PI semantics of yield() as well. This avoids scenario like proxy task yielding, pick next task selecting the same previous blocked donor, running the proxy task again, etc. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202510211205.1e0f5223-lkp@intel.com Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Fernand Sieber <sieberf@amazon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251106104022.195157-1-sieberf@amazon.com
2025-11-04sched_ext: Minor cleanups to scx_task_iterTejun Heo
- Use memset() in scx_task_iter_start() instead of zeroing fields individually. - In scx_task_iter_next(), move __scx_task_iter_maybe_relock() after the batch check which is simpler. - Update comment to reflect that tasks are removed from scx_tasks when dead (commit 7900aa699c34 ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()")). No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-04sched_ext: Move __SCX_DSQ_ITER_ALL_FLAGS BUILD_BUG_ON to the right placeTejun Heo
The BUILD_BUG_ON() which checks that __SCX_DSQ_ITER_ALL_FLAGS doesn't overlap with the private lnode bits was in scx_task_iter_start() which has nothing to do with DSQ iteration. Move it to bpf_iter_scx_dsq_new() where it belongs. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-03sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to ↵Tejun Heo
finish_task_switch() sched_ext_free() was called from __put_task_struct() when the last reference to the task is dropped, which could be long after the task has finished running. This causes cgroup-related problems: - ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d during scheduler load, because the cgroup might be destroyed/unlinked while the zombie or dead task is still lingering on the scx_tasks list. - ops.cgroup_exit() could be called before ops.exit_task() is called on all member tasks, leading to incorrect exit ordering. Fix by moving it to finish_task_switch() to be called right after the final context switch away from the dying task, matching when sched_class->task_dead() is called. Rename it to sched_ext_dead() to match the new calling context. By calling sched_ext_dead() before cgroup_task_dead(), we ensure that: - Tasks visible on scx_tasks list have valid cgroups during scheduler load, as cgroup_mutex prevents cgroup destruction while the task is still linked. - All member tasks have ops.exit_task() called and are removed from scx_tasks before the cgroup can be destroyed and trigger ops.cgroup_exit(). This fix is made possible by the cgroup_task_dead() split in the previous patch. This also makes more sense resource-wise as there's no point in keeping scheduler side resources around for dead tasks. Reported-by: Dan Schatzberg <dschatzberg@meta.com> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhereTejun Heo
The ops.cpu_acquire/release() callbacks miss events under multiple conditions. There are two distinct task dispatch gaps that can cause cpu_released flag desynchronization: 1. balance-to-pick_task gap: This is what was originally reported. balance_scx() can enqueue a task, but during consume_remote_task() when the rq lock is released, a higher priority task can be enqueued and ultimately picked while cpu_released remains false. This gap is closeable via RETRY_TASK handling. 2. ttwu-to-pick_task gap: ttwu() can directly dispatch a task to a CPU's local DSQ. By the time the sched path runs on the target CPU, higher class tasks may already be queued. In such cases, nothing on sched_ext side will be invoked, and the only solution would be a hook invoked regardless of sched class, which isn't desirable. Rather than adding invasive core hooks, BPF schedulers can use generic BPF mechanisms like tracepoints. From SCX scheduler's perspective, this is congruent with other mechanisms it already uses and doesn't add further friction. The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when a CPU gets preempted by a higher priority scheduling class. However, the old scx_bpf_reenqueue_local() could only be called from cpu_release() context. Add a new version of scx_bpf_reenqueue_local() that can be called from any context by deferring the actual re-enqueue operation. This eliminates the need for cpu_acquire/release() ops entirely. Schedulers can now use standard BPF mechanisms like the sched_switch tracepoint to detect and handle CPU preemption. Update scx_qmap to demonstrate the new approach using sched_switch instead of cpu_release, with compat support for older kernels. Mark cpu_acquire/release() as deprecated. The old scx_bpf_reenqueue_local() variant will be removed in v6.23. Reported-by: Wen-Fang Liu <liuwenfang@honor.com> Link: https://lore.kernel.org/all/8d64c74118c6440f81bcf5a4ac6b9f00@honor.com/ Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29sched_ext: Factor out reenq_local() from scx_bpf_reenqueue_local()Tejun Heo
Factor out the core re-enqueue logic from scx_bpf_reenqueue_local() into a new reenq_local() helper function. scx_bpf_reenqueue_local() now handles the BPF kfunc checks and calls reenq_local() to perform the actual work. This is a prep patch to allow reenq_local() to be called from other contexts. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29sched_ext: Split schedule_deferred() into locked and unlocked variantsTejun Heo
schedule_deferred() currently requires the rq lock to be held so that it can use scheduler hooks for efficiency when available. However, there are cases where deferred actions need to be scheduled from contexts that don't hold the rq lock. Split into schedule_deferred() which can be called from any context and just queues irq_work, and schedule_deferred_locked() which requires the rq lock and can optimize by using scheduler hooks when available. Update the existing call site to use the _locked variant. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29Merge branch 'for-6.18-fixes' into for-6.19Tejun Heo
Pull to receive f4fa7c25f632 ("sched_ext: Fix use of uninitialized variable in scx_bpf_cpuperf_set()") which conflicts with changes for planned sub-sched changes.
2025-10-29sched_ext: Fix use of uninitialized variable in scx_bpf_cpuperf_set()Andrea Righi
scx_bpf_cpuperf_set() has a typo where it dereferences the local variable @sch, instead of the global @scx_root pointer. Fix by dereferencing the correct variable. Fixes: 956f2b11a8a4f ("sched_ext: Drop kf_cpu_valid()") Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29Merge branch 'linus/master' into sched/core, to resolve conflictPeter Zijlstra
Conflicts: kernel/sched/ext.c Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-10-28sched_ext: Use SCX_TASK_READY test instead of tryget_task_struct() during ↵Tejun Heo
class switch ddf7233fcab6 ("sched/ext: Fix invalid task state transitions on class switch") added tryget_task_struct() test during scx_enable()'s class switching loop. The reason for the addition was to avoid enabling tasks which skipped prep in the previous loop due to being dead. While tryget_task_struct() does work for this purpose as tasks that fail tryget always will fail it, it's a bit roundabout. A more direct way is testing whether the task is in READY state. Switch to testing SCX_TASK_READY directly. Cc: Andrea Righi <arighi@nvidia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-24sched_ext: Fix scx_bpf_dsq_peek() with FIFO DSQsAndrea Righi
When removing a task from a FIFO DSQ, we must delete it from the list before updating dsq->first_task, otherwise the following lookup will just re-read the same task, leaving first_task pointing to removed entry. This issue only affects DSQs operating in FIFO mode, as priority DSQs correctly update the rbtree before re-evaluating the new first task. Remove the item from the list before refreshing the first task to guarantee the correct behavior in FIFO DSQs. Fixes: 44f5c8ec5b9ad ("sched_ext: Add lockless peek operation for DSQs") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-22sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()Tejun Heo
The find_user_dsq() function is called from contexts that are already under RCU read lock protection. Switch from rhashtable_lookup_fast() to rhashtable_lookup() to avoid redundant RCU locking. Requires: bee8a520eb84 ("rhashtable: Use rcu_dereference_all and rcu_dereference_all_check") Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-22sched_ext: Rename pnt_seq to kick_syncTejun Heo
The pnt_seq field and related infrastructure were originally named for "pick next task sequence", reflecting their original implementation in scx_next_task_picked(). However, the sequence counter is now incremented in both put_prev_task_scx() and pick_task_scx() and its purpose is to synchronize kick operations via SCX_KICK_WAIT, not specifically to track pick_next_task events. Rename to better reflect the actual semantics: - pnt_seq -> kick_sync - scx_kick_pseqs -> scx_kick_syncs - pseqs variables -> ksyncs - Update comments to refer to "kick_sync sequence" instead of "pick_task sequence" This is a pure renaming with no functional changes. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-22sched_ext: Fix SCX_KICK_WAIT to work reliablyTejun Heo
SCX_KICK_WAIT is used to synchronously wait for the target CPU to complete a reschedule and can be used to implement operations like core scheduling. This used to be implemented by scx_next_task_picked() incrementing pnt_seq, which was always called when a CPU picks the next task to run, allowing SCX_KICK_WAIT to reliably wait for the target CPU to enter the scheduler and pick the next task. However, commit b999e365c298 ("sched_ext: Replace scx_next_task_picked() with switch_class()") replaced scx_next_task_picked() with the switch_class() callback, which is only called when switching between sched classes. This broke SCX_KICK_WAIT because pnt_seq would no longer be reliably incremented unless the previous task was SCX and the next task was not. This fix leverages commit 4c95380701f5 ("sched/ext: Fold balance_scx() into pick_task_scx()") which refactored the pick path making put_prev_task_scx() the natural place to track task switches for SCX_KICK_WAIT. The fix moves pnt_seq increment to put_prev_task_scx() and also increments it in pick_task_scx() to handle cases where the same task is re-selected, whether by BPF scheduler decision or slice refill. The semantics: If the current task on the target CPU is SCX, SCX_KICK_WAIT waits until the CPU enters the scheduling path. This provides sufficient guarantee for use cases like core scheduling while keeping the operation self-contained within SCX. v2: - Also increment pnt_seq in pick_task_scx() to handle same-task re-selection (Andrea Righi). - Use smp_cond_load_acquire() for the busy-wait loop for better architecture optimization (Peter Zijlstra). Reported-by: Wen-Fang Liu <liuwenfang@honor.com> Link: http://lkml.kernel.org/r/228ebd9e6ed3437996dffe15735a9caa@honor.com Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-22sched_ext: Don't kick CPUs running higher classesTejun Heo
When a sched_ext scheduler tries to kick a CPU, the CPU may be running a higher class task. sched_ext has no control over such CPUs. A sched_ext scheduler couldn't have expected to get access to the CPU after kicking it anyway. Skip kicking when the target CPU is running a higher class. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-21sched_ext: Fix scx_bpf_dsq_insert() backward binary compatibilityTejun Heo
cded46d97159 ("sched_ext: Make scx_bpf_dsq_insert*() return bool") introduced a new bool-returning scx_bpf_dsq_insert() and renamed the old void-returning version to scx_bpf_dsq_insert___compat, with the expectation that libbpf would match old binaries to the ___compat variant, maintaining backward binary compatibility. However, while libbpf ignores ___suffix on the BPF side when matching symbols, it doesn't do so for kernel-side symbols. Old binaries compiled with the original scx_bpf_dsq_insert() could no longer resolve the symbol. Fix by reversing the naming: Keep scx_bpf_dsq_insert() as the old void-returning interface and add ___v2 to the new bool-returning version. This allows old binaries to continue working while new code can use the ___v2 variant. Once libbpf is updated to ignore kernel-side ___SUFFIX, the ___v2 suffix can be dropped when the compat interface is removed. v2: Use ___v2 instead of ___new. Fixes: cded46d97159 ("sched_ext: Make scx_bpf_dsq_insert*() return bool") Signed-off-by: Tejun Heo <tj@kernel.org>