linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2026-05-10	sched_ext: Fix ops->priv clobber on concurrent attach/detach	Andrea Righi
	Under heavy concurrent attach/detach operations, scx_claim_exit() can trigger a NULL pointer dereference. This can be reproduced running the reload_loop kselftests inside a virtme-ng session: $ vng -v -- ./tools/testing/selftests/sched_ext/runner -t reload_loop ... BUG: kernel NULL pointer dereference, address: 0000000000000400 RIP: 0010:scx_claim_exit+0x3b/0x120 Call Trace: <TASK> bpf_scx_unreg+0x45/0xb0 bpf_struct_ops_map_link_dealloc+0x39/0x50 bpf_link_release+0x18/0x20 __fput+0x10b/0x2e0 __x64_sys_close+0x47/0xa0 The underlying race (diagnosed by Tejun Heo) is a stomp of @ops->priv, not a missing NULL check: T2 unreg(K) T1 reg(K) ----------- --------- sch = ops->priv = sch_b800 scx_disable; flush_disable_work [scx_root_disable: scx_root=NULL, mutex_unlock, state=DISABLED] mutex_lock; state ok scx_alloc_and_add_sched: ops->priv = sch_a800 scx_root = sch_a800; init=0 state=ENABLED; mutex_unlock [flush returns] RCU_INIT_POINTER(ops->priv, NULL) <-- clobbers sch_a800 kobject_put(sch_b800) T1 acquires scx_enable_mutex inside scx_root_disable()'s mutex_unlock window and starts a fresh attach on the same kdata, assigning sch_a800 to @ops->priv. T2 then continues out of scx_disable()/flush_disable_work and clobbers @ops->priv to NULL, leaking sch_a800; the bpf_link is gone but state stays SCX_ENABLED, so all future attaches fail with -EBUSY permanently. The next bpf_scx_unreg() on that kdata then reads NULL @ops->priv and dereferences it in scx_claim_exit(). Make @ops->priv the lifecycle binding: in scx_root_enable_workfn() and scx_sub_enable_workfn(), after the existing state check and still under scx_enable_mutex, refuse with -EBUSY if @ops->priv is non-NULL. This rejects an attempt to reuse a kdata that is still bound to a previous scheduler instance, closing the race without changing the unreg side. Fixes: 105dcd005be2 ("sched_ext: Introduce scx_prog_sched()") Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-10	cgroup/dmem: Return -ENOMEM on failed pool preallocation	Guopeng Zhang
	get_cg_pool_unlocked() handles allocation failures under dmemcg_lock by dropping the lock, preallocating a pool with GFP_KERNEL, and retrying the locked lookup and creation path. If the fallback allocation fails too, pool remains NULL. Since the loop condition is while (!pool), the function can keep retrying instead of propagating the allocation failure to the caller. Set pool to ERR_PTR(-ENOMEM) when the fallback allocation fails so the loop exits through the existing common return path. The callers already handle ERR_PTR() from get_cg_pool_unlocked(), so this restores the expected error path. Fixes: b168ed458dde ("kernel/cgroup: Add "dmem" memory accounting cgroup") Cc: stable@vger.kernel.org # v6.14+ Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-10	Merge branch 'for-7.1-fixes' into for-7.2	Tejun Heo
	Conflict between: [1] 41e3312861ea ("sched_ext: add p->scx.tid and SCX_OPS_TID_TO_TASK lookup") [2] c941d7391f25 ("sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN") in scx_root_enable_workfn()'s post-init block. [1] added a tid hash insertion under a scoped_guard() for scx_tasks_lock; [2] wraps the same region in task_rq_lock() for a DEAD recheck. A naive merge would invert the iter's outer/inner order. [3] f25ad1e3cbaa ("sched_ext: Add scx_task_iter_relock() and use it in scx_root_enable_workfn()") was added to for-7.2 for a clean resolution: scx_task_iter_relock(iter, p) takes both scx_tasks_lock and @p's rq lock in iter order. Resolved by routing both sides through [3]'s dual-lock helper: the post-init region runs under a single scx_task_iter_relock() acquisition, with [2]'s state machine and [1]'s hash insert in sequence inside it. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-10	sched_ext: Add scx_task_iter_relock() and use it in scx_root_enable_workfn()	Tejun Heo
	scx_root_enable_workfn()'s post-init block re-acquires scx_tasks_lock briefly via a scoped_guard() for the tid hash insertion. c941d7391f25 ("sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN") on for-7.1-fixes adds a post-init DEAD recheck that holds the task's rq lock across the state-machine updates in the same region. A naive merge would acquire scx_tasks_lock while the rq lock is held, inverting the iter's outer/inner order (scx_tasks_lock then rq lock). Add scx_task_iter_relock(iter, p), the counterpart to scx_task_iter_unlock(), that re-acquires scx_tasks_lock and, if @p is non-NULL, @p's rq lock. The locks are tracked in @iter so subsequent iteration releases them. Use it in scx_root_enable_workfn()'s post-init block and drop the now-redundant scoped_guard on the hash insertion. The post-init region now runs with both scx_tasks_lock and the task's rq lock held across the init failure check, the state-machine updates and the hash insert. v2: Move scx_task_iter_relock() earlier to ease the for-7.1-fixes merge. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-10	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf 7.1-rc3	Alexei Starovoitov
	Cross-merge BPF and other fixes after downstream PR. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-10	sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths	Tejun Heo
	scx_fail_parent() leaves cgroup tasks at (state=NONE, sched=parent, sched_class=ext) until the parent itself is torn down by the scx_error() it raised. When the later root_disable iterates them, two paths trip on NONE. scx_disable_and_exit_task() re-enters the wrapper at NONE: the inner switch returns early but the trailing scx_set_task_sched(p, NULL) clobbers the parent sched left by scx_fail_parent(), and scx_set_task_state(p, NONE) wastes a write on an already-NONE task. switched_from_scx() then calls scx_disable_task(), which WARNs on non-ENABLED state and writes state=READY, producing a NONE -> READY transition the validation matrix rejects. Treat NONE as "nothing to do" in both paths. Add a NONE early-return at the top of scx_disable_and_exit_task() and a parallel NONE check in switched_from_scx() next to task_dead_and_done(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10	sched_ext: Close sub-sched init race with post-init DEAD recheck	Tejun Heo
	scx_sub_enable_workfn()'s init pass and scx_sub_disable() migration both drop the rq lock to call __scx_init_task() against the other sched. A TASK_DEAD @p can fall through sched_ext_dead() in that window. sched_ext_dead() runs ops.exit_task() on the sched @p was attached to, not on the sched whose init just completed, so the new allocation leaks. Reuse the DEAD signal set by sched_ext_dead(). After __scx_init_task() returns, take task_rq_lock(p) and check for DEAD; on hit, call scx_sub_init_cancel_task() against the sub sched the init ran for and drop @p; on miss, proceed as before. Reported-by: zhidao su <suzhidao@xiaomi.com> Link: https://lore.kernel.org/all/20260429133155.3825247-1-suzhidao@xiaomi.com/ Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10	sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN	Tejun Heo
	scx_root_enable_workfn() drops the iter rq lock for ops.init_task() and a TASK_DEAD @p can fall through sched_ext_dead() in that window. The race hits when sched_ext_dead() observes SCX_TASK_INIT (the intermediate state before @p->scx.sched is published) and dereferences NULL via SCX_HAS_OP(NULL, exit_task), or observes SCX_TASK_NONE during the unlocked init window and skips cleanup so exit_task() never runs. Add SCX_TASK_INIT_BEGIN. The enable path writes NONE -> INIT_BEGIN under the iter rq lock, then takes the rq lock again after init to walk INIT_BEGIN -> INIT -> READY. sched_ext_dead() that wins the rq-lock race observes INIT_BEGIN and sets DEAD without calling into ops; the post-init recheck unwinds via scx_sub_init_cancel_task(). scx_fork() runs single-threaded against sched_ext_dead() (the task is not on scx_tasks until scx_post_fork() adds it) so its INIT_BEGIN -> INIT walk needs no rq-lock pairing; it rolls back to NONE on ops.init_task() failure. The validation matrix grows the INIT_BEGIN row and the INIT_BEGIN -> DEAD edge; INIT now requires INIT_BEGIN as the predecessor. scx_sub_disable()'s migration writes INIT_BEGIN as a synthetic predecessor to satisfy the tightened verification. The sub-sched paths still race with sched_ext_dead() during the unlocked init window. This will be fixed by the next patch. Reported-by: zhidao su <suzhidao@xiaomi.com> Link: https://lore.kernel.org/all/20260429133155.3825247-1-suzhidao@xiaomi.com/ Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10	sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state	Tejun Heo
	SCX_TASK_OFF_TASKS marked tasks already through sched_ext_dead() so cgroup task iteration would skip them. This can be expressed better with a task state. Replace the flag with SCX_TASK_DEAD. scx_disable_and_exit_task() resets state to NONE on its way out, so sched_ext_dead() now sets DEAD after the wrapper returns. The validation matrix grows NONE -> DEAD, warns on DEAD -> NONE, and tightens READY's predecessor to INIT or ENABLED so the new DEAD value cannot silently transition to READY. Prepares for the following enable vs dead race fix. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10	sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into ↵	Tejun Heo
	scx_set_task_state() Prepare for the SCX_TASK_INIT_BEGIN/DEAD work that follows by collapsing the scx_init_task() helper. Move the SCX_TASK_RESET_RUNNABLE_AT setting into scx_set_task_state() on the INIT transition (it was set unconditionally at every INIT site through the scx_init_task() helper), inline scx_init_task() into scx_fork() and scx_root_enable_workfn(), and drop the helper. As a side effect, scx_sub_disable() migration sequence now also sets RESET_RUNNABLE_AT (it previously wrote INIT directly without going through scx_init_task()). The flag triggers a runnable_at reset on the next set_task_runnable(), which is harmless on a task that has just been moved between scheds. On root-enable, p->scx.flags is written without the task's rq lock. The task isn't visible to scx yet, and a follow-up patch restores the lock-held write. v2: Note p->scx.flags rq-lock relaxation on root-enable path. (Andrea) Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-10	sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work	Tejun Heo
	Cleanups in preparation for the state-machine work that follows: - Convert three sub-sched call sites that open-code rcu_assign_pointer(p->scx.sched, ...) to scx_set_task_sched(). - Move scx_get_task_state()/scx_set_task_state() above the SCX task iter section so scx_task_iter_next_locked() can use them without a forward declaration. No functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-09	Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf	Linus Torvalds
	Pull bpf fixes from Alexei Starovoitov: - Fix sk_local_storage diag dump via netlink (Amery Hung) - Fix off-by-one in arena direct-value access (Junyoung Jang) - Reject TCP_NODELAY in bpf-tcp congestion control (KaFai Wan) - Fix type confusion in bpf__sock() (Kuniyuki Iwashima) - Reject TX-only AF_XDP sockets (Linpu Yu) - Don't run arg-tracking analysis twice on main subprog (Paul Chaignon) - Fix NULL pointer dereference in bpf_sk_storage_clone and fib lookup (Weiming Shi) tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Fix off-by-one boundary validation in arena direct-value access xskmap: reject TX-only AF_XDP sockets bpf: Don't run arg-tracking analysis twice on main subprog bpf: Free reuseport cBPF prog after RCU grace period. bpf: tcp: Fix type confusion in sol_tcp_sockopt(). bpf: tcp: Fix type confusion in bpf_skc_to_tcp6_sock(). bpf: tcp: Fix type confusion in bpf_skc_to_tcp_sock(). mptcp: bpf: Fix type confusion in bpf_mptcp_sock_from_subflow() selftest: bpf: Add test for bpf_tcp_sock() and RAW socket. bpf: tcp: Fix type confusion in bpf_tcp_sock(). tools/headers: Regenerate stddef.h to fix BPF selftests bpf: Fix sk_local_storage diag dumping uninitialized special fields bpf: Fix NULL pointer dereference in bpf_skb_fib_lookup() sockmap: Fix sk_psock_drop() race vs sock_map_{unhash,close,destroy}(). bpf: Fix NULL pointer dereference in bpf_sk_storage_clone and diag paths selftests/bpf: Verify bpf-tcp-cc rejects TCP_NODELAY selftests/bpf: Test TCP_NODELAY in TCP hdr opt callbacks bpf: Reject TCP_NODELAY in bpf-tcp-cc bpf: Reject TCP_NODELAY in TCP header option callbacks
2026-05-09	bpf: Fix off-by-one boundary validation in arena direct-value access	Junyoung Jang
	BPF_MAP_TYPE_ARENA accepts BPF_PSEUDO_MAP_VALUE offsets at exactly the end of the arena mapping (off == arena_size). The boundary check in arena_map_direct_value_addr() uses `>` instead of `>=`, which incorrectly allows a one-past-end pointer to be accepted. Change the condition to `>=` to correctly reject offsets that fall outside the valid arena user_vm range. Fixes: 317460317a02 ("bpf: Introduce bpf_arena.") Signed-off-by: Junyoung Jang <graypanda.inzag@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260426172505.1947915-1-graypanda.inzag@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-09	bpf: Don't run arg-tracking analysis twice on main subprog	Paul Chaignon
	Because subprog 0, the main subprog, is considered a global function, we end up running the arg-tracking dataflow analysis twice on it. That results in slightly longer verification but mostly in more verbose verifier logs. This patch fixes it by keeping only the iteration over global subprogs. When running over all of Cilium's programs with BPF_LOG_LEVEL2, this reduces verbosity by ~20% on average. Fixes: bf0c571f7feb6 ("bpf: introduce forward arg-tracking dataflow analysis") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/e4d7b53d4963ef520541a782f5fc8108a168877c.1778176504.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-09	sched_ext: Fix ops_cid layout assert	Tejun Heo
	ca1d48a86fab ("sched_ext: Use offsetofend on both sides of the ops_cid layout assert") replaced sizeof() with offsetofend() to dodge 32-bit PPC trailing padding, but the resulting check is tautological: with CID_OFFSET_MATCH(priv, priv) already enforcing offsetof(priv) equality and @priv being the same type in both structs, the two offsetofends are equal by construction. The original protection - catching a stray field added past @priv in sched_ext_ops_cid - is gone. Anchor on a zero-size __end[] marker appended after @priv. Its offset sits flush after @priv regardless of trailing struct padding; if a field is inserted past @priv, __end shifts and the assert fires. Closes: https://lore.kernel.org/all/20260508215211.0C03AC2BCB0@smtp.kernel.org/ Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
2026-05-08	Merge tag 'timers-urgent-2026-05-09' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix CPU hotplug activation race in the timer migration code, by Frederic Weisbecker" * tag 'timers-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers/migration: Fix another hotplug activation race
2026-05-08	Merge tag 'sched-urgent-2026-05-09' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix spurious failures in rseq self-tests (Mark Brown) - Fix rseq rseq::cpu_id_start ABI regression due to TCMalloc's creative use of the supposedly read-only field The fix is to introduce a new ABI variant based on a new (larger) rseq area registration size, to keep the TCMalloc use of rseq backwards compatible on new kernels (Thomas Gleixner) - Fix wakeup_preempt_fair() for not waking up task (Vincent Guittot) - Fix s64 mult overflow in vruntime_eligible() (Zhan Xusheng) * tag 'sched-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix wakeup_preempt_fair() for not waking up task sched/fair: Fix overflow in vruntime_eligible() selftests/rseq: Expand for optimized RSEQ ABI v2 rseq: Reenable performance optimizations conditionally rseq: Implement read only ABI enforcement for optimized RSEQ V2 mode selftests/rseq: Validate legacy behavior selftests/rseq: Make registration flexible for legacy and optimized mode selftests/rseq: Skip tests if time slice extensions are not available rseq: Revert to historical performance killing behaviour rseq: Don't advertise time slice extensions if disabled rseq: Protect rseq_reset() against interrupts rseq: Set rseq::cpu_id_start to 0 on unregistration selftests/rseq: Don't run tests with runner scripts outside of the scripts
2026-05-08	Merge tag 'perf-urgent-2026-05-09' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events fixes from Ingo Molnar: - Fix deadlock in the perf_mmap() failure path (Peter Zijlstra) - Intel ACR (Auto Counter Reload) fixes (Dapeng Mi): - Fix validation and configuration of ACR masks - Fix ACR rescheduling bug causing stale masks - Disable the PMI on ACR-enabled hardware - Enable ACR on Panther Cover uarch too * tag 'perf-urgent-2026-05-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/intel: Enable auto counter reload for DMR perf/x86/intel: Disable PMI for self-reloaded ACR events perf/x86/intel: Always reprogram ACR events to prevent stale masks perf/x86/intel: Improve validation and configuration of ACR masks perf/core: Fix deadlock in perf_mmap() failure path
2026-05-08	dma-debug: Ensure mappings are created and released with matching attributes	Leon Romanovsky
	The DMA API expects that callers use the same attributes when mapping and unmapping. Add tracking to verify this and catch mismatches. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260501-dma-attrs-debug-v2-6-8dbac75cd501@nvidia.com
2026-05-08	dma-debug: Feed DMA attribute for unmapping flows too	Leon Romanovsky
	There are multiple unmapping flows which didn't provide DMA attributes, which limited DMA debug code to compare the mapping and unmapping attributes. Let's fix it. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260501-dma-attrs-debug-v2-5-8dbac75cd501@nvidia.com
2026-05-08	dma-debug: Record DMA attributes in debug entry	Leon Romanovsky
	To enable reliable comparison of DMA attributes between map and unmap operations, store the attribute value in dma_debug_entry. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260501-dma-attrs-debug-v2-4-8dbac75cd501@nvidia.com
2026-05-08	dma-debug: Remove unused DMA attribute parameter	Leon Romanovsky
	debug_dma_alloc_pages() always receives a DMA attribute value of 0, because dma_alloc_pages() never receives any attributes from its callers. As preparation for upcoming patches, remove this unused attribute from the debug routine. Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260501-dma-attrs-debug-v2-3-8dbac75cd501@nvidia.com
2026-05-08	workqueue: Fix wq->cpu_pwq leak in alloc_and_link_pwqs() WQ_UNBOUND path	Breno Leitao
	For WQ_UNBOUND workqueues, alloc_and_link_pwqs() allocates wq->cpu_pwq via alloc_percpu() and then calls apply_workqueue_attrs_locked(). On failure it returns the error directly, bypassing the enomem: label which holds the only free_percpu(wq->cpu_pwq) in this function. The caller's error path kfree()s wq without touching wq->cpu_pwq, leaking one percpu pointer table (nr_cpu_ids * sizeof(void *) bytes) per failed call. If kmemleak is enabled, we can see: unreferenced object (percpu) 0xc0fffa5b121048 (size 8): comm "insmod", pid 776, jiffies 4294682844 backtrace (crc 0): pcpu_alloc_noprof+0x665/0xac0 __alloc_workqueue+0x33f/0xa20 alloc_workqueue_noprof+0x60/0x100 Route the error through the existing enomem: cleanup and any error before this one. Cc: stable@kernel.org Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-08	workqueue: Release PENDING in __queue_work() drain/destroy reject path	Breno Leitao
	The caller of __queue_work() owns WORK_STRUCT_PENDING, won via test_and_set_bit() in queue_work_on()/__queue_delayed_work(). The state machine documented above __queue_work() requires that owner to either hand the token to a pwq (insert_work() -> set_work_pwq()), hand it to a timer, or release it via set_work_pool_and_clear_pending(). try_to_grab_pending() relies on this: when it observes "PENDING && off-queue" it busy-loops, trusting the current owner to make progress. The (__WQ_DESTROYING \| __WQ_DRAINING) early-return path violates that contract. It WARN_ONCE()s and bare-returns, leaving work->data with PENDING set, WORK_STRUCT_PWQ clear, and work->entry empty. The path is reachable without explicit API abuse: queue_delayed_work() arms a timer with PENDING set; if drain_workqueue() runs while the timer is still pending, delayed_work_timer_fn() -> __queue_work() in softirq context hits the WARN, current is not a wq worker so is_chained_work() is false, and the work is silently dropped with PENDING leaked. Mirror what clear_pending_if_disabled() already does on its analogous reject path: unpack the off-queue data and call set_work_pool_and_clear_pending() to release the token before returning. I was able to reproduce this by queueing several slow works on a max_active=1 wq, arm a delayed_work whose timer fires while drain_workqueue() is blocked, then call cancel_delayed_work_sync(). Without this patch the cancel livelocks at 100% CPU; with it the cancel returns immediately. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-08	sched_ext: Use offsetofend on both sides of the ops_cid layout assert	Tejun Heo
	sizeof() includes trailing struct pad, offsetofend() doesn't. On 32-bit PPC, sched_ext_ops_cid tail-pads 4 bytes past @priv and the assert trips. Use offsetofend() on both sides. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202605081637.DbH4SZ1E-lkp@intel.com/ Fixes: 7e655ed7b953 ("sched_ext: Add bpf_sched_ext_ops_cid struct_ops type") Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-08	sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch->disable_irq_work	Zqiang
	For built with PREEMPT_RT kernels, the scx_disable_irq_workfn() is called from per-cpu irq_work kthreads context, this means that when call the scx_dump_state() in the scx_disable_irq_workfn() to output current->comm/pid, it always output current irq_work kthread's comm/pid. this commit therefore use the IRQ_WORK_INIT_HARD() to initialize sch->disable_irq_work to make scx_disable_irq_workfn() is called from hardirq context. Fixes: f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work") Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-07	sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warnings	Tejun Heo
	W=1 with CONFIG_EXT_SUB_SCHED=n flags 'err_msg' uninitialized and 'err_free_lb_resched' unused. Initialize err_msg and gate the label. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-07	sched_ext: Drop unused scx_find_sub_sched() stub	Tejun Heo
	scx_find_sub_sched()'s only caller, scx_bpf_sub_dispatch(), is gated on CONFIG_EXT_SUB_SCHED. When CONFIG_EXT_SUB_SCHED=n the caller compiles out and the stub becomes dead code, tripping -Wunused-function on randconfigs. Drop the stub. Fixes: 25037af712eb ("sched_ext: Add rhashtable lookup for sub-schedulers") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/all/202605080556.42PXw8U9-lkp@intel.com/ Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-07	cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in ↵	Chen Wandun
	cpuset_current_node_allowed() Since prepare_alloc_pages() unconditionally adds __GFP_HARDWALL for the fast path when cpusets are enabled, the __GFP_HARDWALL check in cpuset_current_node_allowed() causes the PF_EXITING escape path to be skipped on the first allocation attempt. This makes it unreachable in the common case, so dying tasks can get stuck in direct reclaim or even trigger OOM while trying to exit, despite being allowed to allocate from any node. Move the PF_EXITING check before __GFP_HARDWALL so that dying tasks can allocate memory from any node to exit quickly, even when cpusets are enabled. Also update the function comment to reflect the actual behavior of prepare_alloc_pages() and the corrected check ordering. Signed-off-by: Chen Wandun <chenwandun@lixiang.com> Acked-by: Michal Koutný <mkoutny@suse.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-07	umh: replace use of system_unbound_wq with system_dfl_wq	Marco Crivellari
	This patch continues the effort to refactor workqueue APIs, which has begun with the changes introducing new workqueues and a new alloc_workqueue flag: commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq") commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag") The point of the refactoring is to eventually alter the default behavior of workqueues to become unbound by default so that their workload placement is optimized by the scheduler. Before that to happen, workqueue users must be converted to the better named new workqueues with no intended behaviour changes: system_wq -> system_percpu_wq system_unbound_wq -> system_dfl_wq This way the old obsolete workqueues (system_wq, system_unbound_wq) can be removed in the future. Cc: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
2026-05-07	sched_ext: Move scx_error() out of scx_link_sched()'s lock region	Tejun Heo
	scx_link_sched() holds scx_sched_lock. The scx_error() calls inside take the same lock through scx_claim_exit() and deadlock. Move them out of the guard. Fixes: 6b4576b09714 ("sched_ext: Reject sub-sched attachment to a disabled parent") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-06	sched/fair: Fix wakeup_preempt_fair() for not waking up task	Vincent Guittot
	Make sure to only call pick_next_entity() on an non-empty cfs_rq. The assumption that p is always enqueued and not delayed, is only true for wakeup. If p was moved while delayed, pick_next_entity() will dequeue it and the cfs might become empty. Test if there are still queued tasks before trying again to determine if p could be the next one to be picked. There are at least 2 cases: When cfs becomes idle, it tries to pull tasks but if those pulled tasks are delayed, they will be dequeued when attached to cfs. attach_tasks() -> attach_task() -> wakeup_preempt(rq, p, 0); A misfit task running on cfs A triggers a load balance to be pulled on a better cpu, the load balance on cfs B starts an active load balance to pulled the running misfit task. If there is a delayed dequeue task on cfs A, it can be pulled instead of the previously running misfit task. attach_one_task() -> attach_task() -> wakeup_preempt(rq, p, 0); Fixes: ac8e69e69363 ("sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260503104503.1732682-1-vincent.guittot@linaro.org
2026-05-06	sched/fair: Fix overflow in vruntime_eligible()	Zhan Xusheng
	Zhan Xusheng reported running into sporadic a s64 mult overflow in vruntime_eligible(). When constructing a worst case scenario: If you have cgroups, then you can have an entity of weight 2 (per calc_group_shares()), and its vlag should then be bounded by: (slice+TICK_NSEC) * NICE_0_LOAD, which is around 44 bits as per the comment on entity_key(). The other extreme is 100NICE_0_LOAD, thus you get: {key, weight}[] := { puny: { (slice + TICK_NSEC) NICE_0_LOAD, 2 }, max: { 0, 100NICE_0_LOAD }, } The avg_vruntime() would end up being very close to 0 (which is zero_vruntime), so no real help making that more accurate. vruntime_eligible(puny) ends up with: avg = 2 puny.key (+ 0) load = 2 + 100 * NICE_0_LOAD avg >= puny.key * load And that is: (slice + TICK_NSEC) * NICE_0_LOAD * NICE_0_LOAD * 100, which will overflow s64. Zhan suggested using __builtin_mul_overflow(), however after staring at compiler output for various architectures using godbolt, it seems that using an __int128 multiplication often results in better code. Specifically, a number of architectures already compute the __int128 product to determine the overflow. Eg. arm64 already has the 'smulh' instruction used. By explicitly doing an __int128 multiply, it will emit the 'mul; smulh' pattern, which modern cores can fuse (armv8-a clang-22.1.0). x86_64 has less branches (no OF handling). Since Linux has ARCH_SUPPORTS_INT128 to gate __int128 usage, also provide the __builtin_mul_overflow() variant as a fallback. [peterz: Changelog and __int128 bits] Fixes: 556146ce5e94 ("sched/fair: Avoid overflow in enqueue_entity()") Reported-by: Zhan Xusheng <zhanxusheng1024@gmail.com> Closes: https://patch.msgid.link/20260415145742.10359-1-zhanxusheng%40xiaomi.com Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260505103155.GN3102924%40noisy.programming.kicks-ass.net
2026-05-06	rseq: Reenable performance optimizations conditionally	Thomas Gleixner
	Due to the incompatibility with TCMalloc the RSEQ optimizations and extended features (time slice extensions) have been disabled and made run-time conditional. The original RSEQ implementation, which TCMalloc depends on, registers a 32 byte region (ORIG_RSEG_SIZE). This region has a 32 byte alignment requirement. The extension safe newer variant exposes the kernel RSEQ feature size via getauxval(AT_RSEQ_FEATURE_SIZE) and the alignment requirement via getauxval(AT_RSEQ_ALIGN). The alignment requirement is that the registered RSEQ region is aligned to the next power of two of the feature size. The kernel currently has a feature size of 33 bytes, which means the alignment requirement is 64 bytes. The TCMalloc RSEQ region is embedded into a cache line aligned data structure starting at offset 32 bytes so that bytes 28-31 and the cpu_id_start field at bytes 32-35 form a 64-bit little endian pointer with the top-most bit (63 set) to check whether the kernel has overwritten cpu_id_start with an actual CPU id value, which is guaranteed to not have the top most bit set. As this is part of their performance tuned magic, it's a pretty safe assumption, that TCMalloc won't use a larger RSEQ size. This allows the kernel to declare that registrations with a size greater than the original size of 32 bytes, which is the cases since time slice extensions got introduced, as RSEQ ABI v2 with the following differences to the original behaviour: 1) Unconditional updates of the user read only fields (CPU, node, MMCID) are removed. Those fields are only updated on registration, task migration and MMCID changes. 2) Unconditional evaluation of the criticial section pointer is removed. It's only evaluated when user space was interrupted and was scheduled out or before delivering a signal in the interrupted context. 3) The read/only requirement of the ID fields is enforced. When the kernel detects that userspace manipulated the fields, the process is terminated. This ensures that multiple entities (libraries) can utilize RSEQ without interfering. 4) Todays extended RSEQ feature (time slice extensions) and future extensions are only enabled in the v2 enabled mode. Registrations with the original size of 32 bytes operate in backwards compatible legacy mode without performance improvements and extended features. Unfortunately that also affects users of older GLIBC versions which register the original size of 32 bytes and do not evaluate the kernel required size in the auxiliary vector AT_RSEQ_FEATURE_SIZE. That's the result of the lack of enforcement in the original implementation and the unwillingness of a single entity to cooperate with the larger ecosystem for many years. Implement the required registration changes by restructuring the spaghetti code and adding the size/version check. Also add documentation about the differences of legacy and optimized RSEQ V2 mode. Thanks to Mathieu for pointing out the ORIG_RSEQ_SIZE constraints! Fixes: d6200245c75e ("rseq: Allow registering RSEQ with slice extension") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Link: https://patch.msgid.link/20260428224427.927160119%40kernel.org Cc: stable@vger.kernel.org
2026-05-06	rseq: Implement read only ABI enforcement for optimized RSEQ V2 mode	Thomas Gleixner
	The optimized RSEQ V2 mode requires that user space adheres to the ABI specification and does not modify the read-only fields cpu_id_start, cpu_id, node_id and mm_cid behind the kernel's back. While the kernel does not rely on these fields, the adherence to this is a fundamental prerequisite to allow multiple entities, e.g. libraries, in an application to utilize the full potential of RSEQ without stepping on each other toes. Validate this adherence on every update of these fields. If the kernel detects that user space modified the fields, the application is force terminated. Fixes: d6200245c75e ("rseq: Allow registering RSEQ with slice extension") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Link: https://patch.msgid.link/20260428224427.845230956%40kernel.org Cc: stable@vger.kernel.org
2026-05-06	hrtimer: Return ktime_t from ↵	Thomas Weißschuh
	hrtimer_get_next_event()/hrtimer_next_event_without() These functions really work in terms of ktime_t and not u64. Change their return types and adapt the callers. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260504-hrtimer-next_event-v2-1-7a5d0550b42f@linutronix.de
2026-05-06	clocksource: Clean up clocksource_update_freq() functions	Thomas Weißschuh
	Remove the unused functions __clocksource_update_freq_hz() and __clocksource_update_freq_khz(). Then make __clocksource_update_freq_scale() static as it is not used from external callers anymore. Also clean up the comment accordingly. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Acked-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/20260504-clocksource-update_freq-v2-1-3e696fb01776@linutronix.de
2026-05-06	alarmtimer: Remove stale return description from alarm_handle_timer()	Zhan Xusheng
	alarm_handle_timer() was converted from returning enum alarmtimer_restart to void, but the kernel-doc "Return:" line was not removed. Remove the stale description. Fixes: 2634303f8773 ("alarmtimers: Remove return value from alarm functions") Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260429080635.166790-1-zhanxusheng@xiaomi.com
2026-05-06	timers/migration: Handle capacity in connect tracepoints	Frederic Weisbecker
	This let tracers know to which hierarchy a CPU belongs to. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260423165354.95152-6-frederic@kernel.org
2026-05-06	timers/migration: Split per-capacity hierarchies	Frederic Weisbecker
	Systems with heterogeneous CPU capacities, such as big.LITTLE, have reported power issues since the introduction of the new timer migration code. Timers migrate from small capacity CPUs to big ones, degrading their target residency and thus overall power consumption. Solve this with splitting hierarchies per CPU capacity. For example in a big.LITTLE machine, split a single hierarchy in two: one for big capacity CPUs and another one for small capacity CPUs. This way global timers only migrate across CPUs of the same capacity. For simplicity purpose, split hierarchies keep the same number of possible levels as if there were a single hierarchy, even though the CPUs are distributed between multiple hierarchies. This could be a problem on NUMA systems with heterogeneous CPU capacities (provided that ever exists yet) where useless intermediate nodes may be created. Solving this properly will imply on boot to know in advance how many capacities are available and the number of CPUs for each of them. Reported-by: Sehee Jeong <sehee1.jeong@samsung.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260423165354.95152-5-frederic@kernel.org
2026-05-06	timers/migration: Track CPUs in a hierarchy	Frederic Weisbecker
	When a new root is created, the old root is connected to it and propagates up its own assumed to be active state, since the hotplug control CPU is itself active and part of the old root. However with per-capacity hierarchies, this assumption won't be true anymore because the hotplug control CPU calling the timer migration prepare callback may not belong to the same hierarchy as the booting CPU. To solve this, track the available CPUs per hierarchies so that the root connection can be offlined to safe CPUs. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260423165354.95152-4-frederic@kernel.org
2026-05-06	timers/migration: Abstract out hierarchy to prepare for CPU capacity awareness	Frederic Weisbecker
	In order to prepare for separating out CPUs from different capacities in distinct hierarchies, create a hierarchy structure that group setup must rely upon. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260423165354.95152-3-frederic@kernel.org
2026-05-06	Merge branch 'timers/urgent' into timers/core	Thomas Gleixner
	Pull in the timer migration fix so further changes can be applied.
2026-05-06	timers/migration: Fix another hotplug activation race	Frederic Weisbecker
	The hotplug control CPU is assumed to be active in the hierarchy but that doesn't imply that the root is active. If the current CPU is not the one that activated the current hierarchy, and the CPU performing this duty is still halfway through the tree, the root may still be observed inactive. And this can break the activation of a new root as in the following scenario: 1) Initially, the whole system has 64 CPUs and only CPU 63 is awake. [GRP1:0] active / \| \ / \| \ [GRP0:0] [...] [GRP0:7] idle idle active / \| \ \| CPU 0 CPU 1 ... CPU 63 idle idle active 2) CPU 63 goes idle _but_ due to a #VMEXIT it hasn't yet reached the [GRP1:0]->parent dereference (that would be NULL and stop the walk) in __walk_groups_from(). [GRP1:0] idle / \| \ / \| \ [GRP0:0] [...] [GRP0:7] idle idle idle / \| \ \| CPU 0 CPU 1 ... CPU 63 idle idle idle 3) CPU 1 wakes up, activates GRP0:0 but didn't yet manage to propagate up to GRP1:0 due to yet another #VMEXIT. [GRP1:0] idle / \| \ / \| \ [GRP0:0] [...] [GRP0:7] active idle idle / \| \ \| CPU 0 CPU 1 ... CPU 63 idle active idle 3) CPU 0 wakes up and doesn't need to walk above GRP0:0 as it's CPU 1 role. [GRP1:0] idle / \| \ / \| \ [GRP0:0] [...] [GRP0:7] active idle idle / \| \ \| CPU 0 CPU 1 ... CPU 63 active active idle 4) CPU 0 boots CPU 64. It creates a new root for it. [GRP2:0] idle / \ / \ [GRP1:0] [GRP1:1] idle idle / \| \ \ / \| \ \ [GRP0:0] [...] [GRP0:7] [GRP0:8] active idle idle idle / \| \ \| \| CPU 0 CPU 1 ... CPU 63 CPU 64 active active idle offline 5) CPU 0 activates the new root, but note that GRP1:0 is still idle, waiting for CPU 1 to resume from #VMEXIT and activate it. [GRP2:0] active / \ / \ [GRP1:0] [GRP1:1] idle idle / \| \ \ / \| \ \ [GRP0:0] [...] [GRP0:7] [GRP0:8] active idle idle idle / \| \ \| \| CPU 0 CPU 1 ... CPU 63 CPU 64 active active idle offline 6) CPU 63 resumes after #VMEXIT and sees the new GRP1:0 parent. Therefore it propagates the stale inactive state of GRP1:0 up to GRP2:0. [GRP2:0] idle / \ / \ [GRP1:0] [GRP1:1] idle idle / \| \ \ / \| \ \ [GRP0:0] [...] [GRP0:7] [GRP0:8] active idle idle idle / \| \ \| \| CPU 0 CPU 1 ... CPU 63 CPU 64 active active idle offline 7) CPU 1 resumes after #VMEXIT and finally activates GRP1:0. But it doesn't observe its parent link because no ordering enforced that. Therefore GRP2:0 is spuriously left idle. [GRP2:0] idle / \ / \ [GRP1:0] [GRP1:1] active idle / \| \ \ / \| \ \ [GRP0:0] [...] [GRP0:7] [GRP0:8] active idle idle idle / \| \ \| \| CPU 0 CPU 1 ... CPU 63 CPU 64 active active idle offline Such races are highly theoretical and the problem would solve itself once the old root ever becomes idle again. But it still leaves a taste of discomfort. Fix it with enforcing a fully ordered atomic read of the old root state before propagating the activate state up to the new root. It has a two directions ordering effect: * Acquire + release of the latest old root state: If the hotplug control CPU is not the one that woke up the old root, make sure to acquire its active state and propagate it upwards through the ordered chain of activation (the acquire pairs with the cmpxchg() in tmigr_active_up() and subsequent releases will pair with atomic_read_acquire() and smp_mb__after_atomic() in tmigr_inactive_up()). * Release: If the hotplug control CPU is not the one that must wake up the old root, but the CPU covering that is lagging behind its duty, publish the links from the old root to the new parents. This way the lagging CPU will propagate the active state itself. Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260423165354.95152-2-frederic@kernel.org
2026-05-05	Merge tag 'wq-for-7.1-rc2-fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: - Fix devm_alloc_workqueue() passing a va_list as a positional arg to the variadic alloc_workqueue() macro, which garbled wq->name and skipped lockdep init on the devm path. Fold both noprof entry points onto a va_list helper. Also, annotate it using __printf(1, 0) * tag 'wq-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Annotate alloc_workqueue_va() with __printf(1, 0) workqueue: fix devm_alloc_workqueue() va_list misuse
2026-05-05	Merge tag 'cgroup-for-7.1-rc2-fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - During v6.19, cgroup task unlink was moved from do_exit() to after the final task switch to satisfy a controller invariant. That left the kernel seeing tasks past exit_signals() longer than userspace expected, and several v7.0 follow-ups tried to bridge the gap by making rmdir wait for the kernel side. None held up. The latest is an A-A deadlock when rmdir is invoked by the reaper of zombies whose pidns teardown the rmdir itself is waiting on, which points at the synchronizing approach being fundamentally wrong. Take a different approach: drop the wait, leave rmdir's user-visible side returning as soon as cgroup.procs is empty, and defer the css percpu_ref kill that drives ->css_offline() until the cgroup is fully depopulated. Tagged for stable. Somewhat invasive but contained. The hope is that fixing forward sticks. If not, the fallback is to revert the entire chain and rework on the development branch. Note that this doesn't plug a pre-existing analogous race in cgroup_apply_control_disable() (controller disable via subtree_control). Not a regression. The development branch will do the more invasive restructuring needed for that. - Documentation update for cgroup-v1 charge-commit section that still referenced functions removed when the memcg hugetlb try-commit-cancel protocol was retired. * tag 'cgroup-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: docs: cgroup-v1: Update charge-commit section cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated
2026-05-05	Merge tag 'sched_ext-for-7.1-rc2-fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix idle CPU selection returning prev_cpu outside the task's cpus_ptr when the BPF caller's allowed mask was wider. Stable backport. - Two opposite-direction gaps in scx_task_iter's cgroup-scoped mode versus the global mode: - Tasks past exit_signals() are filtered by the cgroup walk but kept by global. Sub-scheduler enable abort leaked __scx_init_task() state. Add a CSS_TASK_ITER_WITH_DEAD flag to cgroup's task iterator (scx_task_iter is its only user) and use it. - Tasks past sched_ext_dead() are still returned, tripping WARN_ON_ONCE() in callers or making them touch torn-down state. Mark and skip under the per-task rq lock. * tag 'sched_ext-for-7.1-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: idle: Recheck prev_cpu after narrowing allowed mask sched_ext: Skip past-sched_ext_dead() tasks in scx_task_iter_next_locked() cgroup, sched_ext: Include exiting tasks in cgroup iter
2026-05-05	rseq: Revert to historical performance killing behaviour	Thomas Gleixner
	The recent RSEQ optimization work broke the TCMalloc abuse of the RSEQ ABI as it not longer unconditionally updates the CPU, node, mm_cid fields, which are documented as read only for user space. Due to the observed behavior of the kernel it was possible for TCMalloc to overwrite the cpu_id_start field for their own purposes and rely on the kernel to update it unconditionally after each context switch and before signal delivery. The RSEQ ABI only guarantees that these fields are updated when the data changes, i.e. the task is migrated or the MMCID of the task changes due to switching from or to per CPU ownership mode. The optimization work eliminated the unconditional updates and reduced them to the documented ABI guarantees, which results in a massive performance win for syscall, scheduling heavy work loads, which in turn breaks the TCMalloc expectations. There have been several options discussed to restore the TCMalloc functionality while preserving the optimization benefits. They all end up in a series of hard to maintain workarounds, which in the worst case introduce overhead for everyone, e.g. in the scheduler. The requirements of TCMalloc and the optimization work are diametral and the required work arounds are a maintainence burden. They end up as fragile constructs, which are blocking further optimization work and are pretty much guaranteed to cause more subtle issues down the road. The optimization work heavily depends on the generic entry code, which is not used by all architectures yet. So the rework preserved the original mechanism moslty unmodified to keep the support for architectures, which handle rseq in their own exit to user space loop. That code is currently optimized out by the compiler on architectures which use the generic entry code. This allows to revert back to the original behaviour by replacing the compile time constant conditions with a runtime condition where required, which disables the optimization and the dependend time slice extension feature until the run-time condition can be enabled in the RSEQ registration code on a per task basis again. The following changes are required to restore the original behavior, which makes TCMalloc work again: 1) Replace the compile time constant conditionals with runtime conditionals where appropriate to prevent the compiler from optimizing the legacy mode out 2) Enforce unconditional update of IDs on context switch for the non-optimized v1 mode 3) Enforce update of IDs in the pre signal delivery path for the non-optimized v1 mode 4) Enforce update of IDs in the membarrier(RSEQ) IPI for the non-optimized v1 mode 5) Make time slice and future extensions depend on optimized v2 mode This brings back the full performance problems, but preserves the v2 optimization code and for generic entry code using architectures also the TIF_RSEQ optimization which avoids a full evaluation of the exit to user mode loop in many cases. Fixes: 566d8015f7ee ("rseq: Avoid CPU/MM CID updates when no event pending") Reported-by: Mathias Stearn <mathias@mongodb.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Closes: https://lore.kernel.org/CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com Link: https://patch.msgid.link/20260428224427.517051752%40kernel.org Cc: stable@vger.kernel.org
2026-05-05	sched: Use trace_call__<tp>() to save a static branch	Gabriele Monaco
	The wrapper functions __trace_set_current_state() and __trace_set_need_resched() allow the tracepoints to be called from code outside sched/core.c, those calls are already guarded by a tracepoint_enabled(<tp>) so there is no need to repeat this check once again inside the call using trace_<tp>(). Use the new trace_call__<tp>() API to directly call the tracepoint without check. Those helper functions must be called after the appropriate check. Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260429094227.34087-1-gmonaco@redhat.com
2026-05-05	sched/membarrier: Modernize membarrier_global_expedited with cleanup guards	Aniket Gattani
	Replace explicit lock/unlock and free calls with scoped guards and automatic cleanup constructs. Signed-off-by: Aniket Gattani <aniketgattani@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260503212205.3714217-3-aniketgattani@google.com