linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2026-04-28	kho: skip KHO for crash kernel	Evangelos Petrongonas
	kho_fill_kimage() unconditionally populates the kimage with KHO metadata for every kexec image type. When the image is a crash kernel, this can be problematic as the crash kernel can run in a small reserved region and the KHO scratch areas can sit outside it. The crash kernel then faults during kho_memory_init() when it tries phys_to_virt() on the KHO FDT address: Unable to handle kernel paging request at virtual address xxxxxxxx ... fdt_offset_ptr+... fdt_check_node_offset_+... fdt_first_property_offset+... fdt_get_property_namelen_+... fdt_getprop+... kho_memory_init+... mm_core_init+... start_kernel+... kho_locate_mem_hole() already skips KHO logic for KEXEC_TYPE_CRASH images, but kho_fill_kimage() was missing the same guard. As kho_fill_kimage() is the single point that populates image->kho.fdt and image->kho.scratch, fixing it here is sufficient for both arm64 and x86 as the FDT and boot_params path are bailing out when these fields are unset. Fixes: d7255959b69a ("kho: allow kexec load before KHO finalization") Signed-off-by: Evangelos Petrongonas <epetron@amazon.de> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Link: https://patch.msgid.link/20260410011609.1103-1-epetron@amazon.de Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-04-28	sched/fair: Clear rel_deadline when initializing forked entities	Zicheng Qu
	A yield-triggered crash can happen when a newly forked sched_entity enters the fair class with se->rel_deadline unexpectedly set. The failing sequence is: 1. A task is forked while se->rel_deadline is still set. 2. __sched_fork() initializes vruntime, vlag and other sched_entity state, but does not clear rel_deadline. 3. On the first enqueue, enqueue_entity() calls place_entity(). 4. Because se->rel_deadline is set, place_entity() treats se->deadline as a relative deadline and converts it to an absolute deadline by adding the current vruntime. 5. However, the forked entity's deadline is not a valid inherited relative deadline for this new scheduling instance, so the conversion produces an abnormally large deadline. 6. If the task later calls sched_yield(), yield_task_fair() advances se->vruntime to se->deadline. 7. The inflated vruntime is then used by the following enqueue path, where the vruntime-derived key can overflow when multiplied by the entity weight. 8. This corrupts cfs_rq->sum_w_vruntime, breaks EEVDF eligibility calculation, and can eventually make all entities appear ineligible. pick_next_entity() may then return NULL unexpectedly, leading to a later NULL dereference. A captured trace shows the effect clearly. Before yield, the entity's vruntime was around: 9834017729983308 After yield_task_fair() executed: se->vruntime = se->deadline the vruntime jumped to: 19668035460670230 and the deadline was later advanced further to: 19668035463470230 This shows that the deadline had already become abnormally large before yield_task_fair() copied it into vruntime. rel_deadline is only meaningful when se->deadline really carries a relative deadline that still needs to be placed against vruntime. A freshly forked sched_entity should not inherit or retain this state. Clear se->rel_deadline in __sched_fork(), together with the other sched_entity runtime state, so that the first enqueue does not interpret the new entity's deadline as a stale relative deadline. Fixes: 82e9d0456e06 ("sched/fair: Avoid re-setting virtual deadline on 'migrations'") Analyzed-by: Hui Tang <tanghui20@huawei.com> Analyzed-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260424071113.1199600-1-quzicheng@huawei.com
2026-04-28	sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue	Vincent Guittot
	Similar to how pick_next_entity() must dequeue delayed entities, so too must wakeup_preempt_fair(). Any delayed task being found means it is eligible and hence past the 0-lag point, ready for removal. Worse, by not removing delayed entities from consideration, it can skew the preemption decision, with the end result that a short slice wakeup will not result in a preemption. tip/sched/core tip/sched/core +this patch cyclictest slice (ms) (default)2.8 8 8 hackbench slice (ms) (default)2.8 20 20 Total Samples \| 22559 22595 22683 Average (us) \| 157 64( 59%) 59( 8%) Median (P50) (us) \| 57 57( 0%) 58(- 2%) 90th Percentile (us) \| 64 60( 6%) 60( 0%) 99th Percentile (us) \| 2407 67( 97%) 67( 0%) 99.9th Percentile (us) \| 3400 2288( 33%) 727( 68%) Maximum (us) \| 5037 9252(-84%) 7461( 19%) Fixes: f12e148892ed ("sched/fair: Prepare pick_next_task() for delayed dequeue") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260422093400.319251-1-vincent.guittot@linaro.org
2026-04-28	sched/fair: Fix the negative lag increase fix	Peter Zijlstra
	Vincent reported that my rework of his original patch lost a little something. Specifically it got the return value wrong; it should not compare against the old se->vlag, but rather against the current value. Since the thing that matters is if the effective vruntime of an entity is affected and the thing needs repositioning or not. Fixes: 059258b0d424 ("sched/fair: Prevent negative lag increase during delayed dequeue") Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260423094107.GT3102624%40noisy.programming.kicks-ass.net
2026-04-27	Merge tag 'cgroup-for-7.1-rc1-fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix UAF race in psi pressure_write() against cgroup file release by extending cgroup_mutex coverage and ordering of->priv access after cgroup_kn_lock_live() - Fix integer overflow in rdmacg_try_charge() when usage equals INT_MAX by performing the increment in s64 - Fix asymmetric DL bandwidth accounting on cpuset attach rollback by recording the CPU used by dl_bw_alloc() so cancel_attach() returns the reservation to the same root domain - Fix nr_dying_subsys_* race that briefly showed 0 in cgroup.stat after rmdir by incrementing from kill_css() instead of offline_css() - Typo fix in cgroup-v2 documentation * tag 'cgroup-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: docs: cgroup: fix typo 'protetion' -> 'protection' cgroup: Increment nr_dying_subsys_* from rmdir context cgroup/cpuset: record DL BW alloc CPU for attach rollback cgroup/rdma: fix integer overflow in rdmacg_try_charge() sched/psi: fix race between file release and pressure write
2026-04-27	bpf: Remove obsolete WARN_ON call	Jiri Olsa
	The WARN_ON call in bpf_trampoline_update could never hit, because we direct the code path with (total == 0) to out label, which effectively skips the WARN_ON call. The WARN_ON made sense back then when it checked tr->selector, but now with total being set just inside the function it's useless. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20260424153905.354922-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-27	bpf: Export cnum_umin/umax() helpers for netronome driver	Alan Maguire
	ERROR: modpost: "cnum64_umin" [drivers/net/ethernet/netronome/nfp/nfp.ko] undefined! ERROR: modpost: "cnum64_umax" [drivers/net/ethernet/netronome/nfp/nfp.ko] undefined! Export symbols for these references. Reported-by: Kaitao Cheng <pilgrimtao@gmail.com> Fixes: bbc631085503 ("bpf: replace min/max fields with struct cnum{32,64}") Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260427112205.1346733-1-alan.maguire@oracle.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-27	bpf: range_within() must check cnum ranges instead of min/max pairs	Eduard Zingerman
	states.c:range_within() must be updated to properly check if cnum-based range in an old state is a superset of a range in the cur state. Currently it makes the decision using min/max accessors: reg_umin(old) <= reg_umin(cur) <= reg_umax(old) This is wrong for cnums that cross both UT_MAX/0 and ST_MAX/ST_MIN boundaries. Consider cnum32{base=0x7FFFFFF0, size=0x80000020}, which represents values [0x7FFFFFF0, ..., U32_MAX, 0, ..., 0x10]. Its projections are u32_min/max=0/U32_MAX, s32_min/max=S32_MIN/MAX. A register with range [0x100, 0x200] (which lies entirely in the gap of the wrapping range) would pass the min/max check despite having no overlap with the actual cnum arc. This commit replaces min/max comparison with cnum{32,64}_is_subset() operation. The operation implementation is verified using cbmc model checker in [1]. [1] https://github.com/eddyz87/cnum-verif/ Fixes: bbc631085503 ("bpf: replace min/max fields with struct cnum{32,64}") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260425-cnum-range-within-v1-1-2fdca70cb09d@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-27	kho: fix error handling in kho_add_subtree()	Breno Leitao
	Fix two error handling issues in kho_add_subtree(), where it doesn't handle the error path correctly. 1. If fdt_setprop() fails after the subnode has been created, the subnode is not removed. This leaves an incomplete node in the FDT (missing "preserved-data" or "blob-size" properties). 2. The fdt_setprop() return value (an FDT error code) is stored directly in err and returned to the caller, which expects -errno. Fix both by storing fdt_setprop() results in fdt_err, jumping to a new out_del_node label that removes the subnode on failure, and only setting err = 0 on the success path, otherwise returning -ENOMEM (instead of FDT_ERR_ errors that would come from fdt_setprop). No user-visible changes. This patch fixes error handling in the KHO (Kexec HandOver) subsystem, which is used to preserve data across kexec reboots. The fix only affects a rare failure path during kexec preparation — specifically when the kernel runs out of space in the Flattened Device Tree buffer while registering preserved memory regions. In the unlikely event that this error path was triggered, the old code would leave a malformed node in the device tree and return an incorrect error code to the calling subsystem, which could lead to confusing log messages or incorrect recovery decisions. With this fix, the incomplete node is properly cleaned up and the appropriate errno value is propagated, this error code is not returned to the user. Link: https://lore.kernel.org/20260410-kho_fix_send-v2-1-1b4debf7ee08@debian.org Fixes: 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers") Signed-off-by: Breno Leitao <leitao@debian.org> Suggested-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Breno Leitao <leitao@debian.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27	liveupdate: fix return value on session allocation failure	Pasha Tatashin
	When session allocation fails during deserialization, the global 'err' variable was not updated before returning. This caused subsequent calls to luo_session_deserialize() to incorrectly report success. Ensure 'err' is set to the error code from PTR_ERR(session). This ensures that an error is correctly returned to userspace when it attempts to open /dev/liveupdate in the new kernel if deserialization failed. Link: https://lore.kernel.org/20260415193738.515491-1-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-26	driver core: Replace dev->offline + ->offline_disabled with accessors	Douglas Anderson
	In C, bitfields are not necessarily safe to modify from multiple threads without locking. Switch "offline" and "offline_disabled" over to the "flags" field so modifications are safe. Cc: Rafael J. Wysocki <rafael@kernel.org> Acked-by: Mark Brown <broonie@kernel.org> Reviewed-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Reviewed-by: Danilo Krummrich <dakr@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Douglas Anderson <dianders@chromium.org> Link: https://patch.msgid.link/20260406162231.v5.9.I897d478b4a9361d79cd5073207c1062fd4d0d0e4@changeid Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2026-04-26	driver core: Replace dev->dma_ops_bypass with dev_dma_ops_bypass()	Douglas Anderson
	In C, bitfields are not necessarily safe to modify from multiple threads without locking. Switch "dma_ops_bypass" over to the "flags" field so modifications are safe. Cc: Christoph Hellwig <hch@lst.de> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Reviewed-by: Danilo Krummrich <dakr@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Douglas Anderson <dianders@chromium.org> Link: https://patch.msgid.link/20260406162231.v5.5.If62b84471ef2c85e7ad250f0468867d6dba965ab@changeid Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2026-04-26	driver core: Replace dev->dma_skip_sync with dev_dma_skip_sync()	Douglas Anderson
	In C, bitfields are not necessarily safe to modify from multiple threads without locking. Switch "dma_skip_sync" over to the "flags" field so modifications are safe. Cc: Alexander Lobakin <aleksander.lobakin@intel.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Reviewed-by: Danilo Krummrich <dakr@kernel.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Douglas Anderson <dianders@chromium.org> Link: https://patch.msgid.link/20260406162231.v5.4.Icf072aa4184dd86a88fa8ca195b09d1651984000@changeid Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2026-04-24	bpf: replace min/max fields with struct cnum{32,64}	Eduard Zingerman
	Replace eight independent s64, u64, s32, u32 min/max fields in bpf_reg_state with two circular number fields: - cnum64 for a unified signed/unsigned 64-bit range tracking; - cnum32 for a unified signed/unsigned 32-bit range tracking. Each cnum represents a range as a single arc on the circular number line (base + size), from which signed and unsigned bounds are derived on demand via accessor functions introduced in the preceding commit. Notable changes: - Signed<->unsigned deductions in __reg_deduce_bounds() are removed. - 64<->32 bit deductions are replaced with: - reg->r32 = cnum32_intersect(reg->r32, cnum32_from_cnum64(reg->r64)); this is functionally equivalent to the old code. - reg->r64 = cnum64_cnum32_intersect(reg->r64, reg->r32); this handles a few additional cases, see commit message for "bpf: representation and basic operations on circular numbers". - regs_refine_cond_op() now computes results in terms of operations on sets, e.g. for JNE: /* Complement of the range [val, val] as cnum64. */ lo = (struct cnum64){ val + 1, U64_MAX - 1 }; reg1->r64 = cnum64_intersect(reg1->r64, lo); - For add, sub operations on scalars replace explicit bounds computations with cnum{32,64}_{add,negate}. - For add, sub operations on pointers deduplicate with arithmetic operations on scalars and use cnum{32,64}_{add,negate}. - For and, or, xor operations on scalars remove explicit signed bounds computations. - range_bounds_violation() reduces to checking cnum_is_empty(). - const_tnum_range_mismatch() reduces to checking cnum_is_const(). Selftest adjustments: a few existing tests are updated because a single cnum arc cannot always represent what the old system expressed as the intersection of independent signed and unsigned ranges. For example, if the old system tracked u64=[0, U64_MAX-U32_MAX+2] and s64=[S64_MIN+2, 2] independently, their intersection is a tight two-point set. A single cnum must pick the shorter arc, losing the other constraint. These cases are documented with comments in the adjusted tests. reg_bounds.c is updated with logic similar to cnum64_cnum32_intersect(). Instead of using cnums it inspects intersection between 'b' and first / last / next-after-first / previous-before-last sub-ranges of 'a'. reg_bounds.c is also updated to skip test cases that rely in signed and unsigned ranges intersecting in two intervals, as such cases are not representable by a single cnum. The following "crafted" test cases are affected: - reg_bounds_crafted/(s64)[0xffffffffffff8000; 0x7fff] (u32)<op> [0; 0x1f] - reg_bounds_crafted/(s64)[0; 0x1f] (u32)<op> [0xffffffffffffff80; 0x7f] - reg_bounds_crafted/(s64)[0xffffffffffffff80; 0x7f] (u32)<op> [0; 0x1f] - reg_bounds_crafted/(u64)[0; 1] (s32)<op> [1; 2147483648] - reg_bounds_crafted/(u64)[1; 2147483648] (s32)<op> [0; 1] - reg_bounds_crafted/(u64)[0; 0xffffffff00000000] (s64)<op> 0 - reg_bounds_crafted/(u64)0 (s64)<op> [0; 0xffffffff00000000] - reg_bounds_crafted/(u64)[0; 0xffffffff00000000] (s32)<op> 0 - reg_bounds_crafted/(u64)0 (s32)<op> [0; 0xffffffff00000000] - reg_bounds_crafted/(s64)[S64_MIN; 0] (u64)<op> S64_MIN - reg_bounds_crafted/(s64)S64_MIN (u64)<op> [S64_MIN; 0] - reg_bounds_crafted/(s32)[S32_MIN; 0] (u32)<op> S32_MIN - reg_bounds_crafted/(s32)S32_MIN (u32)<op> [S32_MIN; 0] - reg_bounds_crafted/(s64)[0; 0x1f] (u32)<op> [0xffffffff80000000; 0x7fffffff] - reg_bounds_crafted/(s64)[0xffffffff80000000; 0x7fffffff] (u32)<op> [0; 0x1f] - reg_bounds_crafted/(s64)[0; 0x1f] (u32)<op> [0xffffffffffff8000; 0x7fff] As well as some reg_bounds_roand_{consts,ranges}_A_B, where A and B differ in sign domain. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260424-cnums-everywhere-rfc-v1-v3-3-ca434b39a486@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-24	bpf: use accessor functions for bpf_reg_state min/max fields	Eduard Zingerman
	Replace direct access to bpf_reg_state->{smin,smax,umin,umax, s32_min,s32_max,u32_min,u32_max}_value with getter/setter inline functions, preparing for future switch to cnum-based internal representation. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260424-cnums-everywhere-rfc-v1-v3-2-ca434b39a486@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-24	bpf: representation and basic operations on circular numbers	Eduard Zingerman
	This commit adds basic definitions for cnum32/cnum64. This is a unified numeric range representation for signed and unsigned domains. Inspired by an old post from Shung-Hsi Yu [1] and paper [2]. Operations correctness is verified using cbmc model checker, tests source code can be found in a separate repo [3]. The cnum64_cnum32_intersect() function is notable, because it handled several cases verifier.c:deduce_bounds_64_from_32() does not. Given: - a is a 64-bit range - b is a 32-bit range - t is a refined 64-bit range, such that ∀ v ∈ a, (u32)v ∈ b: v ∈ t. cnum64_cnum32_intersect() makes the following deductions: (A): 'b' is a sub-range of the first or the last 32-bit sub-range of 'a': 64-bit number axis ---> N2^32 (N+1)2^32 (N+2)2^32 (N+3)2^32 \|\|------\|---\|=====\|-------\|\|----------\|=====\|-------\|\|----------\|=====\|----\|--\|\| \| \|< b >\| \|< b >\| \|< b >\| \| \| \| \| \| \|<--+--------------------------- a ---------------------------+--->\| \| \| \|<-------------------------- t -------------------------->\| (B) 'b' does not intersect with the first of the last 32-bit sub-range of 'a': N2^32 (N+1)2^32 (N+2)2^32 (N+3)2^32 \|\|--\|=====\|----\|----------\|\|--\|=====\|---------------\|\|--\|=====\|------------\|--\|\| \|< b >\| \| \|< b >\| \|< b >\| \| \| \| \| \| \|<-------------+--------- a -------------------\|----------->\| \| \| \|<-------- t ------------------>\| (C) 'b' crosses 0/U32_MAX boundary: N2^32 (N+1)2^32 (N+2)2^32 (N+3)2^32 \|\|===\|---------\|------\|===\|\|===\|----------------\|===\|\|===\|---------\|------\|===\|\| \|b >\| \| \|< b\|\|b >\| \|< b\|\|b >\| \| \|< b\| \| \| \| \| \|<-----+----------------- a --------------+-------->\| \| \| \|<---------------- t ------------->\| Current implementation of deduce_bounds_64_from_32() only handles case (A). [1] https://lore.kernel.org/all/ZTZxoDJJbX9mrQ9w@u94a/ [2] https://jorgenavas.github.io/papers/ACM-TOPLAS-wrapped.pdf [3] https://github.com/eddyz87/cnum-verif/tree/master Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260424-cnums-everywhere-rfc-v1-v3-1-ca434b39a486@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-24	Merge branch 'for-7.1-fixes' into for-7.2	Tejun Heo
	Pull to receive: c0e8ddc76d54 ("sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED") which conflicts with: 41e3312861ea ("sched_ext: add p->scx.tid and SCX_OPS_TID_TO_TASK lookup") It's a simple context conflict. Take changes from both. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-24	sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable	Tejun Heo
	scx_root_enable_workfn() takes cpus_read_lock() before scx_link_sched(sch), but the `if (ret) goto err_disable` on failure skips the matching cpus_read_unlock() - all other err_disable gotos along this path drop the lock first. scx_link_sched() only returns non-zero on the sub-sched path (parent != NULL), so the leak path is unreachable via the root caller today. Still, the unwind is out of line with the surrounding paths. Drop cpus_read_lock() before goto err_disable. v2: Correct Fixes: tag (Andrea Righi). Fixes: 25037af712eb ("sched_ext: Add rhashtable lookup for sub-schedulers") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-24	sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime	Tejun Heo
	scx_prog_sched(aux) returns NULL for TRACING / SYSCALL BPF progs that have no struct_ops association when the root scheduler has sub_attach set. scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() pass that NULL into scx_task_on_sched(sch, p), which under CONFIG_EXT_SUB_SCHED is rcu_access_pointer(p->scx.sched) == sch. For any non-scx task p->scx.sched is NULL, so NULL == NULL returns true and the authority gate is bypassed - a privileged but non-struct_ops-associated prog can poke p->scx.slice / p->scx.dsq_vtime on arbitrary tasks. Reject !sch up front so the gate only admits callers with a resolved scheduler. Fixes: 245d09c594ea ("sched_ext: Enforce scheduler ownership when updating slice and dsq_vtime") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24	sched_ext: Refuse cross-task select_cpu_from_kfunc calls	Tejun Heo
	select_cpu_from_kfunc() skipped pi_lock for @p when called from ops.select_cpu() or another rq-locked SCX op, assuming the held lock protects @p. scx_bpf_select_cpu_dfl() / __scx_bpf_select_cpu_and() accept an arbitrary KF_RCU task_struct, so a caller in e.g. ops.select_cpu(p1) or ops.enqueue(p1) can pass some other p2 - the held pi_lock / rq lock is p1's, not p2's - and reading p2->cpus_ptr / nr_cpus_allowed races with set_cpus_allowed_ptr() and migrate_disable_switch() on another CPU. Abort the scheduler on cross-task calls in both branches: for ops.select_cpu() use scx_kf_arg_task_ok() to verify @p is the wake-up task recorded in current->scx.kf_tasks[] by SCX_CALL_OP_TASK_RET(); for other rq-locked SCX ops compare task_rq(p) against scx_locked_rq(). v2: Switch the in_select_cpu cross-task check from direct_dispatch_task comparison to scx_kf_arg_task_ok(). The former spuriously rejects when ops.select_cpu() calls scx_bpf_dsq_insert() first, then calls scx_bpf_select_cpu_*() on the same task. (Andrea Righi) Fixes: 0022b328504d ("sched_ext: Decouple kfunc unlocked-context check from kf_mask") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrea Righi <arighi@nvidia.com>
2026-04-24	sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED	Tejun Heo
	Two EXT_GROUP_SCHED/SUB_SCHED guards are misclassified: - scx_root_enable_workfn()'s cgroup_get(cgrp) and the err_put_cgrp unwind in scx_alloc_and_add_sched() are under `#if GROUP \|\| SUB`, but the matching cgroup_put() in scx_sched_free_rcu_work() is inside `#ifdef SUB` only (via sch->cgrp, stored only under SUB). GROUP-only would leak a reference on every root-sched enable. - sch_cgroup() / set_cgroup_sched() live under `#if GROUP \|\| SUB` but touch SUB-only fields (sch->cgrp, cgroup->scx_sched). GROUP-only wouldn't compile. GROUP needs CGROUP_SCHED; SUB needs only CGROUPS. CGROUPS=y/CGROUP_SCHED=n gives the reachable GROUP=n, SUB=y combination; GROUP=y, SUB=n isn't reachable today (SUB is def_bool y under CGROUPS). Neither miscategorization triggers a real bug in any reachable config, but keep the guards honest: - Narrow cgroup_get and err_put_cgrp to `#ifdef SUB` (matches the free-side put). - Move sch_cgroup() and set_cgroup_sched() to a separate `#ifdef SUB` block with no-op stubs for the !SUB case; keep root_cgroup() and scx_cgroup_{ lock,unlock}() under `#if GROUP \|\| SUB` since those only need cgroup core. Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24	sched_ext: Make bypass LB cpumasks per-scheduler	Tejun Heo
	scx_bypass_lb_{donee,resched}_cpumask were file-scope statics shared by all scheduler instances. With CONFIG_EXT_SUB_SCHED, multiple sched instances each arm their own bypass_lb_timer; concurrent bypass_lb_node() calls RMW the global cpumasks with no lock, corrupting donee/resched decisions. Move the cpumasks into struct scx_sched, allocate them alongside the timer in scx_alloc_and_add_sched(), free them in scx_sched_free_rcu_work(). Fixes: 95d1df610cdc ("sched_ext: Implement load balancer for bypass mode") Cc: stable@vger.kernel.org # v6.19+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24	sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before	Tejun Heo
	scx_prio_less() runs from core-sched's pick_next_task() path with rq locked but invokes ops.core_sched_before() with NULL locked_rq, leaving scx_locked_rq_state NULL. If the BPF callback calls a kfunc that re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu) - it re-acquires the already-held rq. Pass task_rq(a). Fixes: 7b0888b7cc19 ("sched_ext: Implement core-sched support") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24	sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task	Tejun Heo
	scx_dump_state() walks CPUs with rq_lock_irqsave() held and invokes ops.dump_cpu / ops.dump_task with NULL locked_rq, leaving scx_locked_rq_state NULL. If the BPF callback calls a kfunc that re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu) - it re-acquires the already-held rq. Pass the held rq to SCX_CALL_OP(). Thread it into scx_dump_task() too. The pre-loop ops.dump call runs before rq_lock_irqsave() so keeps rq=NULL. Fixes: 07814a9439a3 ("sched_ext: Print debug dump after an error exit") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24	sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP	Tejun Heo
	SCX_CALL_OP{,_RET}() unconditionally clears scx_locked_rq_state to NULL on exit. Correct at the top level, but ops can recurse via scx_bpf_sub_dispatch(): a parent's ops.dispatch calls the helper, which invokes the child's ops.dispatch under another SCX_CALL_OP. When the inner call returns, the NULL clobbers the outer's state. The parent's BPF then calls kfuncs like scx_bpf_cpuperf_set() which read scx_locked_rq()==NULL and re-acquire the already-held rq. Snapshot scx_locked_rq_state on entry and restore on exit. Rename the rq parameter to locked_rq across all SCX_CALL_OP* macros so the snapshot local can be typed as 'struct rq *' without colliding with the parameter token in the expansion. SCX_CALL_OP_TASK{,_RET}() and SCX_CALL_OP_2TASKS_RET() funnel through the two base macros and inherit the fix. Fixes: 4f8b122848db ("sched_ext: Add basic building blocks for nested sub-scheduler dispatching") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24	sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() ↵	Tejun Heo
	FIFO-tail dispatch_enqueue()'s FIFO-tail path used list_empty(&dsq->list) to decide whether to set dsq->first_task on enqueue. dsq->list can contain parked BPF iterator cursors (SCX_DSQ_LNODE_ITER_CURSOR), so list_empty() is not a reliable "no real task" check. If the last real task is unlinked while a cursor is parked, first_task becomes NULL; the next FIFO-tail enqueue then sees list_empty() == false and skips the first_task update, leaving scx_bpf_dsq_peek() returning NULL for a non-empty DSQ. Test dsq->first_task directly, which already tracks only real tasks and is maintained under dsq->lock. Fixes: 44f5c8ec5b9a ("sched_ext: Add lockless peek operation for DSQs") Cc: stable@vger.kernel.org # v6.19+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Cc: Ryan Newton <newton@meta.com>
2026-04-24	sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / ↵	Tejun Heo
	scx_bpf_dsq_nr_queued() scx_bpf_create_dsq() resolves the calling scheduler via scx_prog_sched(aux) and inserts the new DSQ into that scheduler's dsq_hash. Its inverse scx_bpf_destroy_dsq() and the query helper scx_bpf_dsq_nr_queued() were hard-coded to rcu_dereference(scx_root), so a sub-scheduler could only destroy or query DSQs in the root scheduler's hash - never its own. If the root had a DSQ with the same id, the sub-sched silently destroyed it and the root aborted on the next dispatch ("invalid DSQ ID 0x0.."). Take a const struct bpf_prog_aux *aux via KF_IMPLICIT_ARGS and resolve the scheduler with scx_prog_sched(aux), matching scx_bpf_create_dsq(). Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24	sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters	Tejun Heo
	scx_group_set_{weight,idle,bandwidth}() cache scx_root before acquiring scx_cgroup_ops_rwsem, so the pointer can be stale by the time the op runs. If the loaded scheduler is disabled and freed (via RCU work) and another is enabled between the naked load and the rwsem acquire, the reader sees scx_cgroup_enabled=true (the new scheduler's) but dereferences the freed one - UAF on SCX_HAS_OP(sch, ...) / SCX_CALL_OP(sch, ...). scx_cgroup_enabled is toggled only under scx_cgroup_ops_rwsem write (scx_cgroup_{init,exit}), so reading scx_root inside the rwsem read section correlates @sch with the enabled snapshot. Fixes: a5bd6ba30b33 ("sched_ext: Use cgroup_lock/unlock() to synchronize against cgroup operations") Cc: stable@vger.kernel.org # v6.18+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24	sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path	Tejun Heo
	scx_sub_enable_workfn()'s prep loop calls __scx_init_task(sch, p, false) without transitioning task state, then sets SCX_TASK_SUB_INIT. If prep fails partway, the abort path runs __scx_disable_and_exit_task(sch, p) on the marked tasks. Task state is still the parent's ENABLED, so that dispatches to the SCX_TASK_ENABLED arm and calls scx_disable_task(sch, p) - i.e. child->ops.disable() - for tasks on which child->ops.enable() never ran. A BPF sub-scheduler allocating per-task state in enable/freeing in disable would operate on uninitialized state. The dying-task branch in scx_disable_and_exit_task() has the same problem, and scx_enabling_sub_sched was cleared before the abort cleanup loop - a task exiting during cleanup tripped the WARN and skipped both ops.exit_task and the SCX_TASK_SUB_INIT clear, leaking per-task resources and leaving the task stuck. Introduce scx_sub_init_cancel_task() that calls ops.exit_task with cancelled=true - matching what the top-level init path does when init_task itself returns -errno. Use it in the abort loop and in the dying-task branch. scx_enabling_sub_sched now stays set until the abort loop finishes clearing SUB_INIT, so concurrent exits hitting the dying-task branch can still find @sch. That branch also clears SCX_TASK_SUB_INIT unconditionally when seen, leaving the task unmarked even if the WARN fires. Fixes: 337ec00b1d9c ("sched_ext: Implement cgroup sub-sched enabling and disabling") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24	sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu()	Tejun Heo
	bypass_lb_cpu() transfers tasks between per-CPU bypass DSQs without migrating them - task_cpu() only updates when the donee later consumes the task via move_remote_task_to_local_dsq(). If the LB timer fires again before consumption and the new DSQ becomes a donor, @p is still on the previous CPU and task_rq(@p) != donor_rq. @p can't be moved without its own rq locked. Skip such tasks. Fixes: 95d1df610cdc ("sched_ext: Implement load balancer for bypass mode") Cc: stable@vger.kernel.org # v6.19+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24	sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new	Tejun Heo
	bpf_iter_scx_dsq_new() clears kit->dsq on failure and bpf_iter_scx_dsq_{next,destroy}() guard against that. scx_dsq_move() doesn't - it dereferences kit->dsq immediately, so a BPF program that calls scx_bpf_dsq_move[_vtime]() after a failed iter_new oopses the kernel. Return false if kit->dsq is NULL. Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24	sched_ext: Unregister sub_kset on scheduler disable	Tejun Heo
	When ops.sub_attach is set, scx_alloc_and_add_sched() creates sub_kset as a child of &sch->kobj, which pins the parent with its own reference. The disable paths never call kset_unregister(), so the final kobject_put() in bpf_scx_unreg() leaves a stale reference and scx_kobj_release() never runs, leaking the whole struct scx_sched on every load/unload cycle. Unregister sub_kset in scx_root_disable() and scx_sub_disable() before kobject_del(&sch->kobj). Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24	sched_ext: Defer scx_hardlockup() out of NMI	Tejun Heo
	scx_hardlockup() runs from NMI and eventually calls scx_claim_exit(), which takes scx_sched_lock. scx_sched_lock isn't NMI-safe and grabbing it from NMI context can lead to deadlocks. The hardlockup handler is best-effort recovery and the disable path it triggers runs off of irq_work anyway. Move the handle_lockup() call into an irq_work so it runs in IRQ context. Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24	Merge tag 'trace-ring-buffer-v7.1-3' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ring-buffer fix from Steven Rostedt: - Fix accounting of persistent ring buffer rewind On boot up, the head page is moved back to the earliest point of the saved ring buffer. This is because the ring buffer being read by user space on a crash may not save the part it read. Rewinding the head page back to the earliest saved position helps keep those events from being lost. The number of events is also read during boot up and displayed in the stats file in the tracefs directory. It's also used for other accounting as well. On boot up, the "reader page" is accounted for but a rewind may put it back into the buffer and then the reader page may be accounted for again. Save off the original reader page and skip accounting it when scanning the pages in the ring buffer. * tag 'trace-ring-buffer-v7.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Do not double count the reader_page
2026-04-24	ring-buffer: Do not double count the reader_page	Masami Hiramatsu (Google)
	Since the cpu_buffer->reader_page is updated if there are unwound pages. After that update, we should skip the page if it is the original reader_page, because the original reader_page is already checked. Cc: stable@vger.kernel.org Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ian Rogers <irogers@google.com> Link: https://patch.msgid.link/177701353063.2223789.1471163147644103306.stgit@mhiramat.tok.corp.google.com Fixes: ca296d32ece3 ("tracing: ring_buffer: Rewind persistent ring buffer on reboot") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-04-24	sched_ext: sync disable_irq_work in bpf_scx_unreg()	Richard Cheng
	When unregistered my self-written scx scheduler, the following panic occurs. [ 229.923133] Kernel text patching generated an invalid instruction at 0xffff80009bc2c1f8! [ 229.923146] Internal error: Oops - BRK: 00000000f2000100 [#1] SMP [ 230.077871] CPU: 48 UID: 0 PID: 1760 Comm: kworker/u583:7 Not tainted 7.0.0+ #3 PREEMPT(full) [ 230.086677] Hardware name: NVIDIA GB200 NVL/P3809-BMC, BIOS 02.05.12 20251107 [ 230.093972] Workqueue: events_unbound bpf_map_free_deferred [ 230.099675] Sched_ext: invariant_0.1.0_aarch64_unknown_linux_gnu_debug (disabling), task: runnable_at=-174ms [ 230.116843] pc : 0xffff80009bc2c1f8 [ 230.120406] lr : dequeue_task_scx+0x270/0x2d0 [ 230.217749] Call trace: [ 230.228515] 0xffff80009bc2c1f8 (P) [ 230.232077] dequeue_task+0x84/0x188 [ 230.235728] sched_change_begin+0x1dc/0x250 [ 230.240000] __set_cpus_allowed_ptr_locked+0x17c/0x240 [ 230.245250] __set_cpus_allowed_ptr+0x74/0xf0 [ 230.249701] ___migrate_enable+0x4c/0xa0 [ 230.253707] bpf_map_free_deferred+0x1a4/0x1b0 [ 230.258246] process_one_work+0x184/0x540 [ 230.262342] worker_thread+0x19c/0x348 [ 230.266170] kthread+0x13c/0x150 [ 230.269465] ret_from_fork+0x10/0x20 [ 230.281393] Code: d4202000 d4202000 d4202000 d4202000 (d4202000) [ 230.287621] ---[ end trace 0000000000000000 ]--- [ 231.160046] Kernel panic - not syncing: Oops - BRK: Fatal exception in interrupt The root cause is that the JIT page backing ops->quiescent() is freed before all callers of that function have stopped. The expected ordering during teardown is: bitmap_zero(sch->has_op) + synchronize_rcu() -> guarantees no CPU will ever call sch->ops.* again -> only THEN free the BPF struct_ops JIT page bpf_scx_unreg() is supposed to enforce the order, but after commit f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work"), disable_work is no longer queued directly, causing kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops map too early and poisoned with AARCH64_BREAK_FAULT before disable_workfn ever execute. So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent) as true and calls ops.quiescent, which hit on the poisoned page and BRK panic. Add a helper scx_flush_disable_work() so the future use cases that want to flush disable_work can use it. Also amend the call for scx_root_enable_workfn() and scx_sub_enable_workfn() which have similar pattern in the error path. Fixes: f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work") Signed-off-by: Richard Cheng <icheng@nvidia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-24	Merge tag 'locking-urgent-2026-04-24' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: - Fix ww_mutex regression, which caused hangs/pauses in some DRM drivers - Fix rtmutex proxy-rollback bug * tag 'locking-urgent-2026-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/mutex: Fix ww_mutex wait_list operations rtmutex: Use waiter::task instead of current in remove_waiter()
2026-04-23	cgroup: Increment nr_dying_subsys_* from rmdir context	Petr Malat
	Incrementing nr_dying_subsys_* in offline_css(), which is executed by cgroup_offline_wq worker, leads to a race where user can see the value to be 0 if he reads cgroup.stat after calling rmdir and before the worker executes. This makes the user wrongly expect resources released by the removed cgroup to be available for a new assignment. Increment nr_dying_subsys_* from kill_css(), which is called from the cgroup_rmdir() context. Fixes: ab0312526867 ("cgroup: Show # of subsystem CSSes in cgroup.stat") Signed-off-by: Petr Malat <oss@malat.biz> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-23	cgroup/rdma: refactor resource parsing with match_table_t/match_token()	Tao Cui
	Replace the hand-rolled strsep/strcmp/match_string parsing in rdmacg_resource_set_max() with a match_table_t and match_token() pattern, following the convention used by user_proactive_reclaim() and ioc_cost_model_write(). The old strncmp(value, RDMACG_MAX_STR, strlen(value)) also had two bugs that are fixed by this refactor: - It matched "ma" as "max" because strncmp only compared the shorter strlen(value) bytes. - It silently accepted "hca_handle=" (empty value) as "max" because strncmp with n=0 always returns 0. The match_token() approach also robustly handles extra whitespace in the input by splitting on " \t\n" and skipping empty tokens. Suggested-by: "Michal Koutný" <mkoutny@suse.com> Signed-off-by: Tao Cui <cuitao@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-23	Merge branch 'for-7.1-fixes' into for-7.2	Tejun Heo
	Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-23	sched_ext: Fix local_dsq_post_enq() to use task's scheduler in sub-sched	zhidao su
	local_dsq_post_enq() calls call_task_dequeue() with scx_root instead of the scheduler instance actually managing the task. When CONFIG_EXT_SUB_SCHED is enabled, tasks may be managed by a sub-scheduler whose ops.dequeue() callback differs from root's. Using scx_root causes the wrong scheduler's ops.dequeue() to be consulted: sub-sched tasks dispatched to a local DSQ via scx_bpf_dsq_move_to_local() will have SCX_TASK_IN_CUSTODY cleared but the sub-scheduler's ops.dequeue() is never invoked, violating the custody exit semantics. Fix by adding a 'struct scx_sched *sch' parameter to local_dsq_post_enq() and move_local_task_to_local_dsq(), and propagating the correct scheduler from their callers dispatch_enqueue(), move_task_between_dsqs(), and consume_dispatch_q(). This is consistent with dispatch_enqueue()'s non-local path which already passes 'sch' directly to call_task_dequeue() for global/bypass DSQs. Fixes: ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics") Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-23	bpf: Introduce bpf register BPF_REG_PARAMS	Yonghong Song
	Introduce BPF_REG_PARAMS as a dedicated BPF register for stack argument accesses. It occupies the BPF register number 11 (R11), which is used as the base pointer for the stack argument area, keeping it separate from the R10-based (BPF_REG_FP) program stack. The kernel-internal hidden register BPF_REG_AX previously occupied slot 11 (MAX_BPF_REG). With BPF_REG_PARAMS taking that slot, BPF_REG_AX moves to slot 12 and MAX_BPF_EXT_REG increases accordingly. Acked-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260423033506.2542005-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-23	bpf: Prepare verifier logs for upcoming kfunc stack arguments	Yonghong Song
	This change prepares verifier log reporting for upcoming kfunc stack argument support. Currently verifier log code mostly assumes that an argument can be described directly by a register number. That works for arguments passed in `R1` to `R5`, but it does not work once kfunc arguments can also be passed on the stack. Introduce an opaque `argno_t` type that encodes both register-based and arg-based references. Four helpers form the interface: - argno_from_reg(regno): create from a register number - argno_from_arg(arg): create from a 1-based arg number - reg_from_argno(a): extract register number, or -1 - arg_from_argno(a): extract arg number, or -1 reg_arg_name() converts an argno_t to a human-readable string for verifier logs: "R%d" for register arguments, or "*(R11-off)" for stack arguments beyond R5. Update selftests accordingly. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260423033501.2539667-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-23	bpf: Rename existing argno to arg	Yonghong Song
	To support stack arguments, in later patches, argno will represent both registers and stack arguments. To avoid confusion, rename existing argno to arg. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260423033456.2539340-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-23	bpf: Refactor to handle memory and size together	Yonghong Song
	Similar to the previous patch, try to pass bpf_reg_state from caller to callee. Both mem_reg and size_reg are passed to helper functions. This is important for stack arguments as they may be beyond registers 1-5. Acked-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260423033451.2539065-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-23	bpf: Refactor to avoid redundant calculation of bpf_reg_state	Yonghong Song
	In many cases, once a bpf_reg_state is defined, it can pass to callee's. Otherwise, callee will need to get bpf_reg_state again based on regno. More importantly, this is needed for later stack arguments for kfuncs since the register state for stack arguments does not have a corresponding regno. So it makes sense to pass reg state for callee's. The following is the only change to avoid compilation warning: static int sanitize_check_bounds(struct bpf_verifier_env env, const struct bpf_insn insn, - const struct bpf_reg_state dst_reg) + struct bpf_reg_state dst_reg) Acked-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Reviewed-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260423033446.2538321-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-23	bpf: Remove WARN_ON_ONCE in check_kfunc_mem_size_reg()	Yonghong Song
	The warning is too late if it does happen. Remove it. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260423033441.2538149-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-23	bpf: Fix tail_call_reachable leak	Yonghong Song
	In check_max_stack_depth_subprog(), the local variable tail_call_reachable is set when entering a callee that has a tail call, but never reset when popping back to the parent. This causes the flag to leak across sibling subprogs in the DFS traversal. This results in unnecessary JIT overhead: the JIT emits tail call counter preservation code for subprogs that can never be reached via a tail call path. Fix this by resetting tail_call_reachable to the parent's actual per-subprog flag when popping a frame. If the parent was already marked tail_call_reachable by a previous sibling's traversal, the local variable stays true. Otherwise it resets to false, so subsequent siblings start with a clean state. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260423033435.2538013-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-23	bpf: Remove unused parameter from check_map_kptr_access()	Yonghong Song
	The parameter 'regno' in check_map_kptr_access() is unused. Remove it. Acked-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260423033430.2537615-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-23	locking/mutex: Fix ww_mutex wait_list operations	Peter Zijlstra
	Chaitanya, John and Mikhail reported commit 25500ba7e77c ("locking/mutex: Remove the list_head from struct mutex") wrecked ww_mutex. Specifically there were 2 issues: - __ww_waiter_prev() had the termination condition wrong; it would terminate when the previous entry was the first, which results in a truncated iteration: W3, W2, (no W1). - __mutex_add_waiter(@pos != NULL), as used by __ww_waiter_add() / __ww_mutex_add_waiter(); this inserts @waiter before @pos (which is what list_add_tail() does). But this should then also update lock->first_waiter. Much thanks to Prateek for spotting the __mutex_add_waiter() issue! Fixes: 25500ba7e77c ("locking/mutex: Remove the list_head from struct mutex") Reported-by: "Borah, Chaitanya Kumar" <chaitanya.kumar.borah@intel.com> Closes: https://lore.kernel.org/r/af005996-05e9-4336-8450-d14ca652ba5d%40intel.com Reported-by: John Stultz <jstultz@google.com> Closes: https://lore.kernel.org/r/CANDhNCq%3Doizzud3hH3oqGzTrcjB8OwGeineJ3mwZuGdDWG8fRQ%40mail.gmail.com Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Closes: https://lore.kernel.org/r/CABXGCsO5fKq2nD9nO8yO1z50ZzgCPWqueNXHANjntaswoOh2Dg@mail.gmail.com Debugged-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Link: https://patch.msgid.link/20260422092335.GH3102924%40noisy.programming.kicks-ass.net