summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2026-03-03bpf: extract check_global_subprog_return_code() for clarityEduard Zingerman
From: Eduard Zingerman <eddyz87@gmail.com> Both main progs and subprogs use the same function in the verifier, check_return_code, to verify the type and value range of the register being returned. However, subprogs only need a subset of the logic in check_return_code. this also goes the way - check_return_code explicitly checks whether it is handling a subprogram in multiple places, complicating the logic. Separate the handling of the two into separate functions. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260228184759.108145-4-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-03bpf: Extract program_returns_void() for clarityEduard Zingerman
From: Eduard Zingerman <eddyz87@gmail.com> The check_return_code function has explicit checks on whether a program type can return void. Factor this logic out to reuse it later for both main progs and subprogs. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260228184759.108145-3-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-03bpf: Factor out program return value calculationEmil Tsalapatis
Factor the return value range calculation logic in check_return_code out of the function in preparation for separating the return value validation logic for BPF_EXIT and bpf_throw(). No functional changes. The change made in return_retval_code's handling of PROG_TRACING program types (not error'ing on the default case) is a no-op because the match on the program attach type is exhaustive. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260228184759.108145-2-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-03bpf: Relax fixed offset check for PTR_TO_CTXKumar Kartikeya Dwivedi
The limitation on fixed offsets stems from the fact that certain program types rewrite the accesses to the context structure and translate them to accesses to the real underlying type. Technically, in the past, we could have stashed the register offset in insn_aux and made rewrites work, but we've never needed it in the past since the offset for such context structures easily fit in the s16 signed instruction offset. Regardless, the consequence is that for program types where the program type's verifier ops doesn't supply a convert_ctx_access callback, we unnecessarily reject accesses with a modified ctx pointer (i.e., one whose offset has been shifted) in check_ptr_off_reg. Make an exception for such program types (like syscall, tracepoint, raw_tp, etc.). Pass in fixed_off_ok as true to __check_ptr_off_reg for such cases, and accumulate the register offset into insn->off passed to check_ctx_access. In particular, the accumulation is critical since we need to correctly track the max_ctx_offset which is used for bounds checking the buffer for syscall programs at runtime. Reported-by: Tejun Heo <tj@kernel.org> Reported-by: Dan Schatzberg <dschatzberg@meta.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260227005725.1247305-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-03bpf: Add missing XDP_ABORTED handling in cpumapAnand Kumar Shaw
cpu_map_bpf_prog_run_xdp() handles XDP_PASS, XDP_REDIRECT, and XDP_DROP but is missing an XDP_ABORTED case. Without it, XDP_ABORTED falls into the default case which logs a misleading "invalid XDP action" warning instead of tracing the abort via trace_xdp_exception(). Add the missing XDP_ABORTED case with trace_xdp_exception(), matching the handling already present in the skb path (cpu_map_bpf_prog_run_skb), devmap (dev_map_bpf_prog_run), and the generic XDP path (do_xdp_generic). Also pass xdpf->dev_rx instead of NULL to bpf_warn_invalid_xdp_action() in the default case, so the warning includes the actual device name. This aligns with the generic XDP path in net/core/dev.c which already passes the real device. Signed-off-by: Anand Kumar Shaw <anandkrshawheritage@gmail.com> Link: https://lore.kernel.org/r/20260218042924.42931-1-anandkrshawheritage@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-03s390: Introduce bpf_get_lowcore() kfuncIlya Leoshkevich
Implementing BPF version of preempt_count() requires accessing lowcore from BPF. Since lowcore can be relocated, open-coding (struct lowcore *)0 does not work, so add a kfunc. Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/20260217160813.100855-2-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-03sched_ext: Remove redundant css_put() in scx_cgroup_init()Cheng-Yang Chou
The iterator css_for_each_descendant_pre() walks the cgroup hierarchy under cgroup_lock(). It does not increment the reference counts on yielded css structs. According to the cgroup documentation, css_put() should only be used to release a reference obtained via css_get() or css_tryget_online(). Since the iterator does not use either of these to acquire a reference, calling css_put() in the error path of scx_cgroup_init() causes a refcount underflow. Remove the unbalanced css_put() to prevent a potential Use-After-Free (UAF) vulnerability. Fixes: 819513666966 ("sched_ext: Add cgroup support") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-02sched_ext: Use READ_ONCE() for plain reads of scx_watchdog_timeoutzhidao su
scx_watchdog_timeout is written with WRITE_ONCE() in scx_enable(): WRITE_ONCE(scx_watchdog_timeout, timeout); However, three read-side accesses use plain reads without the matching READ_ONCE(): /* check_rq_for_timeouts() - L2824 */ last_runnable + scx_watchdog_timeout /* scx_watchdog_workfn() - L2852 */ scx_watchdog_timeout / 2 /* scx_enable() - L5179 */ scx_watchdog_timeout / 2 The KCSAN documentation requires that if one accessor uses WRITE_ONCE() to annotate lock-free access, all other accesses must also use the appropriate accessor. Plain reads alongside WRITE_ONCE() leave the pair incomplete and can trigger KCSAN warnings. Note that scx_tick() already uses the correct READ_ONCE() annotation: last_check + READ_ONCE(scx_watchdog_timeout) Fix the three remaining plain reads to match, making all accesses to scx_watchdog_timeout consistently annotated and KCSAN-clean. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-02audit: remove redundant initialization of static variables to 0Ricardo Robaina
Static variables are automatically initialized to 0 by the compiler. Remove the redundant explicit assignments in kernel/audit.c to clean up the code, align with standard kernel coding style, and fix the following checkpatch.pl errors: ./scripts/checkpatch.pl kernel/audit.c | grep -A2 ERROR: ERROR: do not initialise statics to 0 + static unsigned long last_check = 0; -- ERROR: do not initialise statics to 0 + static int messages = 0; -- ERROR: do not initialise statics to 0 + static unsigned long last_msg = 0; Signed-off-by: Ricardo Robaina <rrobaina@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2026-03-02ftrace: Add missing ftrace_lock to update_ftrace_direct_add/delJiri Olsa
Ihor and Kumar reported splat from ftrace_get_addr_curr [1], which happened because of the missing ftrace_lock in update_ftrace_direct_add/del functions allowing concurrent access to ftrace internals. The ftrace_update_ops function must be guarded by ftrace_lock, adding that. Fixes: 05dc5e9c1fe1 ("ftrace: Add update_ftrace_direct_add function") Fixes: 8d2c1233f371 ("ftrace: Add update_ftrace_direct_del function") Reported-by: Ihor Solodrai <ihor.solodrai@linux.dev> Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Closes: https://lore.kernel.org/bpf/1b58ffb2-92ae-433a-ba46-95294d6edea2@linux.dev/ Tested-by: Ihor Solodrai <ihor.solodrai@linux.dev> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20260302081622.165713-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-02sched_ext: Replace naked scx_root dereferences in kobject callbackszhidao su
scx_attr_ops_show() and scx_uevent() access scx_root->ops.name directly. This is problematic for two reasons: 1. The file-level comment explicitly identifies naked scx_root dereferences as a temporary measure that needs to be replaced with proper per-instance access. 2. scx_attr_events_show(), the neighboring sysfs show function in the same group, already uses the correct pattern: struct scx_sched *sch = container_of(kobj, struct scx_sched, kobj); Having inconsistent access patterns in the same sysfs/uevent group is error-prone. The kobject embedded in struct scx_sched is initialized as: kobject_init_and_add(&sch->kobj, &scx_ktype, NULL, "root"); so container_of(kobj, struct scx_sched, kobj) correctly retrieves the owning scx_sched instance in both callbacks. Replace the naked scx_root dereferences with container_of()-based access, consistent with scx_attr_events_show() and in preparation for proper multi-instance scx_sched support. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-02sched_ext: Use READ_ONCE() for the read side of dsq->nr updatezhidao su
scx_bpf_dsq_nr_queued() reads dsq->nr via READ_ONCE() without holding any lock, making dsq->nr a lock-free concurrently accessed variable. However, dsq_mod_nr(), the sole writer of dsq->nr, only uses WRITE_ONCE() on the write side without the matching READ_ONCE() on the read side: WRITE_ONCE(dsq->nr, dsq->nr + delta); ^^^^^^^ plain read -- KCSAN data race The KCSAN documentation requires that if one accessor uses READ_ONCE() or WRITE_ONCE() on a variable to annotate lock-free access, all other accesses must also use the appropriate accessor. A plain read on the right-hand side of WRITE_ONCE() leaves the pair incomplete and will trigger KCSAN warnings. Fix by using READ_ONCE() for the read side of the update: WRITE_ONCE(dsq->nr, READ_ONCE(dsq->nr) + delta); This is consistent with scx_bpf_dsq_nr_queued() and makes the concurrent access annotation complete and KCSAN-clean. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-02blktrace: fix __this_cpu_read/write in preemptible contextChaitanya Kulkarni
tracing_record_cmdline() internally uses __this_cpu_read() and __this_cpu_write() on the per-CPU variable trace_cmdline_save, and trace_save_cmdline() explicitly asserts preemption is disabled via lockdep_assert_preemption_disabled(). These operations are only safe when preemption is off, as they were designed to be called from the scheduler context (probe_wakeup_sched_switch() / probe_wakeup()). __blk_add_trace() was calling tracing_record_cmdline(current) early in the blk_tracer path, before ring buffer reservation, from process context where preemption is fully enabled. This triggers the following using blktests/blktrace/002: blktrace/002 (blktrace ftrace corruption with sysfs trace) [failed] runtime 0.367s ... 0.437s something found in dmesg: [ 81.211018] run blktests blktrace/002 at 2026-02-25 22:24:33 [ 81.239580] null_blk: disk nullb1 created [ 81.357294] BUG: using __this_cpu_read() in preemptible [00000000] code: dd/2516 [ 81.362842] caller is tracing_record_cmdline+0x10/0x40 [ 81.362872] CPU: 16 UID: 0 PID: 2516 Comm: dd Tainted: G N 7.0.0-rc1lblk+ #84 PREEMPT(full) [ 81.362877] Tainted: [N]=TEST [ 81.362878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 [ 81.362881] Call Trace: [ 81.362884] <TASK> [ 81.362886] dump_stack_lvl+0x8d/0xb0 ... (See '/mnt/sda/blktests/results/nodev/blktrace/002.dmesg' for the entire message) [ 81.211018] run blktests blktrace/002 at 2026-02-25 22:24:33 [ 81.239580] null_blk: disk nullb1 created [ 81.357294] BUG: using __this_cpu_read() in preemptible [00000000] code: dd/2516 [ 81.362842] caller is tracing_record_cmdline+0x10/0x40 [ 81.362872] CPU: 16 UID: 0 PID: 2516 Comm: dd Tainted: G N 7.0.0-rc1lblk+ #84 PREEMPT(full) [ 81.362877] Tainted: [N]=TEST [ 81.362878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 [ 81.362881] Call Trace: [ 81.362884] <TASK> [ 81.362886] dump_stack_lvl+0x8d/0xb0 [ 81.362895] check_preemption_disabled+0xce/0xe0 [ 81.362902] tracing_record_cmdline+0x10/0x40 [ 81.362923] __blk_add_trace+0x307/0x5d0 [ 81.362934] ? lock_acquire+0xe0/0x300 [ 81.362940] ? iov_iter_extract_pages+0x101/0xa30 [ 81.362959] blk_add_trace_bio+0x106/0x1e0 [ 81.362968] submit_bio_noacct_nocheck+0x24b/0x3a0 [ 81.362979] ? lockdep_init_map_type+0x58/0x260 [ 81.362988] submit_bio_wait+0x56/0x90 [ 81.363009] __blkdev_direct_IO_simple+0x16c/0x250 [ 81.363026] ? __pfx_submit_bio_wait_endio+0x10/0x10 [ 81.363038] ? rcu_read_lock_any_held+0x73/0xa0 [ 81.363051] blkdev_read_iter+0xc1/0x140 [ 81.363059] vfs_read+0x20b/0x330 [ 81.363083] ksys_read+0x67/0xe0 [ 81.363090] do_syscall_64+0xbf/0xf00 [ 81.363102] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 81.363106] RIP: 0033:0x7f281906029d [ 81.363111] Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d 66 63 0a 00 e8 59 ff 01 00 66 0f 1f 84 00 00 00 00 00 80 3d 41 33 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec [ 81.363113] RSP: 002b:00007ffca127dd48 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 81.363120] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281906029d [ 81.363122] RDX: 0000000000001000 RSI: 0000559f8bfae000 RDI: 0000000000000000 [ 81.363123] RBP: 0000000000001000 R08: 0000002863a10a81 R09: 00007f281915f000 [ 81.363124] R10: 00007f2818f77b60 R11: 0000000000000246 R12: 0000559f8bfae000 [ 81.363126] R13: 0000000000000000 R14: 0000000000000000 R15: 000000000000000a [ 81.363142] </TASK> The same BUG fires from blk_add_trace_plug(), blk_add_trace_unplug(), and blk_add_trace_rq() paths as well. The purpose of tracing_record_cmdline() is to cache the task->comm for a given PID so that the trace can later resolve it. It is only meaningful when a trace event is actually being recorded. Ring buffer reservation via ring_buffer_lock_reserve() disables preemption, and preemption remains disabled until the event is committed :- __blk_add_trace() __trace_buffer_lock_reserve() __trace_buffer_lock_reserve() ring_buffer_lock_reserve() preempt_disable_notrace(); <--- With this fix blktests for blktrace pass: blktests (master) # ./check blktrace blktrace/001 (blktrace zone management command tracing) [passed] runtime 3.650s ... 3.647s blktrace/002 (blktrace ftrace corruption with sysfs trace) [passed] runtime 0.411s ... 0.384s Fixes: 7ffbd48d5cab ("tracing: Cache comms only after an event occurred") Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-01Merge tag 'timers-urgent-2026-03-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Improve the inlining of jiffies_to_msecs() and jiffies_to_usecs(), for the common HZ=100, 250 or 1000 cases. Only use a function call for odd HZ values like HZ=300 that generate more code. The function call overhead showed up in performance tests of the TCP code" * tag 'timers-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time/jiffies: Inline jiffies_to_msecs() and jiffies_to_usecs()
2026-03-01Merge tag 'sched-urgent-2026-03-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix zero_vruntime tracking when there's a single task running - Fix slice protection logic - Fix the ->vprot logic for reniced tasks - Fix lag clamping in mixed slice workloads - Fix objtool uaccess warning (and bug) in the !CONFIG_RSEQ_SLICE_EXTENSION case caused by unexpected un-inlining, which triggers with older compilers - Fix a comment in the rseq registration rseq_size bound check code - Fix a legacy RSEQ ABI quirk that handled 32-byte area sizes differently, which special size we now reached naturally and want to avoid. The visible ugliness of the new reserved field will be avoided the next time the RSEQ area is extended. * tag 'sched-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rseq: slice ext: Ensure rseq feature size differs from original rseq size rseq: Clarify rseq registration rseq_size bound check comment sched/core: Fix wakeup_preempt's next_class tracking rseq: Mark rseq_arm_slice_extension_timer() __always_inline sched/fair: Fix lag clamp sched/eevdf: Update se->vprot in reweight_entity() sched/fair: Only set slice protection at pick time sched/fair: Fix zero_vruntime tracking
2026-03-01Merge tag 'perf-urgent-2026-03-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events fixes from Ingo Molnar: - Fix lock ordering bug found by lockdep in perf_event_wakeup() - Fix uncore counter enumeration on Granite Rapids and Sierra Forest - Fix perf_mmap() refcount bug found by Syzkaller - Fix __perf_event_overflow() vs perf_remove_from_context() race * tag 'perf-urgent-2026-03-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix __perf_event_overflow() vs perf_remove_from_context() race perf/core: Fix refcount bug and potential UAF in perf_mmap perf/x86/intel/uncore: Add per-scheduler IMC CAS count events perf/core: Fix invalid wait context in ctx_sched_in()
2026-03-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf before 7.0-rc2Alexei Starovoitov
Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-28Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Alexei Starovoitov: - Fix alignment of arm64 JIT buffer to prevent atomic tearing (Fuad Tabba) - Fix invariant violation for single value tnums in the verifier (Harishankar Vishwanathan, Paul Chaignon) - Fix a bunch of issues found by ASAN in selftests/bpf (Ihor Solodrai) - Fix race in devmpa and cpumap on PREEMPT_RT (Jiayuan Chen) - Fix show_fdinfo of kprobe_multi when cookies are not present (Jiri Olsa) - Fix race in freeing special fields in BPF maps to prevent memory leaks (Kumar Kartikeya Dwivedi) - Fix OOB read in dmabuf_collector (T.J. Mercier) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (36 commits) selftests/bpf: Avoid simplification of crafted bounds test selftests/bpf: Test refinement of single-value tnum bpf: Improve bounds when tnum has a single possible value bpf: Introduce tnum_step to step through tnum's members bpf: Fix race in devmap on PREEMPT_RT bpf: Fix race in cpumap on PREEMPT_RT selftests/bpf: Add tests for special fields races bpf: Retire rcu_trace_implies_rcu_gp() from local storage bpf: Delay freeing fields in local storage bpf: Lose const-ness of map in map_check_btf() bpf: Register dtor for freeing special fields selftests/bpf: Fix OOB read in dmabuf_collector selftests/bpf: Fix a memory leak in xdp_flowtable test bpf: Fix stack-out-of-bounds write in devmap bpf: Fix kprobe_multi cookies access in show_fdinfo callback bpf, arm64: Force 8-byte alignment for JIT buffer to prevent atomic tearing selftests/bpf: Don't override SIGSEGV handler with ASAN selftests/bpf: Check BPFTOOL env var in detect_bpftool_path() selftests/bpf: Fix out-of-bounds array access bugs reported by ASAN selftests/bpf: Fix array bounds warning in jit_disasm_helpers ...
2026-02-27bpf: Improve bounds when tnum has a single possible valuePaul Chaignon
We're hitting an invariant violation in Cilium that sometimes leads to BPF programs being rejected and Cilium failing to start [1]. The following extract from verifier logs shows what's happening: from 201 to 236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0 236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0 ; if (magic == MARK_MAGIC_HOST || magic == MARK_MAGIC_OVERLAY || magic == MARK_MAGIC_ENCRYPT) @ bpf_host.c:1337 236: (16) if w9 == 0xe00 goto pc+45 ; R9=scalar(smin=umin=smin32=umin32=3585,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) 237: (16) if w9 == 0xf00 goto pc+1 verifier bug: REG INVARIANTS VIOLATION (false_reg1): range bounds violation u64=[0xe01, 0xe00] s64=[0xe01, 0xe00] u32=[0xe01, 0xe00] s32=[0xe01, 0xe00] var_off=(0xe00, 0x0) We reach instruction 236 with two possible values for R9, 0xe00 and 0xf00. This is perfectly reflected in the tnum, but of course the ranges are less accurate and cover [0xe00; 0xf00]. Taking the fallthrough path at instruction 236 allows the verifier to reduce the range to [0xe01; 0xf00]. The tnum is however not updated. With these ranges, at instruction 237, the verifier is not able to deduce that R9 is always equal to 0xf00. Hence the fallthrough pass is explored first, the verifier refines the bounds using the assumption that R9 != 0xf00, and ends up with an invariant violation. This pattern of impossible branch + bounds refinement is common to all invariant violations seen so far. The long-term solution is likely to rely on the refinement + invariant violation check to detect dead branches, as started by Eduard. To fix the current issue, we need something with less refactoring that we can backport. This patch uses the tnum_step helper introduced in the previous patch to detect the above situation. In particular, three cases are now detected in the bounds refinement: 1. The u64 range and the tnum only overlap in umin. u64: ---[xxxxxx]----- tnum: --xx----------x- 2. The u64 range and the tnum only overlap in the maximum value represented by the tnum, called tmax. u64: ---[xxxxxx]----- tnum: xx-----x-------- 3. The u64 range and the tnum only overlap in between umin (excluded) and umax. u64: ---[xxxxxx]----- tnum: xx----x-------x- To detect these three cases, we call tnum_step(tnum, umin), which returns the smallest member of the tnum greater than umin, called tnum_next here. We're in case (1) if umin is part of the tnum and tnum_next is greater than umax. We're in case (2) if umin is not part of the tnum and tnum_next is equal to tmax. Finally, we're in case (3) if umin is not part of the tnum, tnum_next is inferior or equal to umax, and calling tnum_step a second time gives us a value past umax. This change implements these three cases. With it, the above bytecode looks as follows: 0: (85) call bpf_get_prandom_u32#7 ; R0=scalar() 1: (47) r0 |= 3584 ; R0=scalar(smin=0x8000000000000e00,umin=umin32=3584,smin32=0x80000e00,var_off=(0xe00; 0xfffffffffffff1ff)) 2: (57) r0 &= 3840 ; R0=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) 3: (15) if r0 == 0xe00 goto pc+2 ; R0=3840 4: (15) if r0 == 0xf00 goto pc+1 4: R0=3840 6: (95) exit In addition to the new selftests, this change was also verified with Agni [3]. For the record, the raw SMT is available at [4]. The property it verifies is that: If a concrete value x is contained in all input abstract values, after __update_reg_bounds, it will continue to be contained in all output abstract values. Link: https://github.com/cilium/cilium/issues/44216 [1] Link: https://pchaigno.github.io/test-verifier-complexity.html [2] Link: https://github.com/bpfverif/agni [3] Link: https://pastebin.com/raw/naCfaqNx [4] Fixes: 0df1a55afa83 ("bpf: Warn on internal verifier errors") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Tested-by: Marco Schirrmeister <mschirrmeister@gmail.com> Co-developed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/ef254c4f68be19bd393d450188946821c588565d.1772225741.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-27bpf: Introduce tnum_step to step through tnum's membersHarishankar Vishwanathan
This commit introduces tnum_step(), a function that, when given t, and a number z returns the smallest member of t larger than z. The number z must be greater or equal to the smallest member of t and less than the largest member of t. The first step is to compute j, a number that keeps all of t's known bits, and matches all unknown bits to z's bits. Since j is a member of the t, it is already a candidate for result. However, we want our result to be (minimally) greater than z. There are only two possible cases: (1) Case j <= z. In this case, we want to increase the value of j and make it > z. (2) Case j > z. In this case, we want to decrease the value of j while keeping it > z. (Case 1) j <= z t = xx11x0x0 z = 10111101 (189) j = 10111000 (184) ^ k (Case 1.1) Let's first consider the case where j < z. We will address j == z later. Since z > j, there had to be a bit position that was 1 in z and a 0 in j, beyond which all positions of higher significance are equal in j and z. Further, this position could not have been unknown in a, because the unknown positions of a match z. This position had to be a 1 in z and known 0 in t. Let k be position of the most significant 1-to-0 flip. In our example, k = 3 (starting the count at 1 at the least significant bit). Setting (to 1) the unknown bits of t in positions of significance smaller than k will not produce a result > z. Hence, we must set/unset the unknown bits at positions of significance higher than k. Specifically, we look for the next larger combination of 1s and 0s to place in those positions, relative to the combination that exists in z. We can achieve this by concatenating bits at unknown positions of t into an integer, adding 1, and writing the bits of that result back into the corresponding bit positions previously extracted from z. >From our example, considering only positions of significance greater than k: t = xx..x z = 10..1 + 1 ----- 11..0 This is the exact combination 1s and 0s we need at the unknown bits of t in positions of significance greater than k. Further, our result must only increase the value minimally above z. Hence, unknown bits in positions of significance smaller than k should remain 0. We finally have, result = 11110000 (240) (Case 1.2) Now consider the case when j = z, for example t = 1x1x0xxx z = 10110100 (180) j = 10110100 (180) Matching the unknown bits of the t to the bits of z yielded exactly z. To produce a number greater than z, we must set/unset the unknown bits in t, and *all* the unknown bits of t candidates for being set/unset. We can do this similar to Case 1.1, by adding 1 to the bits extracted from the masked bit positions of z. Essentially, this case is equivalent to Case 1.1, with k = 0. t = 1x1x0xxx z = .0.1.100 + 1 --------- .0.1.101 This is the exact combination of bits needed in the unknown positions of t. After recalling the known positions of t, we get result = 10110101 (181) (Case 2) j > z t = x00010x1 z = 10000010 (130) j = 10001011 (139) ^ k Since j > z, there had to be a bit position which was 0 in z, and a 1 in j, beyond which all positions of higher significance are equal in j and z. This position had to be a 0 in z and known 1 in t. Let k be the position of the most significant 0-to-1 flip. In our example, k = 4. Because of the 0-to-1 flip at position k, a member of t can become greater than z if the bits in positions greater than k are themselves >= to z. To make that member *minimally* greater than z, the bits in positions greater than k must be exactly = z. Hence, we simply match all of t's unknown bits in positions more significant than k to z's bits. In positions less significant than k, we set all t's unknown bits to 0 to retain minimality. In our example, in positions of greater significance than k (=4), t=x000. These positions are matched with z (1000) to produce 1000. In positions of lower significance than k, t=10x1. All unknown bits are set to 0 to produce 1001. The final result is: result = 10001001 (137) This concludes the computation for a result > z that is a member of t. The procedure for tnum_step() in this commit implements the idea described above. As a proof of correctness, we verified the algorithm against a logical specification of tnum_step. The specification asserts the following about the inputs t, z and output res that: 1. res is a member of t, and 2. res is strictly greater than z, and 3. there does not exist another value res2 such that 3a. res2 is also a member of t, and 3b. res2 is greater than z 3c. res2 is smaller than res We checked the implementation against this logical specification using an SMT solver. The verification formula in SMTLIB format is available at [1]. The verification returned an "unsat": indicating that no input assignment exists for which the implementation and the specification produce different outputs. In addition, we also automatically generated the logical encoding of the C implementation using Agni [2] and verified it against the same specification. This verification also returned an "unsat", confirming that the implementation is equivalent to the specification. The formula for this check is also available at [3]. Link: https://pastebin.com/raw/2eRWbiit [1] Link: https://github.com/bpfverif/agni [2] Link: https://pastebin.com/raw/EztVbBJ2 [3] Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu> Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu> Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Link: https://lore.kernel.org/r/93fdf71910411c0f19e282ba6d03b4c65f9c5d73.1772225741.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-27bpf: Fix race in devmap on PREEMPT_RTJiayuan Chen
On PREEMPT_RT kernels, the per-CPU xdp_dev_bulk_queue (bq) can be accessed concurrently by multiple preemptible tasks on the same CPU. The original code assumes bq_enqueue() and __dev_flush() run atomically with respect to each other on the same CPU, relying on local_bh_disable() to prevent preemption. However, on PREEMPT_RT, local_bh_disable() only calls migrate_disable() (when PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable preemption, which allows CFS scheduling to preempt a task during bq_xmit_all(), enabling another task on the same CPU to enter bq_enqueue() and operate on the same per-CPU bq concurrently. This leads to several races: 1. Double-free / use-after-free on bq->q[]: bq_xmit_all() snapshots cnt = bq->count, then iterates bq->q[0..cnt-1] to transmit frames. If preempted after the snapshot, a second task can call bq_enqueue() -> bq_xmit_all() on the same bq, transmitting (and freeing) the same frames. When the first task resumes, it operates on stale pointers in bq->q[], causing use-after-free. 2. bq->count and bq->q[] corruption: concurrent bq_enqueue() modifying bq->count and bq->q[] while bq_xmit_all() is reading them. 3. dev_rx/xdp_prog teardown race: __dev_flush() clears bq->dev_rx and bq->xdp_prog after bq_xmit_all(). If preempted between bq_xmit_all() return and bq->dev_rx = NULL, a preempting bq_enqueue() sees dev_rx still set (non-NULL), skips adding bq to the flush_list, and enqueues a frame. When __dev_flush() resumes, it clears dev_rx and removes bq from the flush_list, orphaning the newly enqueued frame. 4. __list_del_clearprev() on flush_node: similar to the cpumap race, both tasks can call __list_del_clearprev() on the same flush_node, the second dereferences the prev pointer already set to NULL. The race between task A (__dev_flush -> bq_xmit_all) and task B (bq_enqueue -> bq_xmit_all) on the same CPU: Task A (xdp_do_flush) Task B (ndo_xdp_xmit redirect) ---------------------- -------------------------------- __dev_flush(flush_list) bq_xmit_all(bq) cnt = bq->count /* e.g. 16 */ /* start iterating bq->q[] */ <-- CFS preempts Task A --> bq_enqueue(dev, xdpf) bq->count == DEV_MAP_BULK_SIZE bq_xmit_all(bq, 0) cnt = bq->count /* same 16! */ ndo_xdp_xmit(bq->q[]) /* frames freed by driver */ bq->count = 0 <-- Task A resumes --> ndo_xdp_xmit(bq->q[]) /* use-after-free: frames already freed! */ Fix this by adding a local_lock_t to xdp_dev_bulk_queue and acquiring it in bq_enqueue() and __dev_flush(). These paths already run under local_bh_disable(), so use local_lock_nested_bh() which on non-RT is a pure annotation with no overhead, and on PREEMPT_RT provides a per-CPU sleeping lock that serializes access to the bq. Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT") Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260225121459.183121-3-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-27bpf: Fix race in cpumap on PREEMPT_RTJiayuan Chen
On PREEMPT_RT kernels, the per-CPU xdp_bulk_queue (bq) can be accessed concurrently by multiple preemptible tasks on the same CPU. The original code assumes bq_enqueue() and __cpu_map_flush() run atomically with respect to each other on the same CPU, relying on local_bh_disable() to prevent preemption. However, on PREEMPT_RT, local_bh_disable() only calls migrate_disable() (when PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable preemption, which allows CFS scheduling to preempt a task during bq_flush_to_queue(), enabling another task on the same CPU to enter bq_enqueue() and operate on the same per-CPU bq concurrently. This leads to several races: 1. Double __list_del_clearprev(): after bq->count is reset in bq_flush_to_queue(), a preempting task can call bq_enqueue() -> bq_flush_to_queue() on the same bq when bq->count reaches CPU_MAP_BULK_SIZE. Both tasks then call __list_del_clearprev() on the same bq->flush_node, the second call dereferences the prev pointer that was already set to NULL by the first. 2. bq->count and bq->q[] races: concurrent bq_enqueue() can corrupt the packet queue while bq_flush_to_queue() is processing it. The race between task A (__cpu_map_flush -> bq_flush_to_queue) and task B (bq_enqueue -> bq_flush_to_queue) on the same CPU: Task A (xdp_do_flush) Task B (cpu_map_enqueue) ---------------------- ------------------------ bq_flush_to_queue(bq) spin_lock(&q->producer_lock) /* flush bq->q[] to ptr_ring */ bq->count = 0 spin_unlock(&q->producer_lock) bq_enqueue(rcpu, xdpf) <-- CFS preempts Task A --> bq->q[bq->count++] = xdpf /* ... more enqueues until full ... */ bq_flush_to_queue(bq) spin_lock(&q->producer_lock) /* flush to ptr_ring */ spin_unlock(&q->producer_lock) __list_del_clearprev(flush_node) /* sets flush_node.prev = NULL */ <-- Task A resumes --> __list_del_clearprev(flush_node) flush_node.prev->next = ... /* prev is NULL -> kernel oops */ Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it in bq_enqueue() and __cpu_map_flush(). These paths already run under local_bh_disable(), so use local_lock_nested_bh() which on non-RT is a pure annotation with no overhead, and on PREEMPT_RT provides a per-CPU sleeping lock that serializes access to the bq. To reproduce, insert an mdelay(100) between bq->count = 0 and __list_del_clearprev() in bq_flush_to_queue(), then run reproducer provided by syzkaller. Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT") Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/ Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260225121459.183121-2-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-27bpf: Retire rcu_trace_implies_rcu_gp() from local storageKumar Kartikeya Dwivedi
This assumption will always hold going forward, hence just remove the various checks and assume it is true with a comment for the uninformed reader. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-27bpf: Delay freeing fields in local storageKumar Kartikeya Dwivedi
Currently, when use_kmalloc_nolock is false, the freeing of fields for a local storage selem is done eagerly before waiting for the RCU or RCU tasks trace grace period to elapse. This opens up a window where the program which has access to the selem can recreate the fields after the freeing of fields is done eagerly, causing memory leaks when the element is finally freed and returned to the kernel. Make a few changes to address this. First, delay the freeing of fields until after the grace periods have expired using a __bpf_selem_free_rcu wrapper which is eventually invoked after transitioning through the necessary number of grace period waits. Replace usage of the kfree_rcu with call_rcu to be able to take a custom callback. Finally, care needs to be taken to extend the rcu barriers for all cases, and not just when use_kmalloc_nolock is true, as RCU and RCU tasks trace callbacks can be in flight for either case and access the smap field, which is used to obtain the BTF record to walk over special fields in the map value. While we're at it, drop migrate_disable() from bpf_selem_free_rcu, since migration should be disabled for RCU callbacks already. Fixes: 9bac675e6368 ("bpf: Postpone bpf_obj_free_fields to the rcu callback") Reviewed-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-27bpf: Lose const-ness of map in map_check_btf()Kumar Kartikeya Dwivedi
BPF hash map may now use the map_check_btf() callback to decide whether to set a dtor on its bpf_mem_alloc or not. Unlike C++ where members can opt out of const-ness using mutable, we must lose the const qualifier on the callback such that we can avoid the ugly cast. Make the change and adjust all existing users, and lose the comment in hashtab.c. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-27bpf: Register dtor for freeing special fieldsKumar Kartikeya Dwivedi
There is a race window where BPF hash map elements can leak special fields if the program with access to the map value recreates these special fields between the check_and_free_fields done on the map value and its eventual return to the memory allocator. Several ways were explored prior to this patch, most notably [0] tried to use a poison value to reject attempts to recreate special fields for map values that have been logically deleted but still accessible to BPF programs (either while sitting in the free list or when reused). While this approach works well for task work, timers, wq, etc., it is harder to apply the idea to kptrs, which have a similar race and failure mode. Instead, we change bpf_mem_alloc to allow registering destructor for allocated elements, such that when they are returned to the allocator, any special fields created while they were accessible to programs in the mean time will be freed. If these values get reused, we do not free the fields again before handing the element back. The special fields thus may remain initialized while the map value sits in a free list. When bpf_mem_alloc is retired in the future, a similar concept can be introduced to kmalloc_nolock-backed kmem_cache, paired with the existing idea of a constructor. Note that the destructor registration happens in map_check_btf, after the BTF record is populated and (at that point) avaiable for inspection and duplication. Duplication is necessary since the freeing of embedded bpf_mem_alloc can be decoupled from actual map lifetime due to logic introduced to reduce the cost of rcu_barrier()s in mem alloc free path in 9f2c6e96c65e ("bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc."). As such, once all callbacks are done, we must also free the duplicated record. To remove dependency on the bpf_map itself, also stash the key size of the map to obtain value from htab_elem long after the map is gone. [0]: https://lore.kernel.org/bpf/20260216131341.1285427-1-mykyta.yatsenko5@gmail.com Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr") Fixes: 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context") Reported-by: Alexei Starovoitov <ast@kernel.org> Tested-by: syzbot@syzkaller.appspotmail.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-27nstree: tighten permission checks for listingChristian Brauner
Even privileged services should not necessarily be able to see other privileged service's namespaces so they can't leak information to each other. Use may_see_all_namespaces() helper that centralizes this policy until the nstree adapts. Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-3-d2c2853313bd@kernel.org Fixes: 76b6f5dfb3fd ("nstree: add listns()") Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@kernel.org # v6.19+ Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-02-27nsfs: tighten permission checks for ns iteration ioctlsChristian Brauner
Even privileged services should not necessarily be able to see other privileged service's namespaces so they can't leak information to each other. Use may_see_all_namespaces() helper that centralizes this policy until the nstree adapts. Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-1-d2c2853313bd@kernel.org Fixes: a1d220d9dafa ("nsfs: iterate through mount namespaces") Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@kernel.org # v6.12+ Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-02-27workqueue: Allow to expose ordered workqueues via sysfsSebastian Andrzej Siewior
Ordered workqueues are not exposed via sysfs because the 'max_active' attribute changes the number actives worker. More than one active worker can break ordering guarantees. This can be avoided by forbidding writes the file for ordered workqueues. Exposing it via sysfs allows to alter other attributes such as the cpumask on which CPU the worker can run. The 'max_active' value shouldn't be changed for BH worker because the core never spawns additional worker and the worker itself can not be preempted. So this make no sense. Allow to expose ordered workqueues via sysfs if requested and forbid changing 'max_active' value for ordered and BH worker. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-27perf/core: Simplify __detach_global_ctx_data()Namhyung Kim
Like in the attach_global_ctx_data() it has a O(N^2) loop to delete task context data for each thread. But perf_free_ctx_data_rcu() can be called under RCU read lock, so just calls it directly rather than iterating the whole thread list again. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260211223222.3119790-4-namhyung@kernel.org
2026-02-27perf/core: Try to allocate task_ctx_data quicklyNamhyung Kim
The attach_global_ctx_data() has O(N^2) algorithm to allocate the context data for each thread. This caused perfomance problems on large systems with O(100k) threads. Because kmalloc(GFP_KERNEL) can go sleep it cannot be called under the RCU lock. So let's try with GFP_NOWAIT first so that it can proceed in normal cases. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260211223222.3119790-3-namhyung@kernel.org
2026-02-27perf/core: Pass GFP flags to attach_task_ctx_data()Namhyung Kim
This is a preparation for the next change to reduce the computational complexity in the global context data handling for LBR callstacks. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260211223222.3119790-2-namhyung@kernel.org
2026-02-27sched: Default enable HRTICK when deferred rearming is enabledPeter Zijlstra
The deferred rearm of the clock event device after an interrupt and and other hrtimer optimizations allow now to enable HRTICK for generic entry architectures. This decouples preemption from CONFIG_HZ, leaving only the periodic load-balancer and various accounting things relying on the tick. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.937531564@kernel.org
2026-02-27hrtimer: Try to modify timers in placeThomas Gleixner
When modifying the expiry of a armed timer it is first dequeued, then the expiry value is updated and then it is queued again. This can be avoided when the new expiry value is within the range of the previous and the next timer as that does not change the position in the RB tree. The linked timerqueue allows to peak ahead to the neighbours and check whether the new expiry time is within the range of the previous and next timer. If so just modify the timer in place and spare the enqueue and requeue effort, which might end up rotating the RB tree twice for nothing. This speeds up the handling of frequently rearmed hrtimers, like the hrtick scheduler timer significantly. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.873359816@kernel.org
2026-02-27hrtimer: Use linked timerqueueThomas Gleixner
To prepare for optimizing the rearming of enqueued timers, switch to the linked timerqueue. That allows to check whether the new expiry time changes the position of the timer in the RB tree or not, by checking the new expiry time against the previous and the next timers expiry. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.806643179@kernel.org
2026-02-27hrtimer: Optimize for_each_active_base()Thomas Gleixner
Give the compiler some help to emit way better code. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.599804894@kernel.org
2026-02-27hrtimer: Simplify run_hrtimer_queues()Thomas Gleixner
Replace the open coded container_of() orgy with a trivial clock_base_next_timer() helper. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.532927977@kernel.org
2026-02-27hrtimer: Rework next event evaluationThomas Gleixner
The per clock base cached expiry time allows to do a more efficient evaluation of the next expiry on a CPU. Separate the reprogramming evaluation from the NOHZ idle evaluation which needs to exclude the NOHZ timer to keep the reprogramming path lean and clean. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.468186893@kernel.org
2026-02-27hrtimer: Keep track of first expiring timer per clock baseThomas Gleixner
Evaluating the next expiry time of all clock bases is cache line expensive as the expiry time of the first expiring timer is not cached in the base and requires to access the timer itself, which is definitely in a different cache line. It's way more efficient to keep track of the expiry time on enqueue and dequeue operations as the relevant data is already in the cache at that point. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.404839710@kernel.org
2026-02-27hrtimer: Avoid re-evaluation when nothing changedThomas Gleixner
Most times there is no change between hrtimer_interrupt() deferring the rearm and the invocation of hrtimer_rearm_deferred(). In those cases it's a pointless exercise to re-evaluate the next expiring timer. Cache the required data and use it if nothing changed. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.338569372@kernel.org
2026-02-27hrtimer: Push reprogramming timers into the interrupt return pathPeter Zijlstra
Currently hrtimer_interrupt() runs expired timers, which can re-arm themselves, after which it computes the next expiration time and re-programs the hardware. However, things like HRTICK, a highres timer driving preemption, cannot re-arm itself at the point of running, since the next task has not been determined yet. The schedule() in the interrupt return path will switch to the next task, which then causes a new hrtimer to be programmed. This then results in reprogramming the hardware at least twice, once after running the timers, and once upon selecting the new task. Notably, *both* events happen in the interrupt. By pushing the hrtimer reprogram all the way into the interrupt return path, it runs after schedule() picks the new task and the double reprogram can be avoided. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.273488269@kernel.org
2026-02-27sched/core: Prepare for deferred hrtimer rearmingPeter Zijlstra
The hrtimer interrupt expires timers and at the end of the interrupt it rearms the clockevent device for the next expiring timer. That's obviously correct, but in the case that a expired timer sets NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is enabled then schedule() will modify the hrtick timer, which causes another reprogramming of the hardware. That can be avoided by deferring the rearming to the return from interrupt path and if the return results in a immediate schedule() invocation then it can be deferred until the end of schedule(), which avoids multiple rearms and re-evaluation of the timer wheel. Add the rearm checks to the existing sched_hrtick_enter/exit() functions, which already handle the batched rearm of the hrtick timer. For now this is just placing empty stubs at the right places which are all optimized out by the compiler until the guard condition becomes true. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.208580085@kernel.org
2026-02-27softirq: Prepare for deferred hrtimer rearmingPeter Zijlstra
The hrtimer interrupt expires timers and at the end of the interrupt it rearms the clockevent device for the next expiring timer. That's obviously correct, but in the case that a expired timer sets NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is enabled then schedule() will modify the hrtick timer, which causes another reprogramming of the hardware. That can be avoided by deferring the rearming to the return from interrupt path and if the return results in a immediate schedule() invocation then it can be deferred until the end of schedule(), which avoids multiple rearms and re-evaluation of the timer wheel. In case that the return from interrupt ends up handling softirqs before reaching the rearm conditions in the return to user entry code functions, a deferred rearm has to be handled before softirq handling enables interrupts as soft interrupt handling can be long and would therefore introduce hard to diagnose latencies to the timer interrupt. Place the for now empty stub call right before invoking the softirq handling routine. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.142854488@kernel.org
2026-02-27entry: Prepare for deferred hrtimer rearmingPeter Zijlstra
The hrtimer interrupt expires timers and at the end of the interrupt it rearms the clockevent device for the next expiring timer. That's obviously correct, but in the case that a expired timer sets NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is enabled then schedule() will modify the hrtick timer, which causes another reprogramming of the hardware. That can be avoided by deferring the rearming to the return from interrupt path and if the return results in a immediate schedule() invocation then it can be deferred until the end of schedule(), which avoids multiple rearms and re-evaluation of the timer wheel. As this is only relevant for interrupt to user return split the work masks up and hand them in as arguments from the relevant exit to user functions, which allows the compiler to optimize the deferred handling out for the syscall exit to user case. Add the rearm checks to the approritate places in the exit to user loop and the interrupt return to kernel path, so that the rearming is always guaranteed. In the return to user space path this is handled in the same way as TIF_RSEQ to avoid extra instructions in the fast path, which are truly hurtful for device interrupt heavy work loads as the extra instructions and conditionals while benign at first sight accumulate quickly into measurable regressions. The return from syscall path is completely unaffected due to the above mentioned split so syscall heavy workloads wont have any extra burden. For now this is just placing empty stubs at the right places which are all optimized out by the compiler until the actual functionality is in place. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.066469985@kernel.org
2026-02-27hrtimer: Prepare stubs for deferred rearmingPeter Zijlstra
The hrtimer interrupt expires timers and at the end of the interrupt it rearms the clockevent device for the next expiring timer. That's obviously correct, but in the case that a expired timer set NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is enabled then schedule() will modify the hrtick timer, which causes another reprogramming of the hardware. That can be avoided by deferring the rearming to the return from interrupt path and if the return results in a immediate schedule() invocation then it can be deferred until the end of schedule(). To make this correct the affected code parts need to be made aware of this. Provide empty stubs for the deferred rearming mechanism, so that the relevant code changes for entry, softirq and scheduler can be split up into separate changes independent of the actual enablement in the hrtimer code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.000891171@kernel.org
2026-02-27hrtimer: Rename hrtimer_cpu_base::in_hrtirq to deferred_rearmThomas Gleixner
The upcoming deferred rearming scheme has the same effect as the deferred rearming when the hrtimer interrupt is executing. So it can reuse the in_hrtirq flag, but when it gets deferred beyond the hrtimer interrupt path, then the name does not make sense anymore. Rename it to deferred_rearm upfront to keep the actual functional change separate from the mechanical rename churn. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163430.935623347@kernel.org
2026-02-27hrtimer: Re-arrange hrtimer_interrupt()Peter Zijlstra
Rework hrtimer_interrupt() such that reprogramming is split out into an independent function at the end of the interrupt. This prepares for reprogramming getting delayed beyond the end of hrtimer_interrupt(). Notably, this changes the hang handling to always wait 100ms instead of trying to keep it proportional to the actual delay. This simplifies the state, also this really shouldn't be happening. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163430.870639266@kernel.org
2026-02-27hrtimer: Separate remove/enqueue handling for local timersThomas Gleixner
As the base switch can be avoided completely when the base stays the same the remove/enqueue handling can be more streamlined. Split it out into a separate function which handles both in one go which is way more efficient and makes the code simpler to follow. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163430.737600486@kernel.org
2026-02-27hrtimer: Use NOHZ information for localityThomas Gleixner
The decision to keep a timer which is associated to the local CPU on that CPU does not take NOHZ information into account. As a result there are a lot of hrtimer base switch invocations which end up not switching the base and stay on the local CPU. That's just work for nothing and can be further improved. If the local CPU is part of the NOISE housekeeping mask, then check: 1) Whether the local CPU has the tick running, which means it is either not idle or already expecting a timer soon. 2) Whether the tick is stopped and need_resched() is set, which means the CPU is about to exit idle. This reduces the amount of hrtimer base switch attempts, which end up on the local CPU anyway, significantly and prepares for further optimizations. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163430.673473029@kernel.org
2026-02-27hrtimer: Optimize for local timersThomas Gleixner
The decision whether to keep timers on the local CPU or on the CPU they are associated to is suboptimal and causes the expensive switch_hrtimer_base() mechanism to be invoked more than necessary. This is especially true for pinned timers. Rewrite the decision logic so that the current base is kept if: 1) The callback is running on the base 2) The timer is associated to the local CPU and the first expiring timer as that allows to optimize for reprogramming avoidance 3) The timer is associated to the local CPU and pinned 4) The timer is associated to the local CPU and timer migration is disabled. Only #2 was covered by the original code, but especially #3 makes a difference for high frequency rearming timers like the scheduler hrtick timer. If timer migration is disabled, then #4 avoids most of the base switches. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163430.607935269@kernel.org