summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2026-02-27bpf: Fix race in devmap on PREEMPT_RTJiayuan Chen
On PREEMPT_RT kernels, the per-CPU xdp_dev_bulk_queue (bq) can be accessed concurrently by multiple preemptible tasks on the same CPU. The original code assumes bq_enqueue() and __dev_flush() run atomically with respect to each other on the same CPU, relying on local_bh_disable() to prevent preemption. However, on PREEMPT_RT, local_bh_disable() only calls migrate_disable() (when PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable preemption, which allows CFS scheduling to preempt a task during bq_xmit_all(), enabling another task on the same CPU to enter bq_enqueue() and operate on the same per-CPU bq concurrently. This leads to several races: 1. Double-free / use-after-free on bq->q[]: bq_xmit_all() snapshots cnt = bq->count, then iterates bq->q[0..cnt-1] to transmit frames. If preempted after the snapshot, a second task can call bq_enqueue() -> bq_xmit_all() on the same bq, transmitting (and freeing) the same frames. When the first task resumes, it operates on stale pointers in bq->q[], causing use-after-free. 2. bq->count and bq->q[] corruption: concurrent bq_enqueue() modifying bq->count and bq->q[] while bq_xmit_all() is reading them. 3. dev_rx/xdp_prog teardown race: __dev_flush() clears bq->dev_rx and bq->xdp_prog after bq_xmit_all(). If preempted between bq_xmit_all() return and bq->dev_rx = NULL, a preempting bq_enqueue() sees dev_rx still set (non-NULL), skips adding bq to the flush_list, and enqueues a frame. When __dev_flush() resumes, it clears dev_rx and removes bq from the flush_list, orphaning the newly enqueued frame. 4. __list_del_clearprev() on flush_node: similar to the cpumap race, both tasks can call __list_del_clearprev() on the same flush_node, the second dereferences the prev pointer already set to NULL. The race between task A (__dev_flush -> bq_xmit_all) and task B (bq_enqueue -> bq_xmit_all) on the same CPU: Task A (xdp_do_flush) Task B (ndo_xdp_xmit redirect) ---------------------- -------------------------------- __dev_flush(flush_list) bq_xmit_all(bq) cnt = bq->count /* e.g. 16 */ /* start iterating bq->q[] */ <-- CFS preempts Task A --> bq_enqueue(dev, xdpf) bq->count == DEV_MAP_BULK_SIZE bq_xmit_all(bq, 0) cnt = bq->count /* same 16! */ ndo_xdp_xmit(bq->q[]) /* frames freed by driver */ bq->count = 0 <-- Task A resumes --> ndo_xdp_xmit(bq->q[]) /* use-after-free: frames already freed! */ Fix this by adding a local_lock_t to xdp_dev_bulk_queue and acquiring it in bq_enqueue() and __dev_flush(). These paths already run under local_bh_disable(), so use local_lock_nested_bh() which on non-RT is a pure annotation with no overhead, and on PREEMPT_RT provides a per-CPU sleeping lock that serializes access to the bq. Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT") Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260225121459.183121-3-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-27bpf: Fix race in cpumap on PREEMPT_RTJiayuan Chen
On PREEMPT_RT kernels, the per-CPU xdp_bulk_queue (bq) can be accessed concurrently by multiple preemptible tasks on the same CPU. The original code assumes bq_enqueue() and __cpu_map_flush() run atomically with respect to each other on the same CPU, relying on local_bh_disable() to prevent preemption. However, on PREEMPT_RT, local_bh_disable() only calls migrate_disable() (when PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable preemption, which allows CFS scheduling to preempt a task during bq_flush_to_queue(), enabling another task on the same CPU to enter bq_enqueue() and operate on the same per-CPU bq concurrently. This leads to several races: 1. Double __list_del_clearprev(): after bq->count is reset in bq_flush_to_queue(), a preempting task can call bq_enqueue() -> bq_flush_to_queue() on the same bq when bq->count reaches CPU_MAP_BULK_SIZE. Both tasks then call __list_del_clearprev() on the same bq->flush_node, the second call dereferences the prev pointer that was already set to NULL by the first. 2. bq->count and bq->q[] races: concurrent bq_enqueue() can corrupt the packet queue while bq_flush_to_queue() is processing it. The race between task A (__cpu_map_flush -> bq_flush_to_queue) and task B (bq_enqueue -> bq_flush_to_queue) on the same CPU: Task A (xdp_do_flush) Task B (cpu_map_enqueue) ---------------------- ------------------------ bq_flush_to_queue(bq) spin_lock(&q->producer_lock) /* flush bq->q[] to ptr_ring */ bq->count = 0 spin_unlock(&q->producer_lock) bq_enqueue(rcpu, xdpf) <-- CFS preempts Task A --> bq->q[bq->count++] = xdpf /* ... more enqueues until full ... */ bq_flush_to_queue(bq) spin_lock(&q->producer_lock) /* flush to ptr_ring */ spin_unlock(&q->producer_lock) __list_del_clearprev(flush_node) /* sets flush_node.prev = NULL */ <-- Task A resumes --> __list_del_clearprev(flush_node) flush_node.prev->next = ... /* prev is NULL -> kernel oops */ Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it in bq_enqueue() and __cpu_map_flush(). These paths already run under local_bh_disable(), so use local_lock_nested_bh() which on non-RT is a pure annotation with no overhead, and on PREEMPT_RT provides a per-CPU sleeping lock that serializes access to the bq. To reproduce, insert an mdelay(100) between bq->count = 0 and __list_del_clearprev() in bq_flush_to_queue(), then run reproducer provided by syzkaller. Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT") Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/ Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260225121459.183121-2-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-27bpf: Retire rcu_trace_implies_rcu_gp() from local storageKumar Kartikeya Dwivedi
This assumption will always hold going forward, hence just remove the various checks and assume it is true with a comment for the uninformed reader. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-27bpf: Delay freeing fields in local storageKumar Kartikeya Dwivedi
Currently, when use_kmalloc_nolock is false, the freeing of fields for a local storage selem is done eagerly before waiting for the RCU or RCU tasks trace grace period to elapse. This opens up a window where the program which has access to the selem can recreate the fields after the freeing of fields is done eagerly, causing memory leaks when the element is finally freed and returned to the kernel. Make a few changes to address this. First, delay the freeing of fields until after the grace periods have expired using a __bpf_selem_free_rcu wrapper which is eventually invoked after transitioning through the necessary number of grace period waits. Replace usage of the kfree_rcu with call_rcu to be able to take a custom callback. Finally, care needs to be taken to extend the rcu barriers for all cases, and not just when use_kmalloc_nolock is true, as RCU and RCU tasks trace callbacks can be in flight for either case and access the smap field, which is used to obtain the BTF record to walk over special fields in the map value. While we're at it, drop migrate_disable() from bpf_selem_free_rcu, since migration should be disabled for RCU callbacks already. Fixes: 9bac675e6368 ("bpf: Postpone bpf_obj_free_fields to the rcu callback") Reviewed-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-27bpf: Lose const-ness of map in map_check_btf()Kumar Kartikeya Dwivedi
BPF hash map may now use the map_check_btf() callback to decide whether to set a dtor on its bpf_mem_alloc or not. Unlike C++ where members can opt out of const-ness using mutable, we must lose the const qualifier on the callback such that we can avoid the ugly cast. Make the change and adjust all existing users, and lose the comment in hashtab.c. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-27bpf: Register dtor for freeing special fieldsKumar Kartikeya Dwivedi
There is a race window where BPF hash map elements can leak special fields if the program with access to the map value recreates these special fields between the check_and_free_fields done on the map value and its eventual return to the memory allocator. Several ways were explored prior to this patch, most notably [0] tried to use a poison value to reject attempts to recreate special fields for map values that have been logically deleted but still accessible to BPF programs (either while sitting in the free list or when reused). While this approach works well for task work, timers, wq, etc., it is harder to apply the idea to kptrs, which have a similar race and failure mode. Instead, we change bpf_mem_alloc to allow registering destructor for allocated elements, such that when they are returned to the allocator, any special fields created while they were accessible to programs in the mean time will be freed. If these values get reused, we do not free the fields again before handing the element back. The special fields thus may remain initialized while the map value sits in a free list. When bpf_mem_alloc is retired in the future, a similar concept can be introduced to kmalloc_nolock-backed kmem_cache, paired with the existing idea of a constructor. Note that the destructor registration happens in map_check_btf, after the BTF record is populated and (at that point) avaiable for inspection and duplication. Duplication is necessary since the freeing of embedded bpf_mem_alloc can be decoupled from actual map lifetime due to logic introduced to reduce the cost of rcu_barrier()s in mem alloc free path in 9f2c6e96c65e ("bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc."). As such, once all callbacks are done, we must also free the duplicated record. To remove dependency on the bpf_map itself, also stash the key size of the map to obtain value from htab_elem long after the map is gone. [0]: https://lore.kernel.org/bpf/20260216131341.1285427-1-mykyta.yatsenko5@gmail.com Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr") Fixes: 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context") Reported-by: Alexei Starovoitov <ast@kernel.org> Tested-by: syzbot@syzkaller.appspotmail.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260227224806.646888-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-27nstree: tighten permission checks for listingChristian Brauner
Even privileged services should not necessarily be able to see other privileged service's namespaces so they can't leak information to each other. Use may_see_all_namespaces() helper that centralizes this policy until the nstree adapts. Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-3-d2c2853313bd@kernel.org Fixes: 76b6f5dfb3fd ("nstree: add listns()") Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@kernel.org # v6.19+ Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-02-27nsfs: tighten permission checks for ns iteration ioctlsChristian Brauner
Even privileged services should not necessarily be able to see other privileged service's namespaces so they can't leak information to each other. Use may_see_all_namespaces() helper that centralizes this policy until the nstree adapts. Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-1-d2c2853313bd@kernel.org Fixes: a1d220d9dafa ("nsfs: iterate through mount namespaces") Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@kernel.org # v6.12+ Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-02-27workqueue: Allow to expose ordered workqueues via sysfsSebastian Andrzej Siewior
Ordered workqueues are not exposed via sysfs because the 'max_active' attribute changes the number actives worker. More than one active worker can break ordering guarantees. This can be avoided by forbidding writes the file for ordered workqueues. Exposing it via sysfs allows to alter other attributes such as the cpumask on which CPU the worker can run. The 'max_active' value shouldn't be changed for BH worker because the core never spawns additional worker and the worker itself can not be preempted. So this make no sense. Allow to expose ordered workqueues via sysfs if requested and forbid changing 'max_active' value for ordered and BH worker. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-27perf/core: Simplify __detach_global_ctx_data()Namhyung Kim
Like in the attach_global_ctx_data() it has a O(N^2) loop to delete task context data for each thread. But perf_free_ctx_data_rcu() can be called under RCU read lock, so just calls it directly rather than iterating the whole thread list again. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260211223222.3119790-4-namhyung@kernel.org
2026-02-27perf/core: Try to allocate task_ctx_data quicklyNamhyung Kim
The attach_global_ctx_data() has O(N^2) algorithm to allocate the context data for each thread. This caused perfomance problems on large systems with O(100k) threads. Because kmalloc(GFP_KERNEL) can go sleep it cannot be called under the RCU lock. So let's try with GFP_NOWAIT first so that it can proceed in normal cases. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260211223222.3119790-3-namhyung@kernel.org
2026-02-27perf/core: Pass GFP flags to attach_task_ctx_data()Namhyung Kim
This is a preparation for the next change to reduce the computational complexity in the global context data handling for LBR callstacks. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260211223222.3119790-2-namhyung@kernel.org
2026-02-27sched: Default enable HRTICK when deferred rearming is enabledPeter Zijlstra
The deferred rearm of the clock event device after an interrupt and and other hrtimer optimizations allow now to enable HRTICK for generic entry architectures. This decouples preemption from CONFIG_HZ, leaving only the periodic load-balancer and various accounting things relying on the tick. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.937531564@kernel.org
2026-02-27hrtimer: Try to modify timers in placeThomas Gleixner
When modifying the expiry of a armed timer it is first dequeued, then the expiry value is updated and then it is queued again. This can be avoided when the new expiry value is within the range of the previous and the next timer as that does not change the position in the RB tree. The linked timerqueue allows to peak ahead to the neighbours and check whether the new expiry time is within the range of the previous and next timer. If so just modify the timer in place and spare the enqueue and requeue effort, which might end up rotating the RB tree twice for nothing. This speeds up the handling of frequently rearmed hrtimers, like the hrtick scheduler timer significantly. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.873359816@kernel.org
2026-02-27hrtimer: Use linked timerqueueThomas Gleixner
To prepare for optimizing the rearming of enqueued timers, switch to the linked timerqueue. That allows to check whether the new expiry time changes the position of the timer in the RB tree or not, by checking the new expiry time against the previous and the next timers expiry. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.806643179@kernel.org
2026-02-27hrtimer: Optimize for_each_active_base()Thomas Gleixner
Give the compiler some help to emit way better code. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.599804894@kernel.org
2026-02-27hrtimer: Simplify run_hrtimer_queues()Thomas Gleixner
Replace the open coded container_of() orgy with a trivial clock_base_next_timer() helper. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.532927977@kernel.org
2026-02-27hrtimer: Rework next event evaluationThomas Gleixner
The per clock base cached expiry time allows to do a more efficient evaluation of the next expiry on a CPU. Separate the reprogramming evaluation from the NOHZ idle evaluation which needs to exclude the NOHZ timer to keep the reprogramming path lean and clean. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.468186893@kernel.org
2026-02-27hrtimer: Keep track of first expiring timer per clock baseThomas Gleixner
Evaluating the next expiry time of all clock bases is cache line expensive as the expiry time of the first expiring timer is not cached in the base and requires to access the timer itself, which is definitely in a different cache line. It's way more efficient to keep track of the expiry time on enqueue and dequeue operations as the relevant data is already in the cache at that point. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.404839710@kernel.org
2026-02-27hrtimer: Avoid re-evaluation when nothing changedThomas Gleixner
Most times there is no change between hrtimer_interrupt() deferring the rearm and the invocation of hrtimer_rearm_deferred(). In those cases it's a pointless exercise to re-evaluate the next expiring timer. Cache the required data and use it if nothing changed. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.338569372@kernel.org
2026-02-27hrtimer: Push reprogramming timers into the interrupt return pathPeter Zijlstra
Currently hrtimer_interrupt() runs expired timers, which can re-arm themselves, after which it computes the next expiration time and re-programs the hardware. However, things like HRTICK, a highres timer driving preemption, cannot re-arm itself at the point of running, since the next task has not been determined yet. The schedule() in the interrupt return path will switch to the next task, which then causes a new hrtimer to be programmed. This then results in reprogramming the hardware at least twice, once after running the timers, and once upon selecting the new task. Notably, *both* events happen in the interrupt. By pushing the hrtimer reprogram all the way into the interrupt return path, it runs after schedule() picks the new task and the double reprogram can be avoided. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.273488269@kernel.org
2026-02-27sched/core: Prepare for deferred hrtimer rearmingPeter Zijlstra
The hrtimer interrupt expires timers and at the end of the interrupt it rearms the clockevent device for the next expiring timer. That's obviously correct, but in the case that a expired timer sets NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is enabled then schedule() will modify the hrtick timer, which causes another reprogramming of the hardware. That can be avoided by deferring the rearming to the return from interrupt path and if the return results in a immediate schedule() invocation then it can be deferred until the end of schedule(), which avoids multiple rearms and re-evaluation of the timer wheel. Add the rearm checks to the existing sched_hrtick_enter/exit() functions, which already handle the batched rearm of the hrtick timer. For now this is just placing empty stubs at the right places which are all optimized out by the compiler until the guard condition becomes true. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.208580085@kernel.org
2026-02-27softirq: Prepare for deferred hrtimer rearmingPeter Zijlstra
The hrtimer interrupt expires timers and at the end of the interrupt it rearms the clockevent device for the next expiring timer. That's obviously correct, but in the case that a expired timer sets NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is enabled then schedule() will modify the hrtick timer, which causes another reprogramming of the hardware. That can be avoided by deferring the rearming to the return from interrupt path and if the return results in a immediate schedule() invocation then it can be deferred until the end of schedule(), which avoids multiple rearms and re-evaluation of the timer wheel. In case that the return from interrupt ends up handling softirqs before reaching the rearm conditions in the return to user entry code functions, a deferred rearm has to be handled before softirq handling enables interrupts as soft interrupt handling can be long and would therefore introduce hard to diagnose latencies to the timer interrupt. Place the for now empty stub call right before invoking the softirq handling routine. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.142854488@kernel.org
2026-02-27entry: Prepare for deferred hrtimer rearmingPeter Zijlstra
The hrtimer interrupt expires timers and at the end of the interrupt it rearms the clockevent device for the next expiring timer. That's obviously correct, but in the case that a expired timer sets NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is enabled then schedule() will modify the hrtick timer, which causes another reprogramming of the hardware. That can be avoided by deferring the rearming to the return from interrupt path and if the return results in a immediate schedule() invocation then it can be deferred until the end of schedule(), which avoids multiple rearms and re-evaluation of the timer wheel. As this is only relevant for interrupt to user return split the work masks up and hand them in as arguments from the relevant exit to user functions, which allows the compiler to optimize the deferred handling out for the syscall exit to user case. Add the rearm checks to the approritate places in the exit to user loop and the interrupt return to kernel path, so that the rearming is always guaranteed. In the return to user space path this is handled in the same way as TIF_RSEQ to avoid extra instructions in the fast path, which are truly hurtful for device interrupt heavy work loads as the extra instructions and conditionals while benign at first sight accumulate quickly into measurable regressions. The return from syscall path is completely unaffected due to the above mentioned split so syscall heavy workloads wont have any extra burden. For now this is just placing empty stubs at the right places which are all optimized out by the compiler until the actual functionality is in place. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.066469985@kernel.org
2026-02-27hrtimer: Prepare stubs for deferred rearmingPeter Zijlstra
The hrtimer interrupt expires timers and at the end of the interrupt it rearms the clockevent device for the next expiring timer. That's obviously correct, but in the case that a expired timer set NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is enabled then schedule() will modify the hrtick timer, which causes another reprogramming of the hardware. That can be avoided by deferring the rearming to the return from interrupt path and if the return results in a immediate schedule() invocation then it can be deferred until the end of schedule(). To make this correct the affected code parts need to be made aware of this. Provide empty stubs for the deferred rearming mechanism, so that the relevant code changes for entry, softirq and scheduler can be split up into separate changes independent of the actual enablement in the hrtimer code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.000891171@kernel.org
2026-02-27hrtimer: Rename hrtimer_cpu_base::in_hrtirq to deferred_rearmThomas Gleixner
The upcoming deferred rearming scheme has the same effect as the deferred rearming when the hrtimer interrupt is executing. So it can reuse the in_hrtirq flag, but when it gets deferred beyond the hrtimer interrupt path, then the name does not make sense anymore. Rename it to deferred_rearm upfront to keep the actual functional change separate from the mechanical rename churn. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163430.935623347@kernel.org
2026-02-27hrtimer: Re-arrange hrtimer_interrupt()Peter Zijlstra
Rework hrtimer_interrupt() such that reprogramming is split out into an independent function at the end of the interrupt. This prepares for reprogramming getting delayed beyond the end of hrtimer_interrupt(). Notably, this changes the hang handling to always wait 100ms instead of trying to keep it proportional to the actual delay. This simplifies the state, also this really shouldn't be happening. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163430.870639266@kernel.org
2026-02-27hrtimer: Separate remove/enqueue handling for local timersThomas Gleixner
As the base switch can be avoided completely when the base stays the same the remove/enqueue handling can be more streamlined. Split it out into a separate function which handles both in one go which is way more efficient and makes the code simpler to follow. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163430.737600486@kernel.org
2026-02-27hrtimer: Use NOHZ information for localityThomas Gleixner
The decision to keep a timer which is associated to the local CPU on that CPU does not take NOHZ information into account. As a result there are a lot of hrtimer base switch invocations which end up not switching the base and stay on the local CPU. That's just work for nothing and can be further improved. If the local CPU is part of the NOISE housekeeping mask, then check: 1) Whether the local CPU has the tick running, which means it is either not idle or already expecting a timer soon. 2) Whether the tick is stopped and need_resched() is set, which means the CPU is about to exit idle. This reduces the amount of hrtimer base switch attempts, which end up on the local CPU anyway, significantly and prepares for further optimizations. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163430.673473029@kernel.org
2026-02-27hrtimer: Optimize for local timersThomas Gleixner
The decision whether to keep timers on the local CPU or on the CPU they are associated to is suboptimal and causes the expensive switch_hrtimer_base() mechanism to be invoked more than necessary. This is especially true for pinned timers. Rewrite the decision logic so that the current base is kept if: 1) The callback is running on the base 2) The timer is associated to the local CPU and the first expiring timer as that allows to optimize for reprogramming avoidance 3) The timer is associated to the local CPU and pinned 4) The timer is associated to the local CPU and timer migration is disabled. Only #2 was covered by the original code, but especially #3 makes a difference for high frequency rearming timers like the scheduler hrtick timer. If timer migration is disabled, then #4 avoids most of the base switches. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163430.607935269@kernel.org
2026-02-27hrtimer: Convert state and properties to booleanThomas Gleixner
All 'u8' flags are true booleans, so make it entirely clear that these can only contain true or false. This is especially true for hrtimer::state, which has a historical leftover of using the state with bitwise operations. That was used in the early hrtimer implementation with several bits, but then converted to a boolean state. But that conversion missed to replace the bit OR and bit check operations all over the place, which creates suboptimal code. As of today 'state' is a misnomer because it's only purpose is to reflect whether the timer is enqueued into the RB-tree or not. Rename it to 'is_queued' and make all operations on it boolean. This reduces text size from 8926 to 8732 bytes. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163430.542427240@kernel.org
2026-02-27hrtimer: Replace the bitfield in hrtimer_cpu_baseThomas Gleixner
Use bool for the various flags as that creates better code in the hot path. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163430.475262618@kernel.org
2026-02-27hrtimer: Evaluate timer expiry only onceThomas Gleixner
No point in accessing the timer twice. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163430.409352042@kernel.org
2026-02-27hrtimer: Cleanup coding style and commentsThomas Gleixner
As this code has some major surgery ahead, clean up coding style and bring comments up to date. No functional change intended. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163430.342740952@kernel.org
2026-02-27hrtimer: Use guards where appropriateThomas Gleixner
Simplify and tidy up the code where possible. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163430.275551488@kernel.org
2026-02-27hrtimer: Reduce trace noise in hrtimer_start()Thomas Gleixner
hrtimer_start() when invoked with an already armed timer traces like: <comm>-.. [032] d.h2. 5.002263: hrtimer_cancel: hrtimer= .... <comm>-.. [032] d.h1. 5.002263: hrtimer_start: hrtimer= .... Which is incorrect as the timer doesn't get canceled. Just the expiry time changes. The internal dequeue operation which is required for that is not really interesting for trace analysis. But it makes it tedious to keep real cancellations and the above case apart. Remove the cancel tracing in hrtimer_start() and add a 'was_armed' indicator to the hrtimer start tracepoint, which clearly indicates what the state of the hrtimer is when hrtimer_start() is invoked: <comm>-.. [032] d.h1. 6.200103: hrtimer_start: hrtimer= .... was_armed=0 <comm>-.. [032] d.h1. 6.200558: hrtimer_start: hrtimer= .... was_armed=1 Fixes: c6a2a1770245 ("hrtimer: Add tracepoint for hrtimers") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163430.208491877@kernel.org
2026-02-27hrtimer: Add debug object init assertionThomas Gleixner
The debug object coverage in hrtimer_start_range_ns() happens too late to do anything useful. Implement the init assert assertion part and invoke that early in hrtimer_start_range_ns(). Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163430.143098153@kernel.org
2026-02-27clockevents: Provide support for clocksource coupled comparatorsThomas Gleixner
Some clockevent devices are coupled to the system clocksource by implementing a less than or equal comparator which compares the programmed absolute expiry time against the underlying time counter. The timekeeping core provides a function to convert and absolute CLOCK_MONOTONIC based expiry time to a absolute clock cycles time which can be directly fed into the comparator. That spares two time reads in the next event progamming path, one to convert the absolute nanoseconds time to a delta value and the other to convert the delta value back to a absolute time value suitable for the comparator. Provide a new clocksource callback which takes the absolute cycle value and wire it up in clockevents_program_event(). Similar to clocksources allow architectures to inline the rearm operation. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163430.010425428@kernel.org
2026-02-27timekeeping: Provide infrastructure for coupled clockeventsThomas Gleixner
Some architectures have clockevent devices which are coupled to the system clocksource by implementing a less than or equal comparator which compares the programmed absolute expiry time against the underlying time counter. Well known examples are TSC/TSC deadline timer and the S390 TOD clocksource/comparator. While the concept is nice it has some downsides: 1) The clockevents core code is strictly based on relative expiry times as that's the most common case for clockevent device hardware. That requires to convert the absolute expiry time provided by the caller (hrtimers, NOHZ code) to a relative expiry time by reading and substracting the current time. The clockevent::set_next_event() callback must then read the counter again to convert the relative expiry back into a absolute one. 2) The conversion factors from nanoseconds to counter clock cycles are set up when the clockevent is registered. When NTP applies corrections then the clockevent conversion factors can deviate from the clocksource conversion substantially which either results in timers firing late or in the worst case early. The early expiry then needs to do a reprogam with a short delta. In most cases this is papered over by the fact that the read in the set_next_event() callback happens after the read which is used to calculate the delta. So the tendency is that timers expire mostly late. All of this can be avoided by providing support for these devices in the core code: 1) The timekeeping core keeps track of the last update to the clocksource by storing the base nanoseconds and the corresponding clocksource counter value. That's used to keep the conversion math for reading the time within 64-bit in the common case. This information can be used to avoid both reads of the underlying clocksource in the clockevents reprogramming path: delta = expiry - base_ns; cycles = base_cycles + ((delta * clockevent::mult) >> clockevent::shift); The resulting cycles value can be directly used to program the comparator. 2) As #1 does not longer provide the "compensation" through the second read the deviation of the clocksource and clockevent conversions caused by NTP become more prominent. This can be cured by letting the timekeeping core compute and store the reverse conversion factors when the clocksource cycles to nanoseconds factors are modified by NTP: CS::MULT (1 << NS_TO_CYC_SHIFT) --------------- = ---------------------- (1 << CS:SHIFT) NS_TO_CYC_MULT Ergo: NS_TO_CYC_MULT = (1 << (CS::SHIFT + NS_TO_CYC_SHIFT)) / CS::MULT The NS_TO_CYC_SHIFT value is calculated when the clocksource is installed so that it aims for a one hour maximum sleep time. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163429.944763521@kernel.org
2026-02-27timekeeping: Allow inlining clocksource::read()Thomas Gleixner
On some architectures clocksource::read() boils down to a single instruction, so the indirect function call is just a massive overhead especially with speculative execution mitigations in effect. Allow architectures to enable conditional inlining of that read to avoid that by: - providing a static branch to switch to the inlined variant - disabling the branch before clocksource changes - enabling the branch after a clocksource change, when the clocksource indicates in a feature flag that it is the one which provides the inlined variant This is intentionally not a static call as that would only remove the indirect call, but not the rest of the overhead. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163429.675151545@kernel.org
2026-02-27clockevents: Remove redundant CLOCK_EVT_FEAT_KTIMEThomas Gleixner
The only real usecase for this is the hrtimer based broadcast device. No point in using two different feature flags for this. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163429.609049777@kernel.org
2026-02-27tick/sched: Avoid hrtimer_cancel/start() sequenceThomas Gleixner
The sequence of cancel and start is inefficient. It has to do the timer lock/unlock twice and in the worst case has to reprogram the underlying clock event device twice. The reason why it is done this way is the usage of hrtimer_forward_now(), which requires the timer to be inactive. But that can be completely avoided as the forward can be done on a variable and does not need any of the overrun accounting provided by hrtimer_forward_now(). Implement a trivial forwarding mechanism and replace the cancel/reprogram sequence with hrtimer_start(..., new_expiry). For the non high resolution case the timer is not actually armed, but used for storage so that code checking for expiry times can unconditially look it up in the timer. So it is safe for that case to set the new expiry time directly. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163429.542178086@kernel.org
2026-02-27sched/hrtick: Mark hrtick timer LAZY_REARMPeter Zijlstra
The hrtick timer is frequently rearmed before expiry and most of the time the new expiry is past the armed one. As this happens on every context switch it becomes expensive with scheduling heavy work loads especially in virtual machines as the "hardware" reprogamming implies a VM exit. hrtimer now provide a lazy rearm mode flag which skips the reprogamming if: 1) The timer was the first expiring timer before the rearm 2) The new expiry time is farther out than the armed time This avoids a massive amount of reprogramming operations of the hrtick timer for the price of eventually taking the alredy armed interrupt for nothing. Mark the hrtick timer accordingly. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163429.475409346@kernel.org
2026-02-27hrtimer: Provide LAZY_REARM modePeter Zijlstra
The hrtick timer is frequently rearmed before expiry and most of the time the new expiry is past the armed one. As this happens on every context switch it becomes expensive with scheduling heavy work loads especially in virtual machines as the "hardware" reprogamming implies a VM exit. Add a lazy rearm mode flag which skips the reprogamming if: 1) The timer was the first expiring timer before the rearm 2) The new expiry time is farther out than the armed time This avoids a massive amount of reprogramming operations of the hrtick timer for the price of eventually taking the alredy armed interrupt for nothing. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163429.408524456@kernel.org
2026-02-27sched/hrtick: Avoid tiny hrtick rearmsThomas Gleixner
Tiny adjustments to the hrtick expiry time below 5 microseconds are just causing extra work for no real value. Filter them out when restarting the hrtick. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163429.340593047@kernel.org
2026-02-27sched: Optimize hrtimer handlingThomas Gleixner
schedule() provides several mechanisms to update the hrtick timer: 1) When the next task is picked 2) When the balance callbacks are invoked before rq::lock is released Each of them can result in a first expiring timer and cause a reprogram of the clock event device. Solve this by deferring the rearm to the end of schedule() right before releasing rq::lock by setting a flag on entry which tells hrtick_start() to cache the runtime constraint in rq::hrtick_delay without touching the timer itself. Right before releasing rq::lock evaluate the flags and either rearm or cancel the hrtick timer. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163429.273068659@kernel.org
2026-02-27sched: Use hrtimer_highres_enabled()Thomas Gleixner
Use the static branch based variant and thereby avoid following three pointers. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163429.203610956@kernel.org
2026-02-27hrtimer: Provide a static branch based hrtimer_hres_enabled()Thomas Gleixner
The scheduler evaluates this via hrtimer_is_hres_active() every time it has to update HRTICK. This needs to follow three pointers, which is expensive. Provide a static branch based mechanism to avoid that. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163429.136503358@kernel.org
2026-02-27hrtimer: Avoid pointless reprogramming in __hrtimer_start_range_ns()Peter Zijlstra
Much like hrtimer_reprogram(), skip programming if the cpu_base is running the hrtimer interrupt. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260224163429.069535561@kernel.org
2026-02-27sched: Avoid ktime_get() indirectionThomas Gleixner
The clock of the hrtick and deadline timers is known to be CLOCK_MONOTONIC. No point in looking it up via hrtimer_cb_get_time(). Just use ktime_get() directly. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163429.001511662@kernel.org