| Age | Commit message (Collapse) | Author |
|
Even privileged services should not necessarily be able to see other
privileged service's namespaces so they can't leak information to each
other. Use may_see_all_namespaces() helper that centralizes this policy
until the nstree adapts.
Link: https://patch.msgid.link/20260226-work-visibility-fixes-v1-1-d2c2853313bd@kernel.org
Fixes: a1d220d9dafa ("nsfs: iterate through mount namespaces")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: stable@kernel.org # v6.12+
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Ordered workqueues are not exposed via sysfs because the 'max_active'
attribute changes the number actives worker. More than one active worker
can break ordering guarantees.
This can be avoided by forbidding writes the file for ordered
workqueues. Exposing it via sysfs allows to alter other attributes such
as the cpumask on which CPU the worker can run.
The 'max_active' value shouldn't be changed for BH worker because the
core never spawns additional worker and the worker itself can not be
preempted. So this make no sense.
Allow to expose ordered workqueues via sysfs if requested and forbid
changing 'max_active' value for ordered and BH worker.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Like in the attach_global_ctx_data() it has a O(N^2) loop to delete task
context data for each thread. But perf_free_ctx_data_rcu() can be
called under RCU read lock, so just calls it directly rather than
iterating the whole thread list again.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260211223222.3119790-4-namhyung@kernel.org
|
|
The attach_global_ctx_data() has O(N^2) algorithm to allocate the
context data for each thread. This caused perfomance problems on large
systems with O(100k) threads.
Because kmalloc(GFP_KERNEL) can go sleep it cannot be called under the
RCU lock. So let's try with GFP_NOWAIT first so that it can proceed in
normal cases.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260211223222.3119790-3-namhyung@kernel.org
|
|
This is a preparation for the next change to reduce the computational
complexity in the global context data handling for LBR callstacks.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260211223222.3119790-2-namhyung@kernel.org
|
|
The deferred rearm of the clock event device after an interrupt and and
other hrtimer optimizations allow now to enable HRTICK for generic entry
architectures.
This decouples preemption from CONFIG_HZ, leaving only the periodic
load-balancer and various accounting things relying on the tick.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.937531564@kernel.org
|
|
When modifying the expiry of a armed timer it is first dequeued, then the
expiry value is updated and then it is queued again.
This can be avoided when the new expiry value is within the range of the
previous and the next timer as that does not change the position in the RB
tree.
The linked timerqueue allows to peak ahead to the neighbours and check
whether the new expiry time is within the range of the previous and next
timer. If so just modify the timer in place and spare the enqueue and
requeue effort, which might end up rotating the RB tree twice for nothing.
This speeds up the handling of frequently rearmed hrtimers, like the hrtick
scheduler timer significantly.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.873359816@kernel.org
|
|
To prepare for optimizing the rearming of enqueued timers, switch to the
linked timerqueue. That allows to check whether the new expiry time changes
the position of the timer in the RB tree or not, by checking the new expiry
time against the previous and the next timers expiry.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.806643179@kernel.org
|
|
Give the compiler some help to emit way better code.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.599804894@kernel.org
|
|
Replace the open coded container_of() orgy with a trivial
clock_base_next_timer() helper.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.532927977@kernel.org
|
|
The per clock base cached expiry time allows to do a more efficient
evaluation of the next expiry on a CPU.
Separate the reprogramming evaluation from the NOHZ idle evaluation which
needs to exclude the NOHZ timer to keep the reprogramming path lean and
clean.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.468186893@kernel.org
|
|
Evaluating the next expiry time of all clock bases is cache line expensive
as the expiry time of the first expiring timer is not cached in the base
and requires to access the timer itself, which is definitely in a different
cache line.
It's way more efficient to keep track of the expiry time on enqueue and
dequeue operations as the relevant data is already in the cache at that
point.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.404839710@kernel.org
|
|
Most times there is no change between hrtimer_interrupt() deferring the rearm
and the invocation of hrtimer_rearm_deferred(). In those cases it's a pointless
exercise to re-evaluate the next expiring timer.
Cache the required data and use it if nothing changed.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.338569372@kernel.org
|
|
Currently hrtimer_interrupt() runs expired timers, which can re-arm
themselves, after which it computes the next expiration time and
re-programs the hardware.
However, things like HRTICK, a highres timer driving preemption, cannot
re-arm itself at the point of running, since the next task has not been
determined yet. The schedule() in the interrupt return path will switch to
the next task, which then causes a new hrtimer to be programmed.
This then results in reprogramming the hardware at least twice, once after
running the timers, and once upon selecting the new task.
Notably, *both* events happen in the interrupt.
By pushing the hrtimer reprogram all the way into the interrupt return
path, it runs after schedule() picks the new task and the double reprogram
can be avoided.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.273488269@kernel.org
|
|
The hrtimer interrupt expires timers and at the end of the interrupt it
rearms the clockevent device for the next expiring timer.
That's obviously correct, but in the case that a expired timer sets
NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is
enabled then schedule() will modify the hrtick timer, which causes another
reprogramming of the hardware.
That can be avoided by deferring the rearming to the return from interrupt
path and if the return results in a immediate schedule() invocation then it
can be deferred until the end of schedule(), which avoids multiple rearms
and re-evaluation of the timer wheel.
Add the rearm checks to the existing sched_hrtick_enter/exit() functions,
which already handle the batched rearm of the hrtick timer.
For now this is just placing empty stubs at the right places which are all
optimized out by the compiler until the guard condition becomes true.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.208580085@kernel.org
|
|
The hrtimer interrupt expires timers and at the end of the interrupt it
rearms the clockevent device for the next expiring timer.
That's obviously correct, but in the case that a expired timer sets
NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is
enabled then schedule() will modify the hrtick timer, which causes another
reprogramming of the hardware.
That can be avoided by deferring the rearming to the return from interrupt
path and if the return results in a immediate schedule() invocation then it
can be deferred until the end of schedule(), which avoids multiple rearms
and re-evaluation of the timer wheel.
In case that the return from interrupt ends up handling softirqs before
reaching the rearm conditions in the return to user entry code functions, a
deferred rearm has to be handled before softirq handling enables interrupts
as soft interrupt handling can be long and would therefore introduce hard
to diagnose latencies to the timer interrupt.
Place the for now empty stub call right before invoking the softirq
handling routine.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.142854488@kernel.org
|
|
The hrtimer interrupt expires timers and at the end of the interrupt it
rearms the clockevent device for the next expiring timer.
That's obviously correct, but in the case that a expired timer sets
NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is
enabled then schedule() will modify the hrtick timer, which causes another
reprogramming of the hardware.
That can be avoided by deferring the rearming to the return from interrupt
path and if the return results in a immediate schedule() invocation then it
can be deferred until the end of schedule(), which avoids multiple rearms
and re-evaluation of the timer wheel.
As this is only relevant for interrupt to user return split the work masks
up and hand them in as arguments from the relevant exit to user functions,
which allows the compiler to optimize the deferred handling out for the
syscall exit to user case.
Add the rearm checks to the approritate places in the exit to user loop and
the interrupt return to kernel path, so that the rearming is always
guaranteed.
In the return to user space path this is handled in the same way as
TIF_RSEQ to avoid extra instructions in the fast path, which are truly
hurtful for device interrupt heavy work loads as the extra instructions and
conditionals while benign at first sight accumulate quickly into measurable
regressions. The return from syscall path is completely unaffected due to
the above mentioned split so syscall heavy workloads wont have any extra
burden.
For now this is just placing empty stubs at the right places which are all
optimized out by the compiler until the actual functionality is in place.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.066469985@kernel.org
|
|
The hrtimer interrupt expires timers and at the end of the interrupt it
rearms the clockevent device for the next expiring timer.
That's obviously correct, but in the case that a expired timer set
NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is
enabled then schedule() will modify the hrtick timer, which causes another
reprogramming of the hardware.
That can be avoided by deferring the rearming to the return from interrupt
path and if the return results in a immediate schedule() invocation then it
can be deferred until the end of schedule().
To make this correct the affected code parts need to be made aware of this.
Provide empty stubs for the deferred rearming mechanism, so that the
relevant code changes for entry, softirq and scheduler can be split up into
separate changes independent of the actual enablement in the hrtimer code.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163431.000891171@kernel.org
|
|
The upcoming deferred rearming scheme has the same effect as the deferred
rearming when the hrtimer interrupt is executing. So it can reuse the
in_hrtirq flag, but when it gets deferred beyond the hrtimer interrupt
path, then the name does not make sense anymore.
Rename it to deferred_rearm upfront to keep the actual functional change
separate from the mechanical rename churn.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.935623347@kernel.org
|
|
Rework hrtimer_interrupt() such that reprogramming is split out into an
independent function at the end of the interrupt.
This prepares for reprogramming getting delayed beyond the end of
hrtimer_interrupt().
Notably, this changes the hang handling to always wait 100ms instead of
trying to keep it proportional to the actual delay. This simplifies the
state, also this really shouldn't be happening.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.870639266@kernel.org
|
|
As the base switch can be avoided completely when the base stays the same
the remove/enqueue handling can be more streamlined.
Split it out into a separate function which handles both in one go which is
way more efficient and makes the code simpler to follow.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.737600486@kernel.org
|
|
The decision to keep a timer which is associated to the local CPU on that
CPU does not take NOHZ information into account. As a result there are a
lot of hrtimer base switch invocations which end up not switching the base
and stay on the local CPU. That's just work for nothing and can be further
improved.
If the local CPU is part of the NOISE housekeeping mask, then check:
1) Whether the local CPU has the tick running, which means it is
either not idle or already expecting a timer soon.
2) Whether the tick is stopped and need_resched() is set, which
means the CPU is about to exit idle.
This reduces the amount of hrtimer base switch attempts, which end up on
the local CPU anyway, significantly and prepares for further optimizations.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.673473029@kernel.org
|
|
The decision whether to keep timers on the local CPU or on the CPU they are
associated to is suboptimal and causes the expensive switch_hrtimer_base()
mechanism to be invoked more than necessary. This is especially true for
pinned timers.
Rewrite the decision logic so that the current base is kept if:
1) The callback is running on the base
2) The timer is associated to the local CPU and the first expiring timer as
that allows to optimize for reprogramming avoidance
3) The timer is associated to the local CPU and pinned
4) The timer is associated to the local CPU and timer migration is
disabled.
Only #2 was covered by the original code, but especially #3 makes a
difference for high frequency rearming timers like the scheduler hrtick
timer. If timer migration is disabled, then #4 avoids most of the base
switches.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.607935269@kernel.org
|
|
All 'u8' flags are true booleans, so make it entirely clear that these can
only contain true or false.
This is especially true for hrtimer::state, which has a historical leftover
of using the state with bitwise operations. That was used in the early
hrtimer implementation with several bits, but then converted to a boolean
state. But that conversion missed to replace the bit OR and bit check
operations all over the place, which creates suboptimal code. As of today
'state' is a misnomer because it's only purpose is to reflect whether the
timer is enqueued into the RB-tree or not. Rename it to 'is_queued' and
make all operations on it boolean.
This reduces text size from 8926 to 8732 bytes.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.542427240@kernel.org
|
|
Use bool for the various flags as that creates better code in the hot path.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.475262618@kernel.org
|
|
No point in accessing the timer twice.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.409352042@kernel.org
|
|
As this code has some major surgery ahead, clean up coding style and bring
comments up to date.
No functional change intended.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.342740952@kernel.org
|
|
Simplify and tidy up the code where possible.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.275551488@kernel.org
|
|
hrtimer_start() when invoked with an already armed timer traces like:
<comm>-.. [032] d.h2. 5.002263: hrtimer_cancel: hrtimer= ....
<comm>-.. [032] d.h1. 5.002263: hrtimer_start: hrtimer= ....
Which is incorrect as the timer doesn't get canceled. Just the expiry time
changes. The internal dequeue operation which is required for that is not
really interesting for trace analysis. But it makes it tedious to keep real
cancellations and the above case apart.
Remove the cancel tracing in hrtimer_start() and add a 'was_armed'
indicator to the hrtimer start tracepoint, which clearly indicates what the
state of the hrtimer is when hrtimer_start() is invoked:
<comm>-.. [032] d.h1. 6.200103: hrtimer_start: hrtimer= .... was_armed=0
<comm>-.. [032] d.h1. 6.200558: hrtimer_start: hrtimer= .... was_armed=1
Fixes: c6a2a1770245 ("hrtimer: Add tracepoint for hrtimers")
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.208491877@kernel.org
|
|
The debug object coverage in hrtimer_start_range_ns() happens too late to
do anything useful. Implement the init assert assertion part and invoke
that early in hrtimer_start_range_ns().
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.143098153@kernel.org
|
|
Some clockevent devices are coupled to the system clocksource by
implementing a less than or equal comparator which compares the programmed
absolute expiry time against the underlying time counter.
The timekeeping core provides a function to convert and absolute
CLOCK_MONOTONIC based expiry time to a absolute clock cycles time which can
be directly fed into the comparator. That spares two time reads in the next
event progamming path, one to convert the absolute nanoseconds time to a
delta value and the other to convert the delta value back to a absolute
time value suitable for the comparator.
Provide a new clocksource callback which takes the absolute cycle value and
wire it up in clockevents_program_event(). Similar to clocksources allow
architectures to inline the rearm operation.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163430.010425428@kernel.org
|
|
Some architectures have clockevent devices which are coupled to the system
clocksource by implementing a less than or equal comparator which compares
the programmed absolute expiry time against the underlying time
counter. Well known examples are TSC/TSC deadline timer and the S390 TOD
clocksource/comparator.
While the concept is nice it has some downsides:
1) The clockevents core code is strictly based on relative expiry times
as that's the most common case for clockevent device hardware. That
requires to convert the absolute expiry time provided by the caller
(hrtimers, NOHZ code) to a relative expiry time by reading and
substracting the current time.
The clockevent::set_next_event() callback must then read the counter
again to convert the relative expiry back into a absolute one.
2) The conversion factors from nanoseconds to counter clock cycles are
set up when the clockevent is registered. When NTP applies corrections
then the clockevent conversion factors can deviate from the
clocksource conversion substantially which either results in timers
firing late or in the worst case early. The early expiry then needs to
do a reprogam with a short delta.
In most cases this is papered over by the fact that the read in the
set_next_event() callback happens after the read which is used to
calculate the delta. So the tendency is that timers expire mostly
late.
All of this can be avoided by providing support for these devices in the
core code:
1) The timekeeping core keeps track of the last update to the clocksource
by storing the base nanoseconds and the corresponding clocksource
counter value. That's used to keep the conversion math for reading the
time within 64-bit in the common case.
This information can be used to avoid both reads of the underlying
clocksource in the clockevents reprogramming path:
delta = expiry - base_ns;
cycles = base_cycles + ((delta * clockevent::mult) >> clockevent::shift);
The resulting cycles value can be directly used to program the
comparator.
2) As #1 does not longer provide the "compensation" through the second
read the deviation of the clocksource and clockevent conversions
caused by NTP become more prominent.
This can be cured by letting the timekeeping core compute and store
the reverse conversion factors when the clocksource cycles to
nanoseconds factors are modified by NTP:
CS::MULT (1 << NS_TO_CYC_SHIFT)
--------------- = ----------------------
(1 << CS:SHIFT) NS_TO_CYC_MULT
Ergo: NS_TO_CYC_MULT = (1 << (CS::SHIFT + NS_TO_CYC_SHIFT)) / CS::MULT
The NS_TO_CYC_SHIFT value is calculated when the clocksource is
installed so that it aims for a one hour maximum sleep time.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.944763521@kernel.org
|
|
On some architectures clocksource::read() boils down to a single
instruction, so the indirect function call is just a massive overhead
especially with speculative execution mitigations in effect.
Allow architectures to enable conditional inlining of that read to avoid
that by:
- providing a static branch to switch to the inlined variant
- disabling the branch before clocksource changes
- enabling the branch after a clocksource change, when the clocksource
indicates in a feature flag that it is the one which provides the
inlined variant
This is intentionally not a static call as that would only remove the
indirect call, but not the rest of the overhead.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.675151545@kernel.org
|
|
The only real usecase for this is the hrtimer based broadcast device.
No point in using two different feature flags for this.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.609049777@kernel.org
|
|
The sequence of cancel and start is inefficient. It has to do the timer
lock/unlock twice and in the worst case has to reprogram the underlying
clock event device twice.
The reason why it is done this way is the usage of hrtimer_forward_now(),
which requires the timer to be inactive.
But that can be completely avoided as the forward can be done on a variable
and does not need any of the overrun accounting provided by
hrtimer_forward_now().
Implement a trivial forwarding mechanism and replace the cancel/reprogram
sequence with hrtimer_start(..., new_expiry).
For the non high resolution case the timer is not actually armed, but used
for storage so that code checking for expiry times can unconditially look
it up in the timer. So it is safe for that case to set the new expiry time
directly.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.542178086@kernel.org
|
|
The hrtick timer is frequently rearmed before expiry and most of the time
the new expiry is past the armed one. As this happens on every context
switch it becomes expensive with scheduling heavy work loads especially in
virtual machines as the "hardware" reprogamming implies a VM exit.
hrtimer now provide a lazy rearm mode flag which skips the reprogamming if:
1) The timer was the first expiring timer before the rearm
2) The new expiry time is farther out than the armed time
This avoids a massive amount of reprogramming operations of the hrtick
timer for the price of eventually taking the alredy armed interrupt for
nothing.
Mark the hrtick timer accordingly.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.475409346@kernel.org
|
|
The hrtick timer is frequently rearmed before expiry and most of the time
the new expiry is past the armed one. As this happens on every context
switch it becomes expensive with scheduling heavy work loads especially in
virtual machines as the "hardware" reprogamming implies a VM exit.
Add a lazy rearm mode flag which skips the reprogamming if:
1) The timer was the first expiring timer before the rearm
2) The new expiry time is farther out than the armed time
This avoids a massive amount of reprogramming operations of the hrtick
timer for the price of eventually taking the alredy armed interrupt for
nothing.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.408524456@kernel.org
|
|
Tiny adjustments to the hrtick expiry time below 5 microseconds are just
causing extra work for no real value. Filter them out when restarting the
hrtick.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.340593047@kernel.org
|
|
schedule() provides several mechanisms to update the hrtick timer:
1) When the next task is picked
2) When the balance callbacks are invoked before rq::lock is released
Each of them can result in a first expiring timer and cause a reprogram of
the clock event device.
Solve this by deferring the rearm to the end of schedule() right before
releasing rq::lock by setting a flag on entry which tells hrtick_start() to
cache the runtime constraint in rq::hrtick_delay without touching the timer
itself.
Right before releasing rq::lock evaluate the flags and either rearm or
cancel the hrtick timer.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.273068659@kernel.org
|
|
Use the static branch based variant and thereby avoid following three
pointers.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.203610956@kernel.org
|
|
The scheduler evaluates this via hrtimer_is_hres_active() every time it has
to update HRTICK. This needs to follow three pointers, which is expensive.
Provide a static branch based mechanism to avoid that.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.136503358@kernel.org
|
|
Much like hrtimer_reprogram(), skip programming if the cpu_base is running
the hrtimer interrupt.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260224163429.069535561@kernel.org
|
|
The clock of the hrtick and deadline timers is known to be CLOCK_MONOTONIC.
No point in looking it up via hrtimer_cb_get_time().
Just use ktime_get() directly.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163429.001511662@kernel.org
|
|
Since the tick causes hard preemption, the hrtick should too.
Letting the hrtick do lazy preemption completely defeats the purpose, since
it will then still be delayed until a old tick and be dependent on
CONFIG_HZ.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163428.933894105@kernel.org
|
|
hrtick_update() was needed when the slice depended on nr_running, all that
code is gone. All that remains is starting the hrtick when nr_running
becomes more than 1.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260224163428.866374835@kernel.org
|
|
The nominal duration for an EEVDF task to run is until its deadline. At
which point the deadline is moved ahead and a new task selection is done.
Try and predict the time 'lost' to higher scheduling classes. Since this is
an estimate, the timer can be both early or late. In case it is early
task_tick_fair() will take the !need_resched() path and restarts the timer.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://patch.msgid.link/20260224163428.798198874@kernel.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"12 hotfixes. 7 are cc:stable. 8 are for MM.
All are singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2026-02-26-14-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: update Yosry Ahmed's email address
mailmap: add entry for Daniele Alessandrelli
mm: fix NULL NODE_DATA dereference for memoryless nodes on boot
mm/tracing: rss_stat: ensure curr is false from kthread context
mm/kfence: fix KASAN hardware tag faults during late enablement
mm/damon/core: disallow non-power of two min_region_sz
Squashfs: check metadata block offset is within range
MAINTAINERS, mailmap: update e-mail address for Vlastimil Babka
liveupdate: luo_file: remember retrieve() status
mm: thp: deny THP for files on anonymous inodes
mm: change vma_alloc_folio_noprof() macro to inline function
mm/kfence: disable KFENCE upon KASAN HW tags enablement
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
"Two DMA-mapping fixes for the recently merged API rework (Jiri Pirko
and Stian Halseth)"
* tag 'dma-mapping-7.0-2026-02-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
sparc: Fix page alignment in dma mapping
dma-mapping: avoid random addr value print out on error path
|
|
SCX_EFLAG_INITIALIZED is the sole member of enum scx_exit_flags with no
explicit value, so the compiler assigns it 0. This makes the bitwise OR
in scx_ops_init() a no-op:
sch->exit_info->flags |= SCX_EFLAG_INITIALIZED; /* |= 0 */
As a result, BPF schedulers cannot distinguish whether ops.init()
completed successfully by inspecting exit_info->flags.
Assign the value 1LLU << 0 so the flag is actually set.
Fixes: f3aec2adce8d ("sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init()")
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
get_upper_ifindexes() iterates over all upper devices and writes their
indices into an array without checking bounds.
Also the callers assume that the max number of upper devices is
MAX_NEST_DEV and allocate excluded_devices[1+MAX_NEST_DEV] on the stack,
but that assumption is not correct and the number of upper devices could
be larger than MAX_NEST_DEV (e.g., many macvlans), causing a
stack-out-of-bounds write.
Add a max parameter to get_upper_ifindexes() to avoid the issue.
When there are too many upper devices, return -EOVERFLOW and abort the
redirect.
To reproduce, create more than MAX_NEST_DEV(8) macvlans on a device with
an XDP program attached using BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS.
Then send a packet to the device to trigger the XDP redirect path.
Reported-by: syzbot+10cc7f13760b31bd2e61@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/698c4ce3.050a0220.340abe.000b.GAE@google.com/T/
Fixes: aeea1b86f936 ("bpf, devmap: Exclude XDP broadcast to master device")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Link: https://lore.kernel.org/r/20260225053506.4738-1-kohei@enjuk.jp
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|