| Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
"Fix lockup in requeue-PI during signal/timeout wakeups, by Sebastian
Andrzej Siewior"
* tag 'locking-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Prevent lockup in requeue-PI during signal/ timeout wakeup
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
- Fix the delayed dequeue negative lag increase fix in the
fair scheduler (Peter Zijlstra)
- Fix wakeup_preempt_fair() to do proper delayed dequeue
(Vincent Guittot)
- Clear sched_entity::rel_deadline when initializing
forked entities, which bug can cause all tasks to be
EEVDF-ineligible, causing a NULL pointer dereference
crash in pick_next_entity() (Zicheng Qu)
* tag 'sched-urgent-2026-05-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Clear rel_deadline when initializing forked entities
sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue
sched/fair: Fix the negative lag increase fix
|
|
Currently need_futex_hash_allocate_default() depends on strict pthread
semantics, abusing CLONE_THREAD. This breaks the non-concurrency
assumptions when doing the mm->futex_ref pcpu allocations, leading to
bugs[0] when sharing the mm in other ways; ie:
BUG: KASAN: slab-use-after-free in futex_hash_put
... where the +1 bias can end up on a percpu counter that mm->futex_ref
no longer points at.
Loosen the check to cover any CLONE_VM clone, except vfork(). Excluding
vfork keeps the existing paths untouched (no overhead), and we can't
race in the first place: either the parent is suspended and the child
runs alone, or mm->futex_ref is already allocated from an earlier
CLONE_VM.
Link: https://lore.kernel.org/all/CAL_bE8LsmCQ-FAtYDuwbJhOkt9p2wwYQwAbMh=PifC=VsiBM6A@mail.gmail.com/ [0]
Fixes: d9b05321e21e ("futex: Move futex_hash_free() back to __mmput()")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM fixes from Andrew Morton:
"20 hotfixes. All are for MM (and for MMish maintainers). 9 are
cc:stable and the remainder are for post-7.0 issues or aren't deemed
suitable for backporting.
There are two DAMON series from SeongJae Park which address races
which could lead to use-after-free errors, and avoid the possibility
of presenting stale parameter values to users"
* tag 'mm-hotfixes-stable-2026-04-30-15-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: memcontrol: fix rcu unbalance in get_non_dying_memcg_end()
mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()
MAINTAINERS: remove stale kdump project URL
mm/damon/stat: detect and use fresh enabled value
mm/damon/lru_sort: detect and use fresh enabled and kdamond_pid values
mm/damon/reclaim: detect and use fresh enabled and kdamond_pid values
selftests/mm: specify requirement for PROC_MEM_ALWAYS_FORCE=y
mm/damon/sysfs-schemes: protect path kfree() with damon_sysfs_lock
mm/damon/sysfs-schemes: protect memcg_path kfree() with damon_sysfs_lock
MAINTAINERS: update Li Wang's email address
MAINTAINERS, mailmap: update email address for Qi Zheng
MAINTAINERS: update Liam's email address
mm/hugetlb_cma: round up per_node before logging it
MAINTAINERS: fix regex pattern in CORE MM category
mm/vma: do not try to unmap a VMA if mmap_prepare() invoked from mmap()
mm: start background writeback based on per-wb threshold for strictlimit BDIs
kho: fix error handling in kho_add_subtree()
liveupdate: fix return value on session allocation failure
mailmap: update entry for Dan Carpenter
vmalloc: fix buffer overflow in vrealloc_node_align()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix inverted check of registering the stats for branch tracing
When calling register_stat_tracer() which returns zero on success and
negative on error, the callers were checking the return of zero as an
error and printing a warning message. Because this was just a normal
printk() message and not a WARN(), it wasn't caught in any testing.
Fix the check to print the warning message when an error actually
happens.
- Fix a typo in a comment in tracepoint.h
- Limit the size of event probes to 3K in size
It is possible to create a dynamic event probe via the tracefs system
that is greater than the max size of an event that the ring buffer
can hold. This basically causes the event to become useless.
Limit the size of an event probe to be 3K as that should be large
enough to handle any dynamic events being created, and fits within
the PAGE_SIZE sub-buffers of the ring buffer.
* tag 'trace-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/probes: Limit size of event probe to 3K
tracepoint: Fix typo in tracepoint.h comment
tracing: branch: Fix inverted check on stat tracer registration
|
|
There currently isn't a max limit an event probe can be. One could make an
event greater than PAGE_SIZE, which makes the event useless because if
it's bigger than the max event that can be recorded into the ring buffer,
then it will never be recorded.
A event probe should never need to be greater than 3K, so make that the
max size. As long as the max is less than the max that can be recorded
onto the ring buffer, it should be fine.
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Fixes: 93ccae7a22274 ("tracing/kprobes: Support basic types on dynamic events")
Link: https://patch.msgid.link/20260428122302.706610ba@gandalf.local.home
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
During wait-requeue-pi (task A) and requeue-PI (task B) the following
race can happen:
Task A Task B
futex_wait_requeue_pi()
futex_setup_timer()
futex_do_wait()
futex_requeue()
CLASS(hb, hb1)(&key1);
CLASS(hb, hb2)(&key2);
*timeout*
futex_requeue_pi_wakeup_sync()
requeue_state = Q_REQUEUE_PI_IGNORE
*blocks on hb->lock*
futex_proxy_trylock_atomic()
futex_requeue_pi_prepare()
Q_REQUEUE_PI_IGNORE => -EAGAIN
double_unlock_hb(hb1, hb2)
*retry*
Task B acquires both hb locks and attempts to acquire the PI-lock of the
top most waiter (task B). Task A is leaving early due to a signal/
timeout and started removing itself from the queue. It updates its
requeue_state but can not remove it from the list because this requires
the hb lock which is owned by task B.
Usually task A is able to swoop the lock after task B unlocked it.
However if task B is of higher priority then task A may not be able to
wake up in time and acquire the lock before task B gets it again.
Especially on a UP system where A is never scheduled.
As a result task A blocks on the lock and task B busy loops, trying to
make progress but live locks the system instead. Tragic.
This can be fixed by removing the top most waiter from the list in this
case. This allows task B to grab the next top waiter (if any) in the
next iteration and make progress.
Remove the top most waiter if futex_requeue_pi_prepare() fails.
Let the waiter conditionally remove itself from the list in
handle_early_requeue_pi_wakeup().
Fixes: 07d91ef510fb1 ("futex: Prevent requeue_pi() lock nesting issue on RT")
Reported-by: Moritz Klammler <Moritz.Klammler@ferchau.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260428103425.dywXyPd3@linutronix.de
Closes: https://lore.kernel.org/all/VE1PR06MB6894BE61C173D802365BE19DFF4CA@VE1PR06MB6894.eurprd06.prod.outlook.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
"The merge window pulled in the cgroup sub-scheduler infrastructure,
and new AI reviews are accelerating bug reporting and fixing - hence
the larger than usual fixes batch:
- Use-after-frees during scheduler load/unload:
- The disable path could free the BPF scheduler while deferred
irq_work / kthread work was still in flight
- cgroup setter callbacks read the active scheduler outside the
rwsem that synchronizes against teardown
Fix both, and reuse the disable drain in the enable error paths so
the BPF JIT page can't be freed under live callbacks.
- Several BPF op invocations didn't tell the framework which runqueue
was already locked, so helper kfuncs that re-acquire the runqueue
by CPU could deadlock on the held lock
Fix the affected callsites, including recursive parent-into-child
dispatch.
- The hardlockup notifier ran from NMI but eventually took a
non-NMI-safe lock. Bounce it through irq_work.
- A handful of bugs in the new sub-scheduler hierarchy:
- helper kfuncs hard-coded the root instead of resolving the
caller's scheduler
- the enable error path tried to disable per-task state that had
never been initialized, and leaked cpus_read_lock on the way
out
- a sysfs object was leaked on every load/unload
- the dispatch fast-path used the root scheduler instead of the
task's
- a couple of CONFIG #ifdef guards were misclassified
- Verifier-time hardening: BPF programs of unrelated struct_ops types
(e.g. tcp_congestion_ops) could call sched_ext kfuncs - a semantic
bug and, once sub-sched was enabled, a KASAN out-of-bounds read.
Now rejected at load. Plus a few NULL and cross-task argument
checks on sched_ext kfuncs, and a selftest covering the new deny.
- rhashtable (Herbert): restore the insecure_elasticity toggle and
bounce the deferred-resize kick through irq_work to break a
lock-order cycle observable from raw-spinlock callers. sched_ext's
scheduler-instance hash is the first user of both.
- The bypass-mode load balancer used file-scope cpumasks; with
multiple scheduler instances now possible, those raced. Move to
per-instance cpumasks, plus a follow-up to skip tasks whose
recorded CPU is stale relative to the new owning runqueue.
- Smaller fixes:
- a dispatch queue's first-task tracking misbehaved when a parked
iterator cursor sat in the list
- the runqueue's next-class wasn't promoted on local-queue
enqueue, leaving an SCX task behind RT in edge cases
- the reference qmap scheduler stopped erroring on legitimate
cross-scheduler task-storage misses"
* tag 'sched_ext-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (26 commits)
sched_ext: Fix scx_flush_disable_work() UAF race
sched_ext: Call wakeup_preempt() in local_dsq_post_enq()
sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable
sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime
sched_ext: Refuse cross-task select_cpu_from_kfunc calls
sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED
sched_ext: Make bypass LB cpumasks per-scheduler
sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before
sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task
sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP
sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail
sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued()
sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters
sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path
sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu()
sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new
sched_ext: Unregister sub_kset on scheduler disable
sched_ext: Defer scx_hardlockup() out of NMI
sched_ext: sync disable_irq_work in bpf_scx_unreg()
sched_ext: Fix local_dsq_post_enq() to use task's scheduler in sub-sched
...
|
|
init_annotated_branch_stats() and all_annotated_branch_stats() check the
return value of register_stat_tracer() with "if (!ret)", but
register_stat_tracer() returns 0 on success and a negative errno on
failure. The inverted check causes the warning to be printed on every
successful registration, e.g.:
Warning: could not register annotated branches stats
while leaving real failures silent. The initcall also returned a
hard-coded 1 instead of the actual error.
Invert the check and propagate ret so that the warning fires on real
errors and the initcall reports the correct status.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: https://patch.msgid.link/20260420-tracing-v1-1-d8f4cd0d6af1@debian.org
Fixes: 002bb86d8d42 ("tracing/ftrace: separate events tracing and stats tracing engine")
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
scx_flush_disable_work() calls irq_work_sync() followed by
kthread_flush_work() to ensure that the disable kthread work has
fully completed before bpf_scx_unreg() frees the SCX scheduler.
However, a concurrent scx_vexit() (e.g., triggered by a watchdog stall)
creates a race window between scx_claim_exit() and irq_work_queue():
CPU A (scx_vexit (watchdog)) CPU B (bpf_scx_unreg)
---- ----
scx_claim_exit()
atomic_try_cmpxchg(NONE->kind)
stack_trace_save()
vscnprintf()
scx_disable()
scx_claim_exit() -> FAIL
scx_flush_disable_work()
irq_work_sync() // no-op: not queued yet
kthread_flush_work() // no-op: not queued yet
kobject_put(&sch->kobj) -> free %sch
irq_work_queue() -> UAF on %sch
scx_disable_irq_workfn()
kthread_queue_work() -> UAF
The root cause is that CPU B's scx_flush_disable_work() returns after
syncing an irq_work that has not yet been queued, while CPU A is still
executing the code between scx_claim_exit() and irq_work_queue().
Loop until exit_kind reaches SCX_EXIT_DONE or SCX_EXIT_NONE, draining
disable_irq_work and disable_work in each pass. This ensures that any
work queued after the previous check is caught, while also correctly
handling cases where no disable was triggered (e.g., the
scx_sub_enable_workfn() abort path).
Fixes: 510a27055446 ("sched_ext: sync disable_irq_work in bpf_scx_unreg()")
Reported-by: https://sashiko.dev/#/patchset/20260424100221.32407-1-icheng%40nvidia.com
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
There are several edge cases (see linked thread) where an IMMED task
can be left lingering on a local DSQ if an RT task swoops in at the
wrong time. All of these edge cases are due to rq->next_class being idle
even after dispatching a task to rq's local DSQ. We should bump
rq->next_class to &ext_sched_class as soon as we've inserted a task into
the local DSQ.
To optimize the common case of rq->next_class == &ext_sched_class,
only call wakeup_preempt() if rq->next_class is below EXT. If next_class
is EXT or above, wakeup_preempt() is a no-op anyway.
This lets us also simplify the preempt_curr() logic a bit since
wakeup_preempt() will call preempt_curr() for us if next_class is
below EXT.
Link: https://lore.kernel.org/all/DHZPHUFXB4N3.2RY28MUEWBNYK@google.com/
Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
A yield-triggered crash can happen when a newly forked sched_entity
enters the fair class with se->rel_deadline unexpectedly set.
The failing sequence is:
1. A task is forked while se->rel_deadline is still set.
2. __sched_fork() initializes vruntime, vlag and other sched_entity
state, but does not clear rel_deadline.
3. On the first enqueue, enqueue_entity() calls place_entity().
4. Because se->rel_deadline is set, place_entity() treats se->deadline
as a relative deadline and converts it to an absolute deadline by
adding the current vruntime.
5. However, the forked entity's deadline is not a valid inherited
relative deadline for this new scheduling instance, so the conversion
produces an abnormally large deadline.
6. If the task later calls sched_yield(), yield_task_fair() advances
se->vruntime to se->deadline.
7. The inflated vruntime is then used by the following enqueue path,
where the vruntime-derived key can overflow when multiplied by the
entity weight.
8. This corrupts cfs_rq->sum_w_vruntime, breaks EEVDF eligibility
calculation, and can eventually make all entities appear ineligible.
pick_next_entity() may then return NULL unexpectedly, leading to a
later NULL dereference.
A captured trace shows the effect clearly. Before yield, the entity's
vruntime was around:
9834017729983308
After yield_task_fair() executed:
se->vruntime = se->deadline
the vruntime jumped to:
19668035460670230
and the deadline was later advanced further to:
19668035463470230
This shows that the deadline had already become abnormally large before
yield_task_fair() copied it into vruntime.
rel_deadline is only meaningful when se->deadline really carries a
relative deadline that still needs to be placed against vruntime. A
freshly forked sched_entity should not inherit or retain this state.
Clear se->rel_deadline in __sched_fork(), together with the other
sched_entity runtime state, so that the first enqueue does not interpret
the new entity's deadline as a stale relative deadline.
Fixes: 82e9d0456e06 ("sched/fair: Avoid re-setting virtual deadline on 'migrations'")
Analyzed-by: Hui Tang <tanghui20@huawei.com>
Analyzed-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260424071113.1199600-1-quzicheng@huawei.com
|
|
Similar to how pick_next_entity() must dequeue delayed entities, so too must
wakeup_preempt_fair(). Any delayed task being found means it is eligible and
hence past the 0-lag point, ready for removal.
Worse, by not removing delayed entities from consideration, it can skew the
preemption decision, with the end result that a short slice wakeup will not
result in a preemption.
tip/sched/core tip/sched/core +this patch
cyclictest slice (ms) (default)2.8 8 8
hackbench slice (ms) (default)2.8 20 20
Total Samples | 22559 22595 22683
Average (us) | 157 64( 59%) 59( 8%)
Median (P50) (us) | 57 57( 0%) 58(- 2%)
90th Percentile (us) | 64 60( 6%) 60( 0%)
99th Percentile (us) | 2407 67( 97%) 67( 0%)
99.9th Percentile (us) | 3400 2288( 33%) 727( 68%)
Maximum (us) | 5037 9252(-84%) 7461( 19%)
Fixes: f12e148892ed ("sched/fair: Prepare pick_next_task() for delayed dequeue")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260422093400.319251-1-vincent.guittot@linaro.org
|
|
Vincent reported that my rework of his original patch lost a little
something.
Specifically it got the return value wrong; it should not compare
against the old se->vlag, but rather against the current value. Since
the thing that matters is if the effective vruntime of an entity is
affected and the thing needs repositioning or not.
Fixes: 059258b0d424 ("sched/fair: Prevent negative lag increase during delayed dequeue")
Reported-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260423094107.GT3102624%40noisy.programming.kicks-ass.net
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- Fix UAF race in psi pressure_write() against cgroup file release by
extending cgroup_mutex coverage and ordering of->priv access after
cgroup_kn_lock_live()
- Fix integer overflow in rdmacg_try_charge() when usage equals INT_MAX
by performing the increment in s64
- Fix asymmetric DL bandwidth accounting on cpuset attach rollback by
recording the CPU used by dl_bw_alloc() so cancel_attach() returns
the reservation to the same root domain
- Fix nr_dying_subsys_* race that briefly showed 0 in cgroup.stat after
rmdir by incrementing from kill_css() instead of offline_css()
- Typo fix in cgroup-v2 documentation
* tag 'cgroup-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
docs: cgroup: fix typo 'protetion' -> 'protection'
cgroup: Increment nr_dying_subsys_* from rmdir context
cgroup/cpuset: record DL BW alloc CPU for attach rollback
cgroup/rdma: fix integer overflow in rdmacg_try_charge()
sched/psi: fix race between file release and pressure write
|
|
Fix two error handling issues in kho_add_subtree(), where it doesn't
handle the error path correctly.
1. If fdt_setprop() fails after the subnode has been created, the
subnode is not removed. This leaves an incomplete node in the FDT
(missing "preserved-data" or "blob-size" properties).
2. The fdt_setprop() return value (an FDT error code) is stored
directly in err and returned to the caller, which expects -errno.
Fix both by storing fdt_setprop() results in fdt_err, jumping to a new
out_del_node label that removes the subnode on failure, and only setting
err = 0 on the success path, otherwise returning -ENOMEM (instead of
FDT_ERR_ errors that would come from fdt_setprop).
No user-visible changes. This patch fixes error handling in the KHO
(Kexec HandOver) subsystem, which is used to preserve data across kexec
reboots. The fix only affects a rare failure path during kexec
preparation — specifically when the kernel runs out of space in the
Flattened Device Tree buffer while registering preserved memory regions.
In the unlikely event that this error path was triggered, the old code
would leave a malformed node in the device tree and return an incorrect
error code to the calling subsystem, which could lead to confusing log
messages or incorrect recovery decisions. With this fix, the incomplete
node is properly cleaned up and the appropriate errno value is propagated,
this error code is not returned to the user.
Link: https://lore.kernel.org/20260410-kho_fix_send-v2-1-1b4debf7ee08@debian.org
Fixes: 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers")
Signed-off-by: Breno Leitao <leitao@debian.org>
Suggested-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Breno Leitao <leitao@debian.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When session allocation fails during deserialization, the global 'err'
variable was not updated before returning. This caused subsequent calls
to luo_session_deserialize() to incorrectly report success.
Ensure 'err' is set to the error code from PTR_ERR(session). This ensures
that an error is correctly returned to userspace when it attempts to open
/dev/liveupdate in the new kernel if deserialization failed.
Link: https://lore.kernel.org/20260415193738.515491-1-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Cc: David Matlack <dmatlack@google.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
scx_root_enable_workfn() takes cpus_read_lock() before
scx_link_sched(sch), but the `if (ret) goto err_disable` on failure
skips the matching cpus_read_unlock() - all other err_disable gotos
along this path drop the lock first.
scx_link_sched() only returns non-zero on the sub-sched path
(parent != NULL), so the leak path is unreachable via the root
caller today. Still, the unwind is out of line with the surrounding
paths.
Drop cpus_read_lock() before goto err_disable.
v2: Correct Fixes: tag (Andrea Righi).
Fixes: 25037af712eb ("sched_ext: Add rhashtable lookup for sub-schedulers")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
scx_prog_sched(aux) returns NULL for TRACING / SYSCALL BPF progs that
have no struct_ops association when the root scheduler has sub_attach
set. scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() pass
that NULL into scx_task_on_sched(sch, p), which under
CONFIG_EXT_SUB_SCHED is rcu_access_pointer(p->scx.sched) == sch. For
any non-scx task p->scx.sched is NULL, so NULL == NULL returns true
and the authority gate is bypassed - a privileged but
non-struct_ops-associated prog can poke p->scx.slice /
p->scx.dsq_vtime on arbitrary tasks.
Reject !sch up front so the gate only admits callers with a resolved
scheduler.
Fixes: 245d09c594ea ("sched_ext: Enforce scheduler ownership when updating slice and dsq_vtime")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
select_cpu_from_kfunc() skipped pi_lock for @p when called from
ops.select_cpu() or another rq-locked SCX op, assuming the held lock
protects @p. scx_bpf_select_cpu_dfl() / __scx_bpf_select_cpu_and() accept an
arbitrary KF_RCU task_struct, so a caller in e.g. ops.select_cpu(p1) or
ops.enqueue(p1) can pass some other p2 - the held pi_lock / rq lock is p1's,
not p2's - and reading p2->cpus_ptr / nr_cpus_allowed races with
set_cpus_allowed_ptr() and migrate_disable_switch() on another CPU.
Abort the scheduler on cross-task calls in both branches: for
ops.select_cpu() use scx_kf_arg_task_ok() to verify @p is the wake-up
task recorded in current->scx.kf_tasks[] by SCX_CALL_OP_TASK_RET();
for other rq-locked SCX ops compare task_rq(p) against scx_locked_rq().
v2: Switch the in_select_cpu cross-task check from direct_dispatch_task
comparison to scx_kf_arg_task_ok(). The former spuriously rejects when
ops.select_cpu() calls scx_bpf_dsq_insert() first, then calls
scx_bpf_select_cpu_*() on the same task. (Andrea Righi)
Fixes: 0022b328504d ("sched_ext: Decouple kfunc unlocked-context check from kf_mask")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrea Righi <arighi@nvidia.com>
|
|
Two EXT_GROUP_SCHED/SUB_SCHED guards are misclassified:
- scx_root_enable_workfn()'s cgroup_get(cgrp) and the err_put_cgrp unwind
in scx_alloc_and_add_sched() are under `#if GROUP || SUB`, but the
matching cgroup_put() in scx_sched_free_rcu_work() is inside `#ifdef SUB`
only (via sch->cgrp, stored only under SUB). GROUP-only would leak a
reference on every root-sched enable.
- sch_cgroup() / set_cgroup_sched() live under `#if GROUP || SUB` but touch
SUB-only fields (sch->cgrp, cgroup->scx_sched). GROUP-only wouldn't
compile.
GROUP needs CGROUP_SCHED; SUB needs only CGROUPS. CGROUPS=y/CGROUP_SCHED=n
gives the reachable GROUP=n, SUB=y combination; GROUP=y, SUB=n isn't
reachable today (SUB is def_bool y under CGROUPS). Neither miscategorization
triggers a real bug in any reachable config, but keep the guards honest:
- Narrow cgroup_get and err_put_cgrp to `#ifdef SUB` (matches the free-side
put).
- Move sch_cgroup() and set_cgroup_sched() to a separate `#ifdef SUB` block
with no-op stubs for the !SUB case; keep root_cgroup() and scx_cgroup_{
lock,unlock}() under `#if GROUP || SUB` since those only need cgroup core.
Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
scx_bypass_lb_{donee,resched}_cpumask were file-scope statics shared by all
scheduler instances. With CONFIG_EXT_SUB_SCHED, multiple sched instances
each arm their own bypass_lb_timer; concurrent bypass_lb_node() calls RMW
the global cpumasks with no lock, corrupting donee/resched decisions.
Move the cpumasks into struct scx_sched, allocate them alongside the timer
in scx_alloc_and_add_sched(), free them in scx_sched_free_rcu_work().
Fixes: 95d1df610cdc ("sched_ext: Implement load balancer for bypass mode")
Cc: stable@vger.kernel.org # v6.19+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
scx_prio_less() runs from core-sched's pick_next_task() path with rq
locked but invokes ops.core_sched_before() with NULL locked_rq, leaving
scx_locked_rq_state NULL. If the BPF callback calls a kfunc that
re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu)
- it re-acquires the already-held rq.
Pass task_rq(a).
Fixes: 7b0888b7cc19 ("sched_ext: Implement core-sched support")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
scx_dump_state() walks CPUs with rq_lock_irqsave() held and invokes
ops.dump_cpu / ops.dump_task with NULL locked_rq, leaving
scx_locked_rq_state NULL. If the BPF callback calls a kfunc that
re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu)
- it re-acquires the already-held rq.
Pass the held rq to SCX_CALL_OP(). Thread it into scx_dump_task() too.
The pre-loop ops.dump call runs before rq_lock_irqsave() so keeps
rq=NULL.
Fixes: 07814a9439a3 ("sched_ext: Print debug dump after an error exit")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
SCX_CALL_OP{,_RET}() unconditionally clears scx_locked_rq_state to NULL on
exit. Correct at the top level, but ops can recurse via
scx_bpf_sub_dispatch(): a parent's ops.dispatch calls the helper, which
invokes the child's ops.dispatch under another SCX_CALL_OP. When the inner
call returns, the NULL clobbers the outer's state. The parent's BPF then
calls kfuncs like scx_bpf_cpuperf_set() which read scx_locked_rq()==NULL and
re-acquire the already-held rq.
Snapshot scx_locked_rq_state on entry and restore on exit. Rename the rq
parameter to locked_rq across all SCX_CALL_OP* macros so the snapshot local
can be typed as 'struct rq *' without colliding with the parameter token in
the expansion. SCX_CALL_OP_TASK{,_RET}() and SCX_CALL_OP_2TASKS_RET() funnel
through the two base macros and inherit the fix.
Fixes: 4f8b122848db ("sched_ext: Add basic building blocks for nested sub-scheduler dispatching")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
FIFO-tail
dispatch_enqueue()'s FIFO-tail path used list_empty(&dsq->list) to decide
whether to set dsq->first_task on enqueue. dsq->list can contain parked BPF
iterator cursors (SCX_DSQ_LNODE_ITER_CURSOR), so list_empty() is not a
reliable "no real task" check. If the last real task is unlinked while a
cursor is parked, first_task becomes NULL; the next FIFO-tail enqueue then
sees list_empty() == false and skips the first_task update, leaving
scx_bpf_dsq_peek() returning NULL for a non-empty DSQ.
Test dsq->first_task directly, which already tracks only real tasks and is
maintained under dsq->lock.
Fixes: 44f5c8ec5b9a ("sched_ext: Add lockless peek operation for DSQs")
Cc: stable@vger.kernel.org # v6.19+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Cc: Ryan Newton <newton@meta.com>
|
|
scx_bpf_dsq_nr_queued()
scx_bpf_create_dsq() resolves the calling scheduler via scx_prog_sched(aux)
and inserts the new DSQ into that scheduler's dsq_hash. Its inverse
scx_bpf_destroy_dsq() and the query helper scx_bpf_dsq_nr_queued() were
hard-coded to rcu_dereference(scx_root), so a sub-scheduler could only
destroy or query DSQs in the root scheduler's hash - never its own. If the
root had a DSQ with the same id, the sub-sched silently destroyed it and the
root aborted on the next dispatch ("invalid DSQ ID 0x0..").
Take a const struct bpf_prog_aux *aux via KF_IMPLICIT_ARGS and resolve the
scheduler with scx_prog_sched(aux), matching scx_bpf_create_dsq().
Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
scx_group_set_{weight,idle,bandwidth}() cache scx_root before acquiring
scx_cgroup_ops_rwsem, so the pointer can be stale by the time the op runs.
If the loaded scheduler is disabled and freed (via RCU work) and another is
enabled between the naked load and the rwsem acquire, the reader sees
scx_cgroup_enabled=true (the new scheduler's) but dereferences the freed one
- UAF on SCX_HAS_OP(sch, ...) / SCX_CALL_OP(sch, ...).
scx_cgroup_enabled is toggled only under scx_cgroup_ops_rwsem write
(scx_cgroup_{init,exit}), so reading scx_root inside the rwsem read section
correlates @sch with the enabled snapshot.
Fixes: a5bd6ba30b33 ("sched_ext: Use cgroup_lock/unlock() to synchronize against cgroup operations")
Cc: stable@vger.kernel.org # v6.18+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
scx_sub_enable_workfn()'s prep loop calls __scx_init_task(sch, p, false)
without transitioning task state, then sets SCX_TASK_SUB_INIT. If prep fails
partway, the abort path runs __scx_disable_and_exit_task(sch, p) on the
marked tasks. Task state is still the parent's ENABLED, so that dispatches
to the SCX_TASK_ENABLED arm and calls scx_disable_task(sch, p) - i.e.
child->ops.disable() - for tasks on which child->ops.enable() never ran. A
BPF sub-scheduler allocating per-task state in enable/freeing in disable
would operate on uninitialized state.
The dying-task branch in scx_disable_and_exit_task() has the same problem,
and scx_enabling_sub_sched was cleared before the abort cleanup loop - a
task exiting during cleanup tripped the WARN and skipped both ops.exit_task
and the SCX_TASK_SUB_INIT clear, leaking per-task resources and leaving the
task stuck.
Introduce scx_sub_init_cancel_task() that calls ops.exit_task with
cancelled=true - matching what the top-level init path does when init_task
itself returns -errno. Use it in the abort loop and in the dying-task
branch. scx_enabling_sub_sched now stays set until the abort loop finishes
clearing SUB_INIT, so concurrent exits hitting the dying-task branch can
still find @sch. That branch also clears SCX_TASK_SUB_INIT unconditionally
when seen, leaving the task unmarked even if the WARN fires.
Fixes: 337ec00b1d9c ("sched_ext: Implement cgroup sub-sched enabling and disabling")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
bypass_lb_cpu() transfers tasks between per-CPU bypass DSQs without
migrating them - task_cpu() only updates when the donee later consumes the
task via move_remote_task_to_local_dsq(). If the LB timer fires again before
consumption and the new DSQ becomes a donor, @p is still on the previous CPU
and task_rq(@p) != donor_rq. @p can't be moved without its own rq locked.
Skip such tasks.
Fixes: 95d1df610cdc ("sched_ext: Implement load balancer for bypass mode")
Cc: stable@vger.kernel.org # v6.19+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
bpf_iter_scx_dsq_new() clears kit->dsq on failure and
bpf_iter_scx_dsq_{next,destroy}() guard against that. scx_dsq_move() doesn't -
it dereferences kit->dsq immediately, so a BPF program that calls
scx_bpf_dsq_move[_vtime]() after a failed iter_new oopses the kernel.
Return false if kit->dsq is NULL.
Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
When ops.sub_attach is set, scx_alloc_and_add_sched() creates sub_kset as a
child of &sch->kobj, which pins the parent with its own reference. The
disable paths never call kset_unregister(), so the final kobject_put() in
bpf_scx_unreg() leaves a stale reference and scx_kobj_release() never runs,
leaking the whole struct scx_sched on every load/unload cycle.
Unregister sub_kset in scx_root_disable() and scx_sub_disable() before
kobject_del(&sch->kobj).
Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
scx_hardlockup() runs from NMI and eventually calls scx_claim_exit(),
which takes scx_sched_lock. scx_sched_lock isn't NMI-safe and grabbing
it from NMI context can lead to deadlocks.
The hardlockup handler is best-effort recovery and the disable path it
triggers runs off of irq_work anyway. Move the handle_lockup() call into
an irq_work so it runs in IRQ context.
Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ring-buffer fix from Steven Rostedt:
- Fix accounting of persistent ring buffer rewind
On boot up, the head page is moved back to the earliest point of the
saved ring buffer. This is because the ring buffer being read by user
space on a crash may not save the part it read. Rewinding the head
page back to the earliest saved position helps keep those events from
being lost.
The number of events is also read during boot up and displayed in the
stats file in the tracefs directory. It's also used for other
accounting as well. On boot up, the "reader page" is accounted for
but a rewind may put it back into the buffer and then the reader page
may be accounted for again.
Save off the original reader page and skip accounting it when
scanning the pages in the ring buffer.
* tag 'trace-ring-buffer-v7.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Do not double count the reader_page
|
|
Since the cpu_buffer->reader_page is updated if there are unwound
pages. After that update, we should skip the page if it is the
original reader_page, because the original reader_page is already
checked.
Cc: stable@vger.kernel.org
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ian Rogers <irogers@google.com>
Link: https://patch.msgid.link/177701353063.2223789.1471163147644103306.stgit@mhiramat.tok.corp.google.com
Fixes: ca296d32ece3 ("tracing: ring_buffer: Rewind persistent ring buffer on reboot")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
When unregistered my self-written scx scheduler, the following panic
occurs.
[ 229.923133] Kernel text patching generated an invalid instruction at 0xffff80009bc2c1f8!
[ 229.923146] Internal error: Oops - BRK: 00000000f2000100 [#1] SMP
[ 230.077871] CPU: 48 UID: 0 PID: 1760 Comm: kworker/u583:7 Not tainted 7.0.0+ #3 PREEMPT(full)
[ 230.086677] Hardware name: NVIDIA GB200 NVL/P3809-BMC, BIOS 02.05.12 20251107
[ 230.093972] Workqueue: events_unbound bpf_map_free_deferred
[ 230.099675] Sched_ext: invariant_0.1.0_aarch64_unknown_linux_gnu_debug (disabling), task: runnable_at=-174ms
[ 230.116843] pc : 0xffff80009bc2c1f8
[ 230.120406] lr : dequeue_task_scx+0x270/0x2d0
[ 230.217749] Call trace:
[ 230.228515] 0xffff80009bc2c1f8 (P)
[ 230.232077] dequeue_task+0x84/0x188
[ 230.235728] sched_change_begin+0x1dc/0x250
[ 230.240000] __set_cpus_allowed_ptr_locked+0x17c/0x240
[ 230.245250] __set_cpus_allowed_ptr+0x74/0xf0
[ 230.249701] ___migrate_enable+0x4c/0xa0
[ 230.253707] bpf_map_free_deferred+0x1a4/0x1b0
[ 230.258246] process_one_work+0x184/0x540
[ 230.262342] worker_thread+0x19c/0x348
[ 230.266170] kthread+0x13c/0x150
[ 230.269465] ret_from_fork+0x10/0x20
[ 230.281393] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
[ 230.287621] ---[ end trace 0000000000000000 ]---
[ 231.160046] Kernel panic - not syncing: Oops - BRK: Fatal exception in interrupt
The root cause is that the JIT page backing ops->quiescent() is freed
before all callers of that function have stopped.
The expected ordering during teardown is:
bitmap_zero(sch->has_op) + synchronize_rcu()
-> guarantees no CPU will ever call sch->ops.* again
-> only THEN free the BPF struct_ops JIT page
bpf_scx_unreg() is supposed to enforce the order, but after
commit f4a6c506d118 ("sched_ext: Always bounce scx_disable() through
irq_work"), disable_work is no longer queued directly, causing
kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops
map too early and poisoned with AARCH64_BREAK_FAULT before
disable_workfn ever execute.
So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent)
as true and calls ops.quiescent, which hit on the poisoned page and BRK
panic.
Add a helper scx_flush_disable_work() so the future use cases that want
to flush disable_work can use it.
Also amend the call for scx_root_enable_workfn() and
scx_sub_enable_workfn() which have similar pattern in the error path.
Fixes: f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work")
Signed-off-by: Richard Cheng <icheng@nvidia.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
- Fix ww_mutex regression, which caused hangs/pauses in some DRM drivers
- Fix rtmutex proxy-rollback bug
* tag 'locking-urgent-2026-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/mutex: Fix ww_mutex wait_list operations
rtmutex: Use waiter::task instead of current in remove_waiter()
|
|
Incrementing nr_dying_subsys_* in offline_css(), which is executed by
cgroup_offline_wq worker, leads to a race where user can see the value
to be 0 if he reads cgroup.stat after calling rmdir and before the worker
executes. This makes the user wrongly expect resources released by the
removed cgroup to be available for a new assignment.
Increment nr_dying_subsys_* from kill_css(), which is called from the
cgroup_rmdir() context.
Fixes: ab0312526867 ("cgroup: Show # of subsystem CSSes in cgroup.stat")
Signed-off-by: Petr Malat <oss@malat.biz>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
local_dsq_post_enq() calls call_task_dequeue() with scx_root instead of
the scheduler instance actually managing the task. When
CONFIG_EXT_SUB_SCHED is enabled, tasks may be managed by a sub-scheduler
whose ops.dequeue() callback differs from root's. Using scx_root causes
the wrong scheduler's ops.dequeue() to be consulted: sub-sched tasks
dispatched to a local DSQ via scx_bpf_dsq_move_to_local() will have
SCX_TASK_IN_CUSTODY cleared but the sub-scheduler's ops.dequeue() is
never invoked, violating the custody exit semantics.
Fix by adding a 'struct scx_sched *sch' parameter to local_dsq_post_enq()
and move_local_task_to_local_dsq(), and propagating the correct scheduler
from their callers dispatch_enqueue(), move_task_between_dsqs(), and
consume_dispatch_q().
This is consistent with dispatch_enqueue()'s non-local path which already
passes 'sch' directly to call_task_dequeue() for global/bypass DSQs.
Fixes: ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics")
Signed-off-by: zhidao su <suzhidao@xiaomi.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Chaitanya, John and Mikhail reported commit 25500ba7e77c ("locking/mutex:
Remove the list_head from struct mutex") wrecked ww_mutex.
Specifically there were 2 issues:
- __ww_waiter_prev() had the termination condition wrong; it would terminate
when the previous entry was the first, which results in a truncated
iteration: W3, W2, (no W1).
- __mutex_add_waiter(@pos != NULL), as used by __ww_waiter_add() /
__ww_mutex_add_waiter(); this inserts @waiter before @pos (which is what
list_add_tail() does). But this should then also update lock->first_waiter.
Much thanks to Prateek for spotting the __mutex_add_waiter() issue!
Fixes: 25500ba7e77c ("locking/mutex: Remove the list_head from struct mutex")
Reported-by: "Borah, Chaitanya Kumar" <chaitanya.kumar.borah@intel.com>
Closes: https://lore.kernel.org/r/af005996-05e9-4336-8450-d14ca652ba5d%40intel.com
Reported-by: John Stultz <jstultz@google.com>
Closes: https://lore.kernel.org/r/CANDhNCq%3Doizzud3hH3oqGzTrcjB8OwGeineJ3mwZuGdDWG8fRQ%40mail.gmail.com
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Closes: https://lore.kernel.org/r/CABXGCsO5fKq2nD9nO8yO1z50ZzgCPWqueNXHANjntaswoOh2Dg@mail.gmail.com
Debugged-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://patch.msgid.link/20260422092335.GH3102924%40noisy.programming.kicks-ass.net
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ring-buffer fix from Steven Rostedt:
- Make undefsyms_base.c into a real file
The file undefsyms_base.c is used to catch any symbols used by a
remote ring buffer that is made for use of a pKVM hypervisor. As it
doesn't share the same text as the rest of the kernel, referencing
any symbols within the kernel will make it fail to be built for the
standalone hypervisor.
A file was created by the Makefile that checked for any symbols that
could cause issues. There's no reason to have this file created by
the Makefile, just create it as a normal file instead.
* tag 'trace-ring-buffer-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Make undefsyms_base.c a first-class citizen
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
Pull kgdb update from Daniel Thompson:
"Only a very small update for kgdb this cycle: a single patch from
Kexin Sun that fixes some outdated comments"
* tag 'kgdb-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kgdb: update outdated references to kgdb_wait()
|
|
Linus points out that dumping undefsyms_base.c form the Makefile
is rather ugly, and that a much better course of action would be
to have this file as a first-class citizen in the git tree.
This allows some extra cleanup in the Makefile, and the removal of
the .gitignore file in kernel/trace.
Cc: Marc Zyngier <maz@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/CAHk-=wieqGd_XKpu8UxDoyADZx8TDe8CF3RmkUXt5N_9t5Pf_w@mail.gmail.com
Link: https://lore.kernel.org/all/20260421095446.2951646-1-maz@kernel.org/
Link: https://patch.msgid.link/20260421100455.324333-1-pbonzini@redhat.com
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
Fix fprobe to unregister ftrace_ops if corresponding type of fprobe
does not exist on the fprobe_ip_table and it is expected to be empty
when unloading modules.
Since ftrace thinks that the empty hash means everything to be traced,
if we set fprobes only on the unloaded module, all functions are traced
unexpectedly after unloading module.
e.g.
# modprobe xt_LOG.ko
# echo 'f:test log_tg*' > dynamic_events
# echo 1 > events/fprobes/test/enable
# cat enabled_functions
log_tg [xt_LOG] (1) tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
log_tg_check [xt_LOG] (1) tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
log_tg_destroy [xt_LOG] (1) tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490
# rmmod xt_LOG
# wc -l enabled_functions
34085 enabled_functions
Link: https://lore.kernel.org/all/177669368776.132053.10042301916765771279.stgit@mhiramat.tok.corp.google.com/
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
The function kgdb_wait() was folded into the static function
kgdb_cpu_enter() by commit 62fae312197a ("kgdb: eliminate
kgdb_wait(), all cpus enter the same way"). Update the four stale
references accordingly:
- include/linux/kgdb.h and arch/x86/kernel/kgdb.c: the
kgdb_roundup_cpus() kdoc describes what other CPUs are rounded up
to call. Because kgdb_cpu_enter() is static, the correct public
entry point is kgdb_handle_exception(); also fix a pre-existing
grammar error ("get them be" -> "get them into") and reflow the
text.
- kernel/debug/debug_core.c: replace with the generic description
"the debug trap handler", since the actual entry path is
architecture-specific.
- kernel/debug/gdbstub.c: kgdb_cpu_enter() is correct here (it
describes internal state, not a call target); add the missing
parentheses.
Suggested-by: Daniel Thompson <daniel@riscstar.com>
Assisted-by: unnamed:deepseek-v3.2 coccinelle
Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn>
|
|
Commit 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
introduced a different ftrace_ops for entry-only fprobes.
However, when unregistering an fprobe, the kernel only checks if another
fprobe exists at the same address, without checking which type of fprobe
it is.
If different fprobes are registered at the same address, the same address
will be registered in both fgraph_ops and ftrace_ops, but only one of
them will be deleted when unregistering. (the one removed first will not
be deleted from the ops).
This results in junk entries remaining in either fgraph_ops or ftrace_ops.
For example:
=======
cd /sys/kernel/tracing
# 'Add entry and exit events on the same place'
echo 'f:event1 vfs_read' >> dynamic_events
echo 'f:event2 vfs_read%return' >> dynamic_events
# 'Enable both of them'
echo 1 > events/fprobes/enable
cat enabled_functions
vfs_read (2) ->arch_ftrace_ops_list_func+0x0/0x210
# 'Disable and remove exit event'
echo 0 > events/fprobes/event2/enable
echo -:event2 >> dynamic_events
# 'Disable and remove all events'
echo 0 > events/fprobes/enable
echo > dynamic_events
# 'Add another event'
echo 'f:event3 vfs_open%return' > dynamic_events
cat dynamic_events
f:fprobes/event3 vfs_open%return
echo 1 > events/fprobes/enable
cat enabled_functions
vfs_open (1) tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
vfs_read (1) tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150}
=======
As you can see, an entry for the vfs_read remains.
To fix this issue, when unregistering, the kernel should also check if
there is the same type of fprobes still exist at the same address, and
if not, delete its entry from either fgraph_ops or ftrace_ops.
Link: https://lore.kernel.org/all/177669367993.132053.10553046138528674802.stgit@mhiramat.tok.corp.google.com/
Fixes: 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
fprobe_remove_node_in_module() is called under RCU read locked, but
this invokes kcalloc() if there are more than 8 fprobes installed
on the module. Sashiko warns it because kcalloc() can sleep [1].
[1] https://sashiko.dev/#/patchset/177552432201.853249.5125045538812833325.stgit%40mhiramat.tok.corp.google.com
To fix this issue, expand the batch size to 128 and do not expand
the fprobe_addr_list, but just cancel walking on fprobe_ip_table,
update fgraph/ftrace_ops and retry the loop again.
Link: https://lore.kernel.org/all/177669367206.132053.1493637946869032744.stgit@mhiramat.tok.corp.google.com/
Fixes: 0de4c70d04a4 ("tracing: fprobe: use rhltable for fprobe_ip_table")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
When register_fprobe_ips() fails, it tries to remove a list of
fprobe_hash_node from fprobe_ip_table, but it missed to remove
fprobe itself from fprobe_table. Moreover, when removing
the fprobe_hash_node which is added to rhltable once, it must
use kfree_rcu() after removing from rhltable.
To fix these issues, this reuses unregister_fprobe() internal
code to rollback the half-way registered fprobe.
Link: https://lore.kernel.org/all/177669366417.132053.17874946321744910456.stgit@mhiramat.tok.corp.google.com/
Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
unregister_fprobe() can fail under memory pressure because of memory
allocation failure, but this maybe called from module unloading, and
usually there is no way to retry it. Moreover. trace_fprobe does not
check the return value.
To fix this problem, unregister fprobe and fprobe_hash_node even if
working memory allocation fails.
Anyway, if the last fprobe is removed, the filter will be freed.
Link: https://lore.kernel.org/all/177669365629.132053.8433032896213721288.stgit@mhiramat.tok.corp.google.com/
Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
Reject registration of a registered fprobe which is on the fprobe
hash table before initializing fprobe.
The add_fprobe_hash() checks this re-register fprobe, but since
fprobe_init() clears hlist_array field, it is too late to check it.
It has to check the re-registration before touncing fprobe.
Link: https://lore.kernel.org/all/177669364845.132053.18375367916162315835.stgit@mhiramat.tok.corp.google.com/
Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|