linux-stable.git/kernel, branch v6.12.95

locking: rtmutex: Fix wake_q logic in task_blocks_on_rt_mutex

2026-07-04T11:43:36+00:00

commit 82f9cc094975240885c93effbca7f4603f5de1bf upstream.

Anders had bisected a crash using PREEMPT_RT with linux-next and
isolated it down to commit 894d1b3db41c ("locking/mutex: Remove
wakeups from under mutex::wait_lock"), where it seemed the
wake_q structure was somehow getting corrupted causing a null
pointer traversal.

I was able to easily repoduce this with PREEMPT_RT and managed
to isolate down that through various call stacks we were
actually calling wake_up_q() twice on the same wake_q.

I found that in the problematic commit, I had added the
wake_up_q() call in task_blocks_on_rt_mutex() around
__ww_mutex_add_waiter(), following a similar pattern in
__mutex_lock_common().

However, its just wrong. We haven't dropped the lock->wait_lock,
so its contrary to the point of the original patch. And it
didn't match the __mutex_lock_common() logic of re-initializing
the wake_q after calling it midway in the stack.

Looking at it now, the wake_up_q() call is incorrect and should
just be removed. So drop the erronious logic I had added.

Fixes: 894d1b3db41c ("locking/mutex: Remove wakeups from under mutex::wait_lock")
Closes: https://lore.kernel.org/lkml/6afb936f-17c7-43fa-90e0-b9e780866097@app.fastmail.com/
Reported-by: Anders Roxell
Reported-by: Arnd Bergmann
Signed-off-by: John Stultz
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Juri Lelli
Tested-by: Anders Roxell
Tested-by: K Prateek Nayak
Link: https://lore.kernel.org/r/20241114190051.552665-1-jstultz@google.com
Signed-off-by: Greg Kroah-Hartman

bpf: use kvfree() for replaced sysctl write buffer

2026-07-04T11:43:33+00:00

commit 4c21b5927d4364bfe7365f2700da5fea0ed0d004 upstream.

proc_sys_call_handler() allocates its temporary sysctl buffer with
kvzalloc() and passes it to __cgroup_bpf_run_filter_sysctl(). Since
kvzalloc() may fall back to vmalloc() for large allocations, freeing
that buffer with kfree() is wrong and can corrupt memory.

Use kvfree() to safely handle both kmalloc and kvzalloc()/vmalloc
allocations.

The bug was first flagged by an experimental analysis tool we are
developing for kernel memory-management bugs while analyzing
v6.13-rc1. The tool is still under development and is not yet publicly
available. Manual inspection confirms that the bug is still
present in v7.1-rc5.

Reproduced the bug based on v7.1-rc4 in a QEMU x86_64 guest booted with
KASAN and CONFIG_FAILSLAB enabled. To exercise the replacement path, the
test tree also included the accompanying fix for the stale ret == 1
check in __cgroup_bpf_run_filter_sysctl(). The reproducer confines
failslab injections to the proc_sys_call_handler() range, uses
stacktrace-depth=32, and injects fail-nth=1 while writing 8191 bytes to
/proc/sys/kernel/domainname from a task in the target cgroup. Under
that setup, fail-nth=1 triggered the fault:

  BUG: unable to handle page fault for address: ffffeb0200024d48
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000  SMP KASAN NOPTI
  CPU: 2 UID: 0 PID: 209 Comm: repro_proc_sys_ Not tainted 7.1.0-rc4-00686-g97625979a5d4  PREEMPT(lazy)
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
  RIP: 0010:kfree+0x6e/0x510
  ...
  Call Trace:
   
   ? __cgroup_bpf_run_filter_sysctl+0x626/0xc30
   __cgroup_bpf_run_filter_sysctl+0x74d/0xc30
   ? __pfx___cgroup_bpf_run_filter_sysctl+0x10/0x10
   ? srso_return_thunk+0x5/0x5f
   ? __kvmalloc_node_noprof+0x345/0x870
   ? proc_sys_call_handler+0x250/0x480
   ? srso_return_thunk+0x5/0x5f
   proc_sys_call_handler+0x3a2/0x480
   ? __pfx_proc_sys_call_handler+0x10/0x10
   ? srso_return_thunk+0x5/0x5f
   ? selinux_file_permission+0x39f/0x500
   ? srso_return_thunk+0x5/0x5f
   ? lock_is_held_type+0x9e/0x120
   vfs_write+0x98e/0x1000
   ...
   

With this fix applied on top of the same test setup, rerunning the
reproducer with fail-nth=1 yields no corresponding Oops reports.

Fixes: 4508943794ef ("proc: use kvzalloc for our kernel buffer")
Cc: stable@vger.kernel.org

Reviewed-by: Emil Tsalapatis 
Reviewed-by: Jiayuan Chen 
Acked-by: Yonghong Song 
Signed-off-by: Zilin Guan 
Signed-off-by: Dawei Feng 
Link: https://lore.kernel.org/r/20260603105317.944304-3-dawei.feng@seu.edu.cn
Signed-off-by: Alexei Starovoitov 
Signed-off-by: Greg Kroah-Hartman

block: invalidate cached plug timestamp after task switch

2026-07-04T11:43:32+00:00

commit fad156c2af227f42ca796cbb20ddc354a6dd9932 upstream.

blk_time_get_ns() caches ktime_get_ns() in current->plug->cur_ktime
and marks the task with PF_BLOCK_TS. That cache is only valid while the
task keeps running; if the task is switched out, wall-clock time
advances and the cached value must not be reused when the task runs again.

The existing invalidation covers explicit plug flushes through
__blk_flush_plug(), and the schedule() / rtmutex paths through
sched_update_worker(). It does not cover in-kernel preemption paths such
as preempt_schedule(), preempt_schedule_notrace(), and
preempt_schedule_irq(), which enter __schedule(SM_PREEMPT) directly and
return without calling sched_update_worker().

As a result, a task preempted while holding a plug with PF_BLOCK_TS set
can reuse a stale plug->cur_ktime after it is scheduled back in. blk-iocost
then consumes that stale timestamp through ioc_now(), producing stale vnow
values for throttle decisions, and through ioc_rqos_done(), inflating
on-queue time and feeding false missed-QoS samples into vrate
adjustment.

Move the schedule-side invalidation to finish_task_switch(), which runs
for the scheduled-in task after every actual context switch regardless
of which schedule entry point was used. Keep __blk_flush_plug() as the
explicit flush/finish-plug invalidation path, and remove only the
PF_BLOCK_TS handling from sched_update_worker().

Fixes: 06b23f92af87 ("block: update cached timestamp post schedule/preemption")
Cc: stable@vger.kernel.org
Signed-off-by: Usama Arif 
Link: https://patch.msgid.link/20260616141604.328820-3-usama.arif@linux.dev
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

kernel/fork: clear PF_BLOCK_TS in copy_process()

2026-07-04T11:43:32+00:00

commit fd38b75c4b43295b10d69772a46d1c74dbd6fc81 upstream.

PF_BLOCK_TS is only set in blk_time_get_ns() when current->plug is
non-NULL, and blk_finish_plug() clears it via __blk_flush_plug()
before NULLing the plug pointer.  copy_process() breaks the
invariant by inheriting PF_BLOCK_TS from the parent while resetting
the child's plug to NULL.

Clear PF_BLOCK_TS alongside that assignment so callers can rely on
"PF_BLOCK_TS set implies current->plug != NULL" and dereference
current->plug unguarded.

Fixes: 06b23f92af87 ("block: update cached timestamp post schedule/preemption")
Cc: stable@vger.kernel.org
Signed-off-by: Usama Arif 
Link: https://patch.msgid.link/20260616141604.328820-2-usama.arif@linux.dev
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

ftrace: Check against is_kernel_text() instead of kaslr_offset()

2026-07-04T11:43:28+00:00

[ Upstream commit da0f622b344be769ed61e7c1caf95cd0cdb47964 ]

As kaslr_offset() is architecture dependent and also may not be defined by
all architectures, when zeroing out unused weak functions, do not check
against kaslr_offset(), but instead check if the address is within the
kernel text sections. If KASLR added a shift to the zeroed out function,
it would still not be located in the kernel text. This is a more robust
way to test if the text is valid or not.

Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Masahiro Yamada 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: "Arnd Bergmann" 
Link: https://lore.kernel.org/20250225182054.471759017@goodmis.org
Fixes: ef378c3b8233 ("scripts/sorttable: Zero out weak functions in mcount_loc table")
Reported-by: Nathan Chancellor 
Reported-by: Mark Brown 
Tested-by: Nathan Chancellor 
Closes: https://lore.kernel.org/all/20250224180805.GA1536711@ax162/
Closes: https://lore.kernel.org/all/5225b07b-a9b2-4558-9d5f-aa60b19f6317@sirena.org.uk/
Signed-off-by: Steven Rostedt (Google) 
Signed-off-by: Andrey Grodzovsky 
Signed-off-by: Greg Kroah-Hartman

ftrace: Test mcount_loc addr before calling ftrace_call_addr()

2026-07-04T11:43:28+00:00

[ Upstream commit 6eeca746fa5f1dd03c6ee05cb03f5eb1ddda1c81 ]

The addresses in the mcount_loc can be zeroed and then moved by KASLR
making them invalid addresses. ftrace_call_addr() for ARM 64 expects a
valid address to kernel text. If the addr read from the mcount_loc section
is invalid, it must not call ftrace_call_addr(). Move the addr check
before calling ftrace_call_addr() in ftrace_process_locs().

Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Masahiro Yamada 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Brown 
Link: https://lore.kernel.org/20250225182054.290128736@goodmis.org
Fixes: ef378c3b8233 ("scripts/sorttable: Zero out weak functions in mcount_loc table")
Reported-by: Nathan Chancellor 
Reported-by: "Arnd Bergmann" 
Tested-by: Nathan Chancellor 
Closes: https://lore.kernel.org/all/20250225025631.GA271248@ax162/
Closes: https://lore.kernel.org/all/91523154-072b-437b-bbdc-0b70e9783fd0@app.fastmail.com/
Signed-off-by: Steven Rostedt (Google) 
Signed-off-by: Andrey Grodzovsky 
Signed-off-by: Greg Kroah-Hartman

ftrace: Do not over-allocate ftrace memory

2026-07-04T11:43:28+00:00

[ Upstream commit be55257fab181b93af38f8c4b1b3cb453a78d742 ]

The pg_remaining calculation in ftrace_process_locs() assumes that
ENTRIES_PER_PAGE multiplied by 2^order equals the actual capacity of the
allocated page group. However, ENTRIES_PER_PAGE is PAGE_SIZE / ENTRY_SIZE
(integer division). When PAGE_SIZE is not a multiple of ENTRY_SIZE (e.g.
4096 / 24 = 170 with remainder 16), high-order allocations (like 256 pages)
have significantly more capacity than 256 * 170. This leads to pg_remaining
being underestimated, which in turn makes skip (derived from skipped -
pg_remaining) larger than expected, causing the WARN(skip != remaining)
to trigger.

Extra allocated pages for ftrace: 2 with 654 skipped
WARNING: CPU: 0 PID: 0 at kernel/trace/ftrace.c:7295 ftrace_process_locs+0x5bf/0x5e0

A similar problem in ftrace_allocate_records() can result in allocating
too many pages. This can trigger the second warning in
ftrace_process_locs().

Extra allocated pages for ftrace
WARNING: CPU: 0 PID: 0 at kernel/trace/ftrace.c:7276 ftrace_process_locs+0x548/0x580

Use the actual capacity of a page group to determine the number of pages
to allocate. Have ftrace_allocate_pages() return the number of allocated
pages to avoid having to calculate it. Use the actual page group capacity
when validating the number of unused pages due to skipped entries.
Drop the definition of ENTRIES_PER_PAGE since it is no longer used.

Cc: stable@vger.kernel.org
Fixes: 4a3efc6baff93 ("ftrace: Update the mcount_loc check of skipped entries")
Link: https://patch.msgid.link/20260113152243.3557219-1-linux@roeck-us.net
Signed-off-by: Guenter Roeck 
Signed-off-by: Steven Rostedt (Google) 
Signed-off-by: Andrey Grodzovsky 
Signed-off-by: Greg Kroah-Hartman

ftrace: Have ftrace pages output reflect freed pages

2026-07-04T11:43:28+00:00

[ Upstream commit 264143c4e54412095f4b615e65bf736fc3c60af0 ]

The amount of memory that ftrace uses to save the descriptors to manage
the functions it can trace is shown at output. But if there are a lot of
functions that are skipped because they were weak or the architecture
added holes into the tables, then the extra pages that were allocated are
freed. But these freed pages are not reflected in the numbers shown, and
they can even be inconsistent with what is reported:

 ftrace: allocating 57482 entries in 225 pages
 ftrace: allocated 224 pages with 3 groups

The above shows the number of original entries that are in the mcount_loc
section and the pages needed to save them (225), but the second output
reflects the number of pages that were actually used. The two should be
consistent as:

 ftrace: allocating 56739 entries in 224 pages
 ftrace: allocated 224 pages with 3 groups

The above also shows the accurate number of entires that were actually
stored and does not include the entries that were removed.

Cc: bpf 
Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Peter Zijlstra 
Cc: Linus Torvalds 
Cc: Masahiro Yamada 
Cc: Nathan Chancellor 
Cc: Nicolas Schier 
Cc: Zheng Yejian 
Cc: Martin  Kelly 
Cc: Christophe Leroy 
Cc: Josh Poimboeuf 
Cc: Heiko Carstens 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Vasily Gorbik 
Cc: Alexander Gordeev 
Link: https://lore.kernel.org/20250218200023.221100846@goodmis.org
Signed-off-by: Steven Rostedt (Google) 
Signed-off-by: Andrey Grodzovsky 
Signed-off-by: Greg Kroah-Hartman

ftrace: Update the mcount_loc check of skipped entries

2026-07-04T11:43:28+00:00

[ Upstream commit 4a3efc6baff931da9a85c6d2e42c87bd9a827399 ]

Now that weak functions turn into skipped entries, update the check to
make sure the amount that was allocated would fit both the entries that
were allocated as well as those that were skipped.

Cc: bpf 
Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Peter Zijlstra 
Cc: Linus Torvalds 
Cc: Masahiro Yamada 
Cc: Nathan Chancellor 
Cc: Nicolas Schier 
Cc: Zheng Yejian 
Cc: Martin  Kelly 
Cc: Christophe Leroy 
Cc: Josh Poimboeuf 
Cc: Heiko Carstens 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Vasily Gorbik 
Cc: Alexander Gordeev 
Link: https://lore.kernel.org/20250218200023.055162048@goodmis.org
Signed-off-by: Steven Rostedt (Google) 
Signed-off-by: Andrey Grodzovsky 
Signed-off-by: Greg Kroah-Hartman

scripts/sorttable: Zero out weak functions in mcount_loc table

2026-07-04T11:43:28+00:00

[ Upstream commit ef378c3b8233855497a414b9d67bf22592c928a4 ]

When a function is annotated as "weak" and is overridden, the code is not
removed. If it is traced, the fentry/mcount location in the weak function
will be referenced by the "__mcount_loc" section. This will then be added
to the available_filter_functions list. Since only the address of the
functions are listed, to find the name to show, a search of kallsyms is
used.

Since kallsyms will return the function by simply finding the function
that the address is after but before the next function, an address of a
weak function will show up as the function before it. This is because
kallsyms does not save names of weak functions. This has caused issues in
the past, as now the traced weak function will be listed in
available_filter_functions with the name of the function before it.

At best, this will cause the previous function's name to be listed twice.
At worse, if the previous function was marked notrace, it will now show up
as a function that can be traced. Note that it only shows up that it can
be traced but will not be if enabled, which causes confusion.

 https://lore.kernel.org/all/20220412094923.0abe90955e5db486b7bca279@kernel.org/

The commit b39181f7c6907 ("ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid
adding weak function") was a workaround to this by checking the function
address before printing its name. If the address was too far from the
function given by the name then instead of printing the name it would
print: __ftrace_invalid_address___

The real issue is that these invalid addresses are listed in the ftrace
table look up which available_filter_functions is derived from. A place
holder must be listed in that file because set_ftrace_filter may take a
series of indexes into that file instead of names to be able to do O(1)
lookups to enable filtering (many tools use this method).

Even if kallsyms saved the size of the function, it does not remove the
need of having these place holders. The real solution is to not add a weak
function into the ftrace table in the first place.

To solve this, the sorttable.c code that sorts the mcount regions during
the build is modified to take a "nm -S vmlinux" input, sort it, and any
function listed in the mcount_loc section that is not within a boundary of
the function list given by nm is considered a weak function and is zeroed
out.

Note, this does not mean they will remain zero when booting as KASLR
will still shift those addresses. To handle this, the entries in the
mcount_loc section will be ignored if they are zero or match the
kaslr_offset() value.

Before:

 ~# grep __ftrace_invalid_address___ /sys/kernel/tracing/available_filter_functions | wc -l
 551

After:

 ~# grep __ftrace_invalid_address___ /sys/kernel/tracing/available_filter_functions | wc -l
 0

Cc: bpf 
Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Peter Zijlstra 
Cc: Linus Torvalds 
Cc: Masahiro Yamada 
Cc: Nathan Chancellor 
Cc: Nicolas Schier 
Cc: Zheng Yejian 
Cc: Martin  Kelly 
Cc: Christophe Leroy 
Cc: Josh Poimboeuf 
Cc: Heiko Carstens 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Vasily Gorbik 
Cc: Alexander Gordeev 
Link: https://lore.kernel.org/20250218200022.883095980@goodmis.org
Signed-off-by: Steven Rostedt (Google) 
Signed-off-by: Andrey Grodzovsky 
Signed-off-by: Greg Kroah-Hartman