summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2026-04-24sched_ext: Defer scx_hardlockup() out of NMITejun Heo
scx_hardlockup() runs from NMI and eventually calls scx_claim_exit(), which takes scx_sched_lock. scx_sched_lock isn't NMI-safe and grabbing it from NMI context can lead to deadlocks. The hardlockup handler is best-effort recovery and the disable path it triggers runs off of irq_work anyway. Move the handle_lockup() call into an irq_work so it runs in IRQ context. Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-04-24Merge tag 'trace-ring-buffer-v7.1-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ring-buffer fix from Steven Rostedt: - Fix accounting of persistent ring buffer rewind On boot up, the head page is moved back to the earliest point of the saved ring buffer. This is because the ring buffer being read by user space on a crash may not save the part it read. Rewinding the head page back to the earliest saved position helps keep those events from being lost. The number of events is also read during boot up and displayed in the stats file in the tracefs directory. It's also used for other accounting as well. On boot up, the "reader page" is accounted for but a rewind may put it back into the buffer and then the reader page may be accounted for again. Save off the original reader page and skip accounting it when scanning the pages in the ring buffer. * tag 'trace-ring-buffer-v7.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Do not double count the reader_page
2026-04-24ring-buffer: Do not double count the reader_pageMasami Hiramatsu (Google)
Since the cpu_buffer->reader_page is updated if there are unwound pages. After that update, we should skip the page if it is the original reader_page, because the original reader_page is already checked. Cc: stable@vger.kernel.org Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ian Rogers <irogers@google.com> Link: https://patch.msgid.link/177701353063.2223789.1471163147644103306.stgit@mhiramat.tok.corp.google.com Fixes: ca296d32ece3 ("tracing: ring_buffer: Rewind persistent ring buffer on reboot") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-04-24sched_ext: sync disable_irq_work in bpf_scx_unreg()Richard Cheng
When unregistered my self-written scx scheduler, the following panic occurs. [ 229.923133] Kernel text patching generated an invalid instruction at 0xffff80009bc2c1f8! [ 229.923146] Internal error: Oops - BRK: 00000000f2000100 [#1] SMP [ 230.077871] CPU: 48 UID: 0 PID: 1760 Comm: kworker/u583:7 Not tainted 7.0.0+ #3 PREEMPT(full) [ 230.086677] Hardware name: NVIDIA GB200 NVL/P3809-BMC, BIOS 02.05.12 20251107 [ 230.093972] Workqueue: events_unbound bpf_map_free_deferred [ 230.099675] Sched_ext: invariant_0.1.0_aarch64_unknown_linux_gnu_debug (disabling), task: runnable_at=-174ms [ 230.116843] pc : 0xffff80009bc2c1f8 [ 230.120406] lr : dequeue_task_scx+0x270/0x2d0 [ 230.217749] Call trace: [ 230.228515] 0xffff80009bc2c1f8 (P) [ 230.232077] dequeue_task+0x84/0x188 [ 230.235728] sched_change_begin+0x1dc/0x250 [ 230.240000] __set_cpus_allowed_ptr_locked+0x17c/0x240 [ 230.245250] __set_cpus_allowed_ptr+0x74/0xf0 [ 230.249701] ___migrate_enable+0x4c/0xa0 [ 230.253707] bpf_map_free_deferred+0x1a4/0x1b0 [ 230.258246] process_one_work+0x184/0x540 [ 230.262342] worker_thread+0x19c/0x348 [ 230.266170] kthread+0x13c/0x150 [ 230.269465] ret_from_fork+0x10/0x20 [ 230.281393] Code: d4202000 d4202000 d4202000 d4202000 (d4202000) [ 230.287621] ---[ end trace 0000000000000000 ]--- [ 231.160046] Kernel panic - not syncing: Oops - BRK: Fatal exception in interrupt The root cause is that the JIT page backing ops->quiescent() is freed before all callers of that function have stopped. The expected ordering during teardown is: bitmap_zero(sch->has_op) + synchronize_rcu() -> guarantees no CPU will ever call sch->ops.* again -> only THEN free the BPF struct_ops JIT page bpf_scx_unreg() is supposed to enforce the order, but after commit f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work"), disable_work is no longer queued directly, causing kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops map too early and poisoned with AARCH64_BREAK_FAULT before disable_workfn ever execute. So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent) as true and calls ops.quiescent, which hit on the poisoned page and BRK panic. Add a helper scx_flush_disable_work() so the future use cases that want to flush disable_work can use it. Also amend the call for scx_root_enable_workfn() and scx_sub_enable_workfn() which have similar pattern in the error path. Fixes: f4a6c506d118 ("sched_ext: Always bounce scx_disable() through irq_work") Signed-off-by: Richard Cheng <icheng@nvidia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-24Merge tag 'locking-urgent-2026-04-24' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: - Fix ww_mutex regression, which caused hangs/pauses in some DRM drivers - Fix rtmutex proxy-rollback bug * tag 'locking-urgent-2026-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/mutex: Fix ww_mutex wait_list operations rtmutex: Use waiter::task instead of current in remove_waiter()
2026-04-23cgroup: Increment nr_dying_subsys_* from rmdir contextPetr Malat
Incrementing nr_dying_subsys_* in offline_css(), which is executed by cgroup_offline_wq worker, leads to a race where user can see the value to be 0 if he reads cgroup.stat after calling rmdir and before the worker executes. This makes the user wrongly expect resources released by the removed cgroup to be available for a new assignment. Increment nr_dying_subsys_* from kill_css(), which is called from the cgroup_rmdir() context. Fixes: ab0312526867 ("cgroup: Show # of subsystem CSSes in cgroup.stat") Signed-off-by: Petr Malat <oss@malat.biz> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-23sched_ext: Fix local_dsq_post_enq() to use task's scheduler in sub-schedzhidao su
local_dsq_post_enq() calls call_task_dequeue() with scx_root instead of the scheduler instance actually managing the task. When CONFIG_EXT_SUB_SCHED is enabled, tasks may be managed by a sub-scheduler whose ops.dequeue() callback differs from root's. Using scx_root causes the wrong scheduler's ops.dequeue() to be consulted: sub-sched tasks dispatched to a local DSQ via scx_bpf_dsq_move_to_local() will have SCX_TASK_IN_CUSTODY cleared but the sub-scheduler's ops.dequeue() is never invoked, violating the custody exit semantics. Fix by adding a 'struct scx_sched *sch' parameter to local_dsq_post_enq() and move_local_task_to_local_dsq(), and propagating the correct scheduler from their callers dispatch_enqueue(), move_task_between_dsqs(), and consume_dispatch_q(). This is consistent with dispatch_enqueue()'s non-local path which already passes 'sch' directly to call_task_dequeue() for global/bypass DSQs. Fixes: ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics") Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-23locking/mutex: Fix ww_mutex wait_list operationsPeter Zijlstra
Chaitanya, John and Mikhail reported commit 25500ba7e77c ("locking/mutex: Remove the list_head from struct mutex") wrecked ww_mutex. Specifically there were 2 issues: - __ww_waiter_prev() had the termination condition wrong; it would terminate when the previous entry was the first, which results in a truncated iteration: W3, W2, (no W1). - __mutex_add_waiter(@pos != NULL), as used by __ww_waiter_add() / __ww_mutex_add_waiter(); this inserts @waiter before @pos (which is what list_add_tail() does). But this should then also update lock->first_waiter. Much thanks to Prateek for spotting the __mutex_add_waiter() issue! Fixes: 25500ba7e77c ("locking/mutex: Remove the list_head from struct mutex") Reported-by: "Borah, Chaitanya Kumar" <chaitanya.kumar.borah@intel.com> Closes: https://lore.kernel.org/r/af005996-05e9-4336-8450-d14ca652ba5d%40intel.com Reported-by: John Stultz <jstultz@google.com> Closes: https://lore.kernel.org/r/CANDhNCq%3Doizzud3hH3oqGzTrcjB8OwGeineJ3mwZuGdDWG8fRQ%40mail.gmail.com Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Closes: https://lore.kernel.org/r/CABXGCsO5fKq2nD9nO8yO1z50ZzgCPWqueNXHANjntaswoOh2Dg@mail.gmail.com Debugged-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Link: https://patch.msgid.link/20260422092335.GH3102924%40noisy.programming.kicks-ass.net
2026-04-22Merge tag 'trace-ring-buffer-v7.1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ring-buffer fix from Steven Rostedt: - Make undefsyms_base.c into a real file The file undefsyms_base.c is used to catch any symbols used by a remote ring buffer that is made for use of a pKVM hypervisor. As it doesn't share the same text as the rest of the kernel, referencing any symbols within the kernel will make it fail to be built for the standalone hypervisor. A file was created by the Makefile that checked for any symbols that could cause issues. There's no reason to have this file created by the Makefile, just create it as a normal file instead. * tag 'trace-ring-buffer-v7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Make undefsyms_base.c a first-class citizen
2026-04-22Merge tag 'kgdb-7.1-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux Pull kgdb update from Daniel Thompson: "Only a very small update for kgdb this cycle: a single patch from Kexin Sun that fixes some outdated comments" * tag 'kgdb-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux: kgdb: update outdated references to kgdb_wait()
2026-04-22tracing: Make undefsyms_base.c a first-class citizenPaolo Bonzini
Linus points out that dumping undefsyms_base.c form the Makefile is rather ugly, and that a much better course of action would be to have this file as a first-class citizen in the git tree. This allows some extra cleanup in the Makefile, and the removal of the .gitignore file in kernel/trace. Cc: Marc Zyngier <maz@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/CAHk-=wieqGd_XKpu8UxDoyADZx8TDe8CF3RmkUXt5N_9t5Pf_w@mail.gmail.com Link: https://lore.kernel.org/all/20260421095446.2951646-1-maz@kernel.org/ Link: https://patch.msgid.link/20260421100455.324333-1-pbonzini@redhat.com Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-04-22tracing/fprobe: Fix to unregister ftrace_ops if it is empty on module unloadingMasami Hiramatsu (Google)
Fix fprobe to unregister ftrace_ops if corresponding type of fprobe does not exist on the fprobe_ip_table and it is expected to be empty when unloading modules. Since ftrace thinks that the empty hash means everything to be traced, if we set fprobes only on the unloaded module, all functions are traced unexpectedly after unloading module. e.g. # modprobe xt_LOG.ko # echo 'f:test log_tg*' > dynamic_events # echo 1 > events/fprobes/test/enable # cat enabled_functions log_tg [xt_LOG] (1) tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490 log_tg_check [xt_LOG] (1) tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490 log_tg_destroy [xt_LOG] (1) tramp: 0xffffffffa0004000 (fprobe_ftrace_entry+0x0/0x490) ->fprobe_ftrace_entry+0x0/0x490 # rmmod xt_LOG # wc -l enabled_functions 34085 enabled_functions Link: https://lore.kernel.org/all/177669368776.132053.10042301916765771279.stgit@mhiramat.tok.corp.google.com/ Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-04-21kgdb: update outdated references to kgdb_wait()Kexin Sun
The function kgdb_wait() was folded into the static function kgdb_cpu_enter() by commit 62fae312197a ("kgdb: eliminate kgdb_wait(), all cpus enter the same way"). Update the four stale references accordingly: - include/linux/kgdb.h and arch/x86/kernel/kgdb.c: the kgdb_roundup_cpus() kdoc describes what other CPUs are rounded up to call. Because kgdb_cpu_enter() is static, the correct public entry point is kgdb_handle_exception(); also fix a pre-existing grammar error ("get them be" -> "get them into") and reflow the text. - kernel/debug/debug_core.c: replace with the generic description "the debug trap handler", since the actual entry path is architecture-specific. - kernel/debug/gdbstub.c: kgdb_cpu_enter() is correct here (it describes internal state, not a call target); add the missing parentheses. Suggested-by: Daniel Thompson <daniel@riscstar.com> Assisted-by: unnamed:deepseek-v3.2 coccinelle Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn>
2026-04-22tracing/fprobe: Check the same type fprobe on table as the unregistered oneMasami Hiramatsu (Google)
Commit 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case") introduced a different ftrace_ops for entry-only fprobes. However, when unregistering an fprobe, the kernel only checks if another fprobe exists at the same address, without checking which type of fprobe it is. If different fprobes are registered at the same address, the same address will be registered in both fgraph_ops and ftrace_ops, but only one of them will be deleted when unregistering. (the one removed first will not be deleted from the ops). This results in junk entries remaining in either fgraph_ops or ftrace_ops. For example: ======= cd /sys/kernel/tracing # 'Add entry and exit events on the same place' echo 'f:event1 vfs_read' >> dynamic_events echo 'f:event2 vfs_read%return' >> dynamic_events # 'Enable both of them' echo 1 > events/fprobes/enable cat enabled_functions vfs_read (2) ->arch_ftrace_ops_list_func+0x0/0x210 # 'Disable and remove exit event' echo 0 > events/fprobes/event2/enable echo -:event2 >> dynamic_events # 'Disable and remove all events' echo 0 > events/fprobes/enable echo > dynamic_events # 'Add another event' echo 'f:event3 vfs_open%return' > dynamic_events cat dynamic_events f:fprobes/event3 vfs_open%return echo 1 > events/fprobes/enable cat enabled_functions vfs_open (1) tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150} vfs_read (1) tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150} ======= As you can see, an entry for the vfs_read remains. To fix this issue, when unregistering, the kernel should also check if there is the same type of fprobes still exist at the same address, and if not, delete its entry from either fgraph_ops or ftrace_ops. Link: https://lore.kernel.org/all/177669367993.132053.10553046138528674802.stgit@mhiramat.tok.corp.google.com/ Fixes: 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-04-22tracing/fprobe: Avoid kcalloc() in rcu_read_lock sectionMasami Hiramatsu (Google)
fprobe_remove_node_in_module() is called under RCU read locked, but this invokes kcalloc() if there are more than 8 fprobes installed on the module. Sashiko warns it because kcalloc() can sleep [1]. [1] https://sashiko.dev/#/patchset/177552432201.853249.5125045538812833325.stgit%40mhiramat.tok.corp.google.com To fix this issue, expand the batch size to 128 and do not expand the fprobe_addr_list, but just cancel walking on fprobe_ip_table, update fgraph/ftrace_ops and retry the loop again. Link: https://lore.kernel.org/all/177669367206.132053.1493637946869032744.stgit@mhiramat.tok.corp.google.com/ Fixes: 0de4c70d04a4 ("tracing: fprobe: use rhltable for fprobe_ip_table") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-04-21tracing/fprobe: Remove fprobe from hash in failure pathMasami Hiramatsu (Google)
When register_fprobe_ips() fails, it tries to remove a list of fprobe_hash_node from fprobe_ip_table, but it missed to remove fprobe itself from fprobe_table. Moreover, when removing the fprobe_hash_node which is added to rhltable once, it must use kfree_rcu() after removing from rhltable. To fix these issues, this reuses unregister_fprobe() internal code to rollback the half-way registered fprobe. Link: https://lore.kernel.org/all/177669366417.132053.17874946321744910456.stgit@mhiramat.tok.corp.google.com/ Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-04-21tracing/fprobe: Unregister fprobe even if memory allocation failsMasami Hiramatsu (Google)
unregister_fprobe() can fail under memory pressure because of memory allocation failure, but this maybe called from module unloading, and usually there is no way to retry it. Moreover. trace_fprobe does not check the return value. To fix this problem, unregister fprobe and fprobe_hash_node even if working memory allocation fails. Anyway, if the last fprobe is removed, the filter will be freed. Link: https://lore.kernel.org/all/177669365629.132053.8433032896213721288.stgit@mhiramat.tok.corp.google.com/ Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-04-21tracing/fprobe: Reject registration of a registered fprobe before initMasami Hiramatsu (Google)
Reject registration of a registered fprobe which is on the fprobe hash table before initializing fprobe. The add_fprobe_hash() checks this re-register fprobe, but since fprobe_init() clears hlist_array field, it is too late to check it. It has to check the re-registration before touncing fprobe. Link: https://lore.kernel.org/all/177669364845.132053.18375367916162315835.stgit@mhiramat.tok.corp.google.com/ Fixes: 4346ba160409 ("fprobe: Rewrite fprobe on function-graph tracer") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-04-20tracing: tell git to ignore the generated 'undefsyms_base.c' fileLinus Torvalds
This odd file was added to automatically figure out tool-generated symbols. Honestly, it *should* have been just a real honest-to-goodness regular file in git, instead of having strange code to generate it in the Makefile, but that is not how that silly thing works. So now we need to ignore it explicitly. Fixes: 1211907ac0b5 ("tracing: Generate undef symbols allowlist for simple_ring_buffer") Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Marc Zyngier <maz@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-04-20Merge tag 'printk-for-7.1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Fix printk ring buffer initialization and sanity checks - Workaround printf kunit test compilation with gcc < 12.1 - Add IPv6 address printf format tests - Misc code and documentation cleanup * tag 'printk-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printf: Compile the kunit test with DISABLE_BRANCH_PROFILING DISABLE_BRANCH_PROFILING lib/vsprintf: use bool for local decode variable lib/hexdump: print_hex_dump_bytes() calls print_hex_dump_debug() printk: ringbuffer: fix errors in comments printk_ringbuffer: Add sanity check for 0-size data printk_ringbuffer: Fix get_data() size sanity check printf: add IPv6 address format tests printk: Fix _DESCS_COUNT type for 64-bit systems
2026-04-20Merge tag 'timers-urgent-2026-04-20' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix timer stalls caused by incorrect handling of the dev->next_event_forced flag" * tag 'timers-urgent-2026-04-20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clockevents: Add missing resets of the next_event_forced flag
2026-04-21rtmutex: Use waiter::task instead of current in remove_waiter()Keenan Dong
remove_waiter() is used by the slowlock paths, but it is also used for proxy-lock rollback in rt_mutex_start_proxy_lock() when invoked from futex_requeue(). In the latter case waiter::task is not current, but remove_waiter() operates on current for the dequeue operation. That results in several problems: 1) the rbtree dequeue happens without waiter::task::pi_lock being held 2) the waiter task's pi_blocked_on state is not cleared, which leaves a dangling pointer primed for UAF around. 3) rt_mutex_adjust_prio_chain() operates on the wrong top priority waiter task Use waiter::task instead of current in all related operations in remove_waiter() to cure those problems. [ tglx: Fixup rt_mutex_adjust_prio_chain(), add a comment and amend the changelog ] Fixes: 8161239a8bcc ("rtmutex: Simplify PI algorithm and make highest prio task get lock") Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Keenan Dong <keenanat2000@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org
2026-04-20sched_ext: Deny SCX kfuncs to non-SCX struct_ops programsCheng-Yang Chou
scx_kfunc_context_filter() currently allows non-SCX struct_ops programs (e.g. tcp_congestion_ops) to call SCX unlocked kfuncs. This is wrong for two reasons: - It is semantically incorrect: a TCP congestion control program has no business calling SCX kfuncs such as scx_bpf_kick_cpu(). - With CONFIG_EXT_SUB_SCHED=y, kfuncs like scx_bpf_kick_cpu() call scx_prog_sched(aux), which invokes bpf_prog_get_assoc_struct_ops(aux) and casts the result to struct sched_ext_ops * before reading ops->priv. For a non-SCX struct_ops program the returned pointer is the kdata of that struct_ops type, which is far smaller than sched_ext_ops, making the read an out-of-bounds access (confirmed with KASAN). Extend the filter to cover scx_kfunc_set_any and scx_kfunc_set_idle as well, and deny all SCX kfuncs for any struct_ops program that is not the SCX struct_ops. This addresses both issues: the semantic contract is enforced at the verifier level, and the runtime out-of-bounds access becomes unreachable. Fixes: d1d3c1c6ae36 ("sched_ext: Add verifier-time kfunc context filter") Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-20Merge branch 'rework/prb-fixes' into for-linusPetr Mladek
2026-04-19sched_ext: Mark scx_sched_hash insecure_elasticityTejun Heo
scx_sched_hash is inserted into under scx_sched_lock (raw_spinlock_irq) in scx_link_sched(). rhashtable's sync grow path calls get_random_u32() and does a GFP_ATOMIC allocation; both acquire regular spinlocks, which is unsafe under raw_spinlock_t. Set insecure_elasticity to skip the sync grow. v2: - Dropped dsq_hash changes. Insertion is not under raw_spin_lock. - Switched from no_sync_grow flag to insecure_elasticity. Fixes: 25037af712eb ("sched_ext: Add rhashtable lookup for sub-schedulers") Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-19Merge tag 'mm-stable-2026-04-18-02-14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "Eliminate Dying Memory Cgroup" (Qi Zheng and Muchun Song) Address the longstanding "dying memcg problem". A situation wherein a no-longer-used memory control group will hang around for an extended period pointlessly consuming memory - "fix unexpected type conversions and potential overflows" (Qi Zheng) Fix a couple of potential 32-bit/64-bit issues which were identified during review of the "Eliminate Dying Memory Cgroup" series - "kho: history: track previous kernel version and kexec boot count" (Breno Leitao) Use Kexec Handover (KHO) to pass the previous kernel's version string and the number of kexec reboots since the last cold boot to the next kernel, and print it at boot time - "liveupdate: prevent double preservation" (Pasha Tatashin) Teach LUO to avoid managing the same file across different active sessions - "liveupdate: Fix module unloading and unregister API" (Pasha Tatashin) Address an issue with how LUO handles module reference counting and unregistration during module unloading - "zswap pool per-CPU acomp_ctx simplifications" (Kanchana Sridhar) Simplify and clean up the zswap crypto compression handling and improve the lifecycle management of zswap pool's per-CPU acomp_ctx resources - "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race" (SeongJae Park) Address unlikely but possible leaks and deadlocks in damon_call() and damon_walk() - "mm/damon/core: validate damos_quota_goal->nid" (SeongJae Park) Fix a couple of root-only wild pointer dereferences - "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race" (SeongJae Park) Update the DAMON documentation to warn operators about potential races which can occur if the commit_inputs parameter is altered at the wrong time - "Minor hmm_test fixes and cleanups" (Alistair Popple) Bugfixes and a cleanup for the HMM kernel selftests - "Modify memfd_luo code" (Chenghao Duan) Cleanups, simplifications and speedups to the memfd_lou code - "mm, kvm: allow uffd support in guest_memfd" (Mike Rapoport) Support for userfaultfd in guest_memfd - "selftests/mm: skip several tests when thp is not available" (Chunyu Hu) Fix several issues in the selftests code which were causing breakage when the tests were run on CONFIG_THP=n kernels - "mm/mprotect: micro-optimization work" (Pedro Falcato) A couple of nice speedups for mprotect() - "MAINTAINERS: update KHO and LIVE UPDATE entries" (Pratyush Yadav) Document upcoming changes in the maintenance of KHO, LUO, memfd_luo, kexec, crash, kdump and probably other kexec-based things - they are being moved out of mm.git and into a new git tree * tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (121 commits) MAINTAINERS: add page cache reviewer mm/vmscan: avoid false-positive -Wuninitialized warning MAINTAINERS: update Dave's kdump reviewer email address MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE MAINTAINERS: drop include/linux/kho/abi/ from KHO MAINTAINERS: update KHO and LIVE UPDATE maintainers MAINTAINERS: update kexec/kdump maintainers entries mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd() selftests: mm: skip charge_reserved_hugetlb without killall userfaultfd: allow registration of ranges below mmap_min_addr mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update mm/hugetlb: fix early boot crash on parameters without '=' separator zram: reject unrecognized type= values in recompress_store() docs: proc: document ProtectionKey in smaps mm/mprotect: special-case small folios when applying permissions mm/mprotect: move softleaf code out of the main function mm: remove '!root_reclaim' checking in should_abort_scan() mm/sparse: fix comment for section map alignment mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete() selftests/mm: transhuge_stress: skip the test when thp not available ...
2026-04-18Merge tag 'memblock-v7.1-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock updates from Mike Rapoport: - improve debuggability of reserve_mem kernel parameter handling with print outs in case of a failure and debugfs info showing what was actually reserved - Make memblock_free_late() and free_reserved_area() use the same core logic for freeing the memory to buddy and ensure it takes care of updating memblock arrays when ARCH_KEEP_MEMBLOCK is enabled. * tag 'memblock-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: x86/alternative: delay freeing of smp_locks section memblock: warn when freeing reserved memory before memory map is initialized memblock, treewide: make memblock_free() handle late freeing memblock: make free_reserved_area() update memblock if ARCH_KEEP_MEMBLOCK=y memblock: extract page freeing from free_reserved_area() into a helper memblock: make free_reserved_area() more robust mm: move free_reserved_area() to mm/memblock.c powerpc: opal-core: pair alloc_pages_exact() with free_pages_exact() powerpc: fadump: pair alloc_pages_exact() with free_pages_exact() memblock: reserve_mem: fix end caclulation in reserve_mem_release_by_name() memblock: move reserve_bootmem_range() to memblock.c and make it static memblock: Add reserve_mem debugfs info memblock: Print out errors on reserve_mem parser
2026-04-18liveupdate: defer file handler module refcounting to active sessionsPasha Tatashin
Stop pinning modules indefinitely upon file handler registration. Instead, dynamically increment the module reference count only when a live update session actively uses the file handler (e.g., during preservation or deserialization), and release it when the session ends. This allows modules providing live update handlers to be gracefully unloaded when no live update is in progress. Link: https://lore.kernel.org/20260327033335.696621-11-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18liveupdate: make unregister functions return voidPasha Tatashin
Change liveupdate_unregister_file_handler and liveupdate_unregister_flb to return void instead of an error code. This follows the design principle that unregistration during module unload should not fail, as the unload cannot be stopped at that point. Link: https://lore.kernel.org/20260327033335.696621-10-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18liveupdate: remove liveupdate_test_unregister()Pasha Tatashin
Now that file handler unregistration automatically unregisters all associated file handlers (FLBs), the liveupdate_test_unregister() function is no longer needed. Remove it along with its usages and declarations. Link: https://lore.kernel.org/20260327033335.696621-9-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18liveupdate: auto unregister FLBs on file handler unregistrationPasha Tatashin
To ensure that unregistration is always successful and doesn't leave dangling resources, introduce auto-unregistration of FLBs: when a file handler is unregistered, all FLBs associated with it are automatically unregistered. Introduce a new helper luo_flb_unregister_all() which unregisters all FLBs linked to the given file handler. Link: https://lore.kernel.org/20260327033335.696621-8-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18liveupdate: remove luo_session_quiesce()Pasha Tatashin
Now that FLB module references are handled dynamically during active sessions, we can safely remove the luo_session_quiesce() and luo_session_resume() mechanism. Link: https://lore.kernel.org/20260327033335.696621-7-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18liveupdate: defer FLB module refcounting to active sessionsPasha Tatashin
Stop pinning modules indefinitely upon FLB registration. Instead, dynamically take a module reference when the FLB is actively used in a session (e.g., during preserve and retrieve) and release it when the session concludes. This allows modules providing FLB operations to be cleanly unloaded when not in active use by the live update orchestrator. Link: https://lore.kernel.org/20260327033335.696621-6-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18liveupdate: protect FLB lists with luo_register_rwlockPasha Tatashin
Because liveupdate FLB objects will soon drop their persistent module references when registered, list traversals must be protected against concurrent module unloading. To provide this protection, utilize the global luo_register_rwlock. It protects the global registry of FLBs and the handler's specific list of FLB dependencies. Read locks are used during concurrent list traversals (e.g., during preservation and serialization). Write locks are taken during registration and unregistration. Link: https://lore.kernel.org/20260327033335.696621-5-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18liveupdate: protect file handler list with rwsemPasha Tatashin
Because liveupdate file handlers will no longer hold a module reference when registered, we must ensure that the access to the handler list is protected against concurrent module unloading. Utilize the global luo_register_rwlock to protect the global registry of file handlers. Read locks are taken during list traversals in luo_preserve_file() and luo_file_deserialize(). Write locks are taken during registration and unregistration. Link: https://lore.kernel.org/20260327033335.696621-4-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18liveupdate: synchronize lazy initialization of FLB private statePasha Tatashin
The luo_flb_get_private() function, which is responsible for lazily initializing the private state of FLB objects, can be called concurrently from multiple threads. This creates a data race on the 'initialized' flag and can lead to multiple executions of mutex_init() and INIT_LIST_HEAD() on the same memory. Introduce a static spinlock (luo_flb_init_lock) local to the function to synchronize the initialization path. Use smp_load_acquire() and smp_store_release() for memory ordering between the fast path and the slow path. Link: https://lore.kernel.org/20260327033335.696621-3-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18liveupdate: safely print untrusted stringsPasha Tatashin
Patch series "liveupdate: Fix module unloading and unregister API", v3. This patch series addresses an issue with how LUO handles module reference counting and unregistration during a module unload (e.g., via rmmod). Currently, modules that register live update file handlers are pinned for the entire duration they are registered. This prevents the modules from being unloaded gracefully, even when no live update session is in progress. Furthermore, if a module is forcefully unloaded, the unregistration functions return an error (e.g. -EBUSY) if a session is active, which is ignored by the kernel's module unload path, leaving dangling pointers in the LUO global lists. To resolve these issues, this series introduces the following changes: 1. Adds a global read-write semaphore (luo_register_rwlock) to protect the registration lists for both file handlers and FLBs. 2. Reduces the scope of module reference counting for file handlers and FLBs. Instead of pinning modules indefinitely upon registration, references are now taken only when they are actively used in a live update session (e.g., during preservation, retrieval, or deserialization). 3. Removes the global luo_session_quiesce() mechanism since module unload behavior now handles active sessions implicitly. 4. Introduces auto-unregistration of FLBs during file handler unregistration to prevent leaving dangling resources. 5. Changes the unregistration functions to return void instead of an error code. 6. Fixes a data race in luo_flb_get_private() by introducing a spinlock for thread-safe lazy initialization. 7. Strengthens security by using %.*s when printing untrusted deserialized compatible strings and session names to prevent out-of-bounds reads. This patch (of 10): Deserialized strings from KHO data (such as file handler compatible strings and session names) are provided by the previous kernel and might not be null-terminated if the data is corrupted or maliciously crafted. When printing these strings in error messages, use the %.*s format specifier with the maximum buffer size to prevent out-of-bounds reads into adjacent kernel memory. Link: https://lore.kernel.org/20260327033335.696621-1-pasha.tatashin@soleen.com Link: https://lore.kernel.org/20260327033335.696621-2-pasha.tatashin@soleen.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Samiullah Khawaja <skhawaja@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18liveupdate: prevent double management of filesPasha Tatashin
Patch series "liveupdate: prevent double preservation", v4. Currently, LUO does not prevent the same file from being managed twice across different active sessions. Because LUO preserves files of absolutely different types: memfd, and upcoming vfiofd [1], iommufd [2], guestmefd (and possible kvmfd/cpufd). There is no common private data or guarantee on how to prevent that the same file is not preserved twice beside using inode or some slower and expensive method like hashtables. This patch (of 4) Currently, LUO does not prevent the same file from being managed twice across different active sessions. Use a global xarray luo_preserved_files to keep track of file identifiers being preserved by LUO. Update luo_preserve_file() to check and insert the file identifier into this xarray when it is preserved, and erase it in luo_file_unpreserve_files() when it is released. To allow handlers to define what constitutes a "unique" file (e.g., different struct file objects pointing to the same hardware resource), add a get_id() callback to struct liveupdate_file_ops. If not provided, the default identifier is the struct file pointer itself. This ensures that the same file (or resource) cannot be managed by multiple sessions. If another session attempts to preserve an already managed file, it will now fail with -EBUSY. Link: https://lore.kernel.org/20260326163943.574070-1-pasha.tatashin@soleen.com Link: https://lore.kernel.org/20260326163943.574070-2-pasha.tatashin@soleen.com Link: https://lore.kernel.org/all/20260129212510.967611-1-dmatlack@google.com [1] Link: https://lore.kernel.org/all/20260203220948.2176157-1-skhawaja@google.com [2] Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: David Matlack <dmatlack@google.com> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18kho: kexec-metadata: track previous kernel chainBreno Leitao
Use Kexec Handover (KHO) to pass the previous kernel's version string and the number of kexec reboots since the last cold boot to the next kernel, and print it at boot time. Example output: [ 0.000000] KHO: exec from: 6.19.0-rc4-next-20260107 (count 1) Motivation ========== Bugs that only reproduce when kexecing from specific kernel versions are difficult to diagnose. These issues occur when a buggy kernel kexecs into a new kernel, with the bug manifesting only in the second kernel. Recent examples include the following commits: * commit eb2266312507 ("x86/boot: Fix page table access in 5-level to 4-level paging transition") * commit 77d48d39e991 ("efistub/tpm: Use ACPI reclaim memory for event log to avoid corruption") * commit 64b45dd46e15 ("x86/efi: skip memattr table on kexec boot") As kexec-based reboots become more common, these version-dependent bugs are appearing more frequently. At scale, correlating crashes to the previous kernel version is challenging, especially when issues only occur in specific transition scenarios. Implementation ============== The kexec metadata is stored as a plain C struct (struct kho_kexec_metadata) rather than FDT format, for simplicity and direct field access. It is registered via kho_add_subtree() as a separate subtree, keeping it independent from the core KHO ABI. This design choice: - Keeps the core KHO ABI minimal and stable - Allows the metadata format to evolve independently - Avoids requiring version bumps for all KHO consumers (LUO, etc.) when the metadata format changes The struct kho_kexec_metadata contains two fields: - previous_release: The kernel version that initiated the kexec - kexec_count: Number of kexec boots since last cold boot On cold boot, kexec_count starts at 0 and increments with each kexec. The count helps identify issues that only manifest after multiple consecutive kexec reboots. [leitao@debian.org: call kho_kexec_metadata_init() for both boot paths] Link: https://lore.kernel.org/all/20260309-kho-v8-5-c3abcf4ac750@debian.org/ [1] Link: https://lore.kernel.org/20260409-kho_fix_merge_issue-v1-1-710c84ceaa85@debian.org Link: https://lore.kernel.org/20260316-kho-v9-5-ed6dcd951988@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18kho: fix kho_in_debugfs_init() to handle non-FDT blobsBreno Leitao
kho_in_debugfs_init() calls fdt_totalsize() to determine blob sizes, which assumes all blobs are FDTs. This breaks for non-FDT blobs like struct kho_kexec_metadata. Fix this by reading the "blob-size" property from the FDT (persisted by kho_add_subtree()) instead of calling fdt_totalsize(). Also rename local variables from fdt_phys/sub_fdt to blob_phys/blob for consistency with the non-FDT-specific naming. Link: https://lore.kernel.org/20260316-kho-v9-4-ed6dcd951988@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18kho: persist blob size in KHO FDTBreno Leitao
kho_add_subtree() accepts a size parameter but only forwards it to debugfs. The size is not persisted in the KHO FDT, so it is lost across kexec. This makes it impossible for the incoming kernel to determine the blob size without understanding the blob format. Store the blob size as a "blob-size" property in the KHO FDT alongside the "preserved-data" physical address. This allows the receiving kernel to recover the size for any blob regardless of format. Also extend kho_retrieve_subtree() with an optional size output parameter so callers can learn the blob size without needing to understand the blob format. Update all callers to pass NULL for the new parameter. Link: https://lore.kernel.org/20260316-kho-v9-3-ed6dcd951988@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18kho: rename fdt parameter to blob in kho_add/remove_subtree()Breno Leitao
Since kho_add_subtree() now accepts arbitrary data blobs (not just FDTs), rename the parameter from 'fdt' to 'blob' to better reflect its purpose. Apply the same rename to kho_remove_subtree() for consistency. Also rename kho_debugfs_fdt_add() and kho_debugfs_fdt_remove() to kho_debugfs_blob_add() and kho_debugfs_blob_remove() respectively, with the same parameter rename from 'fdt' to 'blob'. Link: https://lore.kernel.org/20260316-kho-v9-2-ed6dcd951988@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18kho: add size parameter to kho_add_subtree()Breno Leitao
Patch series "kho: history: track previous kernel version and kexec boot count", v9. Use Kexec Handover (KHO) to pass the previous kernel's version string and the number of kexec reboots since the last cold boot to the next kernel, and print it at boot time. Example ======= [ 0.000000] Linux version 6.19.0-rc3-upstream-00047-ge5d992347849 ... [ 0.000000] KHO: exec from: 6.19.0-rc4-next-20260107upstream-00004-g3071b0dc4498 (count 1) Motivation ========== Bugs that only reproduce when kexecing from specific kernel versions are difficult to diagnose. These issues occur when a buggy kernel kexecs into a new kernel, with the bug manifesting only in the second kernel. Recent examples include: * eb2266312507 ("x86/boot: Fix page table access in 5-level to 4-level paging transition") * 77d48d39e991 ("efistub/tpm: Use ACPI reclaim memory for event log to avoid corruption") * 64b45dd46e15 ("x86/efi: skip memattr table on kexec boot") As kexec-based reboots become more common, these version-dependent bugs are appearing more frequently. At scale, correlating crashes to the previous kernel version is challenging, especially when issues only occur in specific transition scenarios. Some bugs manifest only after multiple consecutive kexec reboots. Tracking the kexec count helps identify these cases (this metric is already used by live update sub-system). KHO provides a reliable mechanism to pass information between kernels. By carrying the previous kernel's release string and kexec count forward, we can print this context at boot time to aid debugging. The goal of this feature is to have this information being printed in early boot, so, users can trace back kernel releases in kexec. Systemd is not helpful because we cannot assume that the previous kernel has systemd or even write access to the disk (common when using Linux as bootloaders) This patch (of 6): kho_add_subtree() assumes the fdt argument is always an FDT and calls fdt_totalsize() on it in the debugfs code path. This assumption will break if a caller passes arbitrary data instead of an FDT. When CONFIG_KEXEC_HANDOVER_DEBUGFS is enabled, kho_debugfs_fdt_add() calls __kho_debugfs_fdt_add(), which executes: f->wrapper.size = fdt_totalsize(fdt); Fix this by adding an explicit size parameter to kho_add_subtree() so callers specify the blob size. This allows subtrees to contain arbitrary data formats, not just FDTs. Update all callers: - memblock.c: use fdt_totalsize(fdt) - luo_core.c: use fdt_totalsize(fdt_out) - test_kho.c: use fdt_totalsize() - kexec_handover.c (root fdt): use fdt_totalsize(kho_out.fdt) Also update __kho_debugfs_fdt_add() to receive the size explicitly instead of computing it internally via fdt_totalsize(). In kho_in_debugfs_init(), pass fdt_totalsize() for the root FDT and sub-blobs since all current users are FDTs. A subsequent patch will persist the size in the KHO FDT so the incoming side can handle non-FDT blobs correctly. Link: https://lore.kernel.org/20260323110747.193569-1-duanchenghao@kylinos.cn Link: https://lore.kernel.org/20260316-kho-v9-1-ed6dcd951988@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Suggested-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18mm: memcontrol: prepare for reparenting non-hierarchical statsQi Zheng
To resolve the dying memcg issue, we need to reparent LRU folios of child memcg to its parent memcg. This could cause problems for non-hierarchical stats. As Yosry Ahmed pointed out: In short, if memory is charged to a dying cgroup at the time of reparenting, when the memory gets uncharged the stats updates will occur at the parent. This will update both hierarchical and non-hierarchical stats of the parent, which would corrupt the parent's non-hierarchical stats (because those counters were never incremented when the memory was charged). Now we have the following two types of non-hierarchical stats, and they are only used in CONFIG_MEMCG_V1: a. memcg->vmstats->state_local[i] b. pn->lruvec_stats->state_local[i] To ensure that these non-hierarchical stats work properly, we need to reparent these non-hierarchical stats after reparenting LRU folios. To this end, this commit makes the following preparations: 1. implement reparent_state_local() to reparent non-hierarchical stats 2. make css_killed_work_fn() to be called in rcu work, and implement get_non_dying_memcg_start() and get_non_dying_memcg_end() to avoid race between mod_memcg_state()/mod_memcg_lruvec_state() and reparent_state_local() Link: https://lore.kernel.org/e862995c45a7101a541284b6ebee5e5c32c89066.1772711148.git.zhengqi.arch@bytedance.com Co-developed-by: Yosry Ahmed <yosry@kernel.org> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Allen Pais <apais@linux.microsoft.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chen Ridong <chenridong@huawei.com> Cc: David Hildenbrand <david@kernel.org> Cc: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Imran Khan <imran.f.khan@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kamalesh Babulal <kamalesh.babulal@oracle.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-17Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Alexei Starovoitov: "Most of the diff stat comes from Xu Kuohai's fix to emit ENDBR/BTI, since all JITs had to be touched to move constant blinding out and pass bpf_verifier_env in. - Fix use-after-free in arena_vm_close on fork (Alexei Starovoitov) - Dissociate struct_ops program with map if map_update fails (Amery Hung) - Fix out-of-range and off-by-one bugs in arm64 JIT (Daniel Borkmann) - Fix precedence bug in convert_bpf_ld_abs alignment check (Daniel Borkmann) - Fix arg tracking for imprecise/multi-offset in BPF_ST/STX insns (Eduard Zingerman) - Copy token from main to subprogs to fix missing kallsyms (Eduard Zingerman) - Prevent double close and leak of btf objects in libbpf (Jiri Olsa) - Fix af_unix null-ptr-deref in sockmap (Michal Luczaj) - Fix NULL deref in map_kptr_match_type for scalar regs (Mykyta Yatsenko) - Avoid unnecessary IPIs. Remove redundant bpf_flush_icache() in arm64 and riscv JITs (Puranjay Mohan) - Fix out of bounds access. Validate node_id in arena_alloc_pages() (Puranjay Mohan) - Reject BPF-to-BPF calls and callbacks in arm32 JIT (Puranjay Mohan) - Refactor all JITs to pass bpf_verifier_env to emit ENDBR/BTI for indirect jump targets on x86-64, arm64 JITs (Xu Kuohai) - Allow UTF-8 literals in bpf_bprintf_prepare() (Yihan Ding)" * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (32 commits) bpf, arm32: Reject BPF-to-BPF calls and callbacks in the JIT bpf: Dissociate struct_ops program with map if map_update fails bpf: Validate node_id in arena_alloc_pages() libbpf: Prevent double close and leak of btf objects selftests/bpf: cover UTF-8 trace_printk output bpf: allow UTF-8 literals in bpf_bprintf_prepare() selftests/bpf: Reject scalar store into kptr slot bpf: Fix NULL deref in map_kptr_match_type for scalar regs bpf: Fix precedence bug in convert_bpf_ld_abs alignment check bpf, arm64: Emit BTI for indirect jump target bpf, x86: Emit ENDBR for indirect jump targets bpf: Add helper to detect indirect jump targets bpf: Pass bpf_verifier_env to JIT bpf: Move constants blinding out of arch-specific JITs bpf, sockmap: Take state lock for af_unix iter bpf, sockmap: Fix af_unix null-ptr-deref in proto update selftests/bpf: Extend bpf_iter_unix to attempt deadlocking bpf, sockmap: Fix af_unix iter deadlock bpf, sockmap: Annotate af_unix sock:: Sk_state data-races selftests/bpf: verify kallsyms entries for token-loaded subprograms ...
2026-04-17bpf: Dissociate struct_ops program with map if map_update failsAmery Hung
Currently, when bpf_struct_ops_map_update_elem() fails, the programs' st_ops_assoc will remain set. They may become dangling pointers if the map is freed later, but they will never be dereferenced since the struct_ops attachment did not succeed. However, if one of the programs is subsequently attached as part of another struct_ops map, its st_ops_assoc will be poisoned even though its old st_ops_assoc was stale from a failed attachment. Fix the spurious poisoned st_ops_assoc by dissociating struct_ops programs with a map if the attachment fails. Move bpf_prog_assoc_struct_ops() to after *plink++ to make sure bpf_prog_disassoc_struct_ops() will not miss a program when iterating st_map->links. Note that, dissociating a program from a map requires some attention as it must not reset a poisoned st_ops_assoc or a st_ops_assoc pointing to another map. The former is already guarded in bpf_prog_disassoc_struct_ops(). The latter also will not happen since st_ops_assoc of programs in st_map->links are set by bpf_prog_assoc_struct_ops(), which can only be poisoned or pointing to the current map. Signed-off-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20260417174900.2895486-1-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-17cgroup/cpuset: record DL BW alloc CPU for attach rollbackGuopeng Zhang
cpuset_can_attach() allocates DL bandwidth only when migrating deadline tasks to a disjoint CPU mask, but cpuset_cancel_attach() rolls back based only on nr_migrate_dl_tasks. This makes the DL bandwidth alloc/free paths asymmetric: rollback can call dl_bw_free() even when no dl_bw_alloc() was done. Rollback also needs to undo the reservation against the same CPU/root domain that was charged. Record the CPU used by dl_bw_alloc() and use that state in cpuset_cancel_attach(). If no allocation happened, dl_bw_cpu stays at -1 and rollback skips dl_bw_free(). If allocation did happen, bandwidth is returned to the same CPU/root domain. Successful attach paths are unchanged. This only fixes failed attach rollback accounting. Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails") Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-17Merge tag 'dma-mapping-7.1-2026-04-16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping updates from Marek Szyprowski: - added support for batched cache sync, what improves performance of dma_map/unmap_sg() operations on ARM64 architecture (Barry Song) - introduced DMA_ATTR_CC_SHARED attribute for explicitly shared memory used in confidential computing (Jiri Pirko) - refactored spaghetti-like code in drivers/of/of_reserved_mem.c and its clients (Marek Szyprowski, shared branch with device-tree updates to avoid merge conflicts) - prepared Contiguous Memory Allocator related code for making dma-buf drivers modularized (Maxime Ripard) - added support for benchmarking dma_map_sg() calls to tools/dma utility (Qinxin Xia) * tag 'dma-mapping-7.1-2026-04-16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: (24 commits) dma-buf: heaps: system: document system_cc_shared heap dma-buf: heaps: system: add system_cc_shared heap for explicitly shared memory dma-mapping: introduce DMA_ATTR_CC_SHARED for shared memory mm: cma: Export cma_alloc(), cma_release() and cma_get_name() dma: contiguous: Export dev_get_cma_area() dma: contiguous: Make dma_contiguous_default_area static dma: contiguous: Make dev_get_cma_area() a proper function dma: contiguous: Turn heap registration logic around of: reserved_mem: rework fdt_init_reserved_mem_node() of: reserved_mem: clarify fdt_scan_reserved_mem*() functions of: reserved_mem: rearrange code a bit of: reserved_mem: replace CMA quirks by generic methods of: reserved_mem: switch to ops based OF_DECLARE() of: reserved_mem: use -ENODEV instead of -ENOENT of: reserved_mem: remove fdt node from the structure dma-mapping: fix false kernel-doc comment marker dma-mapping: Support batch mode for dma_direct_{map,unmap}_sg dma-mapping: Separate DMA sync issuing and completion waiting arm64: Provide dcache_inval_poc_nosync helper arm64: Provide dcache_clean_poc_nosync helper ...
2026-04-17cgroup/rdma: fix integer overflow in rdmacg_try_charge()cuitao
The expression `rpool->resources[index].usage + 1` is computed in int arithmetic before being assigned to s64 variable `new`. When usage equals INT_MAX (the default "max" value), the addition overflows to INT_MIN. This negative value then passes the `new > max` check incorrectly, allowing a charge that should be rejected and corrupting usage to negative. Fix by casting usage to s64 before the addition so the arithmetic is done in 64-bit. Fixes: 39d3e7584a68 ("rdmacg: Added rdma cgroup controller") Signed-off-by: cuitao <cuitao@kylinos.cn> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-17sched/psi: fix race between file release and pressure writeEdward Adam Davis
A potential race condition exists between pressure write and cgroup file release regarding the priv member of struct kernfs_open_file, which triggers the uaf reported in [1]. Consider the following scenario involving execution on two separate CPUs: CPU0 CPU1 ==== ==== vfs_rmdir() kernfs_iop_rmdir() cgroup_rmdir() cgroup_kn_lock_live() cgroup_destroy_locked() cgroup_addrm_files() cgroup_rm_file() kernfs_remove_by_name() kernfs_remove_by_name_ns() vfs_write() __kernfs_remove() new_sync_write() kernfs_drain() kernfs_fop_write_iter() kernfs_drain_open_files() cgroup_file_write() kernfs_release_file() pressure_write() cgroup_file_release() ctx = of->priv; kfree(ctx); of->priv = NULL; cgroup_kn_unlock() cgroup_kn_lock_live() cgroup_get(cgrp) cgroup_kn_unlock() if (ctx->psi.trigger) // here, trigger uaf for ctx, that is of->priv The cgroup_rmdir() is protected by the cgroup_mutex, it also safeguards the memory deallocation of of->priv performed within cgroup_file_release(). However, the operations involving of->priv executed within pressure_write() are not entirely covered by the protection of cgroup_mutex. Consequently, if the code in pressure_write(), specifically the section handling the ctx variable executes after cgroup_file_release() has completed, a uaf vulnerability involving of->priv is triggered. Therefore, the issue can be resolved by extending the scope of the cgroup_mutex lock within pressure_write() to encompass all code paths involving of->priv, thereby properly synchronizing the race condition occurring between cgroup_file_release() and pressure_write(). And, if an live kn lock can be successfully acquired while executing the pressure write operation, it indicates that the cgroup deletion process has not yet reached its final stage; consequently, the priv pointer within open_file cannot be NULL. Therefore, the operation to retrieve the ctx value must be moved to a point *after* the live kn lock has been successfully acquired. In another situation, specifically after entering cgroup_kn_lock_live() but before acquiring cgroup_mutex, there exists a different class of race condition: CPU0: write memory.pressure CPU1: write cgroup.pressure=0 =========================== ============================= kernfs_fop_write_iter() kernfs_get_active_of(of) pressure_write() cgroup_kn_lock_live(memory.pressure) cgroup_tryget(cgrp) kernfs_break_active_protection(kn) ... blocks on cgroup_mutex cgroup_pressure_write() cgroup_kn_lock_live(cgroup.pressure) cgroup_file_show(memory.pressure, false) kernfs_show(false) kernfs_drain_open_files() cgroup_file_release(of) kfree(ctx) of->priv = NULL cgroup_kn_unlock() ... acquires cgroup_mutex ctx = of->priv; // may now be NULL if (ctx->psi.trigger) // NULL dereference Consequently, there is a possibility that of->priv is NULL, the pressure write needs to check for this. Now that the scope of the cgroup_mutex has been expanded, the original explicit cgroup_get/put operations are no longer necessary, this is because acquiring/releasing the live kn lock inherently executes a cgroup get/put operation. [1] BUG: KASAN: slab-use-after-free in pressure_write+0xa4/0x210 kernel/cgroup/cgroup.c:4011 Call Trace: pressure_write+0xa4/0x210 kernel/cgroup/cgroup.c:4011 cgroup_file_write+0x36f/0x790 kernel/cgroup/cgroup.c:4311 kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:352 Allocated by task 9352: cgroup_file_open+0x90/0x3a0 kernel/cgroup/cgroup.c:4256 kernfs_fop_open+0x9eb/0xcb0 fs/kernfs/file.c:724 do_dentry_open+0x83d/0x13e0 fs/open.c:949 Freed by task 9353: cgroup_file_release+0xd6/0x100 kernel/cgroup/cgroup.c:4283 kernfs_release_file fs/kernfs/file.c:764 [inline] kernfs_drain_open_files+0x392/0x720 fs/kernfs/file.c:834 kernfs_drain+0x470/0x600 fs/kernfs/dir.c:525 Fixes: 0e94682b73bf ("psi: introduce psi monitor") Reported-by: syzbot+33e571025d88efd1312c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=33e571025d88efd1312c Tested-by: syzbot+33e571025d88efd1312c@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Signed-off-by: Tejun Heo <tj@kernel.org>