summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-06-09bpf: Move insn if/else into do_check_insn()Luis Gerhorst
This is required to catch the errors later and fall back to a nospec if on a speculative path. Eliminate the regs variable as it is only used once and insn_idx is not modified in-between the definition and usage. Do not pass insn but compute it in the function itself. As Eduard points out [1], insn is assumed to correspond to env->insn_idx in many places (e.g, __check_reg_arg()). Move code into do_check_insn(), replace * "continue" with "return 0" after modifying insn_idx * "goto process_bpf_exit" with "return PROCESS_BPF_EXIT" * "goto process_bpf_exit_full" with "return process_bpf_exit_full()" * "do_print_state = " with "*do_print_state = " [1] https://lore.kernel.org/all/293dbe3950a782b8eb3b87b71d7a967e120191fd.camel@gmail.com/ Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Henriette Herzog <henriette.herzog@rub.de> Cc: Maximilian Ott <ott@cs.fau.de> Cc: Milan Stephan <milan.stephan@fau.de> Link: https://lore.kernel.org/r/20250603205800.334980-2-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-09bpf: Add cookie in fdinfo for raw_tpTao Chen
Add cookie in fdinfo for raw_tp, the info as follows: link_type: raw_tracepoint link_id: 31 prog_tag: 9dfdf8ef453843bf prog_id: 32 tp_name: sys_enter cookie: 23925373020405760 Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250606165818.3394397-5-chen.dylane@linux.dev
2025-06-09bpf: Add cookie in fdinfo for tracingTao Chen
Add cookie in fdinfo for tracing, the info as follows: link_type: tracing link_id: 6 prog_tag: 9dfdf8ef453843bf prog_id: 35 attach_type: 25 target_obj_id: 1 target_btf_id: 60355 cookie: 9007199254740992 Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250606165818.3394397-4-chen.dylane@linux.dev
2025-06-09bpf: Add cookie to tracing bpf_link_infoTao Chen
bpf_tramp_link includes cookie info, we can add it in bpf_link_info. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250606165818.3394397-1-chen.dylane@linux.dev
2025-06-09bpf: Make reg_not_null() true for CONST_PTR_TO_MAPIhor Solodrai
When reg->type is CONST_PTR_TO_MAP, it can not be null. However the verifier explores the branches under rX == 0 in check_cond_jmp_op() even if reg->type is CONST_PTR_TO_MAP, because it was not checked for in reg_not_null(). Fix this by adding CONST_PTR_TO_MAP to the set of types that are considered non nullable in reg_not_null(). An old "unpriv: cmp map pointer with zero" selftest fails with this change, because now early out correctly triggers in check_cond_jmp_op(), making the verification to pass. In practice verifier may allow pointer to null comparison in unpriv, since in many cases the relevant branch and comparison op are removed as dead code. So change the expected test result to __success_unpriv. Signed-off-by: Ihor Solodrai <isolodrai@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250609183024.359974-2-isolodrai@meta.com
2025-06-09bpf: Add show_fdinfo for perf_eventTao Chen
After commit 1b715e1b0ec5 ("bpf: Support ->fill_link_info for perf_event") add perf_event info, we can also show the info with the method of cat /proc/[fd]/fdinfo. kprobe fdinfo: link_type: perf link_id: 10 prog_tag: bcf7977d3b93787c prog_id: 20 name: bpf_fentry_test1 offset: 0x0 missed: 0 addr: 0xffffffffa28a2904 event_type: kprobe cookie: 3735928559 uprobe fdinfo: link_type: perf link_id: 13 prog_tag: bcf7977d3b93787c prog_id: 21 name: /proc/self/exe offset: 0x63dce4 ref_ctr_offset: 0x33eee2a event_type: uprobe cookie: 3735928559 tracepoint fdinfo: link_type: perf link_id: 11 prog_tag: bcf7977d3b93787c prog_id: 22 tp_name: sched_switch event_type: tracepoint cookie: 3735928559 perf_event fdinfo: link_type: perf link_id: 12 prog_tag: bcf7977d3b93787c prog_id: 23 type: 1 config: 2 event_type: event cookie: 3735928559 Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250606150258.3385166-1-chen.dylane@linux.dev
2025-06-09bpf: Implement mprog API on top of existing cgroup progsYonghong Song
Current cgroup prog ordering is appending at attachment time. This is not ideal. In some cases, users want specific ordering at a particular cgroup level. To address this, the existing mprog API seems an ideal solution with supporting BPF_F_BEFORE and BPF_F_AFTER flags. But there are a few obstacles to directly use kernel mprog interface. Currently cgroup bpf progs already support prog attach/detach/replace and link-based attach/detach/replace. For example, in struct bpf_prog_array_item, the cgroup_storage field needs to be together with bpf prog. But the mprog API struct bpf_mprog_fp only has bpf_prog as the member, which makes it difficult to use kernel mprog interface. In another case, the current cgroup prog detach tries to use the same flag as in attach. This is different from mprog kernel interface which uses flags passed from user space. So to avoid modifying existing behavior, I made the following changes to support mprog API for cgroup progs: - The support is for prog list at cgroup level. Cross-level prog list (a.k.a. effective prog list) is not supported. - Previously, BPF_F_PREORDER is supported only for prog attach, now BPF_F_PREORDER is also supported by link-based attach. - For attach, BPF_F_BEFORE/BPF_F_AFTER/BPF_F_ID/BPF_F_LINK is supported similar to kernel mprog but with different implementation. - For detach and replace, use the existing implementation. - For attach, detach and replace, the revision for a particular prog list, associated with a particular attach type, will be updated by increasing count by 1. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250606163141.2428937-1-yonghong.song@linux.dev
2025-06-09cgroup: Add bpf prog revisions to struct cgroup_bpfYonghong Song
One of key items in mprog API is revision for prog list. The revision number will be increased if the prog list changed, e.g., attach, detach or replace. Add 'revisions' field to struct cgroup_bpf, representing revisions for all cgroup related attachment types. The initial revision value is set to 1, the same as kernel mprog implementations. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250606163136.2428732-1-yonghong.song@linux.dev
2025-06-09workqueue: fix opencoded cpumask_next_and_wrap() in wq_select_unbound_cpu()Yury Norov [NVIDIA]
The dedicated helper is more verbose and effective comparing to cpumask_first() followed by cpumask_next(). Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-09sched_ext: Make scx_locked_rq() inlineAndrea Righi
scx_locked_rq() is used both from ext.c and ext_idle.c, move it to ext.h as a static inline function. No functional changes. v2: Rename locked_rq to scx_locked_rq_state, expose it and make scx_locked_rq() inline, as suggested by Tejun. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-09sched_ext: Make scx_rq_bypassing() inlineAndrea Righi
scx_rq_bypassing() is used both from ext.c and ext_idle.c, move it to ext.h as a static inline function. No functional changes. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-09sched_ext: idle: Make local functions static in ext_idle.cAndrea Righi
Functions that are only used within ext_idle.c can be marked as static to limit their scope. No functional changes. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-09sched_ext: idle: Remove unnecessary ifdef in scx_bpf_cpu_node()Andrea Righi
There's no need to make scx_bpf_cpu_node() dependent on CONFIG_NUMA, since cpu_to_node() can be used also in systems with CONFIG_NUMA disabled. This also allows to always validate the @cpu argument regardless of the CONFIG_NUMA settings. Fixes: 01059219b0cfd ("sched_ext: idle: Introduce node-aware idle cpu kfunc helpers") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-09printk: Allow to use the printk kthread immediately even for 1st nbconPetr Mladek
The kthreads for nbcon consoles are created by nbcon_alloc() at the beginning of the console registration. But it currently works only for the 2nd or later nbcon console because the code checks @printk_kthreads_running. The kthread for the 1st registered nbcon console is created at the very end of register_console() by printk_kthreads_check_locked(). As a result, the entire log is replayed synchronously when the "enabled" message gets printed. It might block the boot for a long time with a slow serial console. Prevent the synchronous flush by creating the kthread even for the 1st nbcon console when it is safe (kthreads ready and no boot consoles). Also inform printk() to use the kthread by setting @printk_kthreads_running. Note that the kthreads already must be running when it is safe and this is not the 1st nbcon console. Symmetrically, clear @printk_kthreads_running when the last nbcon console was unregistered by nbcon_free(). This requires updating @have_nbcon_console before nbcon_free() gets called. Note that there is _no_ problem when the 1st nbcon console replaces boot consoles. In this case, the kthread will be started at the end of registration after the boot consoles are removed. But the console does not reply the entire log buffer in this case. Note that the flag CON_PRINTBUFFER is always cleared when the boot consoles are removed and vice versa. Closes: https://lore.kernel.org/r/20250514173514.2117832-1-mcobb@thegoodpenguin.co.uk Tested-by: Michael Cobb <mcobb@thegoodpenguin.co.uk> Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://patch.msgid.link/20250604142045.253301-1-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-06-08Merge tag 'timers-cleanups-2025-06-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanup from Thomas Gleixner: "The delayed from_timer() API cleanup: The renaming to the timer_*() namespace was delayed due massive conflicts against Linux-next. Now that everything is upstream finish the conversion" * tag 'timers-cleanups-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: treewide, timers: Rename from_timer() to timer_container_of()
2025-06-08Merge tag 'trace-v6.16-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull more tracing fixes from Steven Rostedt: - Fix regression of waiting a long time on updating trace event filters When the faultable trace points were added, it needed task trace RCU synchronization. This was added to the tracepoint_synchronize_unregister() function. The filter logic always called this function whenever it updated the trace event filters before freeing the old filters. This increased the time of "trace-cmd record" from taking 13 seconds to running over 2 minutes to complete. Move the freeing of the filters to call_rcu*() logic, which brings the time back down to 13 seconds. - Fix ring_buffer_subbuf_order_set() error path lock protection The error path of the ring_buffer_subbuf_order_set() released the mutex too early and allowed subsequent accesses to setting the subbuffer size to corrupt the data and cause a bug. By moving the mutex locking to the end of the error path, it prevents the reentrant access to the critical data and also allows the function to convert the taking of the mutex over to the guard() logic. - Remove unused power management clock events The clock events were added in 2010 for power management. In 2011 arm used them. In 2013 the code they were used in was removed. These events have been wasting memory since then. - Fix sparse warnings There was a few places that sparse warned about trace_events_filter.c where file->filter was referenced directly, but it is annotated with an __rcu tag. Use the helper functions and fix them up to use rcu_dereference() properly. * tag 'trace-v6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Add rcu annotation around file->filter accesses tracing: PM: Remove unused clock events ring-buffer: Fix buffer locking in ring_buffer_subbuf_order_set() tracing: Fix regression of filter waiting a long time on RCU synchronization
2025-06-08treewide, timers: Rename from_timer() to timer_container_of()Ingo Molnar
Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-07Merge tag 'kbuild-v6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Add support for the EXPORT_SYMBOL_GPL_FOR_MODULES() macro, which exports a symbol only to specified modules - Improve ABI handling in gendwarfksyms - Forcibly link lib-y objects to vmlinux even if CONFIG_MODULES=n - Add checkers for redundant or missing <linux/export.h> inclusion - Deprecate the extra-y syntax - Fix a genksyms bug when including enum constants from *.symref files * tag 'kbuild-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (28 commits) genksyms: Fix enum consts from a reference affecting new values arch: use always-$(KBUILD_BUILTIN) for vmlinux.lds kbuild: set y instead of 1 to KBUILD_{BUILTIN,MODULES} efi/libstub: use 'targets' instead of extra-y in Makefile module: make __mod_device_table__* symbols static scripts/misc-check: check unnecessary #include <linux/export.h> when W=1 scripts/misc-check: check missing #include <linux/export.h> when W=1 scripts/misc-check: add double-quotes to satisfy shellcheck kbuild: move W=1 check for scripts/misc-check to top-level Makefile scripts/tags.sh: allow to use alternative ctags implementation kconfig: introduce menu type enum docs: symbol-namespaces: fix reST warning with literal block kbuild: link lib-y objects to vmlinux forcibly even when CONFIG_MODULES=n tinyconfig: enable CONFIG_LD_DEAD_CODE_DATA_ELIMINATION docs/core-api/symbol-namespaces: drop table of contents and section numbering modpost: check forbidden MODULE_IMPORT_NS("module:") at compile time kbuild: move kbuild syntax processing to scripts/Makefile.build Makefile: remove dependency on archscripts for header installation Documentation/kbuild: Add new gendwarfksyms kABI rules Documentation/kbuild: Drop section numbers ...
2025-06-07tracing: Add rcu annotation around file->filter accessesSteven Rostedt
Running sparse on trace_events_filter.c triggered several warnings about file->filter being accessed directly even though it's annotated with __rcu. Add rcu_dereference() around it and shuffle the logic slightly so that it's always referenced via accessor functions. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250607102821.6c7effbf@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-06Merge tag 'mm-hotfixes-stable-2025-06-06-16-02' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "13 hotfixes. 6 are cc:stable and the remainder address post-6.15 issues or aren't considered necessary for -stable kernels. 11 are for MM" * tag 'mm-hotfixes-stable-2025-06-06-16-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count MAINTAINERS: add mm swap section kmsan: test: add module description MAINTAINERS: add tlb trace events to MMU GATHER AND TLB INVALIDATION mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race mm/hugetlb: unshare page tables during VMA split, not before MAINTAINERS: add Alistair as reviewer of mm memory policy iov_iter: use iov_offset for length calculation in iov_iter_aligned_bvec mm/mempolicy: fix incorrect freeing of wi_kobj alloc_tag: handle module codetag load errors as module load failures mm/madvise: handle madvise_lock() failure during race unwinding mm: fix vmstat after removing NR_BOUNCE KVM: s390: rename PROT_NONE to PROT_TYPE_DUMMY
2025-06-06Merge tag 'riscv-for-linus-6.16-mw1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - Support for the FWFT SBI extension, which is part of SBI 3.0 and a dependency for many new SBI and ISA extensions - Support for getrandom() in the VDSO - Support for mseal - Optimized routines for raid6 syndrome and recovery calculations - kexec_file() supports loading Image-formatted kernel binaries - Improvements to the instruction patching framework to allow for atomic instruction patching, along with rules as to how systems need to behave in order to function correctly - Support for a handful of new ISA extensions: Svinval, Zicbop, Zabha, some SiFive vendor extensions - Various fixes and cleanups, including: misaligned access handling, perf symbol mangling, module loading, PUD THPs, and improved uaccess routines * tag 'riscv-for-linus-6.16-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (69 commits) riscv: uaccess: Only restore the CSR_STATUS SUM bit RISC-V: vDSO: Wire up getrandom() vDSO implementation riscv: enable mseal sysmap for RV64 raid6: Add RISC-V SIMD syndrome and recovery calculations riscv: mm: Add support for Svinval extension RISC-V: Documentation: Add enough title underlines to CMODX riscv: Improve Kconfig help for RISCV_ISA_V_PREEMPTIVE MAINTAINERS: Update Atish's email address riscv: uaccess: do not do misaligned accesses in get/put_user() riscv: process: use unsigned int instead of unsigned long for put_user() riscv: make unsafe user copy routines use existing assembly routines riscv: hwprobe: export Zabha extension riscv: Make regs_irqs_disabled() more clear perf symbols: Ignore mapping symbols on riscv RISC-V: Kconfig: Fix help text of CMDLINE_EXTEND riscv: module: Optimize PLT/GOT entry counting riscv: Add support for PUD THP riscv: xchg: Prefetch the destination word for sc.w riscv: Add ARCH_HAS_PREFETCH[W] support with Zicbop riscv: Add support for Zicbop ...
2025-06-06ring-buffer: Fix buffer locking in ring_buffer_subbuf_order_set()Dmitry Antipov
Enlarge the critical section in ring_buffer_subbuf_order_set() to ensure that error handling takes place with per-buffer mutex held, thus preventing list corruption and other concurrency-related issues. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com> Link: https://lore.kernel.org/20250606112242.1510605-1-dmantipov@yandex.ru Reported-by: syzbot+05d673e83ec640f0ced9@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=05d673e83ec640f0ced9 Fixes: f9b94daa542a8 ("ring-buffer: Set new size of the ring buffer sub page") Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-06tracing: Fix regression of filter waiting a long time on RCU synchronizationSteven Rostedt
When faultable trace events were added, a trace event may no longer use normal RCU to synchronize but instead used synchronize_rcu_tasks_trace(). This synchronization takes a much longer time to synchronize. The filter logic would free the filters by calling tracepoint_synchronize_unregister() after it unhooked the filter strings and before freeing them. With this function now calling synchronize_rcu_tasks_trace() this increased the time to free a filter tremendously. On a PREEMPT_RT system, it was even more noticeable. # time trace-cmd record -p function sleep 1 [..] real 2m29.052s user 0m0.244s sys 0m20.136s As trace-cmd would clear out all the filters before recording, it could take up to 2 minutes to do a recording of "sleep 1". To find out where the issues was: ~# trace-cmd sqlhist -e -n sched_stack select start.prev_state as state, end.next_comm as comm, TIMESTAMP_DELTA_USECS as delta, start.STACKTRACE as stack from sched_switch as start join sched_switch as end on start.prev_pid = end.next_pid Which will produce the following commands (and -e will also execute them): echo 's:sched_stack s64 state; char comm[16]; u64 delta; unsigned long stack[];' >> /sys/kernel/tracing/dynamic_events echo 'hist:keys=prev_pid:__arg_18057_2=prev_state,__arg_18057_4=common_timestamp.usecs,__arg_18057_7=common_stacktrace' >> /sys/kernel/tracing/events/sched/sched_switch/trigger echo 'hist:keys=next_pid:__state_18057_1=$__arg_18057_2,__comm_18057_3=next_comm,__delta_18057_5=common_timestamp.usecs-$__arg_18057_4,__stack_18057_6=$__arg_18057_7:onmatch(sched.sched_switch).trace(sched_stack,$__state_18057_1,$__comm_18057_3,$__delta_18057_5,$__stack_18057_6)' >> /sys/kernel/tracing/events/sched/sched_switch/trigger The above creates a synthetic event that creates a stack trace when a task schedules out and records it with the time it scheduled back in. Basically the time a task is off the CPU. It also records the state of the task when it left the CPU (running, blocked, sleeping, etc). It also saves the comm of the task as "comm" (needed for the next command). ~# echo 'hist:keys=state,stack.stacktrace:vals=delta:sort=state,delta if comm == "trace-cmd" && state & 3' > /sys/kernel/tracing/events/synthetic/sched_stack/trigger The above creates a histogram with buckets per state, per stack, and the value of the total time it was off the CPU for that stack trace. It filters on tasks with "comm == trace-cmd" and only the sleeping and blocked states (1 - sleeping, 2 - blocked). ~# trace-cmd record -p function sleep 1 ~# cat /sys/kernel/tracing/events/synthetic/sched_stack/hist | tail -18 { state: 2, stack.stacktrace __schedule+0x1545/0x3700 schedule+0xe2/0x390 schedule_timeout+0x175/0x200 wait_for_completion_state+0x294/0x440 __wait_rcu_gp+0x247/0x4f0 synchronize_rcu_tasks_generic+0x151/0x230 apply_subsystem_event_filter+0xa2b/0x1300 subsystem_filter_write+0x67/0xc0 vfs_write+0x1e2/0xeb0 ksys_write+0xff/0x1d0 do_syscall_64+0x7b/0x420 entry_SYSCALL_64_after_hwframe+0x76/0x7e } hitcount: 237 delta: 99756288 <<--------------- Delta is 99 seconds! Totals: Hits: 525 Entries: 21 Dropped: 0 This shows that this particular trace waited for 99 seconds on synchronize_rcu_tasks() in apply_subsystem_event_filter(). In fact, there's a lot of places in the filter code that spends a lot of time waiting for synchronize_rcu_tasks_trace() in order to free the filters. Add helper functions that will use call_rcu*() variants to asynchronously free the filters. This brings the timings back to normal: # time trace-cmd record -p function sleep 1 [..] real 0m14.681s user 0m0.335s sys 0m28.616s And the histogram also shows this: ~# cat /sys/kernel/tracing/events/synthetic/sched_stack/hist | tail -21 { state: 2, stack.stacktrace __schedule+0x1545/0x3700 schedule+0xe2/0x390 schedule_timeout+0x175/0x200 wait_for_completion_state+0x294/0x440 __wait_rcu_gp+0x247/0x4f0 synchronize_rcu_normal+0x3db/0x5c0 tracing_reset_online_cpus+0x8f/0x1e0 tracing_open+0x335/0x440 do_dentry_open+0x4c6/0x17a0 vfs_open+0x82/0x360 path_openat+0x1a36/0x2990 do_filp_open+0x1c5/0x420 do_sys_openat2+0xed/0x180 __x64_sys_openat+0x108/0x1d0 do_syscall_64+0x7b/0x420 } hitcount: 2 delta: 77044 Totals: Hits: 55 Entries: 28 Dropped: 0 Where the total waiting time of synchronize_rcu_tasks_trace() is 77 milliseconds. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Andreas Ziegler <ziegler.andreas@siemens.com> Cc: Felix MOESSBAUER <felix.moessbauer@siemens.com> Link: https://lore.kernel.org/20250606201936.1e3d09a9@batman.local.home Reported-by: "Flot, Julien" <julien.flot@siemens.com> Tested-by: Julien Flot <julien.flot@siemens.com> Fixes: a363d27cdbc2 ("tracing: Allow system call tracepoints to handle page faults") Closes: https://lore.kernel.org/all/240017f656631c7dd4017aa93d91f41f653788ea.camel@siemens.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-05kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_countMax Kellermann
Expose a simple counter to userspace for monitoring tools. (akpm: 2536c5c7d6ae added the documentation but the code changes were lost) Link: https://lkml.kernel.org/r/20250504180831.4190860-3-max.kellermann@ionos.com Fixes: 2536c5c7d6ae ("kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count") Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Cc: Core Minyard <cminyard@mvista.com> Cc: Doug Anderson <dianders@chromium.org> Cc: Joel Granados <joel.granados@kernel.org> Cc: Max Kellermann <max.kellermann@ionos.com> Cc: Song Liu <song@kernel.org> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05alloc_tag: handle module codetag load errors as module load failuresSuren Baghdasaryan
Failures inside codetag_load_module() are currently ignored. As a result an error there would not cause a module load failure and freeing of the associated resources. Correct this behavior by propagating the error code to the caller and handling possible errors. With this change, error to allocate percpu counters, which happens at this stage, will not be ignored and will cause a module load failure and freeing of resources. With this change we also do not need to disable memory allocation profiling when this error happens, instead we fail to load the module. Link: https://lkml.kernel.org/r/20250521160602.1940771-1-surenb@google.com Fixes: 10075262888b ("alloc_tag: allocate percpu counters for module tags dynamically") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: Casey Chen <cachen@purestorage.com> Closes: https://lore.kernel.org/all/20250520231620.15259-1-cachen@purestorage.com/ Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Wang <00107082@163.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-05Merge tag 'riscv-mw2-6.16-rc1' of ↵Palmer Dabbelt
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/alexghiti/linux into for-next riscv patches for 6.16-rc1, part 2 * Performance improvements - Add support for vdso getrandom - Implement raid6 calculations using vectors - Introduce svinval tlb invalidation * Cleanup - A bunch of deduplication of the macros we use for manipulating instructions * Misc - Introduce a kunit test for kprobes - Add support for mseal as riscv fits the requirements (thanks to Lorenzo for making sure of that :)) [Palmer: There was a rebase between part 1 and part 2, so I've had to do some more git surgery here... at least two rounds of surgery...] * alex-pr-2: (866 commits) RISC-V: vDSO: Wire up getrandom() vDSO implementation riscv: enable mseal sysmap for RV64 raid6: Add RISC-V SIMD syndrome and recovery calculations riscv: mm: Add support for Svinval extension riscv: Add kprobes KUnit test riscv: kprobes: Remove duplication of RV_EXTRACT_ITYPE_IMM riscv: kprobes: Remove duplication of RV_EXTRACT_UTYPE_IMM riscv: kprobes: Remove duplication of RV_EXTRACT_RD_REG riscv: kprobes: Remove duplication of RVC_EXTRACT_BTYPE_IMM riscv: kprobes: Remove duplication of RVC_EXTRACT_C2_RS1_REG riscv: kproves: Remove duplication of RVC_EXTRACT_JTYPE_IMM riscv: kprobes: Remove duplication of RV_EXTRACT_BTYPE_IMM riscv: kprobes: Remove duplication of RV_EXTRACT_RS1_REG riscv: kprobes: Remove duplication of RV_EXTRACT_JTYPE_IMM riscv: kprobes: Move branch_funct3 to insn.h riscv: kprobes: Move branch_rs2_idx to insn.h Linux 6.15-rc6 Input: xpad - fix xpad_device sorting Input: xpad - add support for several more controllers Input: xpad - fix Share button on Xbox One controllers ...
2025-06-06tinyconfig: enable CONFIG_LD_DEAD_CODE_DATA_ELIMINATIONMasahiro Yamada
This CONFIG option, if supported by the architecture, helps reduce the size of vmlinux. For example, the size of vmlinux with ARCH=arm tinyconfig decreases as follows: text data bss dec hex filename 631684 104500 18176 754360 b82b8 vmlinux.before 455316 93404 15472 564192 89be0 vmlinux.after Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2025-06-05bpf: Clarify sanitize_check_bounds()Luis Gerhorst
As is, it appears as if pointer arithmetic is allowed for everything except PTR_TO_{STACK,MAP_VALUE} if one only looks at sanitize_check_bounds(). However, this is misleading as the function only works together with retrieve_ptr_limit() and the two must be kept in sync. This patch documents the interdependency and adds a check to ensure they stay in sync. adjust_ptr_min_max_vals(): Because the preceding switch returns -EACCES for every opcode except for ADD/SUB, the sanitize_needed() following the sanitize_check_bounds() call is always true if reached. This means, unless sanitize_check_bounds() detected that the pointer goes OOB because of the ADD/SUB and returns -EACCES, sanitize_ptr_alu() always executes after sanitize_check_bounds(). The following shows that this also implies that retrieve_ptr_limit() runs in all relevant cases. Note that there are two calls to sanitize_ptr_alu(), these are simply needed to easily calculate the correct alu_limit as explained in commit 7fedb63a8307 ("bpf: Tighten speculative pointer arithmetic mask"). The truncation-simulation is already performed on the first call. In the second sanitize_ptr_alu(commit_window = true), we always run retrieve_ptr_limit(), unless: * can_skip_alu_sanititation() is true, notably `BPF_SRC(insn->code) == BPF_K`. BPF_K is fine because it means that there is no scalar register (which could be subject to speculative scalar confusion due to Spectre v4) that goes into the ALU operation. The pointer register can not be subject to v4-based value confusion due to the nospec added. Thus, in this case it would have been fine to also skip sanitize_check_bounds(). * If we are on a speculative path (`vstate->speculative`) and in the second "commit" phase, sanitize_ptr_alu() always just returns 0. This makes sense because there are no ALU sanitization limits to be learned from speculative paths. Furthermore, because the sanitization will ensure that pointer arithmetic stays in (architectural) bounds, the sanitize_check_bounds() on the speculative path could also be skipped. The second case needs more attention: Assume we have some ALU operation that is used with scalars architecturally, but with a non-PTR_TO_{STACK,MAP_VALUE} pointer (e.g., PTR_TO_PACKET) speculatively. It might appear as if this would allow an unsanitized pointer ALU operations, but this can not happen because one of the following two always holds: * The type mismatch stems from Spectre v4, then it is prevented by a nospec after the possibly-bypassed store involving the pointer. There is no speculative path simulated for this case thus it never happens. * The type mismatch stems from a Spectre v1 gadget like the following: r1 = slow(0) r4 = fast(0) r3 = SCALAR // Spectre v4 scalar confusion if (r1) { r2 = PTR_TO_PACKET } else { r2 = 42 } if (r4) { r2 += r3 *r2 } If `r2 = PTR_TO_PACKET` is indeed dead code, it will be sanitized to `goto -1` (as is the case for the r4-if block). If it is not (e.g., if `r1 = r4 = 1` is possible), it will also be explored on an architectural path and retrieve_ptr_limit() will reject it. To summarize, the exception for `vstate->speculative` is safe. Back to retrieve_ptr_limit(): It only allows the ALU operation if the involved pointer register (can be either source or destination for ADD) is PTR_TO_STACK or PTR_TO_MAP_VALUE. Otherwise, it returns -EOPNOTSUPP. Therefore, sanitize_check_bounds() returning 0 for non-PTR_TO_{STACK,MAP_VALUE} is fine because retrieve_ptr_limit() also runs for all relevant cases and prevents unsafe operations. To summarize, we allow unsanitized pointer arithmetic with 64-bit ADD/SUB for the following instructions if the requirements from retrieve_ptr_limit() AND sanitize_check_bounds() hold: * ptr -=/+= imm32 (i.e. `BPF_SRC(insn->code) == BPF_K`) * PTR_TO_{STACK,MAP_VALUE} -= scalar * PTR_TO_{STACK,MAP_VALUE} += scalar * scalar += PTR_TO_{STACK,MAP_VALUE} To document the interdependency between sanitize_check_bounds() and retrieve_ptr_limit(), add a verifier_bug_if() to make sure they stay in sync. Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/CAP01T76HZ+s5h+_REqRFkRjjoKwnZZn9YswpSVinGicah1pGJw@mail.gmail.com/ Link: https://lore.kernel.org/bpf/CAP01T75oU0zfZCiymEcH3r-GQ5A6GOc6GmYzJEnMa3=53XuUQQ@mail.gmail.com/ Link: https://lore.kernel.org/r/20250603204557.332447-1-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-05bpf: Add cookie to raw_tp bpf_link_infoTao Chen
After commit 68ca5d4eebb8 ("bpf: support BPF cookie in raw tracepoint (raw_tp, tp_btf) programs"), we can show the cookie in bpf_link_info like kprobe etc. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20250603154309.3063644-1-chen.dylane@linux.dev
2025-06-05kernel: ftrace: export ftrace_sync_ipiAndy Chiu
The following ftrace patch for riscv uses a data store to update ftrace function. Therefore, a romote fence is required to order it against function_trace_op updates. The mechanism is similar to the fence between function_trace_op and update_ftrace_func in the generic ftrace, so we leverage the same ftrace_sync_ipi function. [ alex: Fix build warning when !CONFIG_DYNAMIC_FTRACE ] Signed-off-by: Andy Chiu <andybnac@gmail.com> Link: https://lore.kernel.org/r/20250407180838.42877-4-andybnac@gmail.com Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>
2025-06-05futex: Handle invalid node numbers supplied by userPeter Zijlstra
syzbot used a negative node number which was not rejected early and led to invalid memory access in node_possible(). Reject negative node numbers except for FUTEX_NO_NODE. [bigeasy: Keep the FUTEX_NO_NODE check] Closes: https://lore.kernel.org/all/6835bfe3.a70a0220.253bc2.00b5.GAE@google.com/ Fixes: cec199c5e39bd ("futex: Implement FUTEX2_NUMA") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: syzbot+9afaf6749e3a7aa1bdf3@syzkaller.appspotmail.com Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250528085521.1938355-4-bigeasy@linutronix.de
2025-06-05perf/core: Fix WARN in perf_cgroup_switch()Luo Gengkun
There may be concurrency between perf_cgroup_switch and perf_cgroup_event_disable. Consider the following scenario: after a new perf cgroup event is created on CPU0, the new event may not trigger a reprogramming, causing ctx->is_active to be 0. In this case, when CPU1 disables this perf event, it executes __perf_remove_from_context-> list _del_event->perf_cgroup_event_disable on CPU1, which causes a race with perf_cgroup_switch running on CPU0. The following describes the details of this concurrency scenario: CPU0 CPU1 perf_cgroup_switch: ... # cpuctx->cgrp is not NULL here if (READ_ONCE(cpuctx->cgrp) == NULL) return; perf_remove_from_context: ... raw_spin_lock_irq(&ctx->lock); ... # ctx->is_active == 0 because reprogramm is not # tigger, so CPU1 can do __perf_remove_from_context # for CPU0 __perf_remove_from_context: perf_cgroup_event_disable: ... if (--ctx->nr_cgroups) ... # this warning will happened because CPU1 changed # ctx.nr_cgroups to 0. WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0); [peterz: use guard instead of goto unlock] Fixes: db4a835601b7 ("perf/core: Set cgroup in CPU contexts for new cgroup events") Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250604033924.3914647-3-luogengkun@huaweicloud.com
2025-06-05perf: Fix dangling cgroup pointer in cpuctxYeoreum Yun
Commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit") moves the event->state update to before list_del_event(). This makes the event->state test in list_del_event() always false; never calling perf_cgroup_event_disable(). As a result, cpuctx->cgrp won't be cleared properly; causing havoc. Fixes: a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit") Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: David Wang <00107082@163.com> Link: https://lore.kernel.org/all/aD2TspKH%2F7yvfYoO@e129823.arm.com/
2025-06-05perf: Fix cgroup state vs ERRORPeter Zijlstra
While chasing down a missing perf_cgroup_event_disable() elsewhere, Leo Yan found that both perf_put_aux_event() and perf_remove_sibling_event() were also missing one. Specifically, the rule is that events that switch to OFF,ERROR need to call perf_cgroup_event_disable(). Unify the disable paths to ensure this. Fixes: ab43762ef010 ("perf: Allow normal events to output AUX data") Fixes: 9f0c4fa111dc ("perf/core: Add a new PERF_EV_CAP_SIBLING event capability") Reported-by: Leo Yan <leo.yan@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250605123343.GD35970@noisy.programming.kicks-ass.net
2025-06-05perf: Fix sample vs do_exit()Peter Zijlstra
Baisheng Gao reported an ARM64 crash, which Mark decoded as being a synchronous external abort -- most likely due to trying to access MMIO in bad ways. The crash further shows perf trying to do a user stack sample while in exit_mmap()'s tlb_finish_mmu() -- i.e. while tearing down the address space it is trying to access. It turns out that we stop perf after we tear down the userspace mm; a receipie for disaster, since perf likes to access userspace for various reasons. Flip this order by moving up where we stop perf in do_exit(). Additionally, harden PERF_SAMPLE_CALLCHAIN and PERF_SAMPLE_STACK_USER to abort when the current task does not have an mm (exit_mm() makes sure to set current->mm = NULL; before commencing with the actual teardown). Such that CPU wide events don't trip on this same problem. Fixes: c5ebcedb566e ("perf: Add ability to attach user stack dump to sample") Reported-by: Baisheng Gao <baisheng.gao@unisoc.com> Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250605110815.GQ39944@noisy.programming.kicks-ass.net
2025-06-04Merge tag 'sched_ext-for-6.16-rc1-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "Two fixes in the built-in idle selection helpers: - Fix prev_cpu handling to guarantee that idle selection never returns a CPU that's not allowed - Skip cross-node search when !NUMA which could lead to infinite looping due to a bug in NUMA iterator" * tag 'sched_ext-for-6.16-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: idle: Skip cross-node search with !CONFIG_NUMA sched_ext: idle: Properly handle invalid prev_cpu during idle selection
2025-06-03Merge tag 'trace-v6.16-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix UAF in module unload in ftrace when there's a bug in the module If a module is buggy and triggers ftrace_disable which is set when an anomaly is detected, when it gets unloaded it doesn't free the hooks into kallsyms, and when a kallsyms lookup is performed it may access the mod->modname field and crash via UAF. Fix this by still freeing the mod_maps that are attached to kallsyms on module unload regardless if ftrace_disable is set or not. - Do not bother allocating mod_maps for kallsyms if ftrace_disable is set - Remove unused trace events When a trace event or tracepoint is created but not used, it still creates the code and data structures needed for that trace event. This just wastes memory. Remove the trace events that are created but not used. This does not remove trace events that are created but are not used due configs not being set. That will be handled later. This only removes events that have no user under any config. * tag 'trace-v6.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: fsdax: Remove unused trace events for dax insert mapping genirq/matrix: Remove unused irq_matrix_alloc_reserved tracepoint xdp: Remove unused mem_return_failed event ftrace: Don't allocate ftrace module map if ftrace is disabled ftrace: Fix UAF when lookup kallsym after ftrace disabled
2025-06-03Merge tag 'cgroup-for-6.16-rc1-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: "The rstat per-subsystem split change skipped per-cpu allocation on UP configs; however even on UP, depending on config options, the size of the percpu struct may not be zero leading to crashes. Fix it by conditionalizing the per-cpu area allocation and usage on the size of the per-cpu struct" * tag 'cgroup-for-6.16-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: adjust criteria for rstat subsystem cpu lock access
2025-06-03sched_ext: idle: Skip cross-node search with !CONFIG_NUMAAndrea Righi
In the idle CPU selection logic, attempting cross-node searches adds unnecessary complexity when CONFIG_NUMA is disabled. Since there's no meaningful concept of nodes in this case, simplify the logic by restricting the idle CPU search to the current node only. Fixes: 48849271e6611 ("sched_ext: idle: Per-node idle cpumasks") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-02Merge tag 'modules-6.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux Pull module updates from Petr Pavlu: - Make .static_call_sites in modules read-only after init The .static_call_sites sections in modules have been made read-only after init to avoid any (non-)accidental modifications, similarly to how they are read-only after init in vmlinux - The rest are minor cleanups * tag 'modules-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux: module: Remove outdated comment about text_size module: Make .static_call_sites read-only after init module: Add a separate function to mark sections as read-only after init module: Constify parameters of module_enforce_rwx_sections()
2025-06-02Merge tag 'mm-stable-2025-06-01-14-06' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "zram: support algorithm-specific parameters" from Sergey Senozhatsky adds infrastructure for passing algorithm-specific parameters into zram. A single parameter `winbits' is implemented at this time. - "memcg: nmi-safe kmem charging" from Shakeel Butt makes memcg charging nmi-safe, which is required by BFP, which can operate in NMI context. - "Some random fixes and cleanup to shmem" from Kemeng Shi implements small fixes and cleanups in the shmem code. - "Skip mm selftests instead when kernel features are not present" from Zi Yan fixes some issues in the MM selftest code. - "mm/damon: build-enable essential DAMON components by default" from SeongJae Park reworks DAMON Kconfig to make it easier to enable CONFIG_DAMON. - "sched/numa: add statistics of numa balance task migration" from Libo Chen adds more info into sysfs and procfs files to improve visibility into the NUMA balancer's task migration activity. - "selftests/mm: cow and gup_longterm cleanups" from Mark Brown provides various updates to some of the MM selftests to make them play better with the overall containing framework. * tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (43 commits) mm/khugepaged: clean up refcount check using folio_expected_ref_count() selftests/mm: fix test result reporting in gup_longterm selftests/mm: report unique test names for each cow test selftests/mm: add helper for logging test start and results selftests/mm: use standard ksft_finished() in cow and gup_longterm selftests/damon/_damon_sysfs: skip testcases if CONFIG_DAMON_SYSFS is disabled sched/numa: add statistics of numa balance task sched/numa: fix task swap by skipping kernel threads tools/testing: check correct variable in open_procmap() tools/testing/vma: add missing function stub mm/gup: update comment explaining why gup_fast() disables IRQs selftests/mm: two fixes for the pfnmap test mm/khugepaged: fix race with folio split/free using temporary reference mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order mmu_notifiers: remove leftover stub macros selftests/mm: deduplicate test names in madv_populate kcov: rust: add flags for KCOV with Rust mm: rust: make CONFIG_MMU ifdefs more narrow mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables() mm/damon/Kconfig: enable CONFIG_DAMON by default ...
2025-06-02Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull more kvm updates from Paolo Bonzini: Generic: - Clean up locking of all vCPUs for a VM by using the *_nest_lock() family of functions, and move duplicated code to virt/kvm/. kernel/ patches acked by Peter Zijlstra - Add MGLRU support to the access tracking perf test ARM fixes: - Make the irqbypass hooks resilient to changes in the GSI<->MSI routing, avoiding behind stale vLPI mappings being left behind. The fix is to resolve the VGIC IRQ using the host IRQ (which is stable) and nuking the vLPI mapping upon a routing change - Close another VGIC race where vCPU creation races with VGIC creation, leading to in-flight vCPUs entering the kernel w/o private IRQs allocated - Fix a build issue triggered by the recently added workaround for Ampere's AC04_CPU_23 erratum - Correctly sign-extend the VA when emulating a TLBI instruction potentially targeting a VNCR mapping - Avoid dereferencing a NULL pointer in the VGIC debug code, which can happen if the device doesn't have any mapping yet s390: - Fix interaction between some filesystems and Secure Execution - Some cleanups and refactorings, preparing for an upcoming big series x86: - Wait for target vCPU to ack KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to fix a race between AP destroy and VMRUN - Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM - Refine and harden handling of spurious faults - Add support for ALLOWED_SEV_FEATURES - Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y - Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features that utilize those bits - Don't account temporary allocations in sev_send_update_data() - Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold - Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between SVM and VMX - Advertise support to userspace for WRMSRNS and PREFETCHI - Rescan I/O APIC routes after handling EOI that needed to be intercepted due to the old/previous routing, but not the new/current routing - Add a module param to control and enumerate support for device posted interrupts - Fix a potential overflow with nested virt on Intel systems running 32-bit kernels - Flush shadow VMCSes on emergency reboot - Add support for SNP to the various SEV selftests - Add a selftest to verify fastops instructions via forced emulation - Refine and optimize KVM's software processing of the posted interrupt bitmap, and share the harvesting code between KVM and the kernel's Posted MSI handler" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits) rtmutex_api: provide correct extern functions KVM: arm64: vgic-debug: Avoid dereferencing NULL ITE pointer KVM: arm64: vgic-init: Plug vCPU vs. VGIC creation race KVM: arm64: Unmap vLPIs affected by changes to GSI routing information KVM: arm64: Resolve vLPI by host IRQ in vgic_v4_unset_forwarding() KVM: arm64: Protect vLPI translation with vgic_irq::irq_lock KVM: arm64: Use lock guard in vgic_v4_set_forwarding() KVM: arm64: Mask out non-VA bits from TLBI VA* on VNCR invalidation arm64: sysreg: Drag linux/kconfig.h to work around vdso build issue KVM: s390: Simplify and move pv code KVM: s390: Refactor and split some gmap helpers KVM: s390: Remove unneeded srcu lock s390: Remove unneeded includes s390/uv: Improve splitting of large folios that cannot be split while dirty s390/uv: Always return 0 from s390_wiggle_split_folio() if successful s390/uv: Don't return 0 from make_hva_secure() if the operation was not successful rust: add helper for mutex_trylock RISC-V: KVM: use kvm_trylock_all_vcpus when locking all vCPUs KVM: arm64: use kvm_trylock_all_vcpus when locking all vCPUs x86: KVM: SVM: use kvm_lock_all_vcpus instead of a custom implementation ...
2025-06-02ftrace: Don't allocate ftrace module map if ftrace is disabledYe Bin
If ftrace is disabled, it is meaningless to allocate a module map. Add a check in allocate_ftrace_mod_map() to not allocate if ftrace is disabled. Link: https://lore.kernel.org/20250529111955.2349189-3-yebin@huaweicloud.com Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-02ftrace: Fix UAF when lookup kallsym after ftrace disabledYe Bin
The following issue happens with a buggy module: BUG: unable to handle page fault for address: ffffffffc05d0218 PGD 1bd66f067 P4D 1bd66f067 PUD 1bd671067 PMD 101808067 PTE 0 Oops: Oops: 0000 [#1] SMP KASAN PTI Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS RIP: 0010:sized_strscpy+0x81/0x2f0 RSP: 0018:ffff88812d76fa08 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffffc0601010 RCX: dffffc0000000000 RDX: 0000000000000038 RSI: dffffc0000000000 RDI: ffff88812608da2d RBP: 8080808080808080 R08: ffff88812608da2d R09: ffff88812608da68 R10: ffff88812608d82d R11: ffff88812608d810 R12: 0000000000000038 R13: ffff88812608da2d R14: ffffffffc05d0218 R15: fefefefefefefeff FS: 00007fef552de740(0000) GS:ffff8884251c7000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffc05d0218 CR3: 00000001146f0000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ftrace_mod_get_kallsym+0x1ac/0x590 update_iter_mod+0x239/0x5b0 s_next+0x5b/0xa0 seq_read_iter+0x8c9/0x1070 seq_read+0x249/0x3b0 proc_reg_read+0x1b0/0x280 vfs_read+0x17f/0x920 ksys_read+0xf3/0x1c0 do_syscall_64+0x5f/0x2e0 entry_SYSCALL_64_after_hwframe+0x76/0x7e The above issue may happen as follows: (1) Add kprobe tracepoint; (2) insmod test.ko; (3) Module triggers ftrace disabled; (4) rmmod test.ko; (5) cat /proc/kallsyms; --> Will trigger UAF as test.ko already removed; ftrace_mod_get_kallsym() ... strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN); ... The problem is when a module triggers an issue with ftrace and sets ftrace_disable. The ftrace_disable is set when an anomaly is discovered and to prevent any more damage, ftrace stops all text modification. The issue that happened was that the ftrace_disable stops more than just the text modification. When a module is loaded, its init functions can also be traced. Because kallsyms deletes the init functions after a module has loaded, ftrace saves them when the module is loaded and function tracing is enabled. This allows the output of the function trace to show the init function names instead of just their raw memory addresses. When a module is removed, ftrace_release_mod() is called, and if ftrace_disable is set, it just returns without doing anything more. The problem here is that it leaves the mod_list still around and if kallsyms is called, it will call into this code and access the module memory that has already been freed as it will return: strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN); Where the "mod" no longer exists and triggers a UAF bug. Link: https://lore.kernel.org/all/20250523135452.626d8dcd@gandalf.local.home/ Cc: stable@vger.kernel.org Fixes: aba4b5c22cba ("ftrace: Save module init functions kallsyms symbols for tracing") Link: https://lore.kernel.org/20250529111955.2349189-2-yebin@huaweicloud.com Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-02rtmutex_api: provide correct extern functionsPaolo Bonzini
Commit fb49f07ba1d9 ("locking/mutex: implement mutex_lock_killable_nest_lock") changed the set of functions that mutex.c defines when CONFIG_DEBUG_LOCK_ALLOC is set. - it removed the "extern" declaration of mutex_lock_killable_nested from include/linux/mutex.h, and replaced it with a macro since it could be treated as a special case of _mutex_lock_killable. It also removed a definition of the function in kernel/locking/mutex.c. - likewise, it replaced mutex_trylock() with the more generic mutex_trylock_nest_lock() and replaced mutex_trylock() with a macro. However, it left the old definitions in place in kernel/locking/rtmutex_api.c, which causes failures when building with CONFIG_RT_MUTEXES=y. Bring over the changes. Fixes: fb49f07ba1d9 ("locking/mutex: implement mutex_lock_killable_nest_lock") Reported-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-05-31sched/numa: add statistics of numa balance taskChen Yu
On systems with NUMA balancing enabled, it has been found that tracking task activities resulting from NUMA balancing is beneficial. NUMA balancing employs two mechanisms for task migration: one is to migrate a task to an idle CPU within its preferred node, and the other is to swap tasks located on different nodes when they are on each other's preferred nodes. The kernel already provides NUMA page migration statistics in /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched. However, it lacks statistics regarding task migration and swapping. Therefore, relevant counts for task migration and swapping should be added. The following two new fields: numa_task_migrated numa_task_swapped will be shown in /sys/fs/cgroup/{GROUP}/memory.stat, /proc/{PID}/sched and /proc/vmstat. Introducing both per-task and per-memory cgroup (memcg) NUMA balancing statistics facilitates a rapid evaluation of the performance and resource utilization of the target workload. For instance, users can first identify the container with high NUMA balancing activity and then further pinpoint a specific task within that group, and subsequently adjust the memory policy for that task. In short, although it is possible to iterate through /proc/$pid/sched to locate the problematic task, the introduction of aggregated NUMA balancing activity for tasks within each memcg can assist users in identifying the task more efficiently through a divide-and-conquer approach. As Libo Chen pointed out, the memcg event relies on the text names in vmstat_text, and /proc/vmstat generates corresponding items based on vmstat_text. Thus, the relevant task migration and swapping events introduced in vmstat_text also need to be populated by count_vm_numa_event(), otherwise these values are zero in /proc/vmstat. In theory, task migration and swap events are part of the scheduler's activities. The reason for exposing them through the memory.stat/vmstat interface is that we already have NUMA balancing statistics in memory.stat/vmstat, and these events are closely related to each other. Following Shakeel's suggestion, we describe the end-to-end flow/story of all these events occurring on a timeline for future reference: The goal of NUMA balancing is to co-locate a task and its memory pages on the same NUMA node. There are two strategies: migrate the pages to the task's node, or migrate the task to the node where its pages reside. Suppose a task p1 is running on Node 0, but its pages are located on Node 1. NUMA page fault statistics for p1 reveal its "page footprint" across nodes. If NUMA balancing detects that most of p1's pages are on Node 1: 1.Page Migration Attempt: The Numa balance first tries to migrate p1's pages to Node 0. The numa_page_migrate counter increments. 2.Task Migration Strategies: After the page migration finishes, Numa balance checks every 1 second to see if p1 can be migrated to Node 1. Case 2.1: Idle CPU Available If Node 1 has an idle CPU, p1 is directly scheduled there. This event is logged as numa_task_migrated. Case 2.2: No Idle CPU (Task Swap) If all CPUs on Node1 are busy, direct migration could cause CPU contention or load imbalance. Instead: The Numa balance selects a candidate task p2 on Node 1 that prefers Node 0 (e.g., due to its own page footprint). p1 and p2 are swapped. This cross-node swap is recorded as numa_task_swapped. Link: https://lkml.kernel.org/r/d00edb12ba0f0de3c5222f61487e65f2ac58f5b1.1748493462.git.yu.c.chen@intel.com Link: https://lkml.kernel.org/r/7ef90a88602ed536be46eba7152ed0d33bad5790.1748002400.git.yu.c.chen@intel.com Signed-off-by: Chen Yu <yu.c.chen@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Aubrey Li <aubrey.li@intel.com> Cc: Ayush Jain <Ayush.jain3@amd.com> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Libo Chen <libo.chen@oracle.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31sched/numa: fix task swap by skipping kernel threadsLibo Chen
Patch series "sched/numa: add statistics of numa balance task migration", v6. Introduce task migration and swap statistics in the following places: /sys/fs/cgroup/{GROUP}/memory.stat /proc/{PID}/sched /proc/vmstat These statistics facilitate a rapid evaluation of the performance and resource utilization of the target workload. This patch (of 2): Task swapping is triggered when there are no idle CPUs in task A's preferred node. In this case, the NUMA load balancer chooses a task B on A's preferred node and swaps B with A. This helps improve NUMA locality without introducing load imbalance between nodes. In the current implementation, B's NUMA node preference is not mandatory. That is to say, a kernel thread might be incorrectly chosen as B. However, kernel thread and user space thread that does not have mm are not supposed to be covered by NUMA balancing because NUMA balancing only considers user pages via VMAs. According to Peter's suggestion for fixing this issue, we use PF_KTHREAD to skip the kernel thread. curr->mm is also checked because it is possible that user_mode_thread() might create a user thread without an mm. As per Prateek's analysis, after adding the PF_KTHREAD check, there is no need to further check the PF_IDLE flag: : - play_idle_precise() already ensures PF_KTHREAD is set before adding : PF_IDLE : : - cpu_startup_entry() is only called from the startup thread which : should be marked with PF_KTHREAD (based on my understanding looking at : commit cff9b2332ab7 ("kernel/sched: Modify initial boot task idle : setup")) In summary, the check in task_numa_compare() now aligns with task_tick_numa(). Link: https://lkml.kernel.org/r/cover.1748493462.git.yu.c.chen@intel.com Link: https://lkml.kernel.org/r/43d68b356b25d124f0d222ebedf3859e86eefb9f.1748493462.git.yu.c.chen@intel.com Link: https://lkml.kernel.org/r/cover.1748002400.git.yu.c.chen@intel.com Link: https://lkml.kernel.org/r/eaacc9c9bd37bac92d43a671867d85b2fdad3b06.1748002400.git.yu.c.chen@intel.com Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Libo Chen <libo.chen@oracle.com> Suggested-by: Michal Koutný <mkoutny@suse.com> Tested-by: Ayush Jain <Ayush.jain3@amd.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Aubrey Li <aubrey.li@intel.com> Cc: "Chen, Tim C" <tim.c.chen@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Cc: Mel Gorman <mgorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31mm: rename page->index to page->__folio_indexMatthew Wilcox (Oracle)
All users of page->index have been converted to not refer to it any more. Update a few pieces of documentation that were missed and prevent new users from appearing (or at least make them easy to grep for). Link: https://lkml.kernel.org/r/20250514181508.3019795-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31Merge tag 'mm-nonmm-stable-2025-05-31-15-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "hung_task: extend blocking task stacktrace dump to semaphore" from Lance Yang enhances the hung task detector. The detector presently dumps the blocking tasks's stack when it is blocked on a mutex. Lance's series extends this to semaphores - "nilfs2: improve sanity checks in dirty state propagation" from Wentao Liang addresses a couple of minor flaws in nilfs2 - "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn fixes a couple of issues in the gdb scripts - "Support kdump with LUKS encryption by reusing LUKS volume keys" from Coiby Xu addresses a usability problem with kdump. When the dump device is LUKS-encrypted, the kdump kernel may not have the keys to the encrypted filesystem. A full writeup of this is in the series [0/N] cover letter - "sysfs: add counters for lockups and stalls" from Max Kellermann adds /sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and /sys/kernel/rcu_stall_count - "fork: Page operation cleanups in the fork code" from Pasha Tatashin implements a number of code cleanups in fork.c - "scripts/gdb/symbols: determine KASLR offset on s390 during early boot" from Ilya Leoshkevich fixes some s390 issues in the gdb scripts * tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (67 commits) llist: make llist_add_batch() a static inline delayacct: remove redundant code and adjust indentation squashfs: add optional full compressed block caching crash_dump, nvme: select CONFIGFS_FS as built-in scripts/gdb/symbols: determine KASLR offset on s390 during early boot scripts/gdb/symbols: factor out pagination_off() scripts/gdb/symbols: factor out get_vmlinux() kernel/panic.c: format kernel-doc comments mailmap: update and consolidate Casey Connolly's name and email nilfs2: remove wbc->for_reclaim handling fork: define a local GFP_VMAP_STACK fork: check charging success before zeroing stack fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks code fork: clean-up ifdef logic around stack allocation kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count x86/crash: make the page that stores the dm crypt keys inaccessible x86/crash: pass dm crypt keys to kdump kernel Revert "x86/mm: Remove unused __set_memory_prot()" crash_dump: retrieve dm crypt keys in kdump kernel ...
2025-05-31Merge tag 'mm-stable-2025-05-31-14-50' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of creating a pte which addresses the first page in a folio and reduces the amount of plumbing which architecture must implement to provide this. - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of largely unrelated folio infrastructure changes which clean things up and better prepare us for future work. - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory Price adds early-init code to prevent x86 from leaving physical memory unused when physical address regions are not aligned to memory block size. - "mm/compaction: allow more aggressive proactive compaction" from Michal Clapinski provides some tuning of the (sadly, hard-coded (more sadly, not auto-tuned)) thresholds for our invokation of proactive compaction. In a simple test case, the reduction of a guest VM's memory consumption was dramatic. - "Minor cleanups and improvements to swap freeing code" from Kemeng Shi provides some code cleaups and a small efficiency improvement to this part of our swap handling code. - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin adds the ability for a ptracer to modify syscalls arguments. At this time we can alter only "system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. This series should have been incorporated into mm.git's "non-MM" branch, but I goofed. - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more efficiently get at the info about guard regions. - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan implements that fix. No runtime effect is expected because validate_page_before_insert() happens to fix up this error. - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David Hildenbrand basically brings uprobe text poking into the current decade. Remove a bunch of hand-rolled implementation in favor of using more current facilities. - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman Khandual provides enhancements and generalizations to the pte dumping code. This might be needed when 128-bit Page Table Descriptors are enabled for ARM. - "Always call constructor for kernel page tables" from Kevin Brodsky ensures that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. This permits the addition of more functionality such as "insert hooks to protect page tables". This change does result in various architectures performing unnecesary work, but this is fixed up where it is anticipated to occur. - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM structures. - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo Stoakes takes advantage of some VMA merging opportunities which we've been missing for 15 years. - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB flushing. Instead of flushing each address range in the provided iovec, we batch the flushing across all the iovec entries. The syscall's cost was approximately halved with a microbenchmark which was designed to load this particular operation. - "Track node vacancy to reduce worst case allocation counts" from Sidhartha Kumar makes the maple tree smarter about its node preallocation. stress-ng mmap performance increased by single-digit percentages and the amount of unnecessarily preallocated memory was dramaticelly reduced. - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes a few unnecessary things which Baoquan noted when reading the code. - ""Enhance sysfs handling for memory hotplug in weighted interleave" from Rakie Kim "enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support". Fixes things on error paths which we are unlikely to hit. - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory" from SeongJae Park introduces new DAMOS quota goal metrics which eliminate the manual tuning which is required when utilizing DAMON for memory tiering. - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He provides cleanups and small efficiency improvements which Baoquan found via code inspection. - "vmscan: enforce mems_effective during demotion" from Gregory Price changes reclaim to respect cpuset.mems_effective during demotion when possible. because presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently. - "Clean up split_huge_pmd_locked() and remove unnecessary folio pointers" from Gavin Guo provides minor cleanups and efficiency gains in in the huge page splitting and migrating code. - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache for `struct mem_cgroup', yielding improved memory utilization. - "add max arg to swappiness in memory.reclaim and lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness=" argument for memory.reclaim MGLRU's lru_gen. This directs proactive reclaim to reclaim from only anon folios rather than file-backed folios. - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the first step on the path to permitting the kernel to maintain existing VMs while replacing the host kernel via file-based kexec. At this time only memblock's reserve_mem is preserved. - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides and uses a smarter way of looping over a pfn range. By skipping ranges of invalid pfns. - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning when a task is pinned a single NUMA mode. Dramatic performance benefits were seen in some real world cases. - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank Garg addresses a warning which occurs during memory compaction when using JFS. - "move all VMA allocation, freeing and duplication logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more appropriate mm/vma.c. - "mm, swap: clean up swap cache mapping helper" from Kairui Song provides code consolidation and cleanups related to the folio_index() function. - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that. - "memcg: Fix test_memcg_min/low test failures" from Waiman Long addresses some bogus failures which are being reported by the test_memcontrol selftest. - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo Stoakes commences the deprecation of file_operations.mmap() in favor of the new file_operations.mmap_prepare(). The latter is more restrictive and prevents drivers from messing with things in ways which, amongst other problems, may defeat VMA merging. - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's one. This is a step along the way to making memcg and objcg charging NMI-safe, which is a BPF requirement. - "mm/damon: minor fixups and improvements for code, tests, and documents" from SeongJae Park is yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents. - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg stats to be irq safe. Another step along the way to making memcg charging and stats updates NMI-safe, a BPF requirement. - "Let unmap_hugepage_range() and several related functions take folio instead of page" from Fan Ni provides folio conversions in the hugetlb code. * tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits) mm: pcp: increase pcp->free_count threshold to trigger free_high mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range() mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page mm/hugetlb: pass folio instead of page to unmap_ref_private() memcg: objcg stock trylock without irq disabling memcg: no stock lock for cpu hot-unplug memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs memcg: make count_memcg_events re-entrant safe against irqs memcg: make mod_memcg_state re-entrant safe against irqs memcg: move preempt disable to callers of memcg_rstat_updated memcg: memcg_rstat_updated re-entrant safe against irqs mm: khugepaged: decouple SHMEM and file folios' collapse selftests/eventfd: correct test name and improve messages alloc_tag: check mem_profiling_support in alloc_tag_init Docs/damon: update titles and brief introductions to explain DAMOS selftests/damon/_damon_sysfs: read tried regions directories in order mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject() mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat() mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs ...