linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2026-05-31	ring-buffer: Better comment the use of RB_MISSED_EVENTS	Steven Rostedt
	If the persistent ring buffer is detected on boot up to have a corrupted sub-buffer, that sub-buffer is cleared to zero and its commit value has the RB_MISSED_EVENTS bit set. That bit is to allow the "trace", "trace_pipe" and "trace_pipe_raw" files know that events were dropped by outputting "[LOST EVENTS]". Only in this case does that bit get set in the writeable portion of the ring buffer. When events are dropped in the normal ring buffer, that information is stored in the cpu_buffer descriptor and the RB_MISSED_EVENTS is set in the buffer page at the time the page is consumed. It is never set in the writeable portion of the buffer. Add comments to describe this better as it can be confusing to know when the RB_MISSED_EVENTS are set in the commit portion of the buffer page. Link: https://lore.kernel.org/all/20260529001500.14178455a046a5cbc6180861@kernel.org/ Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://patch.msgid.link/20260528223738.41276c0e@fedora Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-06-01	liveupdate: Reference count incoming FLB data	David Matlack
	Increment the incoming FLB refcount in liveupdate_flb_get_incoming() so that the FLB structure cannot be freed while the caller is actively using it. Add an additional liveupdate_flb_put_incoming() function so the caller can explicitly indicate when it is done using the FLB data. During a Live Update, a subsystem might need to hold onto the incoming File-Lifecycle-Bound (FLB) data for an extended period, such as during device enumeration. Incrementing the reference count guarantees that the data remains valid and accessible until the subsystem releases it, preventing future use-after-free bugs. Fixes: cab056f2aae7 ("liveupdate: luo_flb: introduce File-Lifecycle-Bound global state") Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://lore.kernel.org/r/20260423174032.3140399-3-dmatlack@google.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-06-01	liveupdate: Use refcount_t for FLB reference counts	David Matlack
	Use refcount_t instead of a raw integer to keep track of references on incoming and outgoing FLBs. Using refcount_t provides protection from overflow, underflow, and other issues. Fixes: cab056f2aae7 ("liveupdate: luo_flb: introduce File-Lifecycle-Bound global state") Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://lore.kernel.org/r/20260423174032.3140399-2-dmatlack@google.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-06-01	liveupdate: add LIVEUPDATE_SESSION_GET_NAME ioctl	Luca Boccassi
	Userspace when requesting a session via the ioctl specifies a name and gets a FD, but then there is no ioctl to go back the other way and get the name given a LUO session FD. This is problematic especially when there is a userspace orchestrator that wants to check what FDs it is handling for clients without having to do manual string scraping of procfs, or without procfs at all. Add a ioctl to simply get the name from an FD. Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Link: https://lore.kernel.org/r/20260429212221.814107-4-luca.boccassi@gmail.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-06-01	liveupdate: reject LIVEUPDATE_IOCTL_CREATE_SESSION with invalid name length	Luca Boccassi
	A session name must not be an empty string, and must not exceed the maximum size define in the uapi header, including null termination. Fixes: 0153094d03df ("liveupdate: luo_session: add sessions support") Cc: stable@vger.kernel.org Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Link: https://lore.kernel.org/r/20260429212221.814107-2-luca.boccassi@gmail.com Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-06-01	kho: make preserved pages compatible with deferred struct page init	Evangelos Petrongonas
	When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, struct page initialization is deferred to parallel kthreads that run later in the boot process. During KHO restoration, kho_preserved_memory_reserve() writes metadata for each preserved memory region. However, if the struct page has not been initialized, this write targets uninitialized memory, potentially leading to errors like: BUG: unable to handle page fault for address: ... Fix this by introducing kho_get_preserved_page(), which ensures all struct pages in a preserved region are initialized by calling init_deferred_page() which is a no-op when the struct page is already initialized. Signed-off-by: Evangelos Petrongonas <epetron@amazon.de> Co-developed-by: Michal Clapinski <mclapinski@google.com> Signed-off-by: Michal Clapinski <mclapinski@google.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Link: https://patch.msgid.link/20260423122538.140993-3-mclapinski@google.com Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-06-01	kho: fix deferred initialization of scratch areas	Michal Clapinski
	Currently, if CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, kho_release_scratch() will initialize the struct pages and set migratetype of KHO scratch. Unless the whole scratch fits below first_deferred_pfn, some of that will be overwritten either by deferred_init_pages() or memmap_init_reserved_range(). To fix it, make memmap_init_range(), deferred_init_memmap_chunk() and __init_page_from_nid() recognize KHO scratch regions and set migratetype of pageblocks in those regions to MIGRATE_CMA. Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Michal Clapinski <mclapinski@google.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav (Google) <pratyush@kernel.org> Link: https://patch.msgid.link/20260423122538.140993-2-mclapinski@google.com Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-05-31	sched_ext: Guard BPF arena helper calls to fix 32-bit build	Tejun Heo
	BPF arena (kernel/bpf/arena.c) is compiled only on MMU && 64BIT, while SCHED_CLASS_EXT depends on BPF_SYSCALL && BPF_JIT && DEBUG_INFO_BTF with no 64BIT requirement. On a 32-bit arch with a BPF JIT, SCX builds while the arena helpers are absent, so the cid-form code's unconditional calls to bpf_prog_arena() and bpf_arena_map_kern_vm_start() fail to link: build_policy.o: undefined reference to `bpf_prog_arena' build_policy.o: undefined reference to `bpf_arena_map_kern_vm_start' Guard the three call sites with the same MMU && 64BIT condition that gates arena.o. A cid-form scheduler needs a BPF arena, which isn't available on such builds, so it can't run there regardless. cpu-form schedulers don't touch the arena and are unaffected. This is a quick workaround to get past the build errors. A fuller fix may make the whole cid-form path conditional on the same condition, or drop 32-bit support outright. Fixes: 0e2819cba977 ("sched_ext: Require an arena for cid-form schedulers") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202605310454.U9iByL2n-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202605310926.APXMc0RJ-lkp@intel.com/ Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-31	bpf: fix BPF_PROG_QUERY OOB write and cgroup backward compat	Yuyang Huang
	BPF_PROG_QUERY writes back the 'query.revision' field unconditionally to userspace. If userspace passes a smaller 'bpf_attr' structure (e.g. 40 bytes, which was the layout before the addition of 'query.revision'), the kernel performs an out-of-bounds write. Fix this by propagating the user-provided attribute size 'uattr_size' down to the cgroup query handlers, and conditionally skipping writing the revision field to userspace when the provided buffer size is insufficient. query.revision in bpf_mprog_query is structurally identical to the cgroup case: a late tail field, written unconditionally. But the backward-compat hazard is not the same. The min-historical-size test is per command, and bpf_mprog_query only serves attach types that were born with revision in the struct: - tcx_prog_query -> BPF_TCX_INGRESS/EGRESS - netkit_prog_query -> BPF_NETKIT_PRIMARY/PEER tcx, netkit, the revision field, and bpf_mprog_query itself all landed in the same v6.6 merge window (053c8e1f235d added the mprog query API + revision; tcx in e420bed02507, netkit in 35dfaad7188c). There has never been a tcx/netkit BPF_PROG_QUERY userspace that doesn't know about revision. So for these commands the minimum legitimate struct already covers offset 56-64 — no old binary can be broken here. Contrast with cgroup: BPF_PROG_QUERY on cgroup attach types shipped in 2017; revision write-back was bolted on years later (120933984460). That path has a real population of pre-revision callers. Fixes: 120933984460 ("bpf: Implement mprog API on top of existing cgroup progs") Cc: Maciej Żenczykowski <maze@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Yuyang Huang <yuyanghuang@google.com> Link: https://lore.kernel.org/r/20260531075600.4058207-2-yuyanghuang@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-30	Merge tag 'liveupdate-fixes-2026-05-30' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux Pull liveupdate fixes from Mike Rapoport: "Two kexec handover regression fixes: - fix order calculation for kho_unpreserve_pages() to make sure sure that the order calculation in kho_unpreserve_pages() mathes the order calculation in kho_preserve_pages(). - fix math in calculation of KHO_TREE_MAX_DEPTH to make it work with 16KB pages" * tag 'liveupdate-fixes-2026-05-30' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux: kho: fix order calculation for kho_unpreserve_pages() kho: fix KHO_TREE_MAX_DEPTH for non-4KB page sizes
2026-05-30	tracing/probes: Point the error offset correctly for eprobe argument error	Masami Hiramatsu (Google)
	Fix to point the error offset correctly for eprobe argument error. In the cleanup commit 1b8b0cd754cd ("tracing/probes: Move event parameter fetching code to common parser"), due to incorrect backward compatibility aimed at conforming to the test specifications, the error location was set to 0 when a non-existent formal parameter was specified for Eprobe. However, this should be corrected in both the test and the implementation to point correct error position. Link: https://lore.kernel.org/all/177967567399.209006.1451571244515632097.stgit@devnote2/ Fixes: 1b8b0cd754cd ("tracing/probes: Move event parameter fetching code to common parser") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
2026-05-29	cgroup/cpuset: Free sched domains on rebuild guard failure	Guopeng Zhang
	generate_sched_domains() returns sched-domain masks and optional attributes that are normally handed to partition_sched_domains(), which takes ownership of them. rebuild_sched_domains_locked() has a WARN guard after generate_sched_domains() and before partition_sched_domains() to avoid passing offline CPUs into the scheduler domain rebuild path. If that guard fires, the function currently returns directly without freeing the generated doms and attr. Free the generated sched-domain masks and attributes before returning from the guard failure path. Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-29	workqueue: Add warnings and ensure one among WQ_PERCPU or WQ_UNBOUND is present	Marco Crivellari
	Currently there are no checks in order to enforce the use of one between WQ_PERCPU or WQ_UNBOUND. So act as following: - if neither of them is present, set WQ_PERCPU - if both are present, remove WQ_PERCPU Along with this change, WARN_ONCE(), so that the code still uses both or neither of them, can be changed. Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-29	workqueue: Add warnings and fallback if system_{unbound}_wq is used	Marco Crivellari
	Currently many users transitioned already to the new introduced workqueue (system_percpu_wq, system_dfl_wq), but there are new users who still use the older system_wq and system_unbound_wq. This change try to push this transition forward, by warning whether the old workqueues are used. Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-29	perf/ftrace: Fix WARNING in __unregister_ftrace_function	Rik van Riel
	perf_ftrace_function_unregister() unconditionally calls unregister_ftrace_function() without checking whether the ftrace_ops was ever successfully registered. This triggers a WARN_ON in __unregister_ftrace_function() when the ops doesn't have FTRACE_OPS_FL_ENABLED set. This can happen during perf_event_alloc() error cleanup when perf_trace_destroy() is called via __free_event() on an event whose ftrace_ops registration failed or was already torn down by perf_try_init_event()'s err_destroy path. The call path is: perf_event_alloc() error cleanup -> __free_event() -> event->destroy() [tp_perf_event_destroy] -> perf_trace_destroy() -> perf_trace_event_close() -> TRACE_REG_PERF_CLOSE -> perf_ftrace_function_unregister() -> unregister_ftrace_function() -> __unregister_ftrace_function() -> WARN_ON(!(ops->flags & FTRACE_OPS_FL_ENABLED)) Fix this by checking FTRACE_OPS_FL_ENABLED before attempting to unregister. If the ops is not enabled, just free the filter and return success. Link: https://patch.msgid.link/20260527111301.2d0d8256@fangorn Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-05-29	tracing: Disable KCOV instrumentation for trace_irqsoff.o	Karl Mehltretter
	When KCOV runs its boot selftest with whole-kernel instrumentation enabled, it sets current->kcov_mode to KCOV_MODE_TRACE_PC without installing a coverage area. Any instrumented code accepted as task-context coverage in that window dereferences current->kcov_area and crashes. On ARMv5 Versatile PB with CONFIG_KCOV_SELFTEST=y, CONFIG_KCOV_INSTRUMENT_ALL=y and CONFIG_IRQSOFF_TRACER=y, boot hits a NULL pointer fault during the selftest: kcov: running self test Internal error: Oops: 5 [#1] ARM PC is at __sanitizer_cov_trace_pc+0x4c/0x90 Kernel panic - not syncing: Fatal exception A diagnostic run showed the unwanted coverage comes from the IRQs-off tracer callbacks reached from ARM IRQ entry before hardirq context is visible to KCOV: __sanitizer_cov_trace_pc from tracer_hardirqs_off+0x18/0x1cc tracer_hardirqs_off from trace_hardirqs_off+0x34/0x54 trace_hardirqs_off from __irq_svc+0x58/0xb0 __irq_svc from kcov_init+0x7c/0xdc and similarly through tracer_hardirqs_on(). trace_preemptirq.o is already excluded because this tracing path can run from early interrupt code and produce coverage unrelated to syscall inputs. Exclude trace_irqsoff.o as well, instead of requiring users to turn off CONFIG_KCOV_INSTRUMENT_ALL=y, which is the default whole-kernel KCOV mode. With the exclusion in place, the same ARMv5 Versatile PB QEMU test boots through the KCOV selftest and reaches userspace. Tested on ARMv5 Versatile PB QEMU with CONFIG_KCOV_SELFTEST=y, CONFIG_KCOV_INSTRUMENT_ALL=y and CONFIG_IRQSOFF_TRACER=y. Link: https://patch.msgid.link/20260525170428.67211-1-kmehltretter@gmail.com Assisted-by: Codex:gpt-5 Signed-off-by: Karl Mehltretter <kmehltretter@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-05-29	tracing: Turn hist_elt_data field_var_str into a flexible array	Rosen Penev
	The field_var_str array was allocated separately via kcalloc() with its length already known at elt_data allocation time. Convert it to a flexible array member and fold the two allocations into a single kzalloc_flex(), reordering hist_trigger_elt_data_alloc() so n_str is computed and bounds-checked before the struct allocation. hist_elt_data is only reached through tracing_map_elt::private_data (a void *), never embedded, so adding a FAM imposes no tail-position constraint on any enclosing struct. Added __counted_by for extra runtime analysis. Link: https://patch.msgid.link/20260522214407.18120-1-rosenp@gmail.com Assisted-by: Claude:Opus-4.7 Signed-off-by: Rosen Penev <rosenp@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-05-29	tracing/osnoise: Array printk init and cleanup	Crystal Wood
	None of the calls to trace_array_printk_buf() will do anything if we don't initialize the buffer on instance creation (unless some other tracer called it), so do that. Add an osnoise_print() function to facilitate adding debug prints (without tainting). Use trace_array_printk() instead of trace_array_printk_buf(), as we're only writing to the main buffer (of a non-main instance) anyway -- and trace_array_printk_buf() skips the check to make sure we're not printing to the global instance. Link: https://patch.msgid.link/20260511223035.1475676-1-crwood@redhat.com Signed-off-by: Crystal Wood <crwood@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-05-29	sched/fair: Use rq_clock() in update_tg_load_avg() rate-limit	Rik van Riel
	update_tg_load_avg() is called once per leaf cfs_rq from the __update_blocked_fair() walk that runs inside the NOHZ idle-balance softirq, and again from update_load_avg() with UPDATE_TG. Its first operation after the trivial early-outs is unconditionally: now = sched_clock_cpu(cpu_of(rq_of(cfs_rq))); if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) return; Jakub ran into a system where nohz_idle_balance() was taking 75% of a CPU (which is handling network traffic and doing many irq_exit_cpu calls), with 35% of that CPU spent in update_load_avg, and 17% of the CPU in sched_clock_cpu(), reading the TSC. In a quick synthetic test, it looks like this patch reduces the CPU use of sched_balance_update_blocked_averages by about 20%. Switch the rate-limit to read rq_clock(rq_of(cfs_rq)) instead. This eliminates the rdtsc, and uses a fairly fresh timestamp, because all callers of update_tg_load_avg() and clear_tg_load_avg() hold rq->lock and have called update_rq_clock(rq) within microseconds: caller pre-state __update_blocked_fair encloser did update_rq_clock(rq) update_load_avg's three UPDATE_TG sites under rq->lock after enqueue/dequeue/update_curr attach_/detach_entity_cfs_rq preceded by update_load_avg(...) clear_tg_load_avg via offline path rq_clock_start_loop_update(rq) upfront so rq->clock is fresh at every call. Since cfs_rqs are per-CPU per-task_group, cfs_rq->last_update_tg_load_avg is always compared against the same rq's clock; no cross-rq drift. Signed-off-by: Rik van Riel <riel@surriel.com> Assisted-by: Claude (Anthropic) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260527110250.6a91718d@fangorn
2026-05-29	sched_ext: Auto-register/unregister dl_server reservations	Andrea Righi
	Commit cd959a3562050d ("sched_ext: Add a DL server for sched_ext tasks") introduced an ext_server deadline server to protect sched_ext tasks from fair/RT starvation, mirroring the existing fair_server. Currently, both servers reserve their 50ms/1000ms bandwidth at boot, regardless of whether a BPF scheduler is loaded. Unused bandwidth is still reclaimed at runtime by other classes, but the static reservation prevents the RT class from implicitly using that headroom when one of the two classes is guaranteed to be empty. A sysadmin can work around this by writing /sys/kernel/debug/sched/{fair,ext}_server/cpu*/runtime, but that requires manual action and not all systems expose debugfs. A better approach is to make server bandwidth reservations dynamic: only the scheduling policy that is currently active should register its reservation, while the inactive one should not artificially hold capacity (keeping both reservations only when the BPF scheduler is running in partial mode): +---------------------------------------------+-------------+------------+ \| BPF scheduler state \| fair server \| ext server \| +---------------------------------------------+-------------+------------+ \| not loaded (default boot) \| reserved \| none \| \| loaded full mode (!SCX_OPS_SWITCH_PARTIAL) \| none \| reserved \| \| loaded partial mode (SCX_OPS_SWITCH_PARTIAL)\| reserved \| reserved \| +---------------------------------------------+-------------+------------+ To achieve this, introduce an "attached/detached" state for each deadline server, so the kernel can decide whether a server's bandwidth should be accounted in global bandwidth tracking. At boot, the system starts with only the fair server contributing to bandwidth accounting. When a BPF scheduler is enabled, the ext server is attached and may replace or complement the fair server depending on whether full or partial mode is used. When sched_ext is disabled, the system restores the previous deadline bandwidth values and behavior. The transition logic ensures that switching between scheduling modes is consistent and reversible, without losing runtime configuration or requiring manual intervention. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260526164420.638711-2-arighi@nvidia.com
2026-05-29	sched/deadline: Reject debugfs dl_server writes for offline CPUs	Andrea Righi
	Writing runtime or period via the per-CPU dl_server debugfs files (/sys/kernel/debug/sched/{fair,ext}_server/cpu*/{runtime,period}) on an offline CPU can trigger two distinct kernel issues: 1) Divide-by-zero in dl_server_apply_params(): Oops: divide error: 0000 [#1] SMP NOPTI RIP: 0010:dl_server_apply_params+0x239/0x3a0 Call Trace: sched_server_write_common.isra.0+0x21a/0x3c0 full_proxy_write+0x78/0xd0 vfs_write+0xe7/0x6e0 Both __dl_sub() and __dl_add() divide by cpus internally, which can be 0 once the CPU has been removed from any active root-domain span (this has been latent since the debugfs interface was introduced). 2) WARN_ON_ONCE in dl_server_start(): WARNING: kernel/sched/deadline.c:1805 at dl_server_start+0x232/0x270 Commit ee6e44dfe6e5 ("sched/deadline: Stop dl_server before CPU goes offline") added this check to catch enqueueing the server on an offline rq. There's no meaningful semantics for re-configuring the per-CPU dl_server bandwidth while the CPU is offline, so simply reject the write with -EBUSY so userspace gets a clear error. Closes: https://lore.kernel.org/all/20260526092228.3B6891F00A3A@smtp.kernel.org/ Fixes: d741f297bcea ("sched/fair: Fair server interface") Reported-by: Sashiko <sashiko-bot@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: abaci-kreproducer <abaci@linux.alibaba.com> Link: https://patch.msgid.link/20260526100502.575774-1-arighi@nvidia.com
2026-05-29	sched/topology: Provide arch_llc_mask for cache aware scheduling	Shrikanth Hegde
	Venkat Reported a boot kernel panic next-20260522. Git bisect pointed to b5ea300a17e3 ("sched/cache: Make LLC id continuous") Stacktrace points to llc_mask being null. NIP [c000000000e58504] _find_first_bit+0x44/0x130 LR [c000000000e58500] _find_first_bit+0x40/0x130 Call Trace: build_sched_domains+0xad8/0xe50 sched_init_smp+0xa8/0x164 kernel_init_freeable+0x250/0x370 ret_from_kernel_user_thread+0x14/0x1c On powerpc, cpu_coregroup_mask is available only when the underlying hardware support coregroup. In shared LPAR, QEMU guest or power9 etc coregroup isn't supported. In such cases llc_mask was being referenced when it was null leading to panic. On powerpc, LLC is at SMT core level. So assumption that coregroup(MC) domain point to LLC is wrong. Provide a way for archs to say where its LLC is if it not at MC domain. Fixes: b5ea300a17e3 ("sched/cache: Make LLC id continuous") Closes: https://lore.kernel.org/all/51154de7-3700-4cb4-82f2-1b3a8fa427f7@linux.ibm.com/ Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Co-developed-by: Chen, Yu C <yu.c.chen@intel.com> Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Tested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://patch.msgid.link/20260529075712.1181039-1-sshegde@linux.ibm.com
2026-05-28	kcov: allow simultaneous KCOV_ENABLE/KCOV_REMOTE_ENABLE	Jann Horn
	Allow the same userspace thread to simultaneously collect normal coverage in syscall context (KCOV_ENABLE) and remote coverage of asynchronous work created by the thread (KCOV_REMOTE_ENABLE). With this, remote KCOV coverage becomes useful for generic fuzzing and not just fuzzing of specific data injection interfaces. This requires that the task_struct::kcov_* fields are separated into ones that are used by the task that generates coverage, and ones that are used by the task that requested remote coverage. To split this up: - Split task_struct::kcov into kcov and kcov_remote. kcov_task_exit() now has to clean up both separately. - Only use task_struct::kcov_mode on the task that generates coverage. - Only reset task_struct::kcov_handle on the task that requested remote coverage. After this change, fields used by the task that generates coverage are: - kcov_mode - kcov_size - kcov_area - kcov - kcov_sequence - kcov_softirq Fields used by the task that requested remote coverage are: - kcov_remote - kcov_handle [jannh@google.com: remove unused constant KCOV_MODE_REMOTE, per Dmitry] Link: https://lore.kernel.org/20260515-kcov-simultaneous-remote-v2-1-56fde1cfa509@google.com [jannh@google.com: update documentation on remote coverage collection] Link: https://lore.kernel.org/20260519-kcov-docs-v1-1-5bb22f4cb20c@google.com [jannh@google.com: move and reword sentence on simultaneous normal/remote collection Link: https://lore.kernel.org/20260520-kcov-docs-v2-1-819f78778763@google.com Link: https://lore.kernel.org/20260505-kcov-simultaneous-remote-v1-1-a670ba7cefd2@google.com Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	kcov: refactor common handle ID into kcov_common_handle_id	Jann Horn
	Store common handle IDs in "struct kcov_common_handle_id", which consumes no space in non-KCOV builds. This cleanup removes #ifdef boilerplate code from subsystems that integrate with KCOV (in particular in usbip_common.h and skbuff.h, see the diffstat). This should also make it easier to add KCOV remote coverage to more subsystems in the future. Link: https://lore.kernel.org/20260430-kcov-refactor-common-handle-v1-1-23a0c7a0ba38@google.com Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Hongren (Zenithal) Zheng <i@zenithal.me> Cc: Jann Horn <jannh@google.com> Cc: Jason Wang <jasowang@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Valentina Manea <valentina.manea.m@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	taskstats: retain dead thread stats in TGID queries	Yiyang Chen
	Patch series "taskstats: fix TGID dead-thread stat retention", v3. This series fixes a taskstats TGID aggregation bug where fields added in the TGID query path were not preserved after thread exit, and adds a kselftest covering the regression. The first patch keeps the cached TGID aggregate used for dead threads in step with the fields already accumulated for live threads, and also fixes the final TGID exit notification emitted when group_dead is true. The second patch adds a kselftest that verifies TGID CPU stats do not regress after a worker thread exits and has been reaped. This patch (of 2): fill_stats_for_tgid() builds TGID stats from two sources: the cached aggregate in signal->stats and a scan of the live threads in the group. However, fill_tgid_exit() only accumulates delay accounting into signal->stats. This means that once a thread exits, TGID queries lose the fields that fill_stats_for_tgid() adds for live threads. This gap was introduced incrementally by two earlier changes that extended fill_stats_for_tgid() but did not make the corresponding update to fill_tgid_exit(): - commit 8c733420bdd5 ("taskstats: add e/u/stime for TGID command") added ac_etime, ac_utime, and ac_stime to the TGID query path. - commit b663a79c1915 ("taskstats: add context-switch counters") added nvcsw and nivcsw to the TGID query path. As a result, those fields were accounted for live threads in TGID queries, but were dropped from the cached TGID aggregate after thread exit. The final TGID exit notification emitted when group_dead is true also copies that cached aggregate, so it loses the same fields. Factor the per-task TGID accumulation into tgid_stats_add_task() and use it in both fill_stats_for_tgid() and fill_tgid_exit(). This keeps the cached aggregate used for dead threads aligned with the live-thread accumulation used by TGID queries. Link: https://lore.kernel.org/cover.1776094300.git.cyyzero16@gmail.com Link: https://lore.kernel.org/abd2a15d33343636ab5ba43d540bcfe508bd66c7.1776094300.git.cyyzero16@gmail.com Fixes: 8c733420bdd5 ("taskstats: add e/u/stime for TGID command") Fixes: b663a79c1915 ("taskstats: add context-switch counters") Signed-off-by: Yiyang Chen <cyyzero16@gmail.com> Acked-by: Balbir Singh <balbirs@nvidia.com> Cc: Dr. Thomas Orgis <thomas.orgis@uni-hamburg.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Wang Yaxin <wang.yaxin@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	kasan: skip HW tagging for all kernel thread stacks	Muhammad Usama Anjum
	HW-tag KASAN never checks kernel stacks because stack pointers carry the match-all tag, so setting/poisoning tags is pure overhead. - Add __GFP_SKIP_KASAN to THREADINFO_GFP so every stack allocator that uses it skips tagging (fork path plus arch users) - Add __GFP_SKIP_KASAN to GFP_VMAP_STACK for the fork-specific vmap stacks. - When reusing cached vmap stacks, skip kasan_unpoison_range() if HW tags are enabled. Software KASAN is unchanged; this only affects tag-based KASAN. Link: https://lore.kernel.org/20260429102704.680174-3-dev.jain@arm.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	bpf: arena: use page_ref_count() instead of page_mapped() in arena_free_pages()	David Hildenbrand (Arm)
	Pages that BPF arena code maps are allocated through bpf_map_alloc_pages(), which does not allocate folios but pages. In the future, pages will not have a mapcount, only folios will. Converting the code to use folios and rely on folio_mapped() sounds like the wrong approach. Should BPF arena code allocate folios and use folio_mapped() here? But likely we would not want to use folios here longterm, as we don't really need folio information. Hard to tell. But in the meantime, we can simply use the page refcount instead, as a heuristic whether the page might be mapped to user space and we would want to try zapping it, so we can get rid of page_mapped(). Page allocation will give us a page with a refcount of 1. Any user space mapping adds a page reference. While there can be references from other subsystems (e.g., GUP), in the common case for this test here relying on the page count is good enough. Link: https://lore.kernel.org/20260427-page_mapped-v1-2-e89c3592c74c@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Harry Yoo <harry@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rich Felker <dalias@libc.org> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <song@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	bpf: Fix race between bpf_map_new_fd() and close_fd()	Leon Hwang
	Because there is time gap between bpf_map_new_fd() and close_fd(), a concurrent thread is able to close the new fd and opens a new, unrelated file with the exact same fd number. Thereafter, this close_fd() might inadvertently close the unrelated file. To avoid such regression, do finalize log before security_bpf_map_create(). However, in order to achieve it, move bpf_get_file_flag(), security_bpf_map_create(), bpf_map_alloc_id(), and bpf_map_new_fd() from __map_create() to map_create(). And, rename __map_create() to map_create_alloc() meanwhile. Then, in order to reuse the map and token when all checks pass in map_create_alloc(), pass "struct bpf_map " and "struct bpf_token " to map_create_alloc(). Fixes: 49f9b2b2a18c ("bpf: Add syscall common attributes support for map_create") Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260521142909.95818-1-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-28	ring-buffer: Show persistent buffer dropped events in trace_pipe file	Steven Rostedt
	When the persistent ring buffer is validated on boot up, if a subbuffer is deemed invalid, it resets the buffer and continues. Have the code preserve the RB_MISSED_EVENTS flag in the commit portion of the subbuffer header and pass that back so that the trace_pipe file can show the missed events like the trace file does. For example: <...>-1242 [005] d.... 4429.120116: page_fault_user: address=0x7ffaebb6e728 ip=0x7ffaeb9d4960 error_code=0x7 <...>-1242 [005] ..... 4429.120124: mm_page_alloc: page=00000000055254f3 pfn=0x1373bd order=0 migratetype=1 gfp_flags=GFP_HIGHUSER_MOVABLE\|__GFP_COMP <...>-1242 [005] d..2. 4429.120132: tlb_flush: pages:1 reason:local MM shootdown (3) CPU:5 [LOST EVENTS] <...>-1242 [005] d.... 4429.120661: page_fault_user: address=0x55ba7c2d0944 ip=0x55ba7c20cd02 error_code=0x7 <...>-1242 [005] ..... 4429.120669: mm_page_alloc: page=0000000005a02500 pfn=0x12b6e4 order=0 migratetype=1 gfp_flags=GFP_HIGHUSER_MOVABLE\|__GFP_COMP <...>-1242 [005] d..2. 4429.120680: tlb_flush: pages:1 reason:local MM shootdown (3) Link: https://patch.msgid.link/20260522171052.156419479@kernel.org Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-05-28	ring-buffer: Show persistent buffer dropped events in trace file	Steven Rostedt
	When the persistent ring buffer is validated on boot up, if a subbuffer is deemed invalid, it resets the buffer and continues. Currently, these lost events are not shown in the trace file output. Have the trace iterator look for subbuffers that have the RB_MISSED_EVENTS set and set the iter->missed_events flag when it is detected. This will then have the trace file shows "LOST EVENTS" when it reads across a subbuffer that was corrupted and invalidated. For example: <...>-1016 [005] ...1. 6230.660403: preempt_disable: caller=__mod_memcg_state+0x1c8/0x200 parent=__mod_memcg_state+0x1c8/0x200 CPU:5 [LOST EVENTS] <...>-1016 [005] ..... 6230.660673: kmem_cache_alloc: call_site=__anon_vma_prepare+0x1ad/0x1e0 ptr=000000006e40294c name=anon_vma bytes_req=200 bytes_alloc=208 gfp_flags=GFP_KERNEL node=-1 accounted=true Link: https://patch.msgid.link/20260522171052.006276604@kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-05-28	ring-buffer: Have dropped subbuffers be persistent across reboots	Steven Rostedt
	When the persistent ring buffer detects a corrupted subbuffer, it will zero its size and report dropped pages in the dmesg, then it continues normally. But if a reboot happens without clearing or restarting tracing on the persistent ring buffer, the next boot will show no pages are dropped. If the persistent ring buffer is still the same, then it should still report dropped pages so the user knows that the buffer has missing events. Add the RB_MISSED_EVENTS flag to the commit value of the subbuffer so that the next boot will still show that pages were dropped. Link: https://patch.msgid.link/20260522171051.860780286@kernel.org Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-05-28	ring-buffer: Cleanup buffer_data_page related code	Masami Hiramatsu (Google)
	Code cleanup related to buffer_data_page for readability, which includes: - Introduce rb_data_page_commit() and rb_data_page_size() - Use 'dpage' for buffer_data_page, instead of 'bpage' because 'bpage' is used for buffer_page. Link: https://patch.msgid.link/20260522171051.722645963@kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-05-28	ring-buffer: Cleanup persistent ring buffer validation	Masami Hiramatsu (Google)
	Cleanup rb_meta_validate_events() function to make it easier to read. This includes the following cleanups: - Introduce rb_validatation_state to hold working variables in validation. - Move repleated validation state updates into rb_validate_buffer(). - Move reader_page injection code outside of rb_meta_validate_events(). Link: https://patch.msgid.link/20260522171051.577231395@kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-05-28	ring-buffer: Show commit numbers in buffer_meta file	Masami Hiramatsu (Google)
	In addition to the index number, show the commit numbers of each data page in the per_cpu buffer_meta file. This is useful for understanding the current status of the persistent ring buffer. (Note that this file is shown only for persistent ring buffer and its backup instance) Link: https://patch.msgid.link/20260522171051.424411323@kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-05-28	ring-buffer: Add persistent ring buffer invalid-page inject test	Masami Hiramatsu (Google)
	Add a self-corrupting test for the persistent ring buffer. This will inject an erroneous value to some sub-buffer pages (where the index is even or multiples of 5) in the persistent ring buffer when the kernel panics, and checks whether the number of detected invalid pages and the total entry_bytes are the same as the recorded values after reboot. This ensures that the kernel can correctly recover a partially corrupted persistent ring buffer after a reboot or panic. The test only runs on the persistent ring buffer whose name is "ptracingtest". The user has to fill it with events before a kernel panic. To run the test, enable CONFIG_RING_BUFFER_PERSISTENT_INJECT and add the following kernel cmdline: reserve_mem=20M:2M:trace trace_instance=ptracingtest^traceoff@trace panic=1 Run the following commands after the 1st boot: cd /sys/kernel/tracing/instances/ptracingtest echo 1 > tracing_on echo 1 > events/enable sleep 3 echo c > /proc/sysrq-trigger After panic message, the kernel will reboot and run the verification on the persistent ring buffer, e.g. Ring buffer meta [2] invalid buffer page detected Ring buffer meta [2] is from previous boot! (318 pages discarded) Ring buffer testing [2] invalid pages: PASSED (318/318) Ring buffer testing [2] entry_bytes: PASSED (1300476/1300476) Link: https://patch.msgid.link/20260522171051.260140328@kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-05-28	ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer	Masami Hiramatsu (Google)
	Skip invalid sub-buffers when rewinding the persistent ring buffer instead of stopping the rewinding the ring buffer. The skipped buffers are cleared. To ensure the rewinding stops at the unused page, this also clears buffer_data_page::time_stamp when tracing resets the buffer. This allows us to identify unused pages and empty pages. Link: https://patch.msgid.link/20260522171051.091265852@kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> [ SDR: Have reader_page still get evaluated if header_page fails ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-05-28	ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer	Masami Hiramatsu (Google)
	Skip invalid sub-buffers when validating the persistent ring buffer instead of discarding the entire ring buffer. Only skipped buffers are invalidated (cleared). If the cache data in memory fails to be synchronized during a reboot, the persistent ring buffer may become partially corrupted, but other sub-buffers may still contain readable event data. Only discard the subbuffers that are found to be corrupted. Link: https://lore.kernel.org/all/20260520185018.051228084@kernel.org/ Link: https://patch.msgid.link/20260522171050.914418536@kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> [SDR: Fixed max_loops in rb_iter_peek() as well ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-05-28	bpf: Cache build IDs in sleepable stackmap path	Ihor Solodrai
	Stack traces often contain adjacent IPs from the same VMA or from different VMAs backed by the same ELF file. Cache the last successfully parsed build id together with the resolved VMA range and backing file so the sleepable build id path can avoid repeated VMA locking and file parsing in common cases. Suggested-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260525223948.1920986-4-ihor.solodrai@linux.dev
2026-05-28	bpf: Avoid faultable build ID reads under mm locks	Ihor Solodrai
	Sleepable build ID parsing can block in __kernel_read() [1], so the stackmap sleepable path must not call it while holding mmap_lock or a per-VMA read lock. The issue and the fix are conceptually similar to a recent procfs patch [2]. A similar VMA locking pattern has already been used in PROCMAP_QUERY [3]. Resolve each covered VMA with a stable read-side reference, preferring lock_vma_under_rcu() and falling back to mmap_read_trylock() only long enough to acquire the VMA read lock. Take a reference to the backing file, drop the VMA lock, and then parse the build ID through (sleepable) build_id_parse_file(). We have to use mmap_read_trylock() (and give up on failure) in this context because taking mmap_read_lock() is generally unsafe on code paths reachable from BPF programs [4], and may lead to deadlocks. [1] https://lore.kernel.org/all/20251218005818.614819-1-shakeel.butt@linux.dev/ [2] https://lore.kernel.org/all/20260128183232.2854138-1-andrii@kernel.org/ [3] https://lore.kernel.org/all/20250808152850.2580887-1-surenb@google.com/ [4] https://lore.kernel.org/bpf/2895ecd8-df1e-4cc0-b9f9-aef893dc2360@linux.dev/ Fixes: d4dd9775ec24 ("bpf: wire up sleepable bpf_get_stack() and bpf_get_task_stack() helpers") Suggested-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260525223948.1920986-3-ihor.solodrai@linux.dev
2026-05-28	bpf: Factor out stack_map build ID helpers	Ihor Solodrai
	Factor out helpers from stack_map_get_build_id_offset() in preparation for adding a sleepable build ID resolution path: stack_map_build_id_set_ip(), stack_map_build_id_offset(), and stack_map_build_id_set_valid(). While here, refactor stack_map_get_build_id_offset(): * use continue-driven control flow in the main loop and remove build_id_valid label * update prev_vma and prev_build_id on the fall-back-to-IP branch so the cache reflects the actual VMA seen on the previous IP [1] * guard fetch_build_id() with vma_is_anonymous() [2] to skip parse attempts that would otherwise fail the ELF magic check [1] https://lore.kernel.org/bpf/CAEf4Bzac9uWWqBvzH0iFzKvJcq3vxscZ3pKm0sUHmN-F-z9wVQ@mail.gmail.com/ [2] https://lore.kernel.org/bpf/226398c1ff3f2b686c0aeb010408d85fb15df13f9ff60a045bee31e79b9e41e9@mail.kernel.org/ Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/bpf/20260525223948.1920986-2-ihor.solodrai@linux.dev
2026-05-28	timers/migration: Update stale @online doc to @available	Zhan Xusheng
	Commit 8312cab5ff47 ("timers/migration: Rename 'online' bit to 'available'") renamed the 'online' field of struct tmigr_cpu to 'available'. The kernel doc comment above the struct still describes the old field name. Update it to reflect the actual field name and use the 'available' wording in the description. Fixes: 8312cab5ff47 ("timers/migration: Rename 'online' bit to 'available'") Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260526022106.1302279-1-zhanxusheng@xiaomi.com
2026-05-28	cgroup: pair max limit READ_ONCE() with WRITE_ONCE()	Ren Tamura
	cgroup.max.descendants and cgroup.max.depth are shown through seq_file. Their show callbacks read cgrp->max_descendants and cgrp->max_depth with READ_ONCE(), respectively. The corresponding write callbacks update the same scalar fields while holding the cgroup lock, but the seq_file show path does not serialize against those stores. This leaves the lockless show-side loads annotated with READ_ONCE(), while the corresponding stores remain plain stores. Use WRITE_ONCE() for the updates so the intended lockless access is marked consistently on both sides. This does not change locking, ordering, or user-visible semantics. Assisted-by: OpenAI-Codex:gpt-5.5 Signed-off-by: Ren Tamura <ren.tamura.oss@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-28	dma-contiguous: simplify numa cma area handling	Feng Tang
	Currently, there are 2 kernel cmdline ways to setup numa cma area: "cma_pernuma=" and "numa_cma=", and there are 2 cma arrays as well, while they have no difference technically. Robin suggested to cleanup the code and only use one array [1], as "the apparent intent that users only want one _or_ the other". Simplify the code by only using one array to save the numa cma area. And in rare case that a user really setup the 2 cmdline parameters at the same time, let the per-node specific size setting 'numa_cma=' take priority over the global numa cma setting. Link[1]: https://lore.kernel.org/lkml/43c5301c-fe6a-41e4-9482-ccfc7b62f2a7@arm.com/ Suggested-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260525015111.6267-1-feng.tang@linux.alibaba.com
2026-05-27	audit: fix recursive locking deadlock in audit_dupe_exe()	Ricardo Robaina
	A deadlock occurs in the audit subsystem when duplicating executable-related rules. When a file is moved (e.g., via do_renameat2()), the VFS layer locks the parent directory (I_MUTEX_PARENT), which synchronously triggers an fsnotify_move event. If an existing executable audit rule matches the file being moved, the audit subsystem catches this event and calls audit_dupe_exe() to duplicate the watch and update the rule. Then, audit_alloc_mark() would call kern_path_parent() to resolve the path, leading to a blind attempt to acquire the exact same I_MUTEX_PARENT lock already held by the task, resulting in the following recursive locking deadlock: ============================================ WARNING: possible recursive locking detected 6.12.0-55.27.1.el10_0.x86_64+debug #1 Not tainted -------------------------------------------- mv/5099 is trying to acquire lock: ffff888132845358 (&inode->i_sb->s_type->i_mutex_dir_key/1){+.+.}-{3:3}, at: __kern_path_locked+0x10a/0x2f0 but task is already holding lock: ffff888132846b58 (&inode->i_sb->s_type->i_mutex_dir_key/1){+.+.}-{3:3}, at: lock_two_directories+0x13f/0x2b0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&inode->i_sb->s_type->i_mutex_dir_key/1); lock(&inode->i_sb->s_type->i_mutex_dir_key/1); * DEADLOCK * May be due to missing lock nesting notation 6 locks held by mv/5099: #0: ffff888112a9c440 (sb_writers#13) at: do_renameat2+0x34c/0xbc0 #1: ffff888112a9c790 (&type->s_vfs_rename_key#3) at: do_renameat2+0x415/0xbc0 #2: ffff888132846b58 (&inode->i_sb->s_type->i_mutex_dir_key/1) at: lock_two_directories+0x13f/0x2b0 #3: ffff888132845358 (&inode->i_sb->s_type->i_mutex_dir_key/5) at: lock_two_directories+0x175/0x2b0 #4: ffffffffb3a1fb10 (&fsnotify_mark_srcu) at: fsnotify+0x454/0x28a0 #5: ffffffffaf886230 (audit_filter_mutex) at: audit_update_watch+0x36/0x11e0 stack backtrace: Call Trace: <TASK> dump_stack_lvl+0x6f/0xb0 print_deadlock_bug.cold+0xbd/0xca validate_chain+0x83a/0xf00 __lock_acquire+0xcac/0x1d20 lock_acquire.part.0+0x11b/0x360 down_write_nested+0x9f/0x230 __kern_path_locked+0x10a/0x2f0 kern_path_locked+0x26/0x40 audit_alloc_mark+0xfb/0x4f0 audit_dupe_exe+0x6c/0xe0 audit_dupe_rule+0x6c2/0xc00 audit_update_watch+0x4cc/0x11e0 audit_watch_handle_event+0x12c/0x1b0 send_to_group+0x5d0/0x8b0 fsnotify+0x615/0x28a0 fsnotify_move+0x1d8/0x630 vfs_rename+0xdcd/0x1df0 do_renameat2+0x9d4/0xbc0 __x64_sys_renameat+0x192/0x260 do_syscall_64+0x92/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f0491fe8c4e Code: 0f 1f 40 00 48 8b 15 c1 e1 16 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 49 89 ca b8 08 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 0a c3 66 0f 1f 84 00 00 00 00 00 48 8b 15 89 RSP: 002b:00007ffc7210bf38 EFLAGS: 00000246 ORIG_RAX: 0000000000000108 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0491fe8c4e RDX: 0000000000000003 RSI: 00007ffc7210e6c8 RDI: 00000000ffffff9c RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 R10: 00005575eb2dae2a R11: 0000000000000246 R12: 00005575eb2dae2a R13: 00007ffc7210e6c8 R14: 0000000000000003 R15: 00000000ffffff9c </TASK> The aforementioned deadlock can be consistently reproduced by running the script below: audit-dupe-exe-deadlock.sh -------------------------- #!/bin/bash auditctl -D mkdir -p /tmp/foo touch /tmp/file auditctl -a always,exit -F exe=/tmp/file -F path=/tmp/file -S all -k dr mv /tmp/file /tmp/foo/file rm -Rf /tmp/foo This patch fixes the issue by introducing struct audit_watch_ctx to pass the fsnotify event context down to audit_alloc_mark(). By utilizing the already-resolved directory inode provided by the event, we bypass the kern_path_parent() path resolution entirely, safely avoiding the recursive lock. Furthermore, it explicitly allows duplicate fsnotify marks (allow_dups = 1) during the rename update, allowing the new rule's mark to safely coexist with the old rule's mark until the old rule is freed. P.S.: This issue was identified and reproduced during a comprehensive code coverage analysis of the audit subsystem. The full report is available at the link below: https://people.redhat.com/rrobaina/audit-code-coverage-analysis.pdf P.P.S: With the permission of both Ricardo and Nathan, I've squashed a fixup patch from Nathan that addresses a compile time error when CONFIG_AUDITSYSCALL=n. Cc: stable@kernel.org Fixes: 34d99af52ad4 ("audit: implement audit by executable") Acked-by: Waiman Long <longman@redhat.com> Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Ricardo Robaina <rrobaina@redhat.com> [PM: move link metadata into the msg, apply fix from NC] Signed-off-by: Paul Moore <paul@paul-moore.com>
2026-05-27	bpf: Fix bpf_arena_handle_page_fault() redefinition without CONFIG_BPF_SYSCALL	Tejun Heo
	On configs with CONFIG_BPF=y but CONFIG_BPF_SYSCALL=n (e.g. arm multi_v7_defconfig), kernel/bpf/core.c defines a __weak bpf_arena_handle_page_fault() while bpf_defs.h already supplies a static inline stub for it, causing a redefinition error. Build the __weak definition only under CONFIG_BPF_SYSCALL, matching the bpf_defs.h declaration and the CONFIG_BPF_SYSCALL-gated strong definition in arena.c. Fixes: dc11a4dba246 ("bpf: Recover arena kernel faults with scratch page") Reported-by: Mark Brown <broonie@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20260527192632.2109419-1-tj@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-27	cgroup/cpuset: Use effective_xcpus in partcmd_update add/del mask calculation	Sun Shaojie
	When sibling CPU exclusion occurs, a partition's user_xcpus may contain CPUs that were never actually granted to it. These CPUs are present in user_xcpus(cs) but not in cs->effective_xcpus. The partcmd_update path in update_parent_effective_cpumask() uses user_xcpus(cs) (via the local variable xcpus) to compute the addmask (CPUs to return to parent) and delmask (CPUs to request from parent). This is incorrect: 1) When newmask removes a CPU that was previously excluded by a sibling, addmask incorrectly includes that CPU and tries to return it to the parent even though the partition never actually owned it, causing CPU overlap with sibling partitions and triggering warnings in generate_sched_domains(). 2) When newmask adds a previously excluded CPU that is now available, delmask fails to request it from the parent because user_xcpus(cs) already includes it. Fix this by using cs->effective_xcpus instead of user_xcpus(cs) in all partcmd_update paths that calculate addmask or delmask, including the PERR_NOCPUS error handling paths. Reproducers: Example 1 - Removing a sibling-excluded CPU incorrectly returns it: # cd /sys/fs/cgroup # echo "0-1" > a1/cpuset.cpus # echo "root" > a1/cpuset.cpus.partition # echo "0-2" > b1/cpuset.cpus # echo "root" > b1/cpuset.cpus.partition # echo "2" > b1/cpuset.cpus # cat cpuset.cpus.effective # Actual: 0-1,3 Expected: 3 Example 2 - Expanding to a previously excluded CPU fails to request it: # cd /sys/fs/cgroup # echo "0-1" > a1/cpuset.cpus # echo "root" > a1/cpuset.cpus.partition # echo "0-2" > b1/cpuset.cpus # echo "root" > b1/cpuset.cpus.partition # echo "member" > a1/cpuset.cpus.partition # echo "1-2" > b1/cpuset.cpus # cat cpuset.cpus.effective # Actual: 0-1,3 Expected: 0,3 Fixes: 2a3602030d80 ("cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict") Cc: stable@vger.kernel.org # v7.0+ Suggested-by: Zhang Guopeng <zhangguopeng@kylinos.cn> Signed-off-by: Sun Shaojie <sunshaojie@kylinos.cn> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-27	workqueue: drop spurious '*' from print_worker_info() fn declaration	Breno Leitao
	print_worker_info() declares its local 'fn' as work_func_t * but worker->current_func has type work_func_t (a function pointer). The extra level of indirection is wrong and only happens to be harmless today because every supported Linux architecture has sizeof(work_func_t) == sizeof(work_func_t ): copy_from_kernel_nofault() reads the correct number of bytes by accident, and %ps still resolves the printed address because the stored value is the function address regardless of declared type. On any future ABI where sizeof(void ()()) differs from sizeof(void *), the nofault copy would transfer the wrong number of bytes and the subsequent %ps would print an incorrect address. Match the field type so the intent is explicit and the code does not silently rely on equal pointer sizes. Fixes: 3d1cb2059d93 ("workqueue: include workqueue info when printing debug dump of a worker task") Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-27	sched_ext: idle: Fix errno loss in scx_idle_init()	Cheng-Yang Chou
	\|\| is a boolean operator, any nonzero (error) return short-circuits to 1 rather than the actual errno. The caller in scx_init() logs and propagates this value, so the wrong code reaches upper layers. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-26	audit: fix removal of dangling executable rules	Ricardo Robaina
	When an audited executable is deleted from the disk, its dentry becomes negative. Any later attempt to delete the associated audit rule will lead to audit_alloc_mark() encountering this negative dentry and immediately aborting, returning -ENOENT. This early abort prevents the subsystem from allocating the temporary fsnotify mark needed to construct the search key, meaning the kernel cannot find the existing rule in its own lists to delete it. This leaves a dangling rule in memory, resulting in the following error while attempting to delete the rule: # ./audit-dupe-exe-deadlock.sh No rules Error deleting rule (No such file or directory) There was an error while processing parameters # auditctl -l -a always,exit -S all -F exe=/tmp/file -F path=/tmp/file -F key=dr # auditctl -D Error deleting rule (No such file or directory) There was an error while processing parameters This patch fixes this issue by removing the d_really_is_negative() check. By doing so, a dummy mark can be successfully generated for the deleted path, which allows the audit subsystem to properly match and flush the dangling rule. Cc: stable@kernel.org Fixes: 76a53de6f7ff ("VFS/audit: introduce kern_path_parent() for audit") Acked-by: Waiman Long <longman@redhat.com> Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Ricardo Robaina <rrobaina@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2026-05-26	audit: use 'unsigned int' instead of 'unsigned'	Ricardo Robaina
	Address checkpatch.pl warning below, across the audit subsystem: WARNING: Prefer 'unsigned int' to bare use of 'unsigned' Minor cleanup, no functional changes. Signed-off-by: Ricardo Robaina <rrobaina@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>