linux.git - Linux kernel source tree

Age	Commit message (Collapse)	Author
36 hours	Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf	Linus Torvalds
	Pull bpf fixes from Alexei Starovoitov: - Fix bpf_throw() and global subprog combination (Kumar Kartikeya Dwivedi) - Fix out of bounds access in BPF interpreter (Yazhou Tang) - Fix potential out of bounds access in inner per-cpu array map (Guannan Wang) - Reject NULL data/sig in bpf_verify_pkcs7_signature (KP Singh) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: libbpf: fix off-by-one in emit_signature_match jump offset bpf: Reject NULL data/sig in bpf_verify_pkcs7_signature selftests/bpf: Cover global subprog exception leaks bpf: Check global subprog exception paths bpf: make bpf_session_is_return() reference optional bpf: Use array_map_meta_equal for percpu array inner map replacement selftests/bpf: Add test for large offset bpf-to-bpf call bpf: Fix s16 truncation for large bpf-to-bpf call offsets bpf: Fix out-of-bounds read in bpf_patch_call_args()
3 days	Merge tag 'sched_ext-for-7.1-rc4-fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Spurious WARN in ops_dequeue() racing with concurrent dispatch - Self-deadlock between scheduler disable and a concurrent sub-sched enable * tag 'sched_ext-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue() sched_ext: Fix deadlock between scx_root_disable() and concurrent forks
3 days	Merge tag 'cgroup-for-7.1-rc4-fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Two rstat fixes: - Out-of-bounds access in the css_rstat_updated() BPF kfunc when called with an unchecked user-supplied cpu - Over-strict NMI guard after the recent switch to try_cmpxchg left sparc and ppc64 unable to queue rstat updates from NMI" * tag 'cgroup-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: rstat: relax NMI guard after switch to try_cmpxchg cgroup/rstat: validate cpu before css_rstat_cpu() access
4 days	Merge tag 'dma-mapping-7.1-2026-05-22' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: "Two minor updates for the DMA-mapping code, mainly fixing some rare corner cases (Petr Tesarik, Jianpeng Chang)" * tag 'dma-mapping-7.1-2026-05-22' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-mapping: move dma_map_resource() sanity check into debug code dma-direct: fix use of max_pfn
4 days	Merge tag 'trace-v7.1-rc4' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Avoid NULL return from hist_field_name() The function hist_field_name() is directly passed to a strcat() which does not handle "NULL" characters. Return a zero length string when size is greater than the limit. This is used only to output already created histograms and no field currently is greater than the limit. But it should still not return NULL. - Do not call map->ops->elt_free() on allocation failure When elt_alloc() fails, it should not call the map->ops->elt_free() function if it exists, as that function may not be able to handle the free on allocation failures. The ->elt_free() should only be called when elt_alloc() succeeds. * tag 'trace-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Do not call map->ops->elt_free() if elt_alloc() fails tracing: Avoid NULL return from hist_field_name() on truncation
4 days	Merge tag 'trace-ringbuffer-v7.1-rc4' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ring-buffer fixes from Steven Rostedt: - Fix reporting MISSED EVENTS in trace iterator When the "trace" file is read with tracing enabled, if the writer were to pass the iterator reader, it resets, sets a "missed_events" flag and continues. The tracing output checks for missed events and if there are some, it prints out "[LOST EVENTS]" to let the user know events were dropped. But the clearing of the missed_events happened when the tracing system queried the ring buffer iterator about missed events. This was premature as the ring buffer is per CPU, and the tracing code reads all the CPU buffers and checks for missed events when it is read. If the CPU iterator that had missed events isn't printed next, the output for the LOST EVENTS is lost. Clear the missed_events flag when the iterator moves to the next event and not when the missed_events flag is queried. Also clear it on reset. - Flush and stop the persistent ring buffer on panic On panic the persistent ring buffer is used to debug what caused the panic. But on some architectures, it requires flushing the memory from cache, otherwise, the ring buffer persistent memory may not have the last events and this could also cause the ring buffer to be corrupted on the next boot. - Fix nr_subbufs initialization in simple_ring_buffer_init_mm The remote simple ring buffer meta data nr_subbufs is initialized too early and gets cleared later on, making it zero and not reflect the actual number of sub-buffers. - Fix unload_page for simple_ring_buffer init rollback On error, the pages loaded need to be unloaded. To unload a page it is expected that: page = load_page(va); -> unload_page(page). But the code was doing: unload_page(va) and not unload_page(page). - Create output file from cmd_check_undefined The check for undefined symbols checks if the file .o.checked exists and if so it skips doing the work. But the .o.checked file never was created making every build do the work even when it was already done previously. * tag 'trace-ringbuffer-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Create output file from cmd_check_undefined tracing: Fix unload_page for simple_ring_buffer init rollback tracing: Fix nr_subbufs initialization in simple_ring_buffer_init_mm() ring-buffer: Flush and stop persistent ring buffer on panic ring-buffer: Fix reporting of missed events in iterator
4 days	sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue()	Samuele Mariotti
	ops_dequeue() can race with finish_dispatch() and spuriously trigger the "queued task must be in BPF scheduler's custody" warning. ops_dequeue() snapshots p->scx.ops_state via atomic_long_read_acquire() and then, in the SCX_OPSS_QUEUED arm, asserts that SCX_TASK_IN_CUSTODY is set. The two reads are not atomic w.r.t. a concurrent finish_dispatch() running on another CPU: CPU 1 CPU 2 ===== ===== dequeue_task_scx() ops_dequeue() opss = read_acquire(ops_state) = SCX_OPSS_QUEUED finish_dispatch() cmpxchg ops_state: SCX_OPSS_QUEUED -> SCX_OPSS_DISPATCHING [succeeds] dispatch_enqueue(SCX_DSQ_GLOBAL, SCX_ENQ_CLEAR_OPSS) call_task_dequeue() p->scx.flags &= ~SCX_TASK_IN_CUSTODY WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_IN_CUSTODY)) /* opss is stale: QUEUED, * but task already claimed */ set_release(ops_state, SCX_OPSS_NONE) The race has been observed via two distinct call chains: the most common goes through sched_setaffinity(), a rarer variant through sched_change_begin(). For SCX_DSQ_GLOBAL / SCX_DSQ_BYPASS, dispatch_enqueue() clears SCX_TASK_IN_CUSTODY before clearing ops_state to SCX_OPSS_NONE (intentional, to avoid concurrent non-atomic RMW of p->scx.flags against ops_dequeue()). The window between those two writes is exactly what ops_dequeue() observes as "QUEUED without custody". The observed state is not actually inconsistent, it just means CPU 1 has already claimed the task and the QUEUED value held by CPU 2 is stale. Re-read ops_state in that case; the next read is guaranteed to return SCX_OPSS_DISPATCHING or SCX_OPSS_NONE, both of which exit the switch cleanly. The retry is bounded: once IN_CUSTODY is cleared, ops_state has already advanced past QUEUED for this dispatch cycle, and a fresh QUEUED would require re-enqueue under p's rq lock, which CPU 2 holds. Changes in v2: - Use READ_ONCE() for p->scx.flags to ensure fresh reads and prevent compiler reordering in the lockless path - Add cpu_relax() to reduce power consumption and improve performance during the spin-wait - Use unlikely() to optimize branch prediction for the common case - Expand the in-code comment to document the race condition and bounded retry guarantee Fixes: ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics") Suggested-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Samuele Mariotti <smariotti@disroot.org> Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Tejun Heo <tj@kernel.org>
5 days	tracing: Do not call map->ops->elt_free() if elt_alloc() fails	Masami Hiramatsu (Google)
	In paths where tracing_map_elt_alloc() failed to allocate objects, the map->ops->elt_alloc() call was never successful. In this case, map->ops->elt_free() should not be called. Link: https://sashiko.dev/#/patchset/20260520223101.34710-1-rosenp%40gmail.com Cc: stable@vger.kernel.org Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Rosen Penev <rosenp@gmail.com> Reported-by: Sashiko <sashiko-bot@kernel.org> Fixes: 2734b629525a ("tracing: Add per-element variable support to tracing_map") Link: https://patch.msgid.link/177933895460.108746.5396070821443932634.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
5 days	tracing: Create output file from cmd_check_undefined	Thomas Weißschuh
	As the output file is currently never created, the check will run every time, even if the inputs have not changed. Create an empty output file which allows make to skip the execution when it is not necessary. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20260520-tracing-ringbuffer-check-v1-1-d979cfab1338@weissschuh.net Fixes: 1211907ac0b5 ("tracing: Generate undef symbols allowlist for simple_ring_buffer") Fixes: 58b4bd18390e ("tracing: Adjust cmd_check_undefined to show unexpected undefined symbols") Reviewed-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
5 days	tracing: Fix unload_page for simple_ring_buffer init rollback	Vincent Donnefort
	The unload_page callback expects the return value of load_page() as its argument: ret = load_page(va); unload(ret). Fix the rollback code in simple_ring_buffer_init_mm() where the descriptor's VA is used instead of the loaded page address. Link: https://patch.msgid.link/20260512141614.1759430-1-vdonnefort@google.com Fixes: 635923081c79 ("tracing: load/unload page callbacks for simple_ring_buffer") Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
5 days	tracing: Fix nr_subbufs initialization in simple_ring_buffer_init_mm()	David Carlier
	nr_subbufs in the ring buffer metadata is always initialized to zero because it is assigned from cpu_buffer->nr_pages before the page initialization loop has run. While nr_subbufs is not currently read by the kernel, it should reflect the actual buffer geometry in the meta page for correctness. Move the assignment after the page loop so that cpu_buffer->nr_pages holds the final count. Link: https://patch.msgid.link/20260512135420.99194-1-devnexen@gmail.com Fixes: 34e5b958bdad ("tracing: Introduce simple_ring_buffer") Reviewed-by: Vincent Donnefort <vdonnefort@google.com> Assisted-by: Claude:claude-opus-4-7 Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
5 days	ring-buffer: Flush and stop persistent ring buffer on panic	Masami Hiramatsu (Google)
	On real hardware, panic and machine reboot may not flush hardware cache to memory. This means the persistent ring buffer, which relies on a coherent state of memory, may not have its events written to the buffer and they may be lost. Moreover, there may be inconsistency with the counters which are used for validation of the integrity of the persistent ring buffer which may cause all data to be discarded. To avoid this issue, stop recording of the ring buffer on panic and flush the cache of the ring buffer's memory. Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance") Cc: stable@vger.kernel.org Cc: Will Deacon <will@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ian Rogers <irogers@google.com> Link: https://patch.msgid.link/177751969602.2136606.12031934362587643488.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
5 days	ring-buffer: Fix reporting of missed events in iterator	Steven Rostedt
	When tracing is active while reading the trace file, if the iterator reading the buffer detects that the writer has passed the iterator head, it will reset and set a "missed events" flag. This flag is passed to the output processing to show the user that events were missed: CPU:4 [LOST EVENTS] The problem is that the flag is reset after it is checked in ring_buffer_iter_dropped(). But the "trace" file iterates over all the CPU ring buffers and it will check if they are dropped when figuring out which buffer to print next. This prematurely clears the missed_events flag if the CPU buffer with the missed events is not the one that is printed next. On the iteration where the CPU buffer with the missed events is printed, the check if it had missed events would return false and the output does not show that events were missed. Do not reset the missed_events flag when checking if there were missed events, but instead clear it when moving the iterator head to the next event. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260520220801.4fd09d13@fedora Fixes: c9b7a4a72ff64 ("ring-buffer/tracing: Have iterator acknowledge dropped events") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
5 days	tracing: Avoid NULL return from hist_field_name() on truncation	David Carlier
	hist_field_name() returns "" everywhere except the fully-qualified VAR_REF/EXPR case, where snprintf() truncation returns NULL early and bypasses the bottom NULL->"" guard. Callers don't expect NULL: strcat(expr, hist_field_name(field, 0)) at trace_events_hist.c:1758 and the strcmp() in the sort-key match loop at :4804 both deref it. system and event_name are bounded by MAX_EVENT_NAME_LEN, but the field name on a VAR_REF is kstrdup'd from a histogram variable name parsed out of the trigger string and has no length cap, so a long enough var name in a fully qualified reference can reach the truncation path. Keep the length check but leave field_name as "" on overflow. Link: https://patch.msgid.link/20260508195747.25492-1-devnexen@gmail.com Fixes: 5ec1d1e97de1 ("tracing: Rebuild full_name on each hist_field_name() call") Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
5 days	cgroup: rstat: relax NMI guard after switch to try_cmpxchg	Cunlong Li
	Commit 36df6e3dbd7e ("cgroup: make css_rstat_updated nmi safe") used this_cpu_cmpxchg() for the lockless insertion, and therefore required both ARCH_HAVE_NMI_SAFE_CMPXCHG and ARCH_HAS_NMI_SAFE_THIS_CPU_OPS in the NMI guard: on archs without the latter, this_cpu_cmpxchg() falls back to "local_irq_save() + plain cmpxchg", and local_irq_save() cannot mask NMIs. Commit 3309b63a2281 ("cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated") later replaced this_cpu_cmpxchg() with plain try_cmpxchg() to fix cross-CPU lockless-list corruption, but left the NMI guard untouched. After that switch, css_rstat_updated() no longer performs any this_cpu_*() RMW operations and only relies on the arch having NMI-safe cmpxchg, so ARCH_HAS_NMI_SAFE_THIS_CPU_OPS is no longer required in the guard. Relax the guard accordingly so that archs which have HAVE_NMI and ARCH_HAVE_NMI_SAFE_CMPXCHG but not ARCH_HAS_NMI_SAFE_THIS_CPU_OPS (e.g. sparc, powerpc on PPC64/BOOK3S) can benefit from the existing CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC path. Without this, the css is never queued in NMI on those archs, and the atomics staged by account_{slab,kmem}_nmi_safe() are not drained by flush_nmi_stats(). Fixes: 3309b63a2281 ("cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated") Signed-off-by: Cunlong Li <shenxiaogll@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
6 days	Merge tag 'rcu-fixes.v7.1-20260519a' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU fixes from Boqun Feng: "Fix a regression introduced by commit 61bbcfb50514 ("srcu: Push srcu_node allocation to GP when non-preemptible"). SRCU may queue works on CPUs that are "possible" but never have been online. In such a case, the work callbacks may not be executed until the corresponding CPU gets online, and as the callbacks accumulates, workqueue lockups will fire. Fix this by avoiding queuing works on CPUs that have never been online" * tag 'rcu-fixes.v7.1-20260519a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: srcu: Don't queue workqueue handlers to never-online CPUs
6 days	bpf: Reject NULL data/sig in bpf_verify_pkcs7_signature	KP Singh
	__bpf_dynptr_data() can return NULL (FILE dynptrs, any non-contiguous backing). bpf_verify_pkcs7_signature() forwards the pointer to verify_pkcs7_signature() unchecked, causing a NULL deref in asn1_ber_decoder() reachable from a sleepable BPF LSM at lsm.s/bpf. NULL-check both pointers and reject with -EINVAL. Mirrors the guards already in kernel/bpf/crypto.c. Fixes: 865b0566d8f1 ("bpf: Add bpf_verify_pkcs7_signature() kfunc") Reported-by: Xianrui Dong <dongxianrui1@gmail.com> Signed-off-by: KP Singh <kpsingh@kernel.org> Reviewed-by: Amery Hung <ameryhung@gmail.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20260520024059.313468-1-kpsingh@kernel.org Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
7 days	cgroup/rstat: validate cpu before css_rstat_cpu() access	Qing Ming
	css_rstat_updated() is exposed as a BPF kfunc and accepts a caller-provided cpu argument. The function uses cpu for per-cpu rstat lookups without checking whether it refers to a valid possible CPU. A BPF iter/cgroup program with CAP_BPF and CAP_PERFMON can pass an invalid cpu value. On an unfixed UBSCAN_BOUNDS test kernel, cpu == 0x7fffffff triggers: UBSAN: array-index-out-of-bounds in kernel/cgroup/rstat.c:31:9 index 2147483647 is out of range for type 'long unsigned int [64]' Call Trace: css_rstat_updated bpf_iter_run_prog cgroup_iter_seq_show bpf_seq_read Add cpu validation to the BPF-facing css_rstat_updated() kfunc and move the common implementation to __css_rstat_updated() for in-kernel callers. Fixes: a319185be9f5 ("cgroup: bpf: enable bpf programs to integrate with rstat") Signed-off-by: Qing Ming <a0yami@mailbox.org> Signed-off-by: Tejun Heo <tj@kernel.org>
7 days	srcu: Don't queue workqueue handlers to never-online CPUs	Paul E. McKenney
	While an srcu_struct structure is in the midst of switching from CPU-0 to all-CPUs state, it can attempt to invoke callbacks for CPUs that have never been online. Worse yet, it can attempt in invoke callbacks for CPUs that never will be online, even including imaginary CPUs not in cpu_possible_mask. This can cause hangs on s390, which is not set up to deal with workqueue handlers being scheduled on such CPUs. This commit therefore causes Tree SRCU to refrain from queueing workqueue handlers on CPUs that have not yet (and might never) come online. Because callbacks are not invoked on CPUs that have not been online, it is an error to invoke call_srcu(), synchronize_srcu(), or synchronize_srcu_expedited() on a CPU that is not yet fully online. However, it turns out to be less code to redirect the callbacks from too-early invocations of call_srcu() than to warn about such invocations. This commit therefore also redirects callbacks queued on not-yet-fully-online CPUs to the boot CPU. Reported-by: Vasily Gorbik <gor@linux.ibm.com> Fixes: 61bbcfb50514 ("srcu: Push srcu_node allocation to GP when non-preemptible") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Samir <samir@linux.ibm.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Boqun Feng <boqun@kernel.org>
8 days	dma-mapping: move dma_map_resource() sanity check into debug code	Jianpeng Chang
	dma_map_resource() uses pfn_valid() to ensure the range is not RAM. However, pfn_valid() only checks for availability of the memory map for a PFN but it does not ensure that the PFN is actually backed by RAM. On ARM64 with SPARSEMEM (128MB section granularity), MMIO addresses that share a section with RAM will falsely trigger the WARN_ON_ONCE and cause dma_map_resource() to return DMA_MAPPING_ERROR. This causes a WARNING on Raspberry Pi 4 during spi_bcm2835 probe because the SPI FIFO register (0xfe204004) falls in the same sparsemem section as the end of RAM (0xf8000000-0xfbffffff), both in section 31 (0xf8000000-0xffffffff). Move the sanity check from dma_map_resource() into debug_dma_map_phys() and replace the unreliable pfn_valid() with pfn_valid() && !PageReserved(), which correctly identifies actual usable RAM without false positives for MMIO regions that happen to have struct pages. Since dma_map_resource() is dma_map_phys(DMA_ATTR_MMIO), the check applies equally to both APIs. Any non-reserved page represents kernel memory to a sufficient degree that using DMA_ATTR_MMIO on it is almost certainly wrong and risks breaking coherency on non-coherent platforms. ZONE_DEVICE pages used for PCI P2P DMA (MEMORY_DEVICE_PCI_P2PDMA) have PageReserved set, so they will not trigger a false positive. The check no longer blocks the mapping and uses err_printk() to integrate with dma-debug filtering. Fixes: f7326196a781 ("dma-mapping: export new dma_*map_phys() interface") Reviewed-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Jianpeng Chang <jianpeng.chang.cn@windriver.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260513072209.1486986-1-jianpeng.chang.cn@windriver.com
8 days	sched_ext: Fix deadlock between scx_root_disable() and concurrent forks	Tejun Heo
	scx_root_disable() enters SCX_DISABLING before it grabs scx_enable_mutex to clear __scx_switched_all and scx_switching_all. task_should_scx() short-circuits on DISABLING, so forks in that window land on fair while next_active_class() still skips fair - the new tasks stall. This can deadlock the disable path itself: scx_alloc_and_add_sched() runs under scx_enable_mutex and creates a helper kthread; if that new kthread is one of the stalled fair tasks, the mutex holder waits forever and scx_root_disable() can never make progress. Only sub-sched support exposes this, since sub-sched enables are the only path where scx_alloc_and_add_sched() can race the root's disable. Move the DISABLING check after @scx_switching_all. @scx_switching_all serves as a proxy for __scx_switched_all, so while it's set, forks keep going to scx. Once cleared, DISABLING applies normally. v2: Reword in-source comment and description. (Andrea) Fixes: 337ec00b1d9c ("sched_ext: Implement cgroup sub-sched enabling and disabling") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
8 days	Merge tag 'trace-v7.1-rc3' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Add more functions to the remote allowed list randconfig found more functions that are allowed for the remote code for s390 and arm. Add them to the allowed list. - Fix remote_test error path If one of the simple ring buffers fails to load, the code is supposed to rollback its initialized buffers. Instead of rolling back the buffers for the failed load, it uses the global variable and rolls back all the successfully loaded buffers. * tag 'trace-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix desc in error path for the trace remote test module ring-buffer remote: Avoid unexpected symbol warnings (arm, s390)
8 days	bpf: Check global subprog exception paths	Kumar Kartikeya Dwivedi
	Global subprogs are verified independently and are not descended into when their callers are symbolically executed. This means a caller can hold references or locks across a global subprog call that may throw, while the verifier only checks the non-exceptional return path at the call site. Record whether a subprog might throw in the CFG summary pass, alongside the existing might_sleep and packet-data-changing summaries, and propagate that effect through reachable callees. When a global subprog is marked as possibly throwing, push the normal continuation and validate the exceptional path immediately at the call site, avoiding a synthetic exception state and associated special case in the pruning checks. Fixes: f18b03fabaa9 ("bpf: Implement BPF exceptions") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260517075530.3461166-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 days	Merge tag 'irq-urgent-2026-05-17' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull IRQ fixes from Ingo Molnar: - Fix use-after-free in irq_work_single() on PREEMPT_RT (Jiayuan Chen) - Don't call add_interrupt_randomness() for NMIs in handle_percpu_devid_irq() (Mark Rutland) - Remove unused function in the ath79-cpu irqchip driver causing LKP CI build warnings (Rosen Penev) - Fix IRQ allocation/teardown leakage regressions in the GICv5 irqchip driver (Sascha Bischoff) - Fix an IRQ trigger type regression in the Meson S4 SoC irqchip driver (Xianwei Zhao) - Fix CPU offlining regression in the RiscV IMSIC irqchip driver (Yong-Xuan Wang) * tag 'irq-urgent-2026-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irq_work: Fix use-after-free in irq_work_single() on PREEMPT_RT irqchip/riscv-imsic: Clear interrupt move state during CPU offlining irqchip/meson-gpio: Use the correct register in meson_s4_gpio_irq_set_type() irqchip/ath79-cpu: Remove unused function genirq/chip: Don't call add_interrupt_randomness() for NMIs irqchip/gic-v5: Allocate ITS parent LPIs as a range irqchip/gic-v5: Support range allocation for LPIs irqchip/gic-v5: Move LPI allocation into the LPI domain
9 days	tracing: Fix desc in error path for the trace remote test module	Vincent Donnefort
	During initialisation in remote_test_load(), if one of the simple_ring_buffer fails to initialise, the error path attempts to rollback initialised buffers. However, the rollback incorrectly uses the global pointer to the trace descriptor, which is only set upon successful load completion. Fix the error path by using the local pointer to the descriptor. Link: https://patch.msgid.link/20260515201616.337469-1-vdonnefort@google.com Fixes: ea908a2b79c8 ("tracing: Add a trace remote module for testing") Reported-by: Sashiko <sashiko-bot@kernel.org> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> base-commit: 5d6919055dec134de3c40167a490f33c74c12581 Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
10 days	ring-buffer remote: Avoid unexpected symbol warnings (arm, s390)	Arnd Bergmann
	The now more verbose check found more architecture specific symbol missing from the whitelist, during randconfig testing on s390 and 32-bit arm: Unexpected symbols in kernel/trace/simple_ring_buffer.o: U __aeabi_unwind_cpp_pr1 Unexpected symbols in kernel/trace/simple_ring_buffer.o: U __s390_indirect_jump_r1 U __s390_indirect_jump_r10 U __s390_indirect_jump_r14 U __s390_indirect_jump_r2 U __s390_indirect_jump_r5 U __s390_indirect_jump_r7 U __s390_indirect_jump_r8 U __s390_indirect_jump_r9 make[6]: *** [/home/arnd/arm-soc/kernel/trace/Makefile:160: kernel/trace/simple_ring_buffer.o.checked] Error 1 Add these to the list and keep it roughly sorted into sanitizer and architecture symbols. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Link: https://patch.msgid.link/20260515105717.1023007-1-arnd@kernel.org Fixes: 1211907ac0b5 ("tracing: Generate undef symbols allowlist for simple_ring_buffer") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
10 days	bpf: make bpf_session_is_return() reference optional	Arnd Bergmann
	Building without CONFIG_BPF_EVENTS produces a build-time warning: WARN: resolve_btfids: unresolved symbol bpf_session_is_return The function is actually defined in kernel/trace/bpf_trace.o, which is built conditionally based on configuration. Make the reference to this function conditional as well, as is already done in the bpf verifier for other functions. Fixes: 8fe4dc4f6456 ("bpf: change prototype of bpf_session_{cookie,is_return}") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20260515113242.2706303-1-arnd@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 days	Merge tag 'audit-pr-20260513' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fixes from Paul Moore: - Correctly log the inheritable capabilities - Honor AUDIT_LOCKED in the AUDIT_TRIM and AUDIT_MAKE_EQUIV commands * tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: enforce AUDIT_LOCKED for AUDIT_TRIM and AUDIT_MAKE_EQUIV audit: fix incorrect inheritable capability in CAPSET records
12 days	ptrace: slightly saner 'get_dumpable()' logic	Linus Torvalds
	The 'dumpability' of a task is fundamentally about the memory image of the task - the concept comes from whether it can core dump or not - and makes no sense when you don't have an associated mm. And almost all users do in fact use it only for the case where the task has a mm pointer. But we have one odd special case: ptrace_may_access() uses 'dumpable' to check various other things entirely independently of the MM (typically explicitly using flags like PTRACE_MODE_READ_FSCREDS). Including for threads that no longer have a VM (and maybe never did, like most kernel threads). It's not what this flag was designed for, but it is what it is. The ptrace code does check that the uid/gid matches, so you do have to be uid-0 to see kernel thread details, but this means that the traditional "drop capabilities" model doesn't make any difference for this all. Make it all make a bit more sense by saying that if you don't have a MM pointer, we'll use a cached "last dumpability" flag if the thread ever had a MM (it will be zero for kernel threads since it is never set), and require a proper CAP_SYS_PTRACE capability to override. Reported-by: Qualys Security Advisory <qsa@qualys.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
12 days	bpf: Use array_map_meta_equal for percpu array inner map replacement	Guannan Wang
	percpu_array_map_ops.map_meta_equal points to the generic bpf_map_meta_equal(), which does not compare max_entries. When a percpu array serves as an inner map, replacing it with one that has fewer max_entries bypasses the check. Since percpu_array_map_gen_lookup() inlines the original template's index_mask as a JIT immediate, a lookup on the replacement map can access pptrs[] out of bounds. Point percpu_array_map_ops.map_meta_equal to array_map_meta_equal(), which already enforces the max_entries equality check. Add a selftest to verify that replacing a percpu array inner map with a differently-sized one is rejected. Fixes: db69718b8efa ("bpf: inline bpf_map_lookup_elem() for PERCPU_ARRAY maps") Signed-off-by: Guannan Wang <wgnbuaa@gmail.com> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260514074454.77491-1-wgnbuaa@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 days	Merge tag 'sched_ext-for-7.1-rc3-fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "The bulk of this is hardening of the new sub-scheduler infrastructure. - UAFs and lifecycle bugs on the sub-sched attach/detach paths: parent sub_kset freed under a racing child, list_del_rcu on an uninitialized list head, ops->priv stomped by concurrent attach/detach, and a UAF in the init-failure error path - Task state-machine reorg closing concurrent enable-vs-dead races: a task exiting during the unlocked init window could trip NULL ops derefs or skip exit_task() cleanup - A scx_link_sched() self-deadlock on scx_sched_lock - isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN cpumask without RCU, and stop rejecting BPF schedulers when only cpuset isolated partitions are active - PREEMPT_RT: disable irq_work runs in hardirq context so dumps show the failing task rather than the irq_work kthread - Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build fixes" * tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched() sched_ext: Drop NONE early return in scx_disable_and_exit_task() sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path sched_ext: Clear ops->priv on scx_alloc_and_add_sched() error paths sched_ext: Fix ops->priv clobber on concurrent attach/detach selftests/sched_ext: Fix build error in dequeue selftest sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths sched_ext: Close sub-sched init race with post-init DEAD recheck sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state() sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch->disable_irq_work sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warnings sched_ext: Drop unused scx_find_sub_sched() stub sched_ext: Move scx_error() out of scx_link_sched()'s lock region
12 days	Merge tag 'cgroup-for-7.1-rc3-fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - cpuset fixes: - Partition invalidation could return CPUs still in use by sibling partitions, producing overlapping effective_cpus - cpuset_can_attach() over-reserved DL bandwidth on moves that stayed within the same root domain - Pending DL migration state leaked into later attaches when a later can_attach() check failed - Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can allocate from any node and exit quickly - dmem: propagate -ENOMEM instead of spinning forever when the fallback pool allocation also fails - selftests/cgroup: percpu test error-path leak, bogus numeric comparison of cpuset strings, and a zero-length read() that silently passed OOM-kill tests * tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Return only actually allocated CPUs during partition invalidation selftests/cgroup: Fix error path leaks in test_percpu_basic cgroup/cpuset: Reserve DL bandwidth only for root-domain moves cgroup/cpuset: Reset DL migration state on can_attach() failure selftests/cgroup: Fix string comparison in write_test selftests/cgroup: Fix cg_read_strcmp() empty string comparison cgroup/dmem: Return -ENOMEM on failed pool preallocation cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()
12 days	Merge tag 'wq-for-7.1-rc3-fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: - Plug a wq->cpu_pwq leak on the WQ_UNBOUND allocation failure path - Fix a cancel_delayed_work_sync() livelock against drain_workqueue() caused by the drain/destroy reject path leaving WORK_STRUCT_PENDING set with no owner * tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Fix wq->cpu_pwq leak in alloc_and_link_pwqs() WQ_UNBOUND path workqueue: Release PENDING in __queue_work() drain/destroy reject path
12 days	sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation	Andrea Righi
	scx_enable() refuses to attach a BPF scheduler when isolcpus=domain is in effect by comparing housekeeping_cpumask(HK_TYPE_DOMAIN) against cpu_possible_mask. Since commit 27c3a5967f05 ("sched/isolation: Convert housekeeping cpumasks to rcu pointers"), HK_TYPE_DOMAIN's cpumask is RCU protected and dereferencing it requires either RCU read lock, the cpu_hotplug write lock, or the cpuset lock; scx_enable() holds none of these, so booting with isolcpus=domain and attaching any BPF scheduler triggers the following lockdep splat: ============================= WARNING: suspicious RCU usage ----------------------------- kernel/sched/isolation.c:60 suspicious rcu_dereference_check() usage! 1 lock held by scx_flash/281: #0: ffffffff8379fce0 (update_mutex){+.+.}-{4:4}, at: bpf_struct_ops_link_create+0x134/0x1c0 Call Trace: dump_stack_lvl+0x6f/0xb0 lockdep_rcu_suspicious.cold+0x37/0x70 housekeeping_cpumask+0xcd/0xe0 scx_enable.isra.0+0x17/0x120 bpf_scx_reg+0x5e/0x80 bpf_struct_ops_link_create+0x151/0x1c0 __sys_bpf+0x1e4b/0x33c0 __x64_sys_bpf+0x21/0x30 do_syscall_64+0x117/0xf80 entry_SYSCALL_64_after_hwframe+0x77/0x7f In addition, commit 03ff73510169 ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset") made HK_TYPE_DOMAIN include cpuset isolated partitions as well, which means the current check also rejects BPF schedulers when a cpuset partition is active. That contradicts the original intent of commit 9f391f94a173 ("sched_ext: Disallow loading BPF scheduler if isolcpus= domain isolation is in effect"), which explicitly noted that cpuset partitions are honored through per-task cpumasks and should not be rejected. Switch to housekeeping_enabled(HK_TYPE_DOMAIN_BOOT), which reads only the housekeeping flag bit (no RCU dereference) and reflects exactly the boot-time isolcpus= configuration that the error message refers to. Fixes: 27c3a5967f05 ("sched/isolation: Convert housekeeping cpumasks to rcu pointers") Cc: stable@vger.kernel.org # v7.0+ Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Frederic Weisbecker <frederic@kernel.org>
12 days	cgroup/cpuset: Return only actually allocated CPUs during partition invalidation	sunshaojie
	In update_parent_effective_cpumask() with partcmd_invalidate, the CPUs to return to the parent are computed as: adding = cpumask_and(tmp->addmask, xcpus, parent->effective_xcpus); where xcpus = user_xcpus(cs) which returns cs->exclusive_cpus (if set) or cs->cpus_allowed. When exclusive_cpus is not set, user_xcpus(cs) can contain CPUs that were never actually granted to the partition due to sibling exclusion in compute_excpus(). Consequently, the invalidation may return CPUs to the parent that remain in use by sibling partitions, causing overlapping effective_cpus and triggering the WARN_ON_ONCE(1) in generate_sched_domains(). Use cs->effective_xcpus instead, which reflects the CPUs actually granted to this partition. Reproducer (on a 4-CPU machine): cd /sys/fs/cgroup mkdir a1 b1 # a1 becomes partition root with CPUs 0-1 echo "0-1" > a1/cpuset.cpus echo "root" > a1/cpuset.cpus.partition # b1 becomes partition root with CPUs 1-2, but sibling exclusion # reduces its effective_xcpus to CPU 2 only echo "1-2" > b1/cpuset.cpus echo "root" > b1/cpuset.cpus.partition # b1 changes cpus_allowed to 0-1 -> partition invalidation echo "0-1" > b1/cpuset.cpus # Expected: CPUs 2-3 (only CPU 2 returned from b1) # Actual: CPUs 1-3 (CPU 0-1 returned, overlapping with a1) cat cpuset.cpus.effective dmesg will also show a WARNING from generate_sched_domains() reporting overlapping partition root effective_cpus. Fixes: 2a3602030d80 ("cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict") Cc: stable@vger.kernel.org # v7.0+ Signed-off-by: sunshaojie <sunshaojie@kylinos.cn> Tested-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
13 days	Merge tag 'fixes-2026-05-13' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux Pull liveupdate fixes from Mike Rapoport: "A few fixes for kexec handover and liveupdate: - make sure KHO is skipped for crash kernel - fix error reporting in memfd preservation if it fails mid-loop - don't allow preserving memfds whose page count exceeds UINT_MAX - fix documentation of memfd seals preservation to match the code" * tag 'fixes-2026-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux: mm/memfd_luo: document preservation of file seals mm/memfd_luo: reject memfds whose page count exceeds UINT_MAX mm/memfd_luo: report error when restoring a folio fails mid-loop kho: skip KHO for crash kernel
13 days	sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work	Tejun Heo
	scx_sub_enable_workfn() pins parent->kobj before dropping scx_sched_lock, but that does not pin parent->sub_kset. Concurrent disable can kset_unregister and free sub_kset before scx_alloc_and_add_sched() dereferences it. Split sub_kset teardown: kobject_del() at disable keeps sysfs removal; defer kobject_put() to scx_sched_free_rcu_work so the memory survives. A racing child sees state_in_sysfs=0 with valid memory, sysfs_create_dir() fails, and the existing exit_kind gate in scx_link_sched() turns it away with -ENOENT. Fixes: 411d3ef1a705 ("sched_ext: Unregister sub_kset on scheduler disable") Signed-off-by: Tejun Heo <tj@kernel.org>
13 days	sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched()	Tejun Heo
	On scx_link_sched() error paths (parent disabled, hash insert failure), &sch->all is never added to scx_sched_all. The cleanup path runs scx_unlink_sched() unconditionally, which calls list_del_rcu(&sch->all) on a list_head that was never initialized triggering a corruption warning. Initialize &sch->all. Fixes: 54be8de4236a ("sched_ext: Factor out scx_link_sched() and scx_unlink_sched()") Signed-off-by: Tejun Heo <tj@kernel.org>
13 days	sched_ext: Drop NONE early return in scx_disable_and_exit_task()	Tejun Heo
	d3e73a0808dd ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths") skipped the trailing scx_set_task_sched(p, NULL) on NONE tasks. After scx_fail_parent() parks a task at NONE/sched=parent and the parent is later freed via queue_rcu_work() during root_disable, the preserved p->scx.sched dangles - print_scx_info() from sched_show_task() reads sch->ops.name from freed memory. Drop the early return. __scx_disable_and_exit_task() already short- circuits on NONE and the SUB_INIT block was cleared by scx_fail_parent()'s earlier call, so clearing p->scx.sched is the only work left - and the one thing the path actually needs. v2: Extend the SUB_INIT block comment to note that the flag is only set on the sub-enable path, so it's always clear on the NONE re-entry (Andrea). Fixes: d3e73a0808dd ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
13 days	audit: enforce AUDIT_LOCKED for AUDIT_TRIM and AUDIT_MAKE_EQUIV	Sergio Correia
	AUDIT_ADD_RULE and AUDIT_DEL_RULE correctly check for AUDIT_LOCKED and return -EPERM, but AUDIT_TRIM and AUDIT_MAKE_EQUIV do not. This allows a process with CAP_AUDIT_CONTROL to modify directory tree watches and equivalence mappings even when the audit configuration has been locked, undermining the purpose of the lock. Add AUDIT_LOCKED checks to both commands. Cc: stable@vger.kernel.org Reviewed-by: Ricardo Robaina <rrobaina@redhat.com> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Sergio Correia <scorreia@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
13 days	audit: fix incorrect inheritable capability in CAPSET records	Sergio Correia
	__audit_log_capset() records the effective capability set into the inheritable field due to a copy-paste error. Every CAPSET audit record therefore reports cap_pi (process inheritable) with the value of cap_effective instead of cap_inheritable. This silently corrupts audit data used for compliance and forensic analysis: an attacker who modifies inheritable capabilities to prepare for a privilege-escalating exec would have the change masked in the audit trail. The bug has been present since the original introduction of CAPSET audit records in 2008. Cc: stable@vger.kernel.org Fixes: e68b75a027bb ("When the capset syscall is used it is not possible for audit to record the actual capbilities being added/removed. This patch adds a new record type which emits the target pid and the eff, inh, and perm cap sets.") Reviewed-by: Ricardo Robaina <rrobaina@redhat.com> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Sergio Correia <scorreia@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
13 days	Merge tag 'probes-fixes-v7.1-rc3' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist() Since the ftrace adds its NOPs at .kprobes.text section (which stores an array), a wrong entry is added when loading a module which uses "__kprobes" attribute. To solve this, add "notrace" to __kprobes functions - test_kprobes: clear kprobes between test runs Clear all kprobes in the test program after running a test set, because Kunit test can run several times - fprobe: Fix unregister_fprobe() to wait for RCU grace period Since the fprobe data structure is removed with hlist_del_rcu(), it should wait for the RCU grace period. If the caller waits for RCU, we can use the async variant (e.g. eBPF) * tag 'probes-fixes-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: fprobe: Fix unregister_fprobe() to wait for RCU grace period test_kprobes: clear kprobes between test runs kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()
2026-05-11	sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path	Tejun Heo
	In scx_root_enable_workfn(), put_task_struct(p) is called before scx_error() dereferences p->comm and p->pid. If the iterator's reference is the last drop, the task is freed synchronously and the deref becomes a UAF. Move put_task_struct() past scx_error(). Reported-by: Sashiko <sashiko-bot@kernel.org> Closes: https://lore.kernel.org/all/20260511214031.AF5E9C2BCB0@smtp.kernel.org/ Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-11	cgroup/cpuset: Reserve DL bandwidth only for root-domain moves	Guopeng Zhang
	cpuset_can_attach() currently adds the bandwidth of all migrating SCHED_DEADLINE tasks to sum_migrate_dl_bw. If the source and destination cpuset effective CPU masks do not overlap, the whole sum is then reserved in the destination root domain. set_cpus_allowed_dl(), however, subtracts bandwidth from the source root domain only when the affinity change really moves the task between root domains. A DL task can move between cpusets that are still in the same root domain, so including that task in sum_migrate_dl_bw can reserve destination bandwidth without a matching source-side subtraction. Share the root-domain move test with set_cpus_allowed_dl(). Keep nr_migrate_dl_tasks counting all migrating deadline tasks for cpuset DL task accounting, but add to sum_migrate_dl_bw only for tasks that need a root-domain bandwidth move. Keep using the destination cpuset effective CPU mask and leave the broader can_attach()/attach() transaction model unchanged. Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails") Cc: stable@vger.kernel.org # v6.10+ Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Reviewed-by: Waiman Long <longman@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-11	exit: prevent preemption of oopsing TASK_DEAD task	Jann Horn
	When an already-exiting task oopses, make_task_dead() currently calls do_task_dead() with preemption enabled. That is forbidden: do_task_dead() calls __schedule(), which has a comment saying "WARNING: must be called with preemption disabled!". If an oopsing task is preempted in do_task_dead(), between becoming TASK_DEAD and entering the scheduler explicitly, bad things happen: finish_task_switch() assumes that once the scheduler has switched away from a TASK_DEAD task, the task can never run again and its stack is no longer needed; but that assumption apparently doesn't hold if the dead task was preempted (the SM_PREEMPT case). This means that the scheduler ends up repeatedly dropping references on the dead task's stack, which can lead to use-after-free or double-free of the entire task stack; in other words, two tasks can end up running on the same stack, resulting in various kinds of memory corruption. (This does not just affect "recursively oopsing" tasks; it is enough to oops once during task exit, for example in a file_operations::release handler) Fixes: 7f80a2fd7db9 ("exit: Stop poorly open coding do_task_dead in make_task_dead") Cc: stable@kernel.org Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-05-11	bpf: Fix s16 truncation for large bpf-to-bpf call offsets	Yazhou Tang
	Currently, the BPF instruction set allows bpf-to-bpf calls (or internal calls, pseudo calls) to use a 32-bit imm field to represent the relative jump offset. However, when JIT is disabled or falls back to the interpreter, the verifier invokes bpf_patch_call_args() to rewrite the call instruction. In this function, the 32-bit imm is downcast to s16 and stored in the off field. void bpf_patch_call_args(struct bpf_insn *insn, u32 stack_depth) { stack_depth = max_t(u32, stack_depth, 1); insn->off = (s16) insn->imm; insn->imm = interpreters_args[(round_up(stack_depth, 32) / 32) - 1] - __bpf_call_base_args; insn->code = BPF_JMP \| BPF_CALL_ARGS; } If the original imm exceeds the s16 range (i.e., a jump offset greater than 32767 instructions), this downcast silently truncates the offset, resulting in an incorrect call target. Fix this by: 1. In bpf_patch_call_args(), keeping the imm field unchanged and using the off field to store the index of the interpreter function. 2. In ___bpf_prog_run() for the JMP_CALL_ARGS case, retrieving the interpreter function pointer from the interpreters_args array using the off field as the index, and passing the original imm to calculate the last argument of the interpreter function. After these changes, the truncation issue is resolved, and __bpf_call_base_args is also no longer needed and can be removed, which makes the code cleaner. Performance: In ___bpf_prog_run() for the JMP_CALL_ARGS case, changing the retrieval of the interpreter function pointer from pointer addition to direct array indexing improves performance. The possible reason is that the latter has better instruction-level parallelism. See the v5 discussion [1] for more details. [1] https://lore.kernel.org/bpf/f120c3c4-6999-414a-b514-518bb64b4758@zju.edu.cn/ To avoid requiring bpftool changes, keep the new imm/off encoding internal and restore the legacy xlated dump layout in bpf_insn_prepare_dump(). For bpf-to-bpf call offsets that do not fit in s16, export off as 0 instead of a truncated and misleading value. Fixes: 1ea47e01ad6e ("bpf: add support for bpf_call to interpreter") Fixes: 7105e828c087 ("bpf: allow for correlation of maps and helpers in dump") Suggested-by: Xu Kuohai <xukuohai@huaweicloud.com> Suggested-by: Puranjay Mohan <puranjay@kernel.org> Co-developed-by: Tianci Cao <ziye@zju.edu.cn> Signed-off-by: Tianci Cao <ziye@zju.edu.cn> Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com> Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com> Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com> Link: https://lore.kernel.org/r/20260506094714.419842-3-tangyazhou@zju.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-11	bpf: Fix out-of-bounds read in bpf_patch_call_args()	Yazhou Tang
	The interpreters_args array only accommodates stack depths up to MAX_BPF_STACK (512 bytes). However, do_misc_fixups() may allow a larger stack depth if JIT is requested. If JIT compilation later fails and falls back to the interpreter, the verifier invokes bpf_patch_call_args() with this oversized stack depth. This causes a load-time out-of-bounds (OOB) read when calculating the interpreter function pointer index. Fix this by changing bpf_patch_call_args() to return an int and explicitly rejecting the JIT fallback (returning -EINVAL) if the stack depth exceeds MAX_BPF_STACK. Fixes: 1ea47e01ad6e ("bpf: add support for bpf_call to interpreter") Co-developed-by: Tianci Cao <ziye@zju.edu.cn> Signed-off-by: Tianci Cao <ziye@zju.edu.cn> Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com> Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com> Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com> Acked-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/r/20260506094714.419842-2-tangyazhou@zju.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-11	irq_work: Fix use-after-free in irq_work_single() on PREEMPT_RT	Jiayuan Chen
	On PREEMPT_RT, non-HARD irq_work runs in per-CPU kthreads via run_irq_workd(), so irq_work_sync() uses rcuwait() to wait for BUSY==0. After irq_work_single() clears BUSY via atomic_cmpxchg(), it still dereferences @work for irq_work_is_hard() and rcuwait_wake_up(). An irq_work_sync() caller on another CPU that enters after BUSY is cleared can observe BUSY==0 immediately, return, and free the work before those accesses complete — causing a use-after-free. Fix this by wrapping run_irq_workd() in guard(rcu)() so that the entire irq_work_single() execution is within an RCU read-side critical section. Then add synchronize_rcu() in irq_work_sync() after rcuwait_wait_event() to ensure the caller waits for the RCU grace period before returning, preventing premature frees. Fixes: 810979682ccc ("irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.") Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20260330073234.303732-1-jiayuan.chen@linux.dev
2026-05-11	genirq/chip: Don't call add_interrupt_randomness() for NMIs	Mark Rutland
	Recently handle_percpu_devid_irq() was changed to call add_interrupt_randomness(). This introduced a potential deadlock when handle_percpu_devid_irq() is used to handle an NMI, which can be detected with lockdep, e.g. ================================ WARNING: inconsistent lock state 7.1.0-rc2-pnmi #465 Not tainted -------------------------------- inconsistent {INITIAL USE} -> {IN-NMI} usage. perf/695 [HC1[1]:SC0[0]:HE0:SE1] takes: ffff00837dfd3a18 (&base->lock){-.-.}-{2:2}, at: lock_timer_base+0x6c/0xac {INITIAL USE} state was registered at: _raw_spin_lock_irqsave+0x68/0xb0 lock_timer_base+0x6c/0xac __mod_timer+0x100/0x32c add_timer_global+0x2c/0x40 __queue_delayed_work+0xf0/0x140 queue_delayed_work_on+0x134/0x138 mem_cgroup_css_online+0x30c/0x310 online_css+0x34/0x10c cgroup_init_subsys+0x158/0x1c8 cgroup_init+0x440/0x524 start_kernel+0x888/0x998 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&base->lock); <Interrupt> lock(&base->lock); * DEADLOCK * Call trace: _raw_spin_lock_irqsave+0x68/0xb0 lock_timer_base+0x6c/0xac add_timer_on+0x78/0x16c add_interrupt_randomness+0x124/0x134 handle_percpu_devid_irq+0xd4/0x16c handle_irq_desc+0x40/0x58 generic_handle_domain_nmi+0x28/0x50 __gic_handle_nmi.isra.0+0x4c/0xa0 gic_handle_irq+0x38/0x2bc call_on_irq_stack+0x30/0x48 do_interrupt_handler+0x80/0x98 el1_interrupt+0x90/0xac el1h_64_irq_handler+0x18/0x24 el1h_64_irq+0x80/0x84 [...] During review, Thomas pointed out it wouldn't be safe for handle_percpu_devid_irq() to call add_interrupt_randomness() if it was used to handle NMIs: https://lore.kernel.org/lkml/87bjgik042.ffs@tglx/ ... but evidently people missed that handle_percpu_devid_irq() is used for NMIs. While it might seem that NMIs should be handled with a separate handle_percpu_devid_nmi() function, for various structural reasons this was impractical, and handle_percpu_devid_irq() has been expected to be used for NMIs since commits: 21bbbc50f398f ("irqchip/gic-v3: Switch high priority PPIs over to handle_percpu_devid_irq()") 5ff78c8de9d83 ("genirq: Kill handle_percpu_devid_fasteoi_nmi()") Taking the above into account, avoid the deadlock by not calling add_interrupt_randomness() when handle_percpu_devid_irq() is called in an NMI context. This is consistent with other NNI handling flows, which do not call add_interrupt_randomness(). At the same time, update the kernel-doc comment to make it clear that handle_percpu_devid_irq() can be called in NMI context. The rest of handle_percpu_devid_irq() is currently NMI safe and doesn't need to change. Fixes: fd7400cfcbaa ("genirq/chip: Invoke add_interrupt_randomness() in handle_percpu_devid_irq()") Reported-by: Ada Couprie Diaz <ada.coupriediaz@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://patch.msgid.link/20260507110518.3128248-1-mark.rutland@arm.com
2026-05-11	fprobe: Fix unregister_fprobe() to wait for RCU grace period	Masami Hiramatsu (Google)
	Commit 4346ba1604093 ("fprobe: Rewrite fprobe on function-graph tracer") changed fprobe to register struct fprobe to an rcu-hlist, but it forgot to wait for RCU GP. Thus there can be use-after-free if the fprobe is released right after unregistering. This can be happened on fprobe event and sample module code. To fix this issue, add synchronize_rcu() in unregister_fprobe(). Note that BPF is OK because fprobe is used as a part of bpf_kprobe_multi_link. This unregisters its fprobe in bpf_kprobe_multi_link_release() and it is deallocated via bpf_kprobe_multi_link_dealloc(), which is invoked from bpf_link_defer_dealloc_rcu_gp() RCU callback. For BPF, this also introduced unregister_fprobe_async() which does NOT wait for RCU grace priod. Link: https://lore.kernel.org/all/177813998919.256460.2809243930741138224.stgit@mhiramat.tok.corp.google.com/ Fixes: 4346ba1604093 ("fprobe: Rewrite fprobe on function-graph tracer") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>