summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2026-03-26Merge tag 'dma-mapping-7.0-2026-03-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: "A set of fixes for DMA-mapping subsystem, which resolve false- positive warnings from KMSAN and DMA-API debug (Shigeru Yoshida and Leon Romanovsky) as well as a simple build fix (Miguel Ojeda)" * tag 'dma-mapping-7.0-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-mapping: add missing `inline` for `dma_free_attrs` mm/hmm: Indicate that HMM requires DMA coherency RDMA/umem: Tell DMA mapping that UMEM requires coherency iommu/dma: add support for DMA_ATTR_REQUIRE_COHERENT attribute dma-direct: prevent SWIOTLB path when DMA_ATTR_REQUIRE_COHERENT is set dma-mapping: Introduce DMA require coherency attribute dma-mapping: Clarify valid conditions for CPU cache line overlap dma-mapping: handle DMA_ATTR_CPU_CACHE_CLEAN in trace output dma-debug: Allow multiple invocations of overlapping entries dma: swiotlb: add KMSAN annotations to swiotlb_bounce()
2026-03-26futex: Fix UaF between futex_key_to_node_opt() and vma_replace_policy()Hao-Yu Yang
During futex_key_to_node_opt() execution, vma->vm_policy is read under speculative mmap lock and RCU. Concurrently, mbind() may call vma_replace_policy() which frees the old mempolicy immediately via kmem_cache_free(). This creates a race where __futex_key_to_node() dereferences a freed mempolicy pointer, causing a use-after-free read of mpol->mode. [ 151.412631] BUG: KASAN: slab-use-after-free in __futex_key_to_node (kernel/futex/core.c:349) [ 151.414046] Read of size 2 at addr ffff888001c49634 by task e/87 [ 151.415969] Call Trace: [ 151.416732] __asan_load2 (mm/kasan/generic.c:271) [ 151.416777] __futex_key_to_node (kernel/futex/core.c:349) [ 151.416822] get_futex_key (kernel/futex/core.c:374 kernel/futex/core.c:386 kernel/futex/core.c:593) Fix by adding rcu to __mpol_put(). Fixes: c042c505210d ("futex: Implement FUTEX2_MPOL") Reported-by: Hao-Yu Yang <naup96721@gmail.com> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Hao-Yu Yang <naup96721@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Link: https://patch.msgid.link/20260324174418.GB1850007@noisy.programming.kicks-ass.net
2026-03-26futex: Require sys_futex_requeue() to have identical flagsPeter Zijlstra
Nicholas reported that his LLM found it was possible to create a UaF when sys_futex_requeue() is used with different flags. The initial motivation for allowing different flags was the variable sized futex, but since that hasn't been merged (yet), simply mandate the flags are identical, as is the case for the old style sys_futex() requeue operations. Fixes: 0f4b5f972216 ("futex: Add sys_futex_requeue()") Reported-by: Nicholas Carlini <npc@anthropic.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2026-03-26timens: Remove dependency on the vDSOThomas Weißschuh
Previously, missing time namespace support in the vDSO meant that time namespaces needed to be disabled globally. This was expressed in a hard dependency on the generic vDSO library. This also meant that architectures without any vDSO or only a stub vDSO could not enable time namespaces. Now that all architectures using a real vDSO are using the generic library, that dependency is not necessary anymore. Remove the dependency and let all architectures enable time namespaces. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260326-vdso-timens-decoupling-v2-2-c82693a7775f@linutronix.de
2026-03-26vdso/timens: Move functions to new fileThomas Weißschuh
As a preparation of the untangling of time namespaces and the vDSO, move the glue functions between those subsystems into a new file. While at it, switch the mutex lock and mmap_read_lock() in the vDSO namespace code to guard(). Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260326-vdso-timens-decoupling-v2-1-c82693a7775f@linutronix.de
2026-03-26tracing: Move snapshot code out of trace.c and into trace_snapshot.cSteven Rostedt
The trace.c file was a dumping ground for most tracing code. Start organizing it better by moving various functions out into their own files. Move all the snapshot code, including the max trace code into its own trace_snapshot.c file. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260324140145.36352d6a@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-26nsproxy: Add FOR_EACH_NS_TYPE() X-macro and CLONE_NS_ALLMickaël Salaün
Introduce the FOR_EACH_NS_TYPE(X) macro as the single source of truth for the set of (struct type, CLONE_NEW* flag) pairs that define Linux namespace types. Currently, the list of CLONE_NEW* flags is duplicated inline in multiple call sites and would need another copy in each new consumer. This makes it easy to miss one when a new namespace type is added. Derive two things from the X-macro: - CLONE_NS_ALL: Bitmask of all known CLONE_NEW* flags, usable as a validity mask or iteration bound. - ns_common_type(): Rewritten to use the X-macro via a leading-comma _Generic pattern, so the struct-to-flag mapping stays in sync with the flag set automatically. Replace the inline flag enumerations in copy_namespaces(), unshare_nsproxy_namespaces(), check_setns_flags(), and ksys_unshare() with CLONE_NS_ALL. When a new namespace type is added, only FOR_EACH_NS_TYPE needs to be updated; CLONE_NS_ALL, ns_common_type(), and all the call sites pick up the change automatically. Cc: Christian Brauner <brauner@kernel.org> Cc: Günther Noack <gnoack@google.com> Signed-off-by: Mickaël Salaün <mic@digikod.net> Link: https://patch.msgid.link/20260312100444.2609563-4-mic@digikod.net Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-26kernel: acct: fix duplicate word in commenthaoyu.lu
Fix the duplicate word "kernel" in the comment on line 247. Signed-off-by: haoyu.lu <hechushiguitu666@gmail.com> Link: https://patch.msgid.link/20260326055628.10773-1-hechushiguitu666@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-26kernel: Use trace_call__##name() at guarded tracepoint call sitesVineeth Pillai (Google)
Replace trace_foo() with the new trace_call__foo() at sites already guarded by trace_foo_enabled(), avoiding a redundant static_branch_unlikely() re-evaluation inside the tracepoint. trace_call__foo() calls the tracepoint callbacks directly without utilizing the static branch again. Cc: David Vernet <void@manifault.com> Cc: Andrea Righi <arighi@nvidia.com> Cc: Changwoo Min <changwoo@igalia.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Thomas Gleixner <tglx@kernel.org> Cc: "Yury Norov [NVIDIA]" <yury.norov@gmail.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Kisel <romank@linux.microsoft.com> Cc: Joel Fernandes <joelagnelf@nvidia.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Ulf Hansson <ulf.hansson@linaro.org> Link: https://patch.msgid.link/20260323160052.17528-3-vineeth@bitbyteword.org Suggested-by: Steven Rostedt <rostedt@goodmis.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org> Assisted-by: Claude:claude-sonnet-4-6 Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Thomas Gleixner <tglx@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-26sysctl: fix uninitialized variable in proc_do_large_bitmapMarc Buerg
proc_do_large_bitmap() does not initialize variable c, which is expected to be set to a trailing character by proc_get_long(). However, proc_get_long() only sets c when the input buffer contains a trailing character after the parsed value. If c is not initialized it may happen to contain a '-'. If this is the case proc_do_large_bitmap() expects to be able to parse a second part of the input buffer. If there is no second part an unjustified -EINVAL will be returned. Initialize c to 0 to prevent returning -EINVAL on valid input. Fixes: 9f977fb7ae9d ("sysctl: add proc_do_large_bitmap") Signed-off-by: Marc Buerg <buermarc@googlemail.com> Reviewed-by: Joel Granados <joel.granados@kernel.org> Signed-off-by: Joel Granados <joel.granados@kernel.org>
2026-03-25sched_ext: Fix missing SCX_EV_SUB_BYPASS_DISPATCH aggregation in ↵Cheng-Yang Chou
scx_read_events() 025b1bd41965 introduced SCX_EV_SUB_BYPASS_DISPATCH to track scheduling of bypassed descendant tasks, and correctly increments it per-CPU and displays it in sysfs and dump output. However, scx_read_events() which aggregates per-CPU counters into a summary was not updated to include this event, causing it to always read as zero in sysfs, in debug dumps, and via the scx_bpf_events() kfunc. Add the missing scx_agg_event() call for SCX_EV_SUB_BYPASS_DISPATCH. Fixes: 025b1bd41965 ("sched_ext: Implement hierarchical bypass mode") Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-25sched_ext: Fix missing return after scx_error() in scx_dsq_move()Cheng-Yang Chou
When scx_bpf_dsq_move[_vtime]() is called on a task that belongs to a different scheduler, scx_error() is invoked to flag the violation. scx_error() schedules an asynchronous scheduler teardown via irq_work and returns immediately, so execution falls through and the DSQ move proceeds on a cross-scheduler task regardless, potentially corrupting DSQ state. Add the missing return false so the function exits right after reporting the error, consistent with the other early-exit checks in the same function (e.g. scx_vet_enq_flags() failure at the top). Fixes: bb4d9fd55158 ("sched_ext: scx_dsq_move() should validate the task belongs to the right scheduler") Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-25Merge tag 'rcu-fixes.v7.0-20260325a' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU fixes from Boqun Feng: "Fix a regression introduced by commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast"): BPF contexts can run with preemption disabled or scheduler locks held, so call_srcu() must work in all such contexts. Fix this by converting SRCU's spinlocks to raw spinlocks and avoiding scheduler lock acquisition in call_srcu() by deferring to an irq_work (similar to call_rcu_tasks_generic()), for both tree SRCU and tiny SRCU. Also fix a follow-on lockdep splat caused by srcu_node allocation under the newly introduced raw spinlock by deferring the allocation to grace-period worker context" * tag 'rcu-fixes.v7.0-20260325a' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: srcu: Use irq_work to start GP in tiny SRCU rcu: Use an intermediate irq_work to start process_srcu() srcu: Push srcu_node allocation to GP when non-preemptible srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()
2026-03-25cgroup: Fix cgroup_drain_dying() testing the wrong conditionTejun Heo
cgroup_drain_dying() was using cgroup_is_populated() to test whether there are dying tasks to wait for. cgroup_is_populated() tests nr_populated_csets, nr_populated_domain_children and nr_populated_threaded_children, but cgroup_drain_dying() only needs to care about this cgroup's own tasks - whether there are children is cgroup_destroy_locked()'s concern. This caused hangs during shutdown. When systemd tried to rmdir a cgroup that had no direct tasks but had a populated child, cgroup_drain_dying() would enter its wait loop because cgroup_is_populated() was true from nr_populated_domain_children. The task iterator found nothing to wait for, yet the populated state never cleared because it was driven by live tasks in the child cgroup. Fix it by using cgroup_has_tasks() which only tests nr_populated_csets. v3: Fix cgroup_is_populated() -> cgroup_has_tasks() (Sebastian). v2: https://lore.kernel.org/r/20260323200205.1063629-1-tj@kernel.org Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Fixes: 1b164b876c36 ("cgroup: Wait for dying tasks to leave on rmdir") Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2026-03-25smp: Improve smp_call_function_single() CSD-lock diagnosticsPaul E. McKenney
Both smp_call_function() and smp_call_function_single() use per-CPU call_single_data_t variable to hold the infamous CSD lock. However, while smp_call_function() acquires the destination CPU's CSD lock, smp_call_function_single() instead uses the source CPU's CSD lock. (These are two separate sets of CSD locks, cfd_data and csd_data, respectively.) This otherwise inexplicable pair of choices is explained by their respective queueing properties. If smp_call_function() where to use the sending CPU's CSD lock, that would serialize the destination CPUs' IPI handlers and result in long smp_call_function() latencies, especially on systems with large numbers of CPUs. For its part, if smp_call_function_single() were to use the (single) destination CPU's CSD lock, this would similarly serialize in the case where many CPUs are sending IPIs to a single "victim" CPU. Plus it would result in higher levels of memory contention. Except that if there is no NMI-based stack tracing on a weakly ordered system where remote unsynchronized stack traces are especially unreliable, the improved debugging beats the improved queueing. This improved queueing only matters if a bunch of CPUs are calling smp_call_function_single() concurrently for a single "victim" CPU, which is not the common case. Therefore, make smp_call_function_single() use the destination CPU's csd_data instance in kernels built with CONFIG_CSD_LOCK_WAIT_DEBUG=y where csdlock_debug_enabled is also set. Otherwise, continue to use the source CPU's csd_data. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://patch.msgid.link/25c2eb97-77c8-49a5-80ac-efe78dea272c@paulmck-laptop
2026-03-25smp: Get this_cpu once in smp_call_functionShrikanth Hegde
smp_call_function_single() and smp_call_function_many_cond() disable preemption and cache the CPU number via get_cpu(). Use this cached value throughout the function instead of invoking smp_processor_id() again. [ tglx: Make the copy&pasta'ed change log match the patch ] Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com> Link: https://patch.msgid.link/20260323193630.640311-4-sshegde@linux.ibm.com
2026-03-25smp: Add missing kernel-doc commentsRandy Dunlap
Add missing kernel-doc comments and rearrange the order of others to prevent all kernel-doc warnings. - add function Returns: sections or format existing comments as kernel-doc - add missing function parameter comments - use "/**" for smp_call_function_any() and on_each_cpu_cond_mask() - correct the commented function name for on_each_cpu_cond_mask() - use correct format for function short descriptions - add all kernel-doc comments for smp_call_on_cpu() - remove kernel-doc comments for raw_smp_processor_id() since there is no prototype for it here (other than !SMP) - in smp.h, rearrange some lines so that the kernel-doc comments for smp_processor_id() are immediately before the macro (to prevent kernel-doc warnings) - remove "Returns" from smp_call_function() since it doesn't return a value Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260310061726.1153764-1-rdunlap@infradead.org
2026-03-25Merge tag 'dma-mapping-7.0-2026-03-25' into dma-mapping-for-nextMarek Szyprowski
dma-mapping fixes for Linux 7.0 A set of fixes for DMA-mapping subsystem, which resolve false-positive warnings from KMSAN and DMA-API debug (Shigeru Yoshida and Leon Romanovsky) as well as a simple build fix (Miguel Ojeda). Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
2026-03-25srcu: Use irq_work to start GP in tiny SRCUJoel Fernandes
Tiny SRCU's srcu_gp_start_if_needed() directly calls schedule_work(), which acquires the workqueue pool->lock. This causes a lockdep splat when call_srcu() is called with a scheduler lock held, due to: call_srcu() [holding pi_lock] srcu_gp_start_if_needed() schedule_work() -> pool->lock workqueue_init() / create_worker() [holding pool->lock] wake_up_process() -> try_to_wake_up() -> pi_lock Also add irq_work_sync() to cleanup_srcu_struct() to prevent a use-after-free if a queued irq_work fires after cleanup begins. Tested with rcutorture SRCU-T and no lockdep warnings. [ Thanks to Boqun for similar fix in patch "rcu: Use an intermediate irq_work to start process_srcu()" ] Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun@kernel.org>
2026-03-25rcu: Use an intermediate irq_work to start process_srcu()Boqun Feng
Since commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") we switched to SRCU in BPF. However as BPF instrument can happen basically everywhere (including where a scheduler lock is held), call_srcu() now needs to avoid acquiring scheduler lock because otherwise it could cause deadlock [1]. Fix this by following what the previous RCU Tasks Trace did: using an irq_work to delay the queuing of the work to start process_srcu(). [boqun: Apply Joel's feedback] [boqun: Apply Andrea's test feedback] Reported-by: Andrea Righi <arighi@nvidia.com> Closes: https://lore.kernel.org/all/abjzvz_tL_siV17s@gpd4/ Fixes: commit c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") Link: https://lore.kernel.org/rcu/3c4c5a29-24ea-492d-aeee-e0d9605b4183@nvidia.com/ [1] Suggested-by: Zqiang <qiang.zhang@linux.dev> Tested-by: Andrea Righi <arighi@nvidia.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun@kernel.org>
2026-03-25srcu: Push srcu_node allocation to GP when non-preemptiblePaul E. McKenney
When the srcutree.convert_to_big and srcutree.big_cpu_lim kernel boot parameters specify initialization-time allocation of the srcu_node tree for statically allocated srcu_struct structures (for example, in DEFINE_SRCU() at build time instead of init_srcu_struct() at runtime), init_srcu_struct_nodes() will attempt to dynamically allocate this tree at the first run-time update-side use of this srcu_struct structure, but while holding a raw spinlock. Because the memory allocator can acquire non-raw spinlocks, this can result in lockdep splats. This commit therefore uses the same SRCU_SIZE_ALLOC trick that is used when the first run-time update-side use of this srcu_struct structure happens before srcu_init() is called. The actual allocation then takes place from workqueue context at the ends of upcoming SRCU grace periods. [boqun: Adjust the sha1 of the Fixes tag] Fixes: 175b45ed343a ("srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun@kernel.org>
2026-03-25srcu: Use raw spinlocks so call_srcu() can be used under preempt_disable()Paul E. McKenney
Tree SRCU has used non-raw spinlocks for many years, motivated by a desire to avoid unnecessary real-time latency and the absence of any reason to use raw spinlocks. However, the recent use of SRCU in tracing as the underlying implementation of RCU Tasks Trace means that call_srcu() is invoked from preemption-disabled regions of code, which in turn requires that any locks acquired by call_srcu() or its callees must be raw spinlocks. This commit therefore converts SRCU's spinlocks to raw spinlocks. [boqun: Add Fixes tag] Reported-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Fixes: c27cea4416a3 ("rcu: Re-implement RCU Tasks Trace in terms of SRCU-fast") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2026-03-25workqueue: unlink pwqs from wq->pwqs list in alloc_and_link_pwqs() error pathBreno Leitao
When alloc_and_link_pwqs() fails partway through the per-cpu allocation loop, some pool_workqueues may have already been linked into wq->pwqs via link_pwq(). The error path frees these pwqs with kmem_cache_free() but never removes them from the wq->pwqs list, leaving dangling pointers in the list. Currently this is not exploitable because the workqueue was never added to the global workqueues list and the caller frees the wq immediately after. However, this makes sure that alloc_and_link_pwqs() doesn't leave any half-baked structure, which may have side effects if not properly cleaned up. Fix this by unlinking each pwq from wq->pwqs before freeing it. No locking is needed as the workqueue has not been published yet, thus no concurrency is possible. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-25workqueue: Better describe stall checkPetr Mladek
Try to be more explicit why the workqueue watchdog does not take pool->lock by default. Spin locks are full memory barriers which delay anything. Obviously, they would primary delay operations on the related worker pools. Explain why it is enough to prevent the false positive by re-checking the timestamp under the pool->lock. Finally, make it clear what would be the alternative solution in __queue_work() which is a hotter path. Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-24sched_ext: Choose the right sch->ops.name to output in the print_scx_info()Zqiang
The print_scx_info() always output scx_root structure's->ops.name, but for built with CONFIG_EXT_SUB_SCHED=y kernels, the tasks may be attach an sub scx_sched structure. this commit therefore use the scx_task_sched_rcu() to correctly get scx_sched structure to output ops.name, and drop state check. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-24randomize_kstack: Unify random source across archesRyan Roberts
Previously different architectures were using random sources of differing strength and cost to decide the random kstack offset. A number of architectures (loongarch, powerpc, s390, x86) were using their timestamp counter, at whatever the frequency happened to be. Other arches (arm64, riscv) were using entropy from the crng via get_random_u16(). There have been concerns that in some cases the timestamp counters may be too weak, because they can be easily guessed or influenced by user space. And get_random_u16() has been shown to be too costly for the level of protection kstack offset randomization provides. So let's use a common, architecture-agnostic source of entropy; a per-cpu prng, seeded at boot-time from the crng. This has a few benefits: - We can remove choose_random_kstack_offset(); That was only there to try to make the timestamp counter value a bit harder to influence from user space [*]. - The architecture code is simplified. All it has to do now is call add_random_kstack_offset() in the syscall path. - The strength of the randomness can be reasoned about independently of the architecture. - Arches previously using get_random_u16() now have much faster syscall paths, see below results. [*] Additionally, this gets rid of some redundant work on s390 and x86. Before this patch, those architectures called choose_random_kstack_offset() under arch_exit_to_user_mode_prepare(), which is also called for exception returns to userspace which were *not* syscalls (e.g. regular interrupts). Getting rid of choose_random_kstack_offset() avoids a small amount of redundant work for the non-syscall cases. In some configurations, add_random_kstack_offset() will now call instrumentable code, so for a couple of arches, I have moved the call a bit later to the first point where instrumentation is allowed. This doesn't impact the efficacy of the mechanism. There have been some claims that a prng may be less strong than the timestamp counter if not regularly reseeded. But the prng has a period of about 2^113. So as long as the prng state remains secret, it should not be possible to guess. If the prng state can be accessed, we have bigger problems. Additionally, we are only consuming 6 bits to randomize the stack, so there are only 64 possible random offsets. I assert that it would be trivial for an attacker to brute force by repeating their attack and waiting for the random stack offset to be the desired one. The prng approach seems entirely proportional to this level of protection. Performance data are provided below. The baseline is v6.18 with rndstack on for each respective arch. (I)/(R) indicate statistically significant improvement/regression. arm64 platform is AWS Graviton3 (m7g.metal). x86_64 platform is AWS Sapphire Rapids (m7i.24xlarge): +-----------------+--------------+---------------+---------------+ | Benchmark | Result Class | per-cpu-prng | per-cpu-prng | | | | arm64 (metal) | x86_64 (VM) | +=================+==============+===============+===============+ | syscall/getpid | mean (ns) | (I) -9.50% | (I) -17.65% | | | p99 (ns) | (I) -59.24% | (I) -24.41% | | | p99.9 (ns) | (I) -59.52% | (I) -28.52% | +-----------------+--------------+---------------+---------------+ | syscall/getppid | mean (ns) | (I) -9.52% | (I) -19.24% | | | p99 (ns) | (I) -59.25% | (I) -25.03% | | | p99.9 (ns) | (I) -59.50% | (I) -28.17% | +-----------------+--------------+---------------+---------------+ | syscall/invalid | mean (ns) | (I) -10.31% | (I) -18.56% | | | p99 (ns) | (I) -60.79% | (I) -20.06% | | | p99.9 (ns) | (I) -61.04% | (I) -25.04% | +-----------------+--------------+---------------+---------------+ I tested an earlier version of this change on x86 bare metal and it showed a smaller but still significant improvement. The bare metal system wasn't available this time around so testing was done in a VM instance. I'm guessing the cost of rdtsc is higher for VMs. Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Link: https://patch.msgid.link/20260303150840.3789438-3-ryan.roberts@arm.com Signed-off-by: Kees Cook <kees@kernel.org>
2026-03-24randomize_kstack: Maintain kstack_offset per taskRyan Roberts
kstack_offset was previously maintained per-cpu, but this caused a couple of issues. So let's instead make it per-task. Issue 1: add_random_kstack_offset() and choose_random_kstack_offset() expected and required to be called with interrupts and preemption disabled so that it could manipulate per-cpu state. But arm64, loongarch and risc-v are calling them with interrupts and preemption enabled. I don't _think_ this causes any functional issues, but it's certainly unexpected and could lead to manipulating the wrong cpu's state, which could cause a minor performance degradation due to bouncing the cache lines. By maintaining the state per-task those functions can safely be called in preemptible context. Issue 2: add_random_kstack_offset() is called before executing the syscall and expands the stack using a previously chosen random offset. choose_random_kstack_offset() is called after executing the syscall and chooses and stores a new random offset for the next syscall. With per-cpu storage for this offset, an attacker could force cpu migration during the execution of the syscall and prevent the offset from being updated for the original cpu such that it is predictable for the next syscall on that cpu. By maintaining the state per-task, this problem goes away because the per-task random offset is updated after the syscall regardless of which cpu it is executing on. Fixes: 39218ff4c625 ("stack: Optionally randomize kernel stack offset each syscall") Closes: https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/ Cc: stable@vger.kernel.org Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Link: https://patch.msgid.link/20260303150840.3789438-2-ryan.roberts@arm.com Signed-off-by: Kees Cook <kees@kernel.org>
2026-03-24ring-buffer: Show what clock function is used on timestamp errorsSteven Rostedt
The testing for tracing was triggering a timestamp count issue that was always off by one. This has been happening for some time but has never been reported by anyone else. It was finally discovered to be an issue with the "uptime" (jiffies) clock that happened to be traced and the internal recursion caused the discrepancy. This would have been much easier to solve if the clock function being used was displayed when the error was detected. Add the clock function to the error output. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260323202212.479bb288@gandalf.local.home Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-24Merge commit 'f35dbac6942171dc4ce9398d1d216a59224590a9' into ↵Steven Rostedt
trace/ring-buffer/core The commit f35dbac69421 ("ring-buffer: Fix to update per-subbuf entries of persistent ring buffer") was a fix and merged upstream. It is needed for some other work in the ring buffer. The current branch has the remote buffer code that is shared with the Arm64 subsystem and can't be rebased. Merge in the upstream commit to allow continuing of the ring buffer work. Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-24bpf: Fix variable length stack write over spilled pointersAlexei Starovoitov
Scrub slots if variable-offset stack write goes over spilled pointers. Otherwise is_spilled_reg() may == true && spilled_ptr.type == NOT_INIT and valid program is rejected by check_stack_read_fixed_off() with obscure "invalid size of register fill" message. Fixes: 01f810ace9ed ("bpf: Allow variable-offset stack access") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260324215938.81733-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24timers: Get this_cpu once while clearing the idle stateShrikanth Hegde
Calling smp_processor_id() on: - In CONFIG_DEBUG_PREEMPT=y, if preemption/irq is disabled, then it does not print any warning. - In CONFIG_DEBUG_PREEMPT=n, it doesn't do anything apart from getting __smp_processor_id So with both CONFIG_DEBUG_PREEMPT=y/n, in preemption disabled section it is better to cache the value. It saves a few cycles. Though tiny, repeated adds up. timer_clear_idle() is called with interrupts disabled. So cache the value once. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com> Link: https://patch.msgid.link/20260323193630.640311-5-sshegde@linux.ibm.com
2026-03-24bpf: Use RCU-safe iteration in dev_map_redirect_multi() SKB pathDavid Carlier
The DEVMAP_HASH branch in dev_map_redirect_multi() uses hlist_for_each_entry_safe() to iterate hash buckets, but this function runs under RCU protection (called from xdp_do_generic_redirect_map() in softirq context). Concurrent writers (__dev_map_hash_update_elem, dev_map_hash_delete_elem) modify the list using RCU primitives (hlist_add_head_rcu, hlist_del_rcu). hlist_for_each_entry_safe() performs plain pointer dereferences without rcu_dereference(), missing the acquire barrier needed to pair with writers' rcu_assign_pointer(). On weakly-ordered architectures (ARM64, POWER), a reader can observe a partially-constructed node. It also defeats CONFIG_PROVE_RCU lockdep validation and KCSAN data-race detection. Replace with hlist_for_each_entry_rcu() using rcu_read_lock_bh_held() as the lockdep condition, consistent with the rcu_dereference_check() used in the DEVMAP (non-hash) branch of the same functions. Also fix the same incorrect lockdep_is_held(&dtab->index_lock) condition in dev_map_enqueue_multi(), where the lock is not held either. Fixes: e624d4ed4aa8 ("xdp: Extend xdp_redirect_map with broadcast support") Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260320072645.16731-1-devnexen@gmail.com
2026-03-24alarmtimer: Fix argument order in alarm_timer_forward()Zhan Xusheng
alarm_timer_forward() passes arguments to alarm_forward() in the wrong order: alarm_forward(alarm, timr->it_interval, now); However, alarm_forward() is defined as: u64 alarm_forward(struct alarm *alarm, ktime_t now, ktime_t interval); and uses the second argument as the current time: delta = ktime_sub(now, alarm->node.expires); Passing the interval as "now" results in incorrect delta computation, which can lead to missed expirations or incorrect overrun accounting. This issue has been present since the introduction of alarm_timer_forward(). Fix this by swapping the arguments. Fixes: e7561f1633ac ("alarmtimer: Implement forward callback") Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260323061130.29991-1-zhanxusheng@xiaomi.com
2026-03-24module: Give MODULE_SIG_STRING a more descriptive nameThomas Weißschuh
The purpose of the constant it is not entirely clear from its name. As this constant is going to be exposed in a UAPI header, give it a more specific name for clarity. As all its users call it 'marker', use that wording in the constant itself. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Nicolas Schier <nsc@kernel.org> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2026-03-24module: Give 'enum pkey_id_type' a more specific nameThomas Weißschuh
This enum originates in generic cryptographic code and has a very generic name. Nowadays it is only used for module signatures. As this enum is going to be exposed in a UAPI header, give it a more specific name for clarity and consistency. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Nicolas Schier <nsc@kernel.org> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2026-03-24bpf: update outdated comment for refactored btf_check_kfunc_arg_match()Kexin Sun
The function btf_check_kfunc_arg_match() was refactored into check_kfunc_args() by commit 00b85860feb8 ("bpf: Rewrite kfunc argument handling"). Update the comment accordingly. Assisted-by: unnamed:deepseek-v3.2 coccinelle Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn> Acked-by: Yonghong Song <yonghong.song@linux.dev> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260321105658.6006-1-kexinsun@smail.nju.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24bpf: Support pointer param types via SCALAR_VALUE for trampolinesSlava Imameev
Add BPF verifier support for single- and multi-level pointer parameters and return values in BPF trampolines by treating these parameters as SCALAR_VALUE. This extends the existing support for int and void pointers that are already treated as SCALAR_VALUE. This provides consistent logic for single and multi-level pointers: if a type is treated as SCALAR for a single-level pointer, the same applies to multi-level pointers. The exception is pointer-to-struct, which is currently PTR_TO_BTF_ID for single-level but treated as scalar for multi-level pointers since the verifier lacks context to infer the size of target memory regions. Safety is ensured by existing BTF verification, which rejects invalid pointer types at the BTF verification stage. Signed-off-by: Slava Imameev <slava.imameev@crowdstrike.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260314082127.7939-2-slava.imameev@crowdstrike.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24cgroup: Wait for dying tasks to leave on rmdirTejun Heo
a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") hid PF_EXITING tasks from cgroup.procs so that systemd doesn't see tasks that have already been reaped via waitpid(). However, the populated counter (nr_populated_csets) is only decremented when the task later passes through cgroup_task_dead() in finish_task_switch(). This means cgroup.procs can appear empty while the cgroup is still populated, causing rmdir to fail with -EBUSY. Fix this by making cgroup_rmdir() wait for dying tasks to fully leave. If the cgroup is populated but all remaining tasks have PF_EXITING set (the task iterator returns none due to the existing filter), wait for a kick from cgroup_task_dead() and retry. The wait is brief as tasks are removed from the cgroup's css_set between PF_EXITING assertion in do_exit() and cgroup_task_dead() in finish_task_switch(). v2: cgroup_is_populated() true to false transition happens under css_set_lock not cgroup_mutex, so retest under css_set_lock before sleeping to avoid missed wakeups (Sebastian). Fixes: a72f73c4dd9b ("cgroup: Don't expose dead tasks in cgroup") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202603222104.2c81684e-lkp@intel.com Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Michal Koutny <mkoutny@suse.com> Cc: cgroups@vger.kernel.org
2026-03-24bpf: Support 32-bit scalar spills in stacksafe()Alexei Starovoitov
v1->v2: updated comments v1: https://lore.kernel.org/bpf/20260322225124.14005-1-alexei.starovoitov@gmail.com/ The commit 6efbde200bf3 ("bpf: Handle scalar spill vs all MISC in stacksafe()") in stacksafe() only recognized full 64-bit scalar spills when comparing stack states for equivalence during state pruning and missed 32-bit scalar spill. When 32-bit scalar is spilled the check_stack_write_fixed_off() -> save_register_state() calls mark_stack_slot_misc() for slot[0-3], which preserves STACK_INVALID and STACK_ZERO (on a fresh stack slot[0-3] remain STACK_INVALID), sets slot[4-7] = STACK_SPILL, and updates spilled_ptr. The im=4 path is only reached when im=0 fails: The loop at im=0 already attempts the 64-bit scalar-spill/all-MISC check. If it matches, i advances by 7, skipping the entire 8-byte slot. So im=4 is only reached when bytes 0-3 are neither a scalar spill nor all-MISC — they must pass individual byte-by-byte comparison first. Then bytes 4-7 get the scalar-unit treatment. is_spilled_scalar_after(stack, 4): slot_type[4] == STACK_SPILL from a 64-bit spill would have been caught at im=0 (unless it's a pointer spill, in which case spilled_ptr.type != SCALAR_VALUE -> returns false at im=4 too). A partial overwrite of a 64-bit spill invalidates the entire slot in check_stack_write_fixed_off(). is_stack_misc_after(stack, 4): Only checks bytes 4-7 are MISC/INVALID, returns &unbound_reg. Comparing two unbound regs via regsafe() is safe. Changes to cilium programs: File Program Insns (A) Insns (B) Insns (DIFF) _______________ _________________________________ _________ _________ ________________ bpf_host.o cil_host_policy 49351 45811 -3540 (-7.17%) bpf_host.o cil_to_host 2384 2270 -114 (-4.78%) bpf_host.o cil_to_netdev 112051 100269 -11782 (-10.51%) bpf_host.o tail_handle_ipv4_cont_from_host 61175 60910 -265 (-0.43%) bpf_host.o tail_handle_ipv4_cont_from_netdev 9381 8873 -508 (-5.42%) bpf_host.o tail_handle_ipv4_from_host 12994 7066 -5928 (-45.62%) bpf_host.o tail_handle_ipv4_from_netdev 85015 59875 -25140 (-29.57%) bpf_host.o tail_handle_ipv6_cont_from_host 24732 23527 -1205 (-4.87%) bpf_host.o tail_handle_ipv6_cont_from_netdev 9463 8953 -510 (-5.39%) bpf_host.o tail_handle_ipv6_from_host 12477 11787 -690 (-5.53%) bpf_host.o tail_handle_ipv6_from_netdev 30814 30017 -797 (-2.59%) bpf_host.o tail_handle_nat_fwd_ipv4 8943 8860 -83 (-0.93%) bpf_host.o tail_handle_snat_fwd_ipv4 64716 61625 -3091 (-4.78%) bpf_host.o tail_handle_snat_fwd_ipv6 48299 30797 -17502 (-36.24%) bpf_host.o tail_ipv4_host_policy_ingress 21591 20017 -1574 (-7.29%) bpf_host.o tail_ipv6_host_policy_ingress 21177 20693 -484 (-2.29%) bpf_host.o tail_nodeport_nat_egress_ipv4 16588 16543 -45 (-0.27%) bpf_host.o tail_nodeport_nat_ingress_ipv4 39200 36116 -3084 (-7.87%) bpf_host.o tail_nodeport_nat_ingress_ipv6 50102 48003 -2099 (-4.19%) bpf_lxc.o tail_handle_ipv4_cont 113092 96891 -16201 (-14.33%) bpf_lxc.o tail_handle_ipv6 6727 6701 -26 (-0.39%) bpf_lxc.o tail_handle_ipv6_cont 25567 21805 -3762 (-14.71%) bpf_lxc.o tail_ipv4_ct_egress 28843 15970 -12873 (-44.63%) bpf_lxc.o tail_ipv4_ct_ingress 16691 10213 -6478 (-38.81%) bpf_lxc.o tail_ipv4_ct_ingress_policy_only 16691 10213 -6478 (-38.81%) bpf_lxc.o tail_ipv4_policy 6776 6622 -154 (-2.27%) bpf_lxc.o tail_ipv4_to_endpoint 7523 7219 -304 (-4.04%) bpf_lxc.o tail_ipv6_ct_egress 10275 9999 -276 (-2.69%) bpf_lxc.o tail_ipv6_ct_ingress 6466 6438 -28 (-0.43%) bpf_lxc.o tail_ipv6_ct_ingress_policy_only 6466 6438 -28 (-0.43%) bpf_lxc.o tail_ipv6_policy 6859 5159 -1700 (-24.78%) bpf_lxc.o tail_ipv6_to_endpoint 7039 4427 -2612 (-37.11%) bpf_lxc.o tail_nodeport_ipv6_dsr 1175 1033 -142 (-12.09%) bpf_lxc.o tail_nodeport_nat_egress_ipv4 16318 16292 -26 (-0.16%) bpf_lxc.o tail_nodeport_nat_ingress_ipv4 18907 18490 -417 (-2.21%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 14624 14556 -68 (-0.46%) bpf_lxc.o tail_nodeport_rev_dnat_ipv4 4776 4588 -188 (-3.94%) bpf_overlay.o tail_handle_inter_cluster_revsnat 15733 15498 -235 (-1.49%) bpf_overlay.o tail_handle_ipv4 124682 105717 -18965 (-15.21%) bpf_overlay.o tail_handle_ipv6 16201 15801 -400 (-2.47%) bpf_overlay.o tail_handle_snat_fwd_ipv4 21280 19323 -1957 (-9.20%) bpf_overlay.o tail_handle_snat_fwd_ipv6 20824 20822 -2 (-0.01%) bpf_overlay.o tail_nodeport_ipv6_dsr 1175 1033 -142 (-12.09%) bpf_overlay.o tail_nodeport_nat_egress_ipv4 16293 16267 -26 (-0.16%) bpf_overlay.o tail_nodeport_nat_ingress_ipv4 20841 20737 -104 (-0.50%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 14678 14629 -49 (-0.33%) bpf_sock.o cil_sock4_connect 1678 1623 -55 (-3.28%) bpf_sock.o cil_sock4_sendmsg 1791 1736 -55 (-3.07%) bpf_sock.o cil_sock6_connect 3641 3600 -41 (-1.13%) bpf_sock.o cil_sock6_recvmsg 2048 1899 -149 (-7.28%) bpf_sock.o cil_sock6_sendmsg 3755 3721 -34 (-0.91%) bpf_wireguard.o tail_handle_ipv4 31180 27484 -3696 (-11.85%) bpf_wireguard.o tail_handle_ipv6 12095 11760 -335 (-2.77%) bpf_wireguard.o tail_nodeport_ipv6_dsr 1232 1094 -138 (-11.20%) bpf_wireguard.o tail_nodeport_nat_egress_ipv4 16071 16061 -10 (-0.06%) bpf_wireguard.o tail_nodeport_nat_ingress_ipv4 20804 20565 -239 (-1.15%) bpf_wireguard.o tail_nodeport_nat_ingress_ipv6 13490 12224 -1266 (-9.38%) bpf_xdp.o tail_lb_ipv4 49695 42673 -7022 (-14.13%) bpf_xdp.o tail_lb_ipv6 122683 87896 -34787 (-28.36%) bpf_xdp.o tail_nodeport_ipv6_dsr 1833 1862 +29 (+1.58%) bpf_xdp.o tail_nodeport_nat_egress_ipv4 6999 6990 -9 (-0.13%) bpf_xdp.o tail_nodeport_nat_ingress_ipv4 28903 28780 -123 (-0.43%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 200361 197771 -2590 (-1.29%) bpf_xdp.o tail_nodeport_rev_dnat_ipv4 4606 4454 -152 (-3.30%) Changes to sched-ext: File Program Insns (A) Insns (B) Insns (DIFF) _________________________ ________________ _________ _________ _______________ scx_arena_selftests.bpf.o arena_selftest 236305 236251 -54 (-0.02%) scx_chaos.bpf.o chaos_dispatch 12282 8013 -4269 (-34.76%) scx_chaos.bpf.o chaos_enqueue 11398 7126 -4272 (-37.48%) scx_chaos.bpf.o chaos_init 3854 3828 -26 (-0.67%) scx_flash.bpf.o flash_init 1015 979 -36 (-3.55%) scx_flatcg.bpf.o fcg_dispatch 1143 1100 -43 (-3.76%) scx_lavd.bpf.o lavd_enqueue 35487 35472 -15 (-0.04%) scx_lavd.bpf.o lavd_init 21127 21107 -20 (-0.09%) scx_p2dq.bpf.o p2dq_enqueue 10210 7854 -2356 (-23.08%) scx_p2dq.bpf.o p2dq_init 3233 3207 -26 (-0.80%) scx_qmap.bpf.o qmap_init 20285 20230 -55 (-0.27%) scx_rusty.bpf.o rusty_select_cpu 1165 1148 -17 (-1.46%) scxtop.bpf.o on_sched_switch 2369 2355 -14 (-0.59%) Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260323022410.75444-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24bpf: Fix refcount check in check_struct_ops_btf_id()Keisuke Nishimura
The current implementation only checks whether the first argument is refcounted. Fix this by iterating over all arguments. Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr> Fixes: 38f1e66abd184 ("bpf: Do not allow tail call in strcut_ops program with __ref argument") Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20260320130219.63711-1-keisuke.nishimura@inria.fr Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24bpf: propagate kvmemdup_bpfptr errors from bpf_prog_verify_signatureWeixie Cui
kvmemdup_bpfptr() returns -EFAULT when the user pointer cannot be copied, and -ENOMEM on allocation failure. The error path always returned -ENOMEM, misreporting bad addresses as out-of-memory. Return PTR_ERR(sig) so user space gets the correct errno. Signed-off-by: Weixie Cui <cuiweixie@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/tencent_C9C5B2B28413D6303D505CD02BFEA4708C07@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-25tracing: fprobe: do not zero out unused fgraph_dataMartin Kaiser
If fprobe_entry does not fill the allocated fgraph_data completely, the unused part does not have to be zeroed. fgraph_data is a short-lived part of the shadow stack. The preceding length field allows locating the end regardless of the content. Link: https://lore.kernel.org/all/20260324084804.375764-1-martin@kaiser.cx/ Signed-off-by: Martin Kaiser <martin@kaiser.cx> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-03-24bpf: Simplify tnum_step()Hao Sun
Simplify tnum_step() from a 10-variable algorithm into a straight line sequence of bitwise operations. Problem Reduction: tnum_step(): Given a tnum `(tval, tmask)` where `tval & tmask == 0`, and a value `z` with `tval ≤ z < (tval | tmask)`, find the smallest `r > z`, a tnum-satisfying value, i.e., `r & ~tmask == tval`. Every tnum-satisfying value has the form tval | s where s is a subset of tmask bits (s & ~tmask == 0). Since tval and tmask are disjoint: tval | s = tval + s Similarly z = tval + d where d = z - tval, so r > z becomes: tval + s > tval + d s > d The problem reduces to: find the smallest s, a subset of tmask, such that s > d. Notice that `s` must be a subset of tmask, the problem now is simplified. Algorithm: The mask bits of `d` form a "counter" that we want to increment by one, but the counter has gaps at the fixed-bit positions. A normal +1 would stop at the first 0-bit it meets; we need it to skip over fixed-bit gaps and land on the next mask bit. Step 1 -- plug the gaps: d | carry_mask | ~tmask - ~tmask fills all fixed-bit positions with 1. - carry_mask = (1 << fls64(d & ~tmask)) - 1 fills all positions (including mask positions) below the highest non-mask bit of d. After this, the only remaining 0s are mask bits above the highest non-mask bit of d where d is also 0 -- exactly the positions where the carry can validly land. Step 2 -- increment: (d | carry_mask | ~tmask) + 1 Adding 1 flips all trailing 1s to 0 and sets the first 0 to 1. Since every gap has been plugged, that first 0 is guaranteed to be a mask bit above all non-mask bits of d. Step 3 -- mask: ((d | carry_mask | ~tmask) + 1) & tmask Strip the scaffolding, keeping only mask bits. Call the result inc. Step 4 -- result: tval | inc Reattach the fixed bits. A simple 8-bit example: tmask: 1 1 0 1 0 1 1 0 d: 1 0 1 0 0 0 1 0 (d = 162) ^ non-mask 1 at bit 5 With carry_mask = 0b00111111 (smeared from bit 5): d|carry|~tm 1 0 1 1 1 1 1 1 + 1 1 1 0 0 0 0 0 0 & tmask 1 1 0 0 0 0 0 0 The patch passes my local test: test_verifier, test_progs for `-t verifier` and `-t reg_bounds`. CBMC shows the new code is equiv to original one[1], and a lean4 proof of correctness is available[2]: theorem tnumStep_correct (tval tmask z : BitVec 64) -- Precondition: valid tnum and input z (h_consistent : (tval &&& tmask) = 0) (h_lo : tval ≤ z) (h_hi : z < (tval ||| tmask)) : -- Postcondition: r must be: -- (1) tnum member -- (2) z < r -- (3) for any other member w > z, r <= w let r := tnumStep tval tmask z satisfiesTnum64 r tval tmask ∧ tval ≤ r ∧ r ≤ (tval ||| tmask) ∧ z < r ∧ ∀ w, satisfiesTnum64 w tval tmask → z < w → r ≤ w := by -- unfold definition unfold tnumStep satisfiesTnum64 simp only [] refine ⟨?_, ?_, ?_, ?_, ?_⟩ -- the solver proves each conjunct · bv_decide · bv_decide · bv_decide · bv_decide · intro w hw1 hw2; bv_decide [1] https://github.com/eddyz87/tnum-step-verif/blob/master/main.c [2] https://pastebin.com/raw/czHKiyY0 Signed-off-by: Hao Sun <hao.sun@inf.ethz.ch> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Reviewed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com> Link: https://lore.kernel.org/r/20260320162336.166542-1-hao.sun@inf.ethz.ch Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24bpf: Switch CONFIG_CFI_CLANG to CONFIG_CFICarlos Llamas
This was renamed in commit 23ef9d439769 ("kcfi: Rename CONFIG_CFI_CLANG to CONFIG_CFI") as it is now a compiler-agnostic option. Using the wrong name results in the code getting compiled out. Meaning the CFI failures for btf_dtor_kfunc_t would still trigger. Fixes: 99fde4d06261 ("bpf, btf: Enforce destructor kfunc type with CFI") Signed-off-by: Carlos Llamas <cmllamas@google.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260312183818.2721750-1-cmllamas@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24bpf: Remove inclusions of crypto/sha1.hEric Biggers
Since commit 603b44162325 ("bpf: Update the bpf_prog_calc_tag to use SHA256") made BPF program tags use SHA-256 instead of SHA-1, the header <crypto/sha1.h> no longer needs to be included. Remove the relevant inclusions so that they no longer unnecessarily come up in searches for which kernel code is still using the obsolete SHA-1 algorithm. Since net/ipv6/addrconf.c was relying on the transitive inclusion of <crypto/sha1.h> (for an unrelated purpose) via <linux/filter.h>, make it include <crypto/sha1.h> explicitly in order to keep that file building. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Acked-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/20260314214555.112386-1-ebiggers@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-24sched/core: Get this cpu once in ttwu_queue_cond()Shrikanth Hegde
Calling smp_processor_id() on: - In CONFIG_DEBUG_PREEMPT=y, if preemption/irq is disabled, then it does not print any warning. - In CONFIG_DEBUG_PREEMPT=n, it doesn't do anything apart from getting __smp_processor_id So with both CONFIG_DEBUG_PREEMPT=y/n, in preemption disabled section it is better to cache the value. It could save a few cycles. Though tiny, repeated could add up to a small value. ttwu_queue_cond is called with interrupt disabled. So preemption is disabled. Hence cache the value once instead. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com> Link: https://patch.msgid.link/20260323193630.640311-3-sshegde@linux.ibm.com
2026-03-24sched/fair: Get this cpu once in find_new_ilb()Shrikanth Hegde
Calling smp_processor_id() on: - In CONFIG_DEBUG_PREEMPT=y, if preemption/irq is disabled, then it does not print any warning. - In CONFIG_DEBUG_PREEMPT=n, it doesn't do anything apart from getting __smp_processor_id So with both CONFIG_DEBUG_PREEMPT=y/n, in preemption disabled section it is better to cache the value. It could save a few cycles. Though tiny, repeated in loop could add up to a small value. find_new_ilb is called in interrupt context. So preemption is disabled. So Hoist the this_cpu out of loop Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260323193630.640311-2-sshegde@linux.ibm.com
2026-03-23tracing: Pretty-print enum parameters in function argumentsDonglin Peng
Currently, print_function_args() prints enum parameter values in decimal format, reducing trace log readability. Use BTF information to resolve enum parameters and print their symbolic names (where available). This improves readability by showing meaningful identifiers instead of raw numbers. Before: mod_memcg_lruvec_state(lruvec=0xffff..., idx=5, val=320) After: mod_memcg_lruvec_state(lruvec=0xffff..., idx=5 [NR_SLAB_RECLAIMABLE_B], val=320) Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Link: https://patch.msgid.link/20260209071949.4040193-1-dolinux.peng@gmail.com Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-23tracing: Free up file->private_data for use by individual eventsPetr Pavlu
The tracing_open_file_tr() function currently copies the trace_event_file pointer from inode->i_private to file->private_data when the file is successfully opened. This duplication is not particularly useful, as all event code should utilize event_file_file() or event_file_data() to retrieve a trace_event_file pointer from a file struct and these access functions read file->f_inode->i_private. Moreover, this setup requires the code for opening hist files to explicitly clear file->private_data before calling single_open(), since this function expects the private_data member to be set to NULL and uses it to store a pointer to a seq_file. Remove the unnecessary setting of file->private_data in tracing_open_file_tr() and simplify the hist code. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Link: https://patch.msgid.link/20260219162737.314231-6-petr.pavlu@suse.com Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-23tracing: Clean up access to trace_event_file from a file pointerPetr Pavlu
The tracing code provides two functions event_file_file() and event_file_data() to obtain a trace_event_file pointer from a file struct. The primary method to use is event_file_file(), as it checks for the EVENT_FILE_FL_FREED flag to determine whether the event is being removed. The second function event_file_data() is an optimization for retrieving the same data when the event_mutex is still held. In the past, when removing an event directory in remove_event_file_dir(), the code set i_private to NULL for all event files and readers were expected to check for this state to recognize that the event is being removed. In the case of event_id_read(), the value was read using event_file_data() without acquiring the event_mutex. This required event_file_data() to use READ_ONCE() when retrieving the i_private data. With the introduction of eventfs, i_private is assigned when an eventfs inode is allocated and remains set throughout its lifetime. Remove the now unnecessary READ_ONCE() access to i_private in both event_file_file() and event_file_data(). Inline the access to i_private in remove_event_file_dir(), which allows event_file_data() to handle i_private solely as a trace_event_file pointer. Add a check in event_file_data() to ensure that the event_mutex is held and that file->flags doesn't have the EVENT_FILE_FL_FREED flag set. Finally, move event_file_data() immediately after event_file_code() since the latter provides a comment explaining how both functions should be used together. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <zanussi@kernel.org> Link: https://patch.msgid.link/20260219162737.314231-5-petr.pavlu@suse.com Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>