summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2026-03-18sched/core: Check for rcu_read_lock_any_held() in idle_get_state()K Prateek Nayak
Similar to commit 71fedc41c23b ("sched/fair: Switch to rcu_dereference_all()"), switch to checking for rcu_read_lock_any_held() in idle_get_state() to allow removing superfluous rcu_read_lock() regions in the fair task's wakeup path where the pi_lock is held and IRQs are disabled. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://patch.msgid.link/20260312044434.1974-7-kprateek.nayak@amd.com
2026-03-18sched/topology: Remove sched_domain_shared allocation with sd_dataK Prateek Nayak
Now that "sd->shared" assignments are using the sched_domain_shared objects allocated with s_data, remove the sd_data based allocations. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://patch.msgid.link/20260312044434.1974-6-kprateek.nayak@amd.com
2026-03-18sched/topology: Switch to assigning "sd->shared" from s_dataK Prateek Nayak
Use the "sched_domain_shared" object allocated in s_data for "sd->shared" assignments. Assign "sd->shared" for the topmost SD_SHARE_LLC domain before degeneration and rely on the degeneration path to correctly pass down the shared object to "sd_llc". sd_degenerate_parent() ensures degenerating domains must have the same sched_domain_span() which ensures 1:1 passing down of the shared object. If the topmost SD_SHARE_LLC domain degenerates, the shared object is freed from destroy_sched_domain() when the last reference is dropped. claim_allocations() NULLs out the objects that have been assigned as "sd->shared" and the unassigned ones are freed from the __sds_free() path. To keep all the claim_allocations() bits in one place, claim_allocations() has been extended to accept "s_data" and iterate the domains internally to free both "sched_domain_shared" and the per-topology-level data for the particular CPU in one place. Post cpu_attach_domain(), all reclaims of "sd->shared" are handled via call_rcu() on the sched_domain object via destroy_sched_domains_rcu(). Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://patch.msgid.link/20260312044434.1974-5-kprateek.nayak@amd.com
2026-03-18sched/topology: Allocate per-CPU sched_domain_shared in s_dataK Prateek Nayak
The "sched_domain_shared" object is allocated for every topology level in __sdt_alloc() and is freed post sched domain rebuild if they aren't assigned during sd_init(). "sd->shared" is only assigned for SD_SHARE_LLC domains and out of all the assigned objects, only "sd_llc_shared" is ever used by the scheduler. Since only "sd_llc_shared" is ever used, and since SD_SHARE_LLC domains never overlap, allocate only a single range of per-CPU "sched_domain_shared" object with s_data instead of doing it per topology level. The subsequent commit uses the degeneration path to correctly assign the "sd->shared" to the topmost SD_SHARE_LLC domain. No functional changes are expected at this point. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://patch.msgid.link/20260312044434.1974-4-kprateek.nayak@amd.com
2026-03-18sched/topology: Extract "imb_numa_nr" calculation into a separate helperK Prateek Nayak
Subsequent changes to assign "sd->shared" from "s_data" would necessitate finding the topmost SD_SHARE_LLC to assign shared object to. This is very similar to the "imb_numa_nr" computation loop except that "imb_numa_nr" cares about the first domain without the SD_SHARE_LLC flag (immediate parent of sd_llc) whereas the "sd->shared" assignment would require sd_llc itself. Extract the "imb_numa_nr" calculation into a helper adjust_numa_imbalance() and use the current loop in the build_sched_domains() to find the sd_llc. While at it, guard the call behind CONFIG_NUMA's status since "imb_numa_nr" only makes sense on NUMA enabled configs with SD_NUMA domains. No functional changes intended. Suggested-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://patch.msgid.link/20260312044434.1974-3-kprateek.nayak@amd.com
2026-03-18sched/topology: Compute sd_weight considering cpuset partitionsK Prateek Nayak
The "sd_weight" used for calculating the load balancing interval, and its limits, considers the span weight of the entire topology level without accounting for cpuset partitions. For example, consider a large system of 128CPUs divided into 8 * 16CPUs partition which is typical when deploying virtual machines: [ PKG Domain: 128CPUs ] [Partition0: 16CPUs][Partition1: 16CPUs] ... [Partition7: 16CPUs] Although each partition only contains 16CPUs, the load balancing interval is set to a minimum of 128 jiffies considering the span of the entire domain with 128CPUs which can lead to longer imbalances within the partition although balancing within is cheaper with 16CPUs. Compute the "sd_weight" after computing the "sd_span" considering the cpu_map covered by the partition, and set the load balancing interval, and its limits accordingly. For the above example, the balancing intervals for the partitions PKG domain changes as follows: before after balance_interval 128 16 min_interval 128 16 max_interval 256 32 Intervals are now proportional to the CPUs in the partitioned domain as was intended by the original formula. Fixes: cb83b629bae03 ("sched/numa: Rewrite the CONFIG_NUMA sched domain support") Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://patch.msgid.link/20260312044434.1974-2-kprateek.nayak@amd.com
2026-03-18tracing: Restore accidentally removed SPDX tagMarc Zyngier
Restore the SPDX tag that was accidentally dropped. Fixes: 7e4b6c94300e3 ("tracing: add more symbols to whitelist") Reported-by: Nathan Chancellor <nathan@kernel.org> Cc: Arnd Bergmann <arnd@kernel.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://patch.msgid.link/20260317194252.1890568-1-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-17workqueue: Remove NULL wq WARN in __queue_delayed_work()Tejun Heo
Remove the WARN_ON_ONCE(!wq) which doesn't serve any useful purpose. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-17sched_ext: Fix typos in commentszhidao su
Fix five typos across three files: - kernel/sched/ext.c: 'monotically' -> 'monotonically' (line 55) - kernel/sched/ext.c: 'used by to check' -> 'used to check' (line 56) - kernel/sched/ext.c: 'hardlockdup' -> 'hardlockup' (line 3881) - kernel/sched/ext_idle.c: 'don't perfectly overlaps' -> 'don't perfectly overlap' (line 371) - tools/sched_ext/scx_flatcg.bpf.c: 'shaer' -> 'share' (line 21) Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-17genirq/matrix, LoongArch: Delete IRQ_MATRIX_BITS leftoversNam Cao
Delete IRQ_MATRIX_BITS leftovers after commit 5b98d210ac1e ("genirq/matrix: Dynamic bitmap allocation") has made IRQ_MATRIX_BITS obsolete. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260316072850.467995-1-namcao@linutronix.de
2026-03-17tracing: Generate undef symbols allowlist for simple_ring_bufferVincent Donnefort
Compiler and tooling-generated symbols are difficult to maintain across all supported architectures. Make the allowlist more robust by replacing the harcoded list with a mechanism that automatically detects these symbols. This mechanism generates a C function designed to trigger common compiler-inserted symbols. Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Tested-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20260316092845.3367411-1-vdonnefort@google.com [maz: added __msan prefix to allowlist as pointed out by Arnd] Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-17Merge tag 'v7.0-rc4' into sched/core, to pick up scheduler fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2026-03-16sched: idle: Consolidate the handling of two special casesRafael J. Wysocki
There are two special cases in the idle loop that are handled inconsistently even though they are analogous. The first one is when a cpuidle driver is absent and the default CPU idle time power management implemented by the architecture code is used. In that case, the scheduler tick is stopped every time before invoking default_idle_call(). The second one is when a cpuidle driver is present, but there is only one idle state in its table. In that case, the scheduler tick is never stopped at all. Since each of these approaches has its drawbacks, reconcile them with the help of one simple heuristic. Namely, stop the tick if the CPU has been woken up by it in the previous iteration of the idle loop, or let it tick otherwise. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Fixes: ed98c3491998 ("sched: idle: Do not stop the tick before cpuidle_idle_call()") [ rjw: Added Fixes tag, changelog edits ] Link: https://patch.msgid.link/4741364.LvFx2qVVIh@rafael.j.wysocki Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-03-16Merge tag 'mm-hotfixes-stable-2026-03-16-12-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "6 hotfixes. 4 are cc:stable. 3 are for MM. All are singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-03-16-12-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: MAINTAINERS: update email address for Ignat Korchagin mm/huge_memory: fix early failure try_to_migrate() when split huge pmd for shared THP mm/rmap: fix incorrect pte restoration for lazyfree folios mm/huge_memory: fix use of NULL folio in move_pages_huge_pmd() build_bug.h: correct function parameters names in kernel-doc crash_dump: don't log dm-crypt key bytes in read_key_from_user_keying
2026-03-16bpf: Only enforce 8 frame call stack limit for all-static stacksEmil Tsalapatis
The BPF verifier currently enforces a call stack depth of 8 frames, regardless of the actual stack space consumption of those frames. The limit is necessary for static call stacks, because the bookkeeping data structures used by the verifier when stepping into static functions during verification only support 8 stack frames. However, this limitation only matters for static stack frames: Global subprogs are verified by themselves and do not require limiting the call depth. Relax this limitation to only apply to static stack frames. Verification now only fails when there is a sequence of 8 calls to non-global subprogs. Calling into a global subprog resets the counter. This allows deeper call stacks, provided all frames still fit in the stack. The change does not increase the maximum size of the call stack, only the maximum number of frames we can place in it. Also change the progs/test_global_func3.c selftest to use static functions, since with the new patch it would otherwise unexpectedly pass verification. Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260316161225.128011-2-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-16sched_ext: Fix slab-out-of-bounds in scx_alloc_and_add_sched()Cheng-Yang Chou
ancestors[] is a flexible array member that needs level + 1 slots to hold all ancestors including self (indices 0..level), but kzalloc_flex() only allocates `level` slots: sch = kzalloc_flex(*sch, ancestors, level); ... sch->ancestors[level] = sch; /* one past the end */ For the root scheduler (level = 0), zero slots are allocated and ancestors[0] is written immediately past the end of the object. KASAN reports: BUG: KASAN: slab-out-of-bounds in scx_alloc_and_add_sched+0x1c17/0x1d10 Write of size 8 at addr ffff888066b56538 by task scx_enable_help/667 The buggy address is located 0 bytes to the right of allocated 1336-byte region [ffff888066b56000, ffff888066b56538) Fix by passing level + 1 to kzalloc_flex(). Tested with vng + scx_lavd, KASAN no longer triggers. Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support") Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-16locking: Add lock context annotations in the spinlock implementationBart Van Assche
Make the spinlock implementation compatible with lock context analysis (CONTEXT_ANALYSIS := 1) by adding lock context annotations to the _raw_##op##_...() macros. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260313171510.230998-4-bvanassche@acm.org
2026-03-16jump_label: use ATOMIC_INIT() for initialization of .enabledThomas Weißschuh
Currently ATOMIC_INIT() is not used because in the past that macro was provided by linux/atomic.h which is not usable from linux/jump_label.h. However since commit 7ca8cf5347f7 ("locking/atomic: Move ATOMIC_INIT into linux/types.h") the macro only requires linux/types.h. Remove the now unnecessary workaround and the associated assertions. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260313-jump_label-cleanup-v2-1-35d3c0bde549@linutronix.de
2026-03-16futex: Convert to compiler context analysisPeter Zijlstra
Convert the sparse annotations over to the new compiler context analysis stuff. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260121111213.950376128@infradead.org
2026-03-16locking/rwsem: Fix logic error in rwsem_del_waiter()Andrei Vagin
Commit 1ea4b473504b ("locking/rwsem: Remove the list_head from struct rw_semaphore") introduced a logic error in rwsem_del_waiter(). The root cause of this issue is an inconsistency in the return values of __rwsem_del_waiter() and rwsem_del_waiter(). Specifically, __rwsem_del_waiter() returns true when the wait list becomes empty, whereas rwsem_del_waiter() is supposed to return true if the wait list is NOT empty. This caused a null pointer dereference in rwsem_mark_wake() because it was being called when sem->first_waiter was NULL. Fixes: 1ea4b473504b ("locking/rwsem: Remove the list_head from struct rw_semaphore") Reported-by: syzbot+3d2ff92c67127d337463@syzkaller.appspotmail.com Signed-off-by: Andrei Vagin <avagin@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: syzbot+3d2ff92c67127d337463@syzkaller.appspotmail.com Link: https://patch.msgid.link/20260314182607.3343346-1-avagin@google.com
2026-03-16dma: swiotlb: add KMSAN annotations to swiotlb_bounce()Shigeru Yoshida
When a device performs DMA to a bounce buffer, KMSAN is unaware of the write and does not mark the data as initialized. When swiotlb_bounce() later copies the bounce buffer back to the original buffer, memcpy propagates the uninitialized shadow to the original buffer, causing false positive uninit-value reports. Fix this by calling kmsan_unpoison_memory() on the bounce buffer before copying it back in the DMA_FROM_DEVICE path, so that memcpy naturally propagates initialized shadow to the destination. Suggested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/CAG_fn=WUGta-paG1BgsGRoAR+fmuCgh3xo=R3XdzOt_-DqSdHw@mail.gmail.com/ Fixes: 7ade4f10779c ("dma: kmsan: unpoison DMA mappings") Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260315082750.2375581-1-syoshida@redhat.com
2026-03-15sched_ext: Use kobject_put() for kobject_init_and_add() failure in ↵Tejun Heo
scx_alloc_and_add_sched() kobject_init_and_add() failure requires kobject_put() for proper cleanup, but the error paths were using kfree(sch) possibly leaking the kobject name. The kset_create_and_add() failure was already using kobject_put() correctly. Switch the kobject_init_and_add() error paths to use kobject_put(). As the release path puts the cgroup ref, make scx_alloc_and_add_sched() always consume @cgrp via a new err_put_cgrp label at the bottom of the error chain and update scx_sub_enable_workfn() accordingly. Fixes: 17108735b47d ("sched_ext: Use dynamic allocation for scx_sched") Reported-by: David Carlier <devnexen@gmail.com> Link: https://lore.kernel.org/r/20260314134457.46216-1-devnexen@gmail.com Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-15sched_ext: Fix cgroup double-put on sub-sched abort pathTejun Heo
The abort path in scx_sub_enable_workfn() fell through to out_put_cgrp, double-putting the cgroup ref already owned by sch->cgrp. It also skipped kthread_flush_work() needed to flush the disable path. Relocate the abort block above err_unlock_and_disable so it falls through to err_disable. Fixes: 337ec00b1d9c ("sched_ext: Implement cgroup sub-sched enabling and disabling") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-15Merge tag 'probes-fixes-v7.0-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - Avoid crash when rmmod/insmod after ftrace killed This fixes a kernel crash caused by kprobes on the symbol in a module which is unloaded after ftrace_kill() is called. - Remove unneeded warnings from __arm_kprobe_ftrace() Remove unneeded WARN messages which can be triggered if the kprobe is using ftrace and it fails to enable the ftrace. Since kprobes correctly handle such failure, we don't need to warn it. * tag 'probes-fixes-v7.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: kprobes: Remove unneeded warnings from __arm_kprobe_ftrace() kprobes: avoid crash when rmmod/insmod after ftrace killed
2026-03-15Merge tag 'timers-urgent-2026-03-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix function tracer recursion bug by marking jiffies_64_to_clock_t() notrace" * tag 'timers-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time/jiffies: Mark jiffies_64_to_clock_t() notrace
2026-03-15Merge tag 'sched-urgent-2026-03-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "More MM-CID fixes, mostly fixing hangs/races: - Fix CID hangs due to a race between concurrent forks - Fix vfork()/CLONE_VM MMCID bug causing hangs - Remove pointless preemption guard - Fix CID task list walk performance regression on large systems by removing the known-flaky and slow counting logic using for_each_process_thread() in mm_cid_*fixup_tasks_to_cpus(), and implementing a simple sched_mm_cid::node list instead" * tag 'sched-urgent-2026-03-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/mmcid: Avoid full tasklist walks sched/mmcid: Remove pointless preempt guard sched/mmcid: Handle vfork()/CLONE_VM correctly sched/mmcid: Prevent CID stalls due to concurrent forks
2026-03-13sched_ext: Fix uninitialized ret in scx_alloc_and_add_sched()Cheng-Yang Chou
Under CONFIG_EXT_SUB_SCHED, the kzalloc() and kstrdup() failure paths jump to err_stop_helper without first setting ret. The function then returns ERR_PTR(ret) with ret uninitialized, which can produce ERR_PTR(0) (NULL), causing the caller's IS_ERR() check to pass and leading to a NULL pointer dereference. Set ret = -ENOMEM before each goto to fix the error path. Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support") Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-13bpf: Avoid one round of bounds deductionPaul Chaignon
In commit 5dbb19b16ac49 ("bpf: Add third round of bounds deduction"), I added a new round of bounds deduction because two rounds were not enough to converge to a fixed point. This commit slightly refactor the bounds deduction logic such that two rounds are enough. In [1], Eduard noticed that after we improved the refinement logic, a third call to the bounds deduction (__reg_deduce_bounds) was needed to converge to a fixed point. More specifically, we needed this third call to improve the s64 range using the s32 range. We added the third call and postponed a more detailed analysis of the refinement logic. I've been looking into this more recently. The register refinement consists of the following calls. __update_reg_bounds(); 3 x __reg_deduce_bounds() { deduce_bounds_32_from_64(); deduce_bounds_32_from_32(); deduce_bounds_64_from_64(); deduce_bounds_64_from_32(); }; __reg_bound_offset(); __update_reg_bounds(); From this, we can observe that we first improve the 32bit ranges from the 64bit ranges in deduce_bounds_32_from_64, then improve the 64bit ranges on their own in deduce_bounds_64_from_64. Intuitively, if we were to improve the 64bit ranges on their own *before* we use them to improve the 32bit ranges, we may reach a fixed point earlier. In a similar manner, using CBMC, Eduard found that it's best to improve the 32bit ranges on their own *after* we've improve them using the 64bit ranges. That is, running deduce_bounds_32_from_32 after deduce_bounds_32_from_64. These changes allow us to lose one call to __reg_deduce_bounds. Without this reordering, the test "verifier_bounds/bounds deduction cross sign boundary, negative overlap" fails when removing one call to __reg_deduce_bounds. In some cases, this change can even improve precision a little bit, as illustrated in the new selftest in the next patch. As expected, this change didn't have any impact on the number of instructions processed when running it through the Cilium complexity test suite [2]. Link: https://lore.kernel.org/bpf/aIKtSK9LjQXB8FLY@mail.gmail.com/ [1] Link: https://pchaigno.github.io/test-verifier-complexity.html [2] Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Co-developed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/1b00d2749ec4c774c3ada84e265ac7fda72cfe56.1773401138.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-13bpf: better naming for __reg_deduce_bounds() partsEduard Zingerman
This renaming will also help reshuffle the different parts in the subsequent patch. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/a988ecf2c57e265b97917136b14b421038534e8c.1773401138.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-13dma-mapping: Support batch mode for dma_direct_{map,unmap}_sgBarry Song
Extending these APIs with a flush argument: dma_direct_unmap_phys(), dma_direct_map_phys(), and dma_direct_sync_single_for_cpu(). For single-buffer cases, flush=true would be used, while for SG cases flush=false would be used, followed by a single flush after all cache operations are issued in dma_direct_{map,unmap}_sg(). This ultimately benefits dma_map_sg() and dma_unmap_sg(). Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Reviewed-by: Leon Romanovsky <leon@kernel.org> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com> Signed-off-by: Barry Song <baohua@kernel.org> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260228221337.59951-1-21cnbao@gmail.com
2026-03-13dma-mapping: Separate DMA sync issuing and completion waitingBarry Song
Currently, arch_sync_dma_for_cpu and arch_sync_dma_for_device always wait for the completion of each DMA buffer. That is, issuing the DMA sync and waiting for completion is done in a single API call. For scatter-gather lists with multiple entries, this means issuing and waiting is repeated for each entry, which can hurt performance. Architectures like ARM64 may be able to issue all DMA sync operations for all entries first and then wait for completion together. To address this, arch_sync_dma_for_* now batches DMA operations and performs a flush afterward. On ARM64, the flush is implemented with a dsb instruction in arch_sync_dma_flush(). On other architectures, arch_sync_dma_flush() is currently a nop. Cc: Leon Romanovsky <leon@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Ada Couprie Diaz <ada.coupriediaz@arm.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Reviewed-by: Juergen Gross <jgross@suse.com> # drivers/xen/swiotlb-xen.c Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com> Signed-off-by: Barry Song <baohua@kernel.org> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260228221316.59934-1-21cnbao@gmail.com
2026-03-13Merge tag 'wq-for-7.0-rc3-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fixes from Tejun Heo: - Improve workqueue stall diagnostics: dump all busy workers (not just running ones), show wall-clock duration of in-flight work items, and add a sample module for reproducing stalls - Fix POOL_BH vs WQ_BH flag namespace mismatch in pr_cont_worker_id() - Rename pool->watchdog_ts to pool->last_progress_ts and related functions for clarity * tag 'wq-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Rename show_cpu_pool{s,}_hog{s,}() to reflect broadened scope workqueue: Add stall detector sample module workqueue: Show all busy workers in stall diagnostics workqueue: Show in-flight work item duration in stall diagnostics workqueue: Rename pool->watchdog_ts to pool->last_progress_ts workqueue: Use POOL_BH instead of WQ_BH when checking pool flags
2026-03-13Merge tag 'cgroup-for-7.0-rc3-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Hide PF_EXITING tasks from cgroup.procs to avoid exposing dead tasks that haven't been removed yet, fixing a systemd timeout issue on PREEMPT_RT - Call rebuild_sched_domains() directly in CPU hotplug instead of deferring to a workqueue, fixing a race where online/offline CPUs could briefly appear in stale sched domains * tag 'cgroup-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Don't expose dead tasks in cgroup cgroup/cpuset: Call rebuild_sched_domains() directly in hotplug
2026-03-13Merge tag 'sched_ext-for-7.0-rc3-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix data races flagged by KCSAN: add missing READ_ONCE()/WRITE_ONCE() annotations for lock-free accesses to module parameters and dsq->seq - Fix silent truncation of upper 32 enqueue flags (SCX_ENQ_PREEMPT and above) when passed through the int sched_class interface - Documentation updates: scheduling class precedence, task ownership state machine, example scheduler descriptions, config list cleanup - Selftest fix for format specifier and buffer length in file_write_long() * tag 'sched_ext-for-7.0-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Use WRITE_ONCE() for the write side of scx_enable helper pointer sched_ext: Fix enqueue_task_scx() truncation of upper enqueue flags sched_ext: Documentation: Update sched-ext.rst sched_ext: Use READ_ONCE() for scx_slice_bypass_us in scx_bypass() sched_ext: Documentation: Mention scheduling class precedence sched_ext: Document task ownership state machine sched_ext: Use READ_ONCE() for lock-free reads of module param variables sched_ext/selftests: Fix format specifier and buffer length in file_write_long() sched_ext: Use WRITE_ONCE() for the write side of dsq->seq update
2026-03-13sched_ext: Use schedule_deferred_locked() in schedule_dsq_reenq()Tejun Heo
schedule_dsq_reenq() always uses schedule_deferred() which falls back to irq_work. However, callers like schedule_reenq_local() already hold the target rq lock, and scx_bpf_dsq_reenq() may hold it via the ops callback. Add a locked_rq parameter so schedule_dsq_reenq() can use schedule_deferred_locked() when the target rq is already held. The locked variant can use cheaper paths (balance callbacks, wakeup hooks) instead of always bouncing through irq_work. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-13sched_ext: Add SCX_OPS_ALWAYS_ENQ_IMMED ops flagTejun Heo
SCX_ENQ_IMMED makes enqueue to local DSQs succeed only if the task can start running immediately. Otherwise, the task is re-enqueued through ops.enqueue(). This provides tighter control but requires specifying the flag on every insertion. Add SCX_OPS_ALWAYS_ENQ_IMMED ops flag. When set, SCX_ENQ_IMMED is automatically applied to all local DSQ enqueues including through scx_bpf_dsq_move_to_local(). scx_qmap is updated with -I option to test the feature and -F option for IMMED stress testing which forces every Nth enqueue to a busy local DSQ. v2: - Cover scx_bpf_dsq_move_to_local() path (now has enq_flags via ___v2). - scx_qmap: Remove sched_switch and cpu_release handlers (superseded by kernel-side wakeup_preempt_scx()). Add -F for IMMED stress testing. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-13sched_ext: Add enq_flags to scx_bpf_dsq_move_to_local()Tejun Heo
scx_bpf_dsq_move_to_local() moves a task from a non-local DSQ to the current CPU's local DSQ. This is an indirect way of dispatching to a local DSQ and should support enq_flags like direct dispatches do - e.g. SCX_ENQ_HEAD for head-of-queue insertion and SCX_ENQ_IMMED for immediate execution guarantees. Add scx_bpf_dsq_move_to_local___v2() with an enq_flags parameter. The original becomes a v1 compat wrapper passing 0. The compat macro is updated to a three-level chain: v2 (7.1+) -> v1 (current) -> scx_bpf_consume (pre-rename). All in-tree BPF schedulers are updated to pass 0. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-13sched_ext: Plumb enq_flags through the consume pathTejun Heo
Add enq_flags parameter to consume_dispatch_q() and consume_remote_task(), passing it through to move_{local,remote}_task_to_local_dsq(). All callers pass 0. No functional change. This prepares for SCX_ENQ_IMMED support on the consume path. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-13sched_ext: Implement SCX_ENQ_IMMEDTejun Heo
Add SCX_ENQ_IMMED enqueue flag for local DSQ insertions. Once a task is dispatched with IMMED, it either gets on the CPU immediately and stays on it, or gets reenqueued back to the BPF scheduler. It will never linger on a local DSQ behind other tasks or on a CPU taken by a higher-priority class. rq_is_open() uses rq->next_class to determine whether the rq is available, and wakeup_preempt_scx() triggers reenqueue when a higher-priority class task arrives. These capture all higher class preemptions. Combined with reenqueue points in the dispatch path, all cases where an IMMED task would not execute immediately are covered. SCX_TASK_IMMED persists in p->scx.flags until the next fresh enqueue, so the guarantee survives SAVE/RESTORE cycles. If preempted while running, put_prev_task_scx() reenqueues through ops.enqueue() with SCX_TASK_REENQ_PREEMPTED instead of silently placing the task back on the local DSQ. This enables tighter scheduling latency control by preventing tasks from piling up on local DSQs. It also enables opportunistic CPU sharing across sub-schedulers - without this, a sub-scheduler can stuff the local DSQ of a shared CPU, making it difficult for others to use. v2: - Rewrite is_curr_done() as rq_is_open() using rq->next_class and implement wakeup_preempt_scx() to achieve complete coverage of all cases where IMMED tasks could get stranded. - Track IMMED persistently in p->scx.flags and reenqueue preempted-while-running tasks through ops.enqueue(). - Bound deferred reenq cycles (SCX_REENQ_LOCAL_MAX_REPEAT). - Misc renames, documentation. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-13sched_ext: Add scx_vet_enq_flags() and plumb dsq_id into preambleTejun Heo
Add scx_vet_enq_flags() stub and call it from scx_dsq_insert_preamble() and scx_dsq_move(). Pass dsq_id into preamble so the vetting function can validate flag and DSQ combinations. No functional change. This prepares for SCX_ENQ_IMMED which will populate the vetting function. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-13sched_ext: Split task_should_reenq() into local and user variantsTejun Heo
Split task_should_reenq() into local_task_should_reenq() and user_task_should_reenq(). The local variant takes reenq_flags by pointer. No functional change. This prepares for SCX_ENQ_IMMED which will add IMMED-specific logic to the local variant. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-13workqueue: fix parse_affn_scope() prefix matching bugBreno Leitao
parse_affn_scope() uses strncasecmp() with the length of the candidate name, which means it only checks if the input *starts with* a known scope name. Given that the upcoming diff will create "cache_shard" affinity scope, writing "cache_shard" to a workqueue's affinity_scope sysfs attribute always matches "cache" first, making it impossible to select "cache_shard" via sysfs, so, this fix enable it to distinguish "cache" and "cache_shard" Fix by replacing the hand-rolled prefix matching loop with sysfs_match_string(), which uses sysfs_streq() for exact matching (modulo trailing newlines). Also add the missing const qualifier to the wq_affn_names[] array declaration. Note that sysfs_streq() is case-sensitive, unlike the previous strncasecmp() approach. This is intentional and consistent with how other sysfs attributes handle string matching in the kernel. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-13kprobes: Remove unneeded warnings from __arm_kprobe_ftrace()Masami Hiramatsu (Google)
Remove unneeded warnings for handled errors from __arm_kprobe_ftrace() because all caller handled the error correctly. Link: https://lore.kernel.org/all/177261531182.1312989.8737778408503961141.stgit@mhiramat.tok.corp.google.com/ Reported-by: Zw Tang <shicenci@gmail.com> Closes: https://lore.kernel.org/all/CAPHJ_V+J6YDb_wX2nhXU6kh466Dt_nyDSas-1i_Y8s7tqY-Mzw@mail.gmail.com/ Fixes: 9c89bb8e3272 ("kprobes: treewide: Cleanup the error messages for kprobes") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-03-13kprobes: avoid crash when rmmod/insmod after ftrace killedMasami Hiramatsu (Google)
After we hit ftrace is killed by some errors, the kernel crash if we remove modules in which kprobe probes. BUG: unable to handle page fault for address: fffffbfff805000d PGD 817fcc067 P4D 817fcc067 PUD 817fc8067 PMD 101555067 PTE 0 Oops: Oops: 0000 [#1] SMP KASAN PTI CPU: 4 UID: 0 PID: 2012 Comm: rmmod Tainted: G W OE Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE RIP: 0010:kprobes_module_callback+0x89/0x790 RSP: 0018:ffff88812e157d30 EFLAGS: 00010a02 RAX: 1ffffffff805000d RBX: dffffc0000000000 RCX: ffffffff86a8de90 RDX: ffffed1025c2af9b RSI: 0000000000000008 RDI: ffffffffc0280068 RBP: 0000000000000000 R08: 0000000000000001 R09: ffffed1025c2af9a R10: ffff88812e157cd7 R11: 205d323130325420 R12: 0000000000000002 R13: ffffffffc0290488 R14: 0000000000000002 R15: ffffffffc0280040 FS: 00007fbc450dd740(0000) GS:ffff888420331000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffbfff805000d CR3: 000000010f624000 CR4: 00000000000006f0 Call Trace: <TASK> notifier_call_chain+0xc6/0x280 blocking_notifier_call_chain+0x60/0x90 __do_sys_delete_module.constprop.0+0x32a/0x4e0 do_syscall_64+0x5d/0xfa0 entry_SYSCALL_64_after_hwframe+0x76/0x7e This is because the kprobe on ftrace does not correctly handles the kprobe_ftrace_disabled flag set by ftrace_kill(). To prevent this error, check kprobe_ftrace_disabled in __disarm_kprobe_ftrace() and skip all ftrace related operations. Link: https://lore.kernel.org/all/176473947565.1727781.13110060700668331950.stgit@mhiramat.tok.corp.google.com/ Reported-by: Ye Bin <yebin10@huawei.com> Closes: https://lore.kernel.org/all/20251125020536.2484381-1-yebin@huaweicloud.com/ Fixes: ae6aa16fdc16 ("kprobes: introduce ftrace based optimization") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-12tracing: add more symbols to whitelistArnd Bergmann
Randconfig builds show a number of cryptic build errors from hitting undefined symbols in simple_ring_buffer.o: make[7]: *** [/home/arnd/arm-soc/kernel/trace/Makefile:147: kernel/trace/simple_ring_buffer.o.checked] Error 1 These happen with CONFIG_TRACE_BRANCH_PROFILING, CONFIG_KASAN_HW_TAGS, CONFIG_STACKPROTECTOR, CONFIG_DEBUG_IRQFLAGS and indirectly from WARN_ON(). Add exceptions for each one that I have hit so far on arm64, x86_64 and arm randconfig builds. Other architectures likely hit additional ones, so it would be nice to produce a little more verbose output that include the name of the missing symbols directly. Fixes: a717943d8ecc ("tracing: Check for undefined symbols in simple_ring_buffer") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20260312123601.625063-2-arnd@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-12tracing: Update undefined symbols allow list for simple_ring_bufferVincent Donnefort
Undefined symbols are not allowed for simple_ring_buffer.c. But some compiler emitted symbols are missing in the allowlist. Update it. Reported-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Fixes: a717943d8ecc ("tracing: Check for undefined symbols in simple_ring_buffer") Closes: https://lore.kernel.org/all/20260311221816.GA316631@ax162/ Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://patch.msgid.link/20260312113535.2213350-1-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-12namespace: allow creating empty mount namespacesChristian Brauner
Add support for creating a mount namespace that contains only a copy of the root mount from the caller's mount namespace, with none of the child mounts. This is useful for containers and sandboxes that want to start with a minimal mount table and populate it from scratch rather than inheriting and then tearing down the full mount tree. Two new flags are introduced: - CLONE_EMPTY_MNTNS for clone3(), using the 64-bit flag space. - UNSHARE_EMPTY_MNTNS for unshare(), reusing the CLONE_PARENT_SETTID bit which has no meaning for unshare. Both flags imply CLONE_NEWNS. For the unshare path, UNSHARE_EMPTY_MNTNS is converted to CLONE_EMPTY_MNTNS in unshare_nsproxy_namespaces() before it reaches copy_mnt_ns(), so the mount namespace code only needs to handle a single flag. In copy_mnt_ns(), when CLONE_EMPTY_MNTNS is set, clone_mnt() is used instead of copy_tree() to clone only the root mount. The caller's root and working directory are both reset to the root dentry of the new mount. The cleanup variables are changed from vfsmount pointers with __free(mntput) to struct path with __free(path_put) because the empty mount namespace path needs to release both mount and dentry references when replacing the caller's root and pwd. In the normal (non-empty) path only the mount component is set, and dput(NULL) is a no-op so path_put remains correct there as well. Link: https://patch.msgid.link/20260306-work-empty-mntns-consolidated-v1-1-6eb30529bbb0@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-12clocksource: Don't use non-continuous clocksources as watchdogThomas Gleixner
Using a non-continuous aka untrusted clocksource as a watchdog for another untrusted clocksource is equivalent to putting the fox in charge of the henhouse. That's especially true with the jiffies clocksource which depends on interrupt delivery based on a periodic timer. Neither the frequency of that timer is trustworthy nor the kernel's ability to react on it in a timely manner and rearm it if it is not self rearming. Just don't bother to deal with this. It's not worth the trouble and only relevant to museum piece hardware. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260123231521.858743259@kernel.org
2026-03-12hrtimer: Add a helper to retrieve a hrtimer from its timerqueue nodeThomas Weißschuh (Schneider Electric)
The container_of() call is open-coded multiple times. Add a helper macro. Use container_of_const() to preserve constness. Signed-off-by: Thomas Weißschuh (Schneider Electric) <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260311-hrtimer-cleanups-v1-12-095357392669@linutronix.de
2026-03-12hrtimer: Drop unnecessary pointer indirection in hrtimer_expire_entry eventThomas Weißschuh (Schneider Electric)
This pointer indirection is a remnant from when ktime_t was a struct, today it is pointless. Drop the pointer indirection. Signed-off-by: Thomas Weißschuh (Schneider Electric) <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260311-hrtimer-cleanups-v1-9-095357392669@linutronix.de