summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2026-02-27sched/fair: Make hrtick resched hardPeter Zijlstra (Intel)
Since the tick causes hard preemption, the hrtick should too. Letting the hrtick do lazy preemption completely defeats the purpose, since it will then still be delayed until a old tick and be dependent on CONFIG_HZ. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163428.933894105@kernel.org
2026-02-27sched/fair: Simplify hrtick_update()Peter Zijlstra (Intel)
hrtick_update() was needed when the slice depended on nr_running, all that code is gone. All that remains is starting the hrtick when nr_running becomes more than 1. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163428.866374835@kernel.org
2026-02-27sched/eevdf: Fix HRTICK durationPeter Zijlstra
The nominal duration for an EEVDF task to run is until its deadline. At which point the deadline is moved ahead and a new task selection is done. Try and predict the time 'lost' to higher scheduling classes. Since this is an estimate, the timer can be both early or late. In case it is early task_tick_fair() will take the !need_resched() path and restarts the timer. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260224163428.798198874@kernel.org
2026-02-26Merge tag 'mm-hotfixes-stable-2026-02-26-14-14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "12 hotfixes. 7 are cc:stable. 8 are for MM. All are singletons - please see the changelogs for details" * tag 'mm-hotfixes-stable-2026-02-26-14-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: MAINTAINERS: update Yosry Ahmed's email address mailmap: add entry for Daniele Alessandrelli mm: fix NULL NODE_DATA dereference for memoryless nodes on boot mm/tracing: rss_stat: ensure curr is false from kthread context mm/kfence: fix KASAN hardware tag faults during late enablement mm/damon/core: disallow non-power of two min_region_sz Squashfs: check metadata block offset is within range MAINTAINERS, mailmap: update e-mail address for Vlastimil Babka liveupdate: luo_file: remember retrieve() status mm: thp: deny THP for files on anonymous inodes mm: change vma_alloc_folio_noprof() macro to inline function mm/kfence: disable KFENCE upon KASAN HW tags enablement
2026-02-26Merge tag 'dma-mapping-7.0-2026-02-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: "Two DMA-mapping fixes for the recently merged API rework (Jiri Pirko and Stian Halseth)" * tag 'dma-mapping-7.0-2026-02-26' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: sparc: Fix page alignment in dma mapping dma-mapping: avoid random addr value print out on error path
2026-02-26sched_ext: Fix SCX_EFLAG_INITIALIZED being a no-op flagDavid Carlier
SCX_EFLAG_INITIALIZED is the sole member of enum scx_exit_flags with no explicit value, so the compiler assigns it 0. This makes the bitwise OR in scx_ops_init() a no-op: sch->exit_info->flags |= SCX_EFLAG_INITIALIZED; /* |= 0 */ As a result, BPF schedulers cannot distinguish whether ops.init() completed successfully by inspecting exit_info->flags. Assign the value 1LLU << 0 so the flag is actually set. Fixes: f3aec2adce8d ("sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init()") Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-26bpf: Fix stack-out-of-bounds write in devmapKohei Enju
get_upper_ifindexes() iterates over all upper devices and writes their indices into an array without checking bounds. Also the callers assume that the max number of upper devices is MAX_NEST_DEV and allocate excluded_devices[1+MAX_NEST_DEV] on the stack, but that assumption is not correct and the number of upper devices could be larger than MAX_NEST_DEV (e.g., many macvlans), causing a stack-out-of-bounds write. Add a max parameter to get_upper_ifindexes() to avoid the issue. When there are too many upper devices, return -EOVERFLOW and abort the redirect. To reproduce, create more than MAX_NEST_DEV(8) macvlans on a device with an XDP program attached using BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS. Then send a packet to the device to trigger the XDP redirect path. Reported-by: syzbot+10cc7f13760b31bd2e61@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/698c4ce3.050a0220.340abe.000b.GAE@google.com/T/ Fixes: aeea1b86f936 ("bpf, devmap: Exclude XDP broadcast to master device") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Kohei Enju <kohei@enjuk.jp> Link: https://lore.kernel.org/r/20260225053506.4738-1-kohei@enjuk.jp Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-26bpf: Fix kprobe_multi cookies access in show_fdinfo callbackJiri Olsa
We don't check if cookies are available on the kprobe_multi link before accessing them in show_fdinfo callback, we should. Cc: stable@vger.kernel.org Fixes: da7e9c0a7fbc ("bpf: Add show_fdinfo for kprobe_multi") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260225111249.186230-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-26Merge tag 'kmalloc_obj-v7.0-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kmalloc_obj fixes from Kees Cook: - Fix pointer-to-array allocation types for ubd and kcsan - Force size overflow helpers to __always_inline - Bump __builtin_counted_by_ref to Clang 22.1 from 22.0 (Nathan Chancellor) * tag 'kmalloc_obj-v7.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: kcsan: test: Adjust "expect" allocation type for kmalloc_obj overflow: Make sure size helpers are always inlined init/Kconfig: Adjust fixed clang version for __builtin_counted_by_ref ubd: Use pointer-to-pointers for io_thread_req arrays
2026-02-26kcsan: test: Adjust "expect" allocation type for kmalloc_objKees Cook
The call to kmalloc_obj(observed.lines) returns "char (*)[3][512]", a pointer to the whole 2D array. But "expect" wants to be "char (*)[512]", the decayed pointer type, as if it were observed.lines itself (though without the "3" bounds). This produces the following build error: ../kernel/kcsan/kcsan_test.c: In function '__report_matches': ../kernel/kcsan/kcsan_test.c:171:16: error: assignment to 'char (*)[512]' from incompatible pointer type 'char (*)[3][512]' [-Wincompatible-pointer-types] 171 | expect = kmalloc_obj(observed.lines); | ^ Instead of changing the "expect" type to "char (*)[3][512]" and requiring a dereference at each use (e.g. "(expect*)[0]"), just explicitly cast the return to the desired type. Note that I'm intentionally not switching back to byte-based "kmalloc" here because I cannot find a way for the Coccinelle script (which will be used going forward to catch future conversions) to exclude this case. Tested with: $ ./tools/testing/kunit/kunit.py run \ --kconfig_add CONFIG_DEBUG_KERNEL=y \ --kconfig_add CONFIG_KCSAN=y \ --kconfig_add CONFIG_KCSAN_KUNIT_TEST=y \ --arch=x86_64 --qemu_args '-smp 2' kcsan Reported-by: Nathan Chancellor <nathan@kernel.org> Fixes: 69050f8d6d07 ("treewide: Replace kmalloc with kmalloc_obj for non-scalar types") Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-26kthread: consolidate kthread exit paths to prevent use-after-freeChristian Brauner
Guillaume reported crashes via corrupted RCU callback function pointers during KUnit testing. The crash was traced back to the pidfs rhashtable conversion which replaced the 24-byte rb_node with an 8-byte rhash_head in struct pid, shrinking it from 160 to 144 bytes. struct kthread (without CONFIG_BLK_CGROUP) is also 144 bytes. With CONFIG_SLAB_MERGE_DEFAULT and SLAB_HWCACHE_ALIGN both round up to 192 bytes and share the same slab cache. struct pid.rcu.func and struct kthread.affinity_node both sit at offset 0x78. When a kthread exits via make_task_dead() it bypasses kthread_exit() and misses the affinity_node cleanup. free_kthread_struct() frees the memory while the node is still linked into the global kthread_affinity_list. A subsequent list_del() by another kthread writes through dangling list pointers into the freed and reused memory, corrupting the pid's rcu.func pointer. Instead of patching free_kthread_struct() to handle the missed cleanup, consolidate all kthread exit paths. Turn kthread_exit() into a macro that calls do_exit() and add kthread_do_exit() which is called from do_exit() for any task with PF_KTHREAD set. This guarantees that kthread-specific cleanup always happens regardless of the exit path - make_task_dead(), direct do_exit(), or kthread_exit(). Replace __to_kthread() with a new tsk_is_kthread() accessor in the public header. Export do_exit() since module code using the kthread_exit() macro now needs it directly. Reported-by: Guillaume Tucker <gtucker@gtucker.io> Tested-by: Guillaume Tucker <gtucker@gtucker.io> Tested-by: Mark Brown <broonie@kernel.org> Tested-by: David Gow <davidgow@google.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/all/20260224-mittlerweile-besessen-2738831ae7f6@brauner Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: 4d13f4304fa4 ("kthread: Implement preferred affinity") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-02-25sched_ext: Fix out-of-bounds access in scx_idle_init_masks()David Carlier
scx_idle_node_masks is allocated with num_possible_nodes() elements but indexed by NUMA node IDs via for_each_node(). On systems with non-contiguous NUMA node numbering (e.g. nodes 0 and 4), node IDs can exceed the array size, causing out-of-bounds memory corruption. Use nr_node_ids instead, which represents the maximum node ID range and is the correct size for arrays indexed by node ID. Fixes: 7c60329e3521 ("sched_ext: Add NUMA-awareness to the default idle selection policy") Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-25Merge tag 'vfs-7.0-rc2.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix an uninitialized variable in file_getattr(). The flags_valid field wasn't initialized before calling vfs_fileattr_get(), triggering KMSAN uninit-value reports in fuse - Fix writeback wakeup and logging timeouts when DETECT_HUNG_TASK is not enabled. sysctl_hung_task_timeout_secs is 0 in that case causing spurious "waiting for writeback completion for more than 1 seconds" warnings - Fix a null-ptr-deref in do_statmount() when the mount is internal - Add missing kernel-doc description for the @private parameter in iomap_readahead() - Fix mount namespace creation to hold namespace_sem across the mount copy in create_new_namespace(). The previous drop-and-reacquire pattern was fragile and failed to clean up mount propagation links if the real rootfs was a shared or dependent mount - Fix /proc mount iteration where m->index wasn't updated when m->show() overflows, causing a restart to repeatedly show the same mount entry in a rapidly expanding mount table - Return EFSCORRUPTED instead of ENOSPC in minix_new_inode() when the inode number is out of range - Fix unshare(2) when CLONE_NEWNS is set and current->fs isn't shared. copy_mnt_ns() received the live fs_struct so if a subsequent namespace creation failed the rollback would leave pwd and root pointing to detached mounts. Always allocate a new fs_struct when CLONE_NEWNS is requested - fserror bug fixes: - Remove the unused fsnotify_sb_error() helper now that all callers have been converted to fserror_report_metadata - Fix a lockdep splat in fserror_report() where igrab() takes inode::i_lock which can be held in IRQ context. Replace igrab() with a direct i_count bump since filesystems should not report inodes that are about to be freed or not yet exposed - Handle error pointer in procfs for try_lookup_noperm() - Fix an integer overflow in ep_loop_check_proc() where recursive calls returning INT_MAX would overflow when +1 is added, breaking the recursion depth check - Fix a misleading break in pidfs * tag 'vfs-7.0-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: pidfs: avoid misleading break eventpoll: Fix integer overflow in ep_loop_check_proc() proc: Fix pointer error dereference fserror: fix lockdep complaint when igrabbing inode fsnotify: drop unused helper unshare: fix unshare_fs() handling minix: Correct errno in minix_new_inode namespace: fix proc mount iteration mount: hold namespace_sem across copy in create_new_namespace() iomap: Describe @private in iomap_readahead() statmount: Fix the null-ptr-deref in do_statmount() writeback: Fix wakeup and logging timeouts for !DETECT_HUNG_TASK fs: init flags_valid before calling vfs_fileattr_get
2026-02-25cgroup/cpuset: fix null-ptr-deref in rebuild_sched_domains_cpuslockedChen Ridong
A null-pointer-dereference bug was reported by syzbot: Oops: general protection fault, probably for address 0xdffffc0000000000: KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:bitmap_subset include/linux/bitmap.h:433 [inline] RIP: 0010:cpumask_subset include/linux/cpumask.h:836 [inline] RIP: 0010:rebuild_sched_domains_locked kernel/cgroup/cpuset.c:967 RSP: 0018:ffffc90003ecfbc0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000020 RDX: ffff888028de0000 RSI: ffffffff8200f003 RDI: ffffffff8df14f28 RBP: 0000000000000000 R08: 0000000000000cc0 R09: 00000000ffffffff R10: ffffffff8e7d95b3 R11: 0000000000000001 R12: 0000000000000000 R13: 00000000000f4240 R14: dffffc0000000000 R15: 0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f463fff CR3: 000000003704c000 CR4: 00000000003526f0 Call Trace: <TASK> rebuild_sched_domains_cpuslocked kernel/cgroup/cpuset.c:983 [inline] rebuild_sched_domains+0x21/0x40 kernel/cgroup/cpuset.c:990 sched_rt_handler+0xb5/0xe0 kernel/sched/rt.c:2911 proc_sys_call_handler+0x47f/0x5a0 fs/proc/proc_sysctl.c:600 new_sync_write fs/read_write.c:595 [inline] vfs_write+0x6ac/0x1070 fs/read_write.c:688 ksys_write+0x12a/0x250 fs/read_write.c:740 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x106/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f The issue occurs when generate_sched_domains() returns ndoms = 1 and doms = NULL due to a kmalloc failure. This leads to a null-pointer dereference when accessing doms in rebuild_sched_domains_locked(). Fix this by adding a NULL check for doms before accessing it. Fixes: 6ee43047e8ad ("cpuset: Remove unnecessary checks in rebuild_sched_domains_locked") Reported-by: syzbot+460792609a79c085f79f@syzkaller.appspotmail.com Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-25sched/deadline: Add reporting of runtime left & abs deadline to ↵Tommaso Cucinotta
sched_getattr() for DEADLINE tasks The SCHED_DEADLINE scheduler allows reading the statically configured run-time, deadline, and period parameters through the sched_getattr() system call. However, there is no immediate way to access, from user space, the current parameters used within the scheduler: the instantaneous runtime left in the current cycle, as well as the current absolute deadline. The `flags' sched_getattr() parameter, so far mandated to contain zero, now supports the SCHED_GETATTR_FLAG_DL_DYNAMIC=1 flag, to request retrieval of the leftover runtime and absolute deadline, converted to a CLOCK_MONOTONIC reference, instead of the statically configured parameters. This feature is useful for adaptive SCHED_DEADLINE tasks that need to modify their behavior depending on whether or not there is enough runtime left in the current period, and/or what is the current absolute deadline. Notes: - before returning the instantaneous parameters, the runtime is updated; - the abs deadline is returned shifted from rq_clock() to ktime_get_ns(), in CLOCK_MONOTONIC reference; this causes multiple invocations from the same period to return values that may differ for a few ns (showing some small drift), albeit the deadline doesn't move, in rq_clock() reference; - the abs deadline value returned to user-space, as unsigned 64-bit value, can represent nearly 585 years since boot time; - setting flags=0 provides the old behavior (retrieve static parameters). See also the notes from discussion held at OSPM 2025 on the topic "Making user space aware of current deadline-scheduler parameters". Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Link: https://patch.msgid.link/20250912053937.31636-2-tommaso.cucinotta@santannapisa.it
2026-02-25perf: Fix __perf_event_overflow() vs perf_remove_from_context() racePeter Zijlstra
Make sure that __perf_event_overflow() runs with IRQs disabled for all possible callchains. Specifically the software events can end up running it with only preemption disabled. This opens up a race vs perf_event_exit_event() and friends that will go and free various things the overflow path expects to be present, like the BPF program. Fixes: 592903cdcbf6 ("perf_counter: add an event_list") Reported-by: Simond Hu <cmdhh1767@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Simond Hu <cmdhh1767@gmail.com> Link: https://patch.msgid.link/20260224122909.GV1395416@noisy.programming.kicks-ass.net
2026-02-24sched_ext: Disable preemption between scx_claim_exit() and kicking helper workTejun Heo
scx_claim_exit() atomically sets exit_kind, which prevents scx_error() from triggering further error handling. After claiming exit, the caller must kick the helper kthread work which initiates bypass mode and teardown. If the calling task gets preempted between claiming exit and kicking the helper work, and the BPF scheduler fails to schedule it back (since error handling is now disabled), the helper work is never queued, bypass mode never activates, tasks stop being dispatched, and the system wedges. Disable preemption across scx_claim_exit() and the subsequent work kicking in all callers - scx_disable() and scx_vexit(). Add lockdep_assert_preemption_disabled() to scx_claim_exit() to enforce the requirement. Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-24liveupdate: luo_file: remember retrieve() statusPratyush Yadav (Google)
LUO keeps track of successful retrieve attempts on a LUO file. It does so to avoid multiple retrievals of the same file. Multiple retrievals cause problems because once the file is retrieved, the serialized data structures are likely freed and the file is likely in a very different state from what the code expects. The retrieve boolean in struct luo_file keeps track of this, and is passed to the finish callback so it knows what work was already done and what it has left to do. All this works well when retrieve succeeds. When it fails, luo_retrieve_file() returns the error immediately, without ever storing anywhere that a retrieve was attempted or what its error code was. This results in an errored LIVEUPDATE_SESSION_RETRIEVE_FD ioctl to userspace, but nothing prevents it from trying this again. The retry is problematic for much of the same reasons listed above. The file is likely in a very different state than what the retrieve logic normally expects, and it might even have freed some serialization data structures. Attempting to access them or free them again is going to break things. For example, if memfd managed to restore 8 of its 10 folios, but fails on the 9th, a subsequent retrieve attempt will try to call kho_restore_folio() on the first folio again, and that will fail with a warning since it is an invalid operation. Apart from the retry, finish() also breaks. Since on failure the retrieved bool in luo_file is never touched, the finish() call on session close will tell the file handler that retrieve was never attempted, and it will try to access or free the data structures that might not exist, much in the same way as the retry attempt. There is no sane way of attempting the retrieve again. Remember the error retrieve returned and directly return it on a retry. Also pass this status code to finish() so it can make the right decision on the work it needs to do. This is done by changing the bool to an integer. A value of 0 means retrieve was never attempted, a positive value means it succeeded, and a negative value means it failed and the error code is the value. Link: https://lkml.kernel.org/r/20260216132221.987987-1-pratyush@kernel.org Fixes: 7c722a7f44e0 ("liveupdate: luo_file: implement file systems callbacks") Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-24bpf: Do not increment tailcall count when prog is NULLHari Bathini
Currently, tailcall count is incremented in the interpreter even when tailcall fails due to non-existent prog. Fix this by holding off on the tailcall count increment until after NULL check on the prog. Suggested-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Link: https://lore.kernel.org/r/20260220062959.195101-1-hbathini@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-24time/kunit: Add .kunitconfigRyota Sakamoto
Add .kunitconfig file to the time directory to enable easy execution of KUnit tests. With the .kunitconfig, developers can run the tests: $ ./tools/testing/kunit/kunit.py run --kunitconfig kernel/time Also, add the new .kunitconfig file to the TIMEKEEPING section in the MAINTAINERS file. Signed-off-by: Ryota Sakamoto <sakamo.ryota@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260223-add-time-kunitconfig-v1-1-1801eeb33ece@gmail.com
2026-02-23bpf: allow using bpf_kptr_xchg even if the MEM_RCU flag is setKaitao Cheng
For the following scenario: struct tree_node { struct bpf_refcount ref; struct bpf_rb_node node; struct node_data __kptr * node_data; u64 key; }; This means node_data would have the type PTR_TO_BTF_ID | MEM_ALLOC | NON_OWN_REF | MEM_RCU. When traversing an rbtree using bpf_rbtree_left/right, if we need to use bpf_kptr_xchg to read the __kptr pointer, we still need to follow the remove-read-add sequence. This patch allows us to use bpf_kptr_xchg to directly read the __kptr pointer without any prior operations. Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Link: https://lore.kernel.org/r/20260214124042.62229-5-pilgrimtao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-23bpf: allow using bpf_kptr_xchg even if the NON_OWN_REF flag is setKaitao Cheng
When traversing an rbtree using bpf_rbtree_left/right, if bpf_kptr_xchg is used to access the __kptr pointer contained in a node, it currently requires first removing the node with bpf_rbtree_remove and clearing the NON_OWN_REF flag, then re-adding the node to the original rbtree with bpf_rbtree_add after usage. This process significantly degrades rbtree traversal performance. The patch enables accessing __kptr pointers with the NON_OWN_REF flag set while holding the lock, eliminating the need for this remove-read-add sequence. Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Link: https://lore.kernel.org/r/20260214124042.62229-3-pilgrimtao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-23bpf: allow calling bpf_kptr_xchg while holding a lockKaitao Cheng
For the following scenario: struct tree_node { struct bpf_rb_node node; struct request __kptr *req; u64 key; }; struct bpf_rb_root tree_root __contains(tree_node, node); struct bpf_spin_lock tree_lock; If we need to traverse all nodes in the rbtree, retrieve the __kptr pointer from each node, and read kernel data from the referenced object, using bpf_kptr_xchg appears unavoidable. This patch skips the BPF verifier checks for bpf_kptr_xchg when called while holding a lock. Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> Link: https://lore.kernel.org/r/20260214124042.62229-2-pilgrimtao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-23cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lockWaiman Long
The current cpuset partition code is able to dynamically update the sched domains of a running system and the corresponding HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentially the "isolcpus=domain,..." boot command line feature at run time. The housekeeping cpumask update requires flushing a number of different workqueues which may not be safe with cpus_read_lock() held as the workqueue flushing code may acquire cpus_read_lock() or acquiring locks which have locking dependency with cpus_read_lock() down the chain. Below is an example of such circular locking problem. ====================================================== WARNING: possible circular locking dependency detected 6.18.0-test+ #2 Tainted: G S ------------------------------------------------------ test_cpuset_prs/10971 is trying to acquire lock: ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x7a/0x180 but task is already holding lock: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (cpuset_mutex){+.+.}-{4:4}: -> #3 (cpu_hotplug_lock){++++}-{0:0}: -> #2 (rtnl_mutex){+.+.}-{4:4}: -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}: -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}: Chain exists of: (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex 5 locks held by test_cpuset_prs/10971: #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0 #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x260/0x5f0 #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x2b6/0x5f0 #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_partition_write+0x77/0x130 #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: cpuset_partition_write+0x85/0x130 Call Trace: <TASK> : touch_wq_lockdep_map+0x93/0x180 __flush_workqueue+0x111/0x10b0 housekeeping_update+0x12d/0x2d0 update_parent_effective_cpumask+0x595/0x2440 update_prstate+0x89d/0xce0 cpuset_partition_write+0xc5/0x130 cgroup_file_write+0x1a5/0x680 kernfs_fop_write_iter+0x3df/0x5f0 vfs_write+0x525/0xfd0 ksys_write+0xf9/0x1d0 do_syscall_64+0x95/0x520 entry_SYSCALL_64_after_hwframe+0x76/0x7e To avoid such a circular locking dependency problem, we have to call housekeeping_update() without holding the cpus_read_lock() and cpuset_mutex. The current set of wq's flushed by housekeeping_update() may not have work functions that call cpus_read_lock() directly, but we are likely to extend the list of wq's that are flushed in the future. Moreover, the current set of work functions may hold locks that may have cpu_hotplug_lock down the dependency chain. So housekeeping_update() is now called after releasing cpus_read_lock and cpuset_mutex at the end of a cpuset operation. These two locks are then re-acquired later before calling rebuild_sched_domains_locked(). To enable mutual exclusion between the housekeeping_update() call and other cpuset control file write actions, a new top level cpuset_top_mutex is introduced. This new mutex will be acquired first to allow sharing variables used by both code paths. However, cpuset update from CPU hotplug can still happen in parallel with the housekeeping_update() call, though that should be rare in production environment. As cpus_read_lock() is now no longer held when tmigr_isolated_exclude_cpumask() is called, it needs to acquire it directly. The lockdep_is_cpuset_held() is also updated to return true if either cpuset_top_mutex or cpuset_mutex is held. Fixes: 03ff73510169 ("cpuset: Update HK_TYPE_DOMAIN cpumask from cpuset") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueueWaiman Long
The cpuset_handle_hotplug() may need to invoke housekeeping_update(), for instance, when an isolated partition is invalidated because its last active CPU has been put offline. As we are going to enable dynamic update to the nozh_full housekeeping cpumask (HK_TYPE_KERNEL_NOISE) soon with the help of CPU hotplug, allowing the CPU hotplug path to call into housekeeping_update() directly from update_isolation_cpumasks() will likely cause deadlock. So we have to defer any call to housekeeping_update() after the CPU hotplug operation has finished. This is now done via the workqueue where the update_hk_sched_domains() function will be invoked via the hk_sd_workfn(). An concurrent cpuset control file write may have executed the required update_hk_sched_domains() function before the work function is called. So the work function call may become a no-op when it is invoked. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() togetherWaiman Long
With the latest changes in sched/isolation.c, rebuild_sched_domains*() requires the HK_TYPE_DOMAIN housekeeping cpumask to be properly updated first, if needed, before the sched domains can be rebuilt. So the two naturally fit together. Do that by creating a new update_hk_sched_domains() helper to house both actions. The name of the isolated_cpus_updating flag to control the call to housekeeping_update() is now outdated. So change it to update_housekeeping to better reflect its purpose. Also move the call to update_hk_sched_domains() to the end of cpuset and hotplug operations before releasing the cpuset_mutex. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changedWaiman Long
As cpuset is updating HK_TYPE_DOMAIN housekeeping mask when there is a change in the set of isolated CPUs, making this change is now more costly than before. Right now, the isolated_cpus_updating flag can be set even if there is no real change in isolated_cpus. Put in additional checks to make sure that isolated_cpus_updating is set only if there is a real change in isolated_cpus. Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23cgroup/cpuset: Clarify exclusion rules for cpuset internal variablesWaiman Long
Clarify the locking rules associated with file level internal variables inside the cpuset code. There is no functional change. Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in ↵Waiman Long
update_cpumasks_hier() Commit e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") incorrectly changed the 2nd parameter of cpuset_update_tasks_cpumask() from tmp->new_cpus to cp->effective_cpus. This second parameter is just a temporary cpumask for internal use. The cpuset_update_tasks_cpumask() function was originally called update_tasks_cpumask() before commit 381b53c3b549 ("cgroup/cpuset: rename functions shared between v1 and v2"). This mistake can incorrectly change the effective_cpus of the cpuset when it is the top_cpuset or in arm64 architecture where task_cpu_possible_mask() may differ from cpu_possible_mask. So far top_cpuset hasn't been passed to update_cpumasks_hier() yet, but arm64 arch can still be impacted. Fix it by reverting the incorrect change. Fixes: e2ffe502ba45 ("cgroup/cpuset: Add cpuset.cpus.exclusive for v2") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del()Waiman Long
The effective_xcpus of a cpuset can contain offline CPUs. In partition_xcpus_del(), the xcpus parameter is incorrectly used as a temporary cpumask to mask out offline CPUs. As xcpus can be the effective_xcpus of a cpuset, this can result in unexpected changes in that cpumask. Fix this problem by not making any changes to the xcpus parameter. Fixes: 11e5f407b64a ("cgroup/cpuset: Keep track of CPUs in isolated partitions") Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23sched_ext: Fix ops.dequeue() semanticsAndrea Righi
Currently, ops.dequeue() is only invoked when the sched_ext core knows that a task resides in BPF-managed data structures, which causes it to miss scheduling property change events. In addition, ops.dequeue() callbacks are completely skipped when tasks are dispatched to non-local DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably track task state. Fix this by guaranteeing that each task entering the BPF scheduler's custody triggers exactly one ops.dequeue() call when it leaves that custody, whether the exit is due to a dispatch (regular or via a core scheduling pick) or to a scheduling property change (e.g. sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA balancing, etc.). BPF scheduler custody concept: a task is considered to be in the BPF scheduler's custody when the scheduler is responsible for managing its lifecycle. This includes tasks dispatched to user-created DSQs or stored in the BPF scheduler's internal data structures from ops.enqueue(). Custody ends when the task is dispatched to a terminal DSQ (such as the local DSQ or %SCX_DSQ_GLOBAL), selected by core scheduling, or removed due to a property change. Tasks directly dispatched to terminal DSQs bypass the BPF scheduler entirely and are never in its custody. Terminal DSQs include: - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues where tasks go directly to execution. - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the BPF scheduler is considered "done" with the task. As a result, ops.dequeue() is not invoked for tasks directly dispatched to terminal DSQs. To identify dequeues triggered by scheduling property changes, introduce the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set, the dequeue was caused by a scheduling property change. New ops.dequeue() semantics: - ops.dequeue() is invoked exactly once when the task leaves the BPF scheduler's custody, in one of the following cases: a) regular dispatch: a task dispatched to a user DSQ or stored in internal BPF data structures is moved to a terminal DSQ (ops.dequeue() called without any special flags set), b) core scheduling dispatch: core-sched picks task before dispatch (ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set), c) property change: task properties modified before dispatch, (ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set). This allows BPF schedulers to: - reliably track task ownership and lifecycle, - maintain accurate accounting of managed tasks, - update internal state when tasks change properties. Cc: Tejun Heo <tj@kernel.org> Cc: Emil Tsalapatis <emil@etsalapatis.com> Cc: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23sched_ext: Add rq parameter to dispatch_enqueue()Andrea Righi
This prepares for a later commit fixing the ops.dequeue() semantics. No functional change intended. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23sched_ext: Properly mark SCX-internal migrations via sticky_cpuAndrea Righi
Reposition the setting and clearing of sticky_cpu to better define the scope of SCX-internal migrations. This ensures @sticky_cpu is set for the entire duration of an internal migration (from dequeue through enqueue), making it a reliable indicator that an SCX-internal migration is in progress. The dequeue and enqueue paths can then use @sticky_cpu to identify internal migrations and skip BPF scheduler notifications accordingly. This prepares for a later commit fixing the ops.dequeue() semantics. No functional change intended. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23module: Fix kernel panic when a symbol st_shndx is out of boundsIhor Solodrai
The module loader doesn't check for bounds of the ELF section index in simplify_symbols(): for (i = 1; i < symsec->sh_size / sizeof(Elf_Sym); i++) { const char *name = info->strtab + sym[i].st_name; switch (sym[i].st_shndx) { case SHN_COMMON: [...] default: /* Divert to percpu allocation if a percpu var. */ if (sym[i].st_shndx == info->index.pcpu) secbase = (unsigned long)mod_percpu(mod); else /** HERE --> **/ secbase = info->sechdrs[sym[i].st_shndx].sh_addr; sym[i].st_value += secbase; break; } } A symbol with an out-of-bounds st_shndx value, for example 0xffff (known as SHN_XINDEX or SHN_HIRESERVE), may cause a kernel panic: BUG: unable to handle page fault for address: ... RIP: 0010:simplify_symbols+0x2b2/0x480 ... Kernel panic - not syncing: Fatal exception This can happen when module ELF is legitimately using SHN_XINDEX or when it is corrupted. Add a bounds check in simplify_symbols() to validate that st_shndx is within the valid range before using it. This issue was discovered due to a bug in llvm-objcopy, see relevant discussion for details [1]. [1] https://lore.kernel.org/linux-modules/20251224005752.201911-1-ihor.solodrai@linux.dev/ Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2026-02-23Remove WARN_ALL_UNSEEDED_RANDOM kernel config optionLinus Torvalds
This config option goes way back - it used to be an internal debug option to random.c (at that point called DEBUG_RANDOM_BOOT), then was renamed and exposed as a config option as CONFIG_WARN_UNSEEDED_RANDOM, and then further renamed to the current CONFIG_WARN_ALL_UNSEEDED_RANDOM. It was all done with the best of intentions: the more limited rate-limited reports were reporting some cases, but if you wanted to see all the gory details, you'd enable this "ALL" option. However, it turns out - perhaps not surprisingly - that when people don't care about and fix the first rate-limited cases, they most certainly don't care about any others either, and so warning about all of them isn't actually helping anything. And the non-ratelimited reporting causes problems, where well-meaning people enable debug options, but the excessive flood of messages that nobody cares about will hide actual real information when things go wrong. I just got a kernel bug report (which had nothing to do with randomness) where two thirds of the the truncated dmesg was just variations of random: get_random_u32 called from __get_random_u32_below+0x10/0x70 with crng_init=0 and in the process early boot messages had been lost (in addition to making the messages that _hadn't_ been lost harder to read). The proper way to find these things for the hypothetical developer that cares - if such a person exists - is almost certainly with boot time tracing. That gives you the option to get call graphs etc too, which is likely a requirement for fixing any problems anyway. See Documentation/trace/boottime-trace.rst for that option. And if we for some reason do want to re-introduce actual printing of these things, it will need to have some uniqueness filtering rather than this "just print it all" model. Fixes: cc1e127bfa95 ("random: remove ratelimiting for in-kernel unseeded randomness") Acked-by: Jason Donenfeld <Jason@zx2c4.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-23module: Fix the modversions and signing submenusPetr Pavlu
The module Kconfig file contains a set of options related to "Module versioning support" (depends on MODVERSIONS) and "Module signature verification" (depends on MODULE_SIG). The Kconfig tool automatically creates submenus when an entry for a symbol is followed by consecutive items that all depend on the symbol. However, this functionality doesn't work for the mentioned module options. The MODVERSIONS options are interleaved with ASM_MODVERSIONS, which has no 'depends on MODVERSIONS' but instead uses 'default HAVE_ASM_MODVERSIONS && MODVERSIONS'. Similarly, the MODULE_SIG options are interleaved by a comment warning not to forget signing modules with scripts/sign-file, which uses the condition 'depends on MODULE_SIG_FORCE && !MODULE_SIG_ALL'. The result is that the options are confusingly shown when using a menuconfig tool, as follows: [*] Module versioning support Module versioning implementation (genksyms (from source code)) ---> [ ] Extended Module Versioning Support [*] Basic Module Versioning Support [*] Source checksum for all modules [*] Module signature verification [ ] Require modules to be validly signed [ ] Automatically sign all modules Hash algorithm to sign modules (SHA-256) ---> Fix the issue by using if/endif to group related options together in kernel/module/Kconfig, similarly to how the MODULE_DEBUG options are already grouped. Note that the signing-related options depend on 'MODULE_SIG || IMA_APPRAISE_MODSIG', with the exception of MODULE_SIG_FORCE, which is valid only for MODULE_SIG and is therefore kept separately. For consistency, do the same for the MODULE_COMPRESS entries. The options are then properly placed into submenus, as follows: [*] Module versioning support Module versioning implementation (genksyms (from source code)) ---> [ ] Extended Module Versioning Support [*] Basic Module Versioning Support [*] Source checksum for all modules [*] Module signature verification [ ] Require modules to be validly signed [ ] Automatically sign all modules Hash algorithm to sign modules (SHA-256) ---> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2026-02-23module: Remove duplicate freeing of lockdep classesPetr Pavlu
In the error path of load_module(), under the free_module label, the code calls lockdep_free_key_range() to release lock classes associated with the MOD_DATA, MOD_RODATA and MOD_RO_AFTER_INIT module regions, and subsequently invokes module_deallocate(). Since commit ac3b43283923 ("module: replace module_layout with module_memory"), the module_deallocate() function calls free_mod_mem(), which releases the lock classes as well and considers all module regions. Attempting to free these classes twice is unnecessary. Remove the redundant code in load_module(). Fixes: ac3b43283923 ("module: replace module_layout with module_memory") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Acked-by: Song Liu <song@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2026-02-23sched/fair: Skip SCHED_IDLE rq for SCHED_IDLE taskChristian Loehle
CPUs whose rq only have SCHED_IDLE tasks running are considered to be equivalent to truly idle CPUs during wakeup path. For fork and exec SCHED_IDLE is even preferred. This is based on the assumption that the SCHED_IDLE CPU is not in an idle state and might be in a higher P-state, allowing the task/wakee to run immediately without sharing the rq. However this assumption doesn't hold if the wakee has SCHED_IDLE policy itself, as it will share the rq with existing SCHED_IDLE tasks. In this case, we are better off continuing to look for a truly idle CPU. On a Intel Xeon 2-socket with 64 logical cores in total this yields for kernel compilation using SCHED_IDLE: +---------+----------------------+----------------------+--------+ | workers | mainline (seconds) | patch (seconds) | delta% | +=========+======================+======================+========+ | 1 | 4384.728 ± 21.085 | 3843.250 ± 16.235 | -12.35 | | 2 | 2242.513 ± 2.099 | 1971.696 ± 2.842 | -12.08 | | 4 | 1199.324 ± 1.823 | 1033.744 ± 1.803 | -13.81 | | 8 | 649.083 ± 1.959 | 559.123 ± 4.301 | -13.86 | | 16 | 370.425 ± 0.915 | 325.906 ± 4.623 | -12.02 | | 32 | 234.651 ± 2.255 | 217.266 ± 0.253 | -7.41 | | 64 | 202.286 ± 1.452 | 197.977 ± 2.275 | -2.13 | | 128 | 217.092 ± 1.687 | 212.164 ± 1.138 | -2.27 | +---------+----------------------+----------------------+--------+ Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260203184939.2138022-1-christian.loehle@arm.com
2026-02-23sched: Replace use of system_unbound_wq with system_dfl_wqMarco Crivellari
Currently if a user enqueues a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistency cannot be addressed without refactoring the API. For more details see the Link tag below. This continues the effort to refactor workqueue APIs, which began with the introduction of new workqueues and a new alloc_workqueue flag in: commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq") commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag") Switch to using system_dfl_wq because system_unbound_wq is going away as part of a workqueue restructuring. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Link: https://patch.msgid.link/20251107092452.43399-1-marco.crivellari@suse.com
2026-02-23sched: Fix incorrect schedstats for rt and dl threadDengjun Su
For RT and DL thread, only 'set_next_task_(rt/dl)' will call 'update_stats_wait_end_(rt/dl)' to update schedstats information. However, during the migration process, 'update_stats_wait_start_(rt/dl)' will be called twice, which will cause the values of wait_max and wait_sum to be incorrect. The specific output as follows: $ cat /proc/6046/task/6046/sched | grep wait wait_start : 0.000000 wait_max : 496717.080029 wait_sum : 7921540.776553 A complete schedstats information update flow of migrate should be __update_stats_wait_start() [enter queue A, stage 1] -> __update_stats_wait_end() [leave queue A, stage 2] -> __update_stats_wait_start() [enter queue B, stage 3] -> __update_stats_wait_end() [start running on queue B, stage 4] Stage 1: prev_wait_start is 0, and in the end, wait_start records the time of entering the queue. Stage 2: task_on_rq_migrating(p) is true, and wait_start is updated to the waiting time on queue A. Stage 3: prev_wait_start is the waiting time on queue A, wait_start is the time of entering queue B, and wait_start is expected to be greater than prev_wait_start. Under this condition, wait_start is updated to (the moment of entering queue B) - (the waiting time on queue A). Stage 4: the final wait time = (time when starting to run on queue B) - (time of entering queue B) + (waiting time on queue A) = waiting time on queue B + waiting time on queue A. The current problem is that stage 2 does not call __update_stats_wait_end to update wait_start, which causes the final computed wait time = waiting time on queue B + the moment of entering queue A, leading to incorrect wait_max and wait_sum. Add 'update_stats_wait_end_(rt/dl)' in 'update_stats_dequeue_(rt/dl)' to update schedstats information when dequeue_task. Signed-off-by: Dengjun Su <dengjun.su@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260204115959.3183567-1-dengjun.su@mediatek.com
2026-02-23sched/fair: Filter false overloaded_group case for EASVincent Guittot
With EAS, a group should be set overloaded if at least 1 CPU in the group is overutilized but it can happen that a CPU is fully utilized by tasks because of clamping the compute capacity of the CPU. In such case, the CPU is not overutilized and as a result should not be set overloaded as well. group_overloaded being a higher priority than group_misfit, such group can be selected as the busiest group instead of a group with a mistfit task and prevents load_balance to select the CPU with the misfit task to pull the latter on a fitting CPU. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Pierre Gondois <pierre.gondois@arm.com> Link: https://patch.msgid.link/20260206095454.1520619-1-vincent.guittot@linaro.org
2026-02-23sched/fair: Update overutilized detectionVincent Guittot
Checking uclamp_min is useless and counterproductive for overutilized state as misfit can now happen without being in overutilized state. Since commit e5ed0550c04c ("sched/fair: unlink misfit task from cpu overutilized") util_fits_cpu returns -1 when uclamp_min is above capacity which is not considered as cpu overutilized. Remove the useless rq_util_min parameter. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/20260213101751.3121899-1-vincent.guittot@linaro.org
2026-02-23sched/fair: Use full weight to __calc_delta()Peter Zijlstra
Since we now use the full weight for avg_vruntime(), also make __calc_delta() use the full value. Since weight is effectively NICE_0_LOAD, this is 20 bits on 64bit. This leaves 44 bits for delta_exec, which is ~16k seconds, way longer than any one tick would ever be, so no worry about overflow. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Link: https://patch.msgid.link/20260219080625.183283814%40infradead.org
2026-02-23sched/fair: Revert 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug ↵Peter Zijlstra
causing scheduling lag") Zicheng Qu reported that, because avg_vruntime() always includes cfs_rq->curr, when ->on_rq, place_entity() doesn't work right. Specifically, the lag scaling in place_entity() relies on avg_vruntime() being the state *before* placement of the new entity. However in this case avg_vruntime() will actually already include the entity, which breaks things. Also, Zicheng Qu argues that avg_vruntime should be invariant under reweight. IOW commit 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag") was wrong! The issue reported in 6d71a9c61604 could possibly be explained by rounding artifacts -- notably the extreme weight '2' is outside of the range of avg_vruntime/sum_w_vruntime, since that uses scale_load_down(). By scaling vruntime by the real weight, but accounting it in vruntime with a factor 1024 more, the average moves significantly. However, that is now cured. Tested by reverting 66951e4860d3 ("sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE") and tracing vruntime and vlag figures again. Reported-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Link: https://patch.msgid.link/20260219080625.066102672%40infradead.org
2026-02-23sched/fair: Increase weight bits for avg_vruntimePeter Zijlstra
Due to the zero_vruntime patch, the deltas are now a lot smaller and measurement with kernel-build and hackbench runs show about 45 bits used. This ensures avg_vruntime() tracks the full weight range, reducing numerical artifacts in reweight and the like. Also, lets keep the paranoid debug code around fow now. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com> Link: https://patch.msgid.link/20260219080624.942813440%40infradead.org
2026-02-23sched/fair: More complex proportional newidle balancePeter Zijlstra
It turns out that a few workloads (easyWave, fio) have a fairly low success rate on newidle balance, but still benefit greatly from having it anyway. Luckliky these workloads have a faily low newidle rate, so the cost if doing the newidle is relatively low, even if unsuccessfull. Add a simple rate based part to the newidle ratio compute, such that low rate newidle will still have a high newidle ratio. This cures the easyWave and fio workloads while not affecting the schbench numbers either (which have a very high newidle rate). Reported-by: Mario Roy <marioeroy@gmail.com> Reported-by: "Mohamed Abuelfotoh, Hazem" <abuehaze@amazon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Mario Roy <marioeroy@gmail.com> Tested-by: "Mohamed Abuelfotoh, Hazem" <abuehaze@amazon.com> Link: https://patch.msgid.link/20260127151748.GA1079264@noisy.programming.kicks-ass.net
2026-02-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after 7.0-rc1Alexei Starovoitov
Cross-merge trees after 7.0-rc1. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-02-23perf/core: Fix refcount bug and potential UAF in perf_mmapHaocheng Yu
Syzkaller reported a refcount_t: addition on 0; use-after-free warning in perf_mmap. The issue is caused by a race condition between a failing mmap() setup and a concurrent mmap() on a dependent event (e.g., using output redirection). In perf_mmap(), the ring_buffer (rb) is allocated and assigned to event->rb with the mmap_mutex held. The mutex is then released to perform map_range(). If map_range() fails, perf_mmap_close() is called to clean up. However, since the mutex was dropped, another thread attaching to this event (via inherited events or output redirection) can acquire the mutex, observe the valid event->rb pointer, and attempt to increment its reference count. If the cleanup path has already dropped the reference count to zero, this results in a use-after-free or refcount saturation warning. Fix this by extending the scope of mmap_mutex to cover the map_range() call. This ensures that the ring buffer initialization and mapping (or cleanup on failure) happens atomically effectively, preventing other threads from accessing a half-initialized or dying ring buffer. Closes: https://lore.kernel.org/oe-kbuild-all/202602020208.m7KIjdzW-lkp@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Haocheng Yu <yuhaocheng035@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260202162057.7237-1-yuhaocheng035@gmail.com
2026-02-23perf/core: Fix invalid wait context in ctx_sched_in()Namhyung Kim
Lockdep found a bug in the event scheduling when a pinned event was failed and wakes up the threads in the ring buffer like below. It seems it should not grab a wait-queue lock under perf-context lock. Let's do it with irq_work. [ 39.913691] ============================= [ 39.914157] [ BUG: Invalid wait context ] [ 39.914623] 6.15.0-next-20250530-next-2025053 #1 Not tainted [ 39.915271] ----------------------------- [ 39.915731] repro/837 is trying to lock: [ 39.916191] ffff88801acfabd8 (&event->waitq){....}-{3:3}, at: __wake_up+0x26/0x60 [ 39.917182] other info that might help us debug this: [ 39.917761] context-{5:5} [ 39.918079] 4 locks held by repro/837: [ 39.918530] #0: ffffffff8725cd00 (rcu_read_lock){....}-{1:3}, at: __perf_event_task_sched_in+0xd1/0xbc0 [ 39.919612] #1: ffff88806ca3c6f8 (&cpuctx_lock){....}-{2:2}, at: __perf_event_task_sched_in+0x1a7/0xbc0 [ 39.920748] #2: ffff88800d91fc18 (&ctx->lock){....}-{2:2}, at: __perf_event_task_sched_in+0x1f9/0xbc0 [ 39.921819] #3: ffffffff8725cd00 (rcu_read_lock){....}-{1:3}, at: perf_event_wakeup+0x6c/0x470 Fixes: f4b07fd62d4d ("perf/core: Use POLLHUP for a pinned event in error") Closes: https://lore.kernel.org/lkml/aD2w50VDvGIH95Pf@ly-workstation Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: "Lai, Yi" <yi1.lai@linux.intel.com> Link: https://patch.msgid.link/20250603045105.1731451-1-namhyung@kernel.org
2026-02-23rseq: slice ext: Ensure rseq feature size differs from original rseq sizeMathieu Desnoyers
Before rseq became extensible, its original size was 32 bytes even though the active rseq area was only 20 bytes. This had the following impact in terms of userspace ecosystem evolution: * The GNU libc between 2.35 and 2.39 expose a __rseq_size symbol set to 32, even though the size of the active rseq area is really 20. * The GNU libc 2.40 changes this __rseq_size to 20, thus making it express the active rseq area. * Starting from glibc 2.41, __rseq_size corresponds to the AT_RSEQ_FEATURE_SIZE from getauxval(3). This means that users of __rseq_size can always expect it to correspond to the active rseq area, except for the value 32, for which the active rseq area is 20 bytes. Exposing a 32 bytes feature size would make life needlessly painful for userspace. Therefore, add a reserved field at the end of the rseq area to bump the feature size to 33 bytes. This reserved field is expected to be replaced with whatever field will come next, expecting that this field will be larger than 1 byte. The effect of this change is to increase the size from 32 to 64 bytes before we actually have fields using that memory. Clarify the allocation size and alignment requirements in the struct rseq uapi comment. Change the value returned by getauxval(AT_RSEQ_ALIGN) to return the value of the active rseq area size rounded up to next power of 2, which guarantees that the rseq structure will always be aligned on the nearest power of two large enough to contain it, even as it grows. Change the alignment check in the rseq registration accordingly. This will minimize the amount of ABI corner-cases we need to document and require userspace to play games with. The rule stays simple when __rseq_size != 32: #define rseq_field_available(field) (__rseq_size >= offsetofend(struct rseq_abi, field)) Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260220200642.1317826-3-mathieu.desnoyers@efficios.com