summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
35 hoursbpf: use kvfree() for replaced sysctl write bufferDawei Feng
commit 4c21b5927d4364bfe7365f2700da5fea0ed0d004 upstream. proc_sys_call_handler() allocates its temporary sysctl buffer with kvzalloc() and passes it to __cgroup_bpf_run_filter_sysctl(). Since kvzalloc() may fall back to vmalloc() for large allocations, freeing that buffer with kfree() is wrong and can corrupt memory. Use kvfree() to safely handle both kmalloc and kvzalloc()/vmalloc allocations. The bug was first flagged by an experimental analysis tool we are developing for kernel memory-management bugs while analyzing v6.13-rc1. The tool is still under development and is not yet publicly available. Manual inspection confirms that the bug is still present in v7.1-rc5. Reproduced the bug based on v7.1-rc4 in a QEMU x86_64 guest booted with KASAN and CONFIG_FAILSLAB enabled. To exercise the replacement path, the test tree also included the accompanying fix for the stale ret == 1 check in __cgroup_bpf_run_filter_sysctl(). The reproducer confines failslab injections to the proc_sys_call_handler() range, uses stacktrace-depth=32, and injects fail-nth=1 while writing 8191 bytes to /proc/sys/kernel/domainname from a task in the target cgroup. Under that setup, fail-nth=1 triggered the fault: BUG: unable to handle page fault for address: ffffeb0200024d48 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 SMP KASAN NOPTI CPU: 2 UID: 0 PID: 209 Comm: repro_proc_sys_ Not tainted 7.1.0-rc4-00686-g97625979a5d4 PREEMPT(lazy) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 RIP: 0010:kfree+0x6e/0x510 ... Call Trace: <TASK> ? __cgroup_bpf_run_filter_sysctl+0x626/0xc30 __cgroup_bpf_run_filter_sysctl+0x74d/0xc30 ? __pfx___cgroup_bpf_run_filter_sysctl+0x10/0x10 ? srso_return_thunk+0x5/0x5f ? __kvmalloc_node_noprof+0x345/0x870 ? proc_sys_call_handler+0x250/0x480 ? srso_return_thunk+0x5/0x5f proc_sys_call_handler+0x3a2/0x480 ? __pfx_proc_sys_call_handler+0x10/0x10 ? srso_return_thunk+0x5/0x5f ? selinux_file_permission+0x39f/0x500 ? srso_return_thunk+0x5/0x5f ? lock_is_held_type+0x9e/0x120 vfs_write+0x98e/0x1000 ... </TASK> With this fix applied on top of the same test setup, rerunning the reproducer with fail-nth=1 yields no corresponding Oops reports. Fixes: 4508943794ef ("proc: use kvzalloc for our kernel buffer") Cc: stable@vger.kernel.org Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn> Link: https://lore.kernel.org/r/20260603105317.944304-3-dawei.feng@seu.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
35 hoursblock: invalidate cached plug timestamp after task switchUsama Arif
commit fad156c2af227f42ca796cbb20ddc354a6dd9932 upstream. blk_time_get_ns() caches ktime_get_ns() in current->plug->cur_ktime and marks the task with PF_BLOCK_TS. That cache is only valid while the task keeps running; if the task is switched out, wall-clock time advances and the cached value must not be reused when the task runs again. The existing invalidation covers explicit plug flushes through __blk_flush_plug(), and the schedule() / rtmutex paths through sched_update_worker(). It does not cover in-kernel preemption paths such as preempt_schedule(), preempt_schedule_notrace(), and preempt_schedule_irq(), which enter __schedule(SM_PREEMPT) directly and return without calling sched_update_worker(). As a result, a task preempted while holding a plug with PF_BLOCK_TS set can reuse a stale plug->cur_ktime after it is scheduled back in. blk-iocost then consumes that stale timestamp through ioc_now(), producing stale vnow values for throttle decisions, and through ioc_rqos_done(), inflating on-queue time and feeding false missed-QoS samples into vrate adjustment. Move the schedule-side invalidation to finish_task_switch(), which runs for the scheduled-in task after every actual context switch regardless of which schedule entry point was used. Keep __blk_flush_plug() as the explicit flush/finish-plug invalidation path, and remove only the PF_BLOCK_TS handling from sched_update_worker(). Fixes: 06b23f92af87 ("block: update cached timestamp post schedule/preemption") Cc: stable@vger.kernel.org Signed-off-by: Usama Arif <usama.arif@linux.dev> Link: https://patch.msgid.link/20260616141604.328820-3-usama.arif@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
35 hourskernel/fork: clear PF_BLOCK_TS in copy_process()Usama Arif
commit fd38b75c4b43295b10d69772a46d1c74dbd6fc81 upstream. PF_BLOCK_TS is only set in blk_time_get_ns() when current->plug is non-NULL, and blk_finish_plug() clears it via __blk_flush_plug() before NULLing the plug pointer. copy_process() breaks the invariant by inheriting PF_BLOCK_TS from the parent while resetting the child's plug to NULL. Clear PF_BLOCK_TS alongside that assignment so callers can rely on "PF_BLOCK_TS set implies current->plug != NULL" and dereference current->plug unguarded. Fixes: 06b23f92af87 ("block: update cached timestamp post schedule/preemption") Cc: stable@vger.kernel.org Signed-off-by: Usama Arif <usama.arif@linux.dev> Link: https://patch.msgid.link/20260616141604.328820-2-usama.arif@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-06-19sched_ext: Don't warn on NULL cgrp_moving_from in scx_cgroup_move_task()Tejun Heo
commit 02e545c4297a26dbbc41df81b831e7f605bcd306 upstream. A WARN fires when systemd's user manager writes "+cpu +memory +pids" to its own subtree_control while a sched_ext scheduler is loaded: WARNING: at kernel/sched/ext.c:3227 scx_cgroup_move_task+0xa8/0xb0 scx_cgroup_move_task+0xa8/0xb0 sched_move_task+0x134/0x290 cpu_cgroup_attach+0x39/0x70 cgroup_migrate_execute+0x37d/0x450 cgroup_update_dfl_csses+0x1e3/0x270 cgroup_subtree_control_write+0x3e7/0x440 scx_cgroup_can_attach() arms cgrp_moving_from only when a task's cpu cgroup changes. It can still be NULL when scx_cgroup_move_task() runs, through this sequence: Step Result --------------------------------- ---------------------------------- 1. cpu enabled on cgroup G cpu css = A 2. cpu toggled off then on for G A killed, B created (same cgroup) 3. an exiting task keeps A alive migration skips it, A now stale 4. +memory migrates G stale A vs current B pulls cpu in 5. cpu attach runs for all tasks hits a live, cpu-unchanged task 6. scx_cgroup_move_task() on it cgrp_moving_from NULL -> WARN The mismatch is that scx_cgroup_can_attach() keys on cgroup identity while migration drives the move on css identity, so a NULL cgrp_moving_from here is a legitimate css-only migration, not a missing prep. The call is already gated on cgrp_moving_from, so just drop the warning. ops.cgroup_prep_move() and ops.cgroup_move() stay paired. Fixes: 819513666966 ("sched_ext: Add cgroup support") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Matt Fleming <mfleming@cloudflare.com> Closes: https://lore.kernel.org/all/20260601124156.2205704-1-mfleming@cloudflare.com/ Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> [ mfleming: keep the 6.18.y SCX_KF_REST argument in the SCX_CALL_OP_TASK() call. ] Signed-off-by: Matt Fleming <mfleming@cloudflare.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-06-19locking/rtmutex: Skip remove_waiter() when waiter is not enqueuedDavidlohr Bueso
commit 40a25d59e85b3c8709ac2424d44f65610467871e upstream. syzbot triggered the following splat in remove_waiter() via FUTEX_CMP_REQUEUE_PI: KASAN: null-ptr-deref in range [0x0000000000000a88-0x0000000000000a8f] class_raw_spinlock_constructor remove_waiter+0x159/0x1200 kernel/locking/rtmutex.c:1561 rt_mutex_start_proxy_lock+0x103/0x120 futex_requeue+0x10e4/0x20d0 __x64_sys_futex+0x34f/0x4d0 task_blocks_on_rt_mutex() does not arm the waiter upon deadlock detection, leaving waiter->task nil, where 3bfdc63936dd ("rtmutex: Use waiter::task instead of current in remove_waiter()") made this fatal. Furthermore, rt_mutex_start_proxy_lock() should not be calling into remove_waiter() upon a successfully grabbing the rtmutex. 1a1fb985f2e2 ("futex: Handle early deadlock return correctly"), moved the remove_waiter() out of __rt_mutex_start_proxy_lock() (where 'ret' was only ever 0 or < 0) into the wrapper. Tighten this check to account for try_to_take_rt_mutex(). Fixes: 3bfdc63936dd ("rtmutex: Use waiter::task instead of current in remove_waiter()") Reported-by: syzbot+78147abe6c524f183ee9@syzkaller.appspotmail.com Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Closes: https://lore.kernel.org/all/69f114ac.050a0220.ac8b.0003.GAE@google.com/ Link: https://patch.msgid.link/20260507112913.1019537-1-dave@stgolabs.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-06-19futex/requeue: Prevent NULL pointer dereference in remove_waiter() on ↵Ji'an Zhou
self-deadlock commit 74e144274af39935b0f410c0ee4d2b91c3730414 upstream. When FUTEX_CMP_REQUEUE_PI requeues a non-top waiter that already owns the target PI futex, task_blocks_on_rt_mutex() returns -EDEADLK before setting waiter->task. The subsequent remove_waiter() in rt_mutex_start_proxy_lock() dereferences the NULL waiter->task, causing a kernel crash. Add a self-deadlock check for non-top waiters before calling rt_mutex_start_proxy_lock(), analogous to the top-waiter check in futex_lock_pi_atomic(). Fixes: 3bfdc63936dd4773109b7b8c280c0f3b5ae7d349 ("rtmutex: Use waiter::task instead of current in remove_waiter()") Signed-off-by: Ji'an Zhou <eilaimemedsnaimel@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-06-19pidfd: refuse access to tasks that have started exiting harderChristian Brauner
commit 62c4d31d78294bd61cf3403626b789e854357177 upstream. The recent ptrace fix closed a hole where someone could rely on task->mm becoming NULL during do_exit() to bypass dumpability checks. This api here leans on on the very same check and so inherits the fix. But there is no good reason to let it succeed at all once the target has entered do_exit(). PF_EXITING is set by exit_signals() at the very top of do_exit(), before exit_mm() and exit_files() run. Once we observe it, the task is committed to dying and exit_files() will release the fdtable shortly. Fixes: 8649c322f75c ("pid: Implement pidfd_getfd syscall") Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260518-obgleich-petersilie-2d77ccccf9b9@brauner Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-06-19timers/migration: Fix livelock in tmigr_handle_remote_up()Amit Matityahu
commit d486b4934a8e504376b85cdb3766f306d57aff5b upstream. tmigr_handle_remote_cpu() skips timer_expire_remote() when cpu == smp_processor_id(), assuming the local softirq path already handled this CPU's timers. This assumption is wrong because jiffies can advance after the handling of the CPU's global timers in run_timer_base(BASE_GLOBAL) and before tmigr_handle_remote() evaluates the expiry times. As a consequence a timer which expires after the CPU local timer wheel advanced and becomes expired in the remote handling is ignored and the callback is never invoked and removed from the timer wheel. What's worse is that fetch_next_timer_interrupt_remote() keeps reporting it as expired, and the event is re-queued with expires == now on each iteration. The goto-again loop spins indefinitely. Fix this by calling timer_expire_remote() unconditionally. That's minimal overhead for the common case as __run_timer_base() returns immediately if there is nothing to expire in the local wheel. [ tglx: Amend change log and add a comment ] Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model") Reported-by: Alon Kariv <alonka@amazon.com> Signed-off-by: Amit Matityahu <amitmat@amazon.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260603170139.33628-1-amitmat@amazon.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-06-19tracing/probes: Point the error offset correctly for eprobe argument errorMasami Hiramatsu (Google)
commit 85e0f27dd1396307913ffc5745b0c05137e9beac upstream. Fix to point the error offset correctly for eprobe argument error. In the cleanup commit 1b8b0cd754cd ("tracing/probes: Move event parameter fetching code to common parser"), due to incorrect backward compatibility aimed at conforming to the test specifications, the error location was set to 0 when a non-existent formal parameter was specified for Eprobe. However, this should be corrected in both the test and the implementation to point correct error position. Link: https://lore.kernel.org/all/177967567399.209006.1451571244515632097.stgit@devnote2/ Fixes: 1b8b0cd754cd ("tracing/probes: Move event parameter fetching code to common parser") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-06-19dma-debug: fix physical address retrieval in debug_dma_sync_sg_for_deviceLi RongQing
[ Upstream commit 9bfaa86b405381326c971984fd6da184c289713f ] In debug_dma_sync_sg_for_device(), when iterating over a scatterlist, the debug entry population mistakenly uses the head of the scatterlist 'sg' to fetch the physical address via sg_phys(), instead of using the current iterator variable 's'. This causes dma-debug to track the physical address of the very first scatterlist entry for all subsequent entries in the list. Fix this by passing the correct loop iterator 's' to sg_phys() Fixes: 9d4f645a1fd49ee ("dma-debug: store a phys_addr_t in struct dma_debug_entry") Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260603123708.1665-1-lirongqing@baidu.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-06-19dma-mapping: direct: fix missing mapping for THRU_HOST_BRIDGE segmentsLi RongQing
[ Upstream commit 560000d619ef162568746ce287f0c725e24ea967 ] In dma_direct_map_sg(), the case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE incorrectly used 'break' instead of falling through to MAP_NONE. As a result, segments traversing the host bridge skipped the required dma_direct_map_phys() call entirely, leaving sg->dma_address uninitialized and leading to DMA failures. Fix this by using 'fallthrough;'. Fixes: a25e7962db0d79 ("PCI/P2PDMA: Refactor the p2pdma mapping helpers") Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260603013723.2439-1-lirongqing@baidu.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-06-19time: Fix off-by-one in settimeofday() usec validationNaveen Kumar Chaudhary
[ Upstream commit ce4abda5e12622f33450159e76c8f56d28d7f03d ] The validation check uses '>' instead of '>=' when comparing tv_usec against USEC_PER_SEC, allowing the value 1000000 through. After conversion to nanoseconds (*= 1000), this produces tv_nsec == NSEC_PER_SEC, violating the timespec invariant that tv_nsec must be less than NSEC_PER_SEC. Use '>=' to reject tv_usec values that are not in the valid range of 0 to 999999. Fixes: 5e0fb1b57bea ("y2038: time: avoid timespec usage in settimeofday()") Signed-off-by: Naveen Kumar Chaudhary <naveen.osdev@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Acked-by: John Stultz <jstultz@google.com> Link: https://patch.msgid.link/4rikk44zew3s6577dugmx4jyblz7o5c57niuap6ct3td5yfm6w@gh7pcumg7qor Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-06-19signal: clear JOBCTL_PENDING_MASK for caller in zap_other_threads()Aleksandr Nogikh
[ Upstream commit 90918794a4e2c3b440f8fcf3847765a8b1d81b25 ] When a multi-threaded process receives a stop signal (e.g., SIGSTOP), do_signal_stop() sets JOBCTL_STOP_PENDING and JOBCTL_STOP_CONSUME on all threads and sets signal->group_stop_count to the number of threads. If one of the threads concurrently calls execve(), de_thread() invokes zap_other_threads() to kill all other threads. zap_other_threads() aborts the pending group stop by resetting signal->group_stop_count to 0 and clears the JOBCTL_PENDING_MASK for all other threads. However, it fails to clear the job control flags for the calling thread. When execve() completes, the calling thread returns to user mode and checks for pending signals. Seeing the stale JOBCTL_STOP_PENDING flag, it calls do_signal_stop(), which invokes task_participate_group_stop(). Since JOBCTL_STOP_CONSUME is still set, it attempts to decrement the already-zero signal->group_stop_count, triggering a warning: sig->group_stop_count == 0 WARNING: CPU: 1 PID: 6475 at kernel/signal.c:373 task_participate_group_stop+0x215/0x2d0 Call Trace: <TASK> do_signal_stop+0x3be/0x5c0 kernel/signal.c:2619 get_signal+0xa8c/0x1330 kernel/signal.c:2884 arch_do_signal_or_restart+0xbc/0x840 arch/x86/kernel/signal.c:337 exit_to_user_mode_loop+0x8c/0x4d0 kernel/entry/common.c:98 do_syscall_64+0x33e/0xf80 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> Fix this race condition by clearing the JOBCTL_PENDING_MASK for the calling thread in zap_other_threads(), ensuring it does not retain any stale job control state after the thread group is destroyed. This aligns with other functions that tear down a thread group and abort group stops, such as zap_process() and complete_signal(), which correctly clear these flags for all threads including the current one. Fixes: 39efa3ef3a37 ("signal: Use GROUP_STOP_PENDING to stop once for a single group stop") Assisted-by: Gemini:gemini-3.1-pro-preview Gemini:gemini-3-flash-preview syzbot Reported-by: syzbot+b109633ea805cac54a61@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b109633ea805cac54a61 Link: https://syzkaller.appspot.com/ai_job?id=d70208cc-862b-4fe3-bf02-3031e10cd0b3 Signed-off-by: Aleksandr Nogikh <nogikh@google.com> Link: https://patch.msgid.link/20260521142240.2973022-1-nogikh@google.com Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-06-01tracing: Avoid NULL return from hist_field_name() on truncationDavid Carlier
[ Upstream commit 576ec047d20b368b43c4d5db98c4f2e0f3c101ec ] hist_field_name() returns "" everywhere except the fully-qualified VAR_REF/EXPR case, where snprintf() truncation returns NULL early and bypasses the bottom NULL->"" guard. Callers don't expect NULL: strcat(expr, hist_field_name(field, 0)) at trace_events_hist.c:1758 and the strcmp() in the sort-key match loop at :4804 both deref it. system and event_name are bounded by MAX_EVENT_NAME_LEN, but the field name on a VAR_REF is kstrdup'd from a histogram variable name parsed out of the trigger string and has no length cap, so a long enough var name in a fully qualified reference can reach the truncation path. Keep the length check but leave field_name as "" on overflow. Link: https://patch.msgid.link/20260508195747.25492-1-devnexen@gmail.com Fixes: 5ec1d1e97de1 ("tracing: Rebuild full_name on each hist_field_name() call") Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-06-01cgroup: rstat: relax NMI guard after switch to try_cmpxchgCunlong Li
[ Upstream commit 22572dbcd3486e6c4dced877125bbf50e4e24edf ] Commit 36df6e3dbd7e ("cgroup: make css_rstat_updated nmi safe") used this_cpu_cmpxchg() for the lockless insertion, and therefore required both ARCH_HAVE_NMI_SAFE_CMPXCHG and ARCH_HAS_NMI_SAFE_THIS_CPU_OPS in the NMI guard: on archs without the latter, this_cpu_cmpxchg() falls back to "local_irq_save() + plain cmpxchg", and local_irq_save() cannot mask NMIs. Commit 3309b63a2281 ("cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated") later replaced this_cpu_cmpxchg() with plain try_cmpxchg() to fix cross-CPU lockless-list corruption, but left the NMI guard untouched. After that switch, css_rstat_updated() no longer performs any this_cpu_*() RMW operations and only relies on the arch having NMI-safe cmpxchg, so ARCH_HAS_NMI_SAFE_THIS_CPU_OPS is no longer required in the guard. Relax the guard accordingly so that archs which have HAVE_NMI and ARCH_HAVE_NMI_SAFE_CMPXCHG but not ARCH_HAS_NMI_SAFE_THIS_CPU_OPS (e.g. sparc, powerpc on PPC64/BOOK3S) can benefit from the existing CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC path. Without this, the css is never queued in NMI on those archs, and the atomics staged by account_{slab,kmem}_nmi_safe() are not drained by flush_nmi_stats(). Fixes: 3309b63a2281 ("cgroup: rstat: use LOCK CMPXCHG in css_rstat_updated") Signed-off-by: Cunlong Li <shenxiaogll@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-06-01cgroup/rstat: validate cpu before css_rstat_cpu() accessQing Ming
[ Upstream commit 8817005efbdfdf5d4e4814cb5dc52b53d12917d7 ] css_rstat_updated() is exposed as a BPF kfunc and accepts a caller-provided cpu argument. The function uses cpu for per-cpu rstat lookups without checking whether it refers to a valid possible CPU. A BPF iter/cgroup program with CAP_BPF and CAP_PERFMON can pass an invalid cpu value. On an unfixed UBSCAN_BOUNDS test kernel, cpu == 0x7fffffff triggers: UBSAN: array-index-out-of-bounds in kernel/cgroup/rstat.c:31:9 index 2147483647 is out of range for type 'long unsigned int [64]' Call Trace: css_rstat_updated bpf_iter_run_prog cgroup_iter_seq_show bpf_seq_read Add cpu validation to the BPF-facing css_rstat_updated() kfunc and move the common implementation to __css_rstat_updated() for in-kernel callers. Fixes: a319185be9f5 ("cgroup: bpf: enable bpf programs to integrate with rstat") Signed-off-by: Qing Ming <a0yami@mailbox.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-06-01dma-mapping: move dma_map_resource() sanity check into debug codeJianpeng Chang
[ Upstream commit af0c3f05866237f7592219bfe05387bc3bfc99b5 ] dma_map_resource() uses pfn_valid() to ensure the range is not RAM. However, pfn_valid() only checks for availability of the memory map for a PFN but it does not ensure that the PFN is actually backed by RAM. On ARM64 with SPARSEMEM (128MB section granularity), MMIO addresses that share a section with RAM will falsely trigger the WARN_ON_ONCE and cause dma_map_resource() to return DMA_MAPPING_ERROR. This causes a WARNING on Raspberry Pi 4 during spi_bcm2835 probe because the SPI FIFO register (0xfe204004) falls in the same sparsemem section as the end of RAM (0xf8000000-0xfbffffff), both in section 31 (0xf8000000-0xffffffff). Move the sanity check from dma_map_resource() into debug_dma_map_phys() and replace the unreliable pfn_valid() with pfn_valid() && !PageReserved(), which correctly identifies actual usable RAM without false positives for MMIO regions that happen to have struct pages. Since dma_map_resource() is dma_map_phys(DMA_ATTR_MMIO), the check applies equally to both APIs. Any non-reserved page represents kernel memory to a sufficient degree that using DMA_ATTR_MMIO on it is almost certainly wrong and risks breaking coherency on non-coherent platforms. ZONE_DEVICE pages used for PCI P2P DMA (MEMORY_DEVICE_PCI_P2PDMA) have PageReserved set, so they will not trigger a false positive. The check no longer blocks the mapping and uses err_printk() to integrate with dma-debug filtering. Fixes: f7326196a781 ("dma-mapping: export new dma_*map_phys() interface") Reviewed-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Jianpeng Chang <jianpeng.chang.cn@windriver.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20260513072209.1486986-1-jianpeng.chang.cn@windriver.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-06-01irq_work: Fix use-after-free in irq_work_single() on PREEMPT_RTJiayuan Chen
[ Upstream commit 91840be8f710370607f949a627e070896faeddb8 ] On PREEMPT_RT, non-HARD irq_work runs in per-CPU kthreads via run_irq_workd(), so irq_work_sync() uses rcuwait() to wait for BUSY==0. After irq_work_single() clears BUSY via atomic_cmpxchg(), it still dereferences @work for irq_work_is_hard() and rcuwait_wake_up(). An irq_work_sync() caller on another CPU that enters after BUSY is cleared can observe BUSY==0 immediately, return, and free the work before those accesses complete — causing a use-after-free. Fix this by wrapping run_irq_workd() in guard(rcu)() so that the entire irq_work_single() execution is within an RCU read-side critical section. Then add synchronize_rcu() in irq_work_sync() after rcuwait_wait_event() to ensure the caller waits for the RCU grace period before returning, preventing premature frees. Fixes: 810979682ccc ("irq_work: Allow irq_work_sync() to sleep if irq_work() no IRQ support.") Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20260330073234.303732-1-jiayuan.chen@linux.dev Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-06-01fprobe: Fix unregister_fprobe() to wait for RCU grace periodMasami Hiramatsu (Google)
[ Upstream commit 657b594b2084b39a4bc6d8493aa2140cb00cea49 ] Commit 4346ba1604093 ("fprobe: Rewrite fprobe on function-graph tracer") changed fprobe to register struct fprobe to an rcu-hlist, but it forgot to wait for RCU GP. Thus there can be use-after-free if the fprobe is released right after unregistering. This can be happened on fprobe event and sample module code. To fix this issue, add synchronize_rcu() in unregister_fprobe(). Note that BPF is OK because fprobe is used as a part of bpf_kprobe_multi_link. This unregisters its fprobe in bpf_kprobe_multi_link_release() and it is deallocated via bpf_kprobe_multi_link_dealloc(), which is invoked from bpf_link_defer_dealloc_rcu_gp() RCU callback. For BPF, this also introduced unregister_fprobe_async() which does NOT wait for RCU grace priod. Link: https://lore.kernel.org/all/177813998919.256460.2809243930741138224.stgit@mhiramat.tok.corp.google.com/ Fixes: 4346ba1604093 ("fprobe: Rewrite fprobe on function-graph tracer") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-06-01tracing: Do not call map->ops->elt_free() if elt_alloc() failsMasami Hiramatsu (Google)
commit 8f0f5c4fb9df0e19a341e0c6ed8dc4fda9124f03 upstream. In paths where tracing_map_elt_alloc() failed to allocate objects, the map->ops->elt_alloc() call was never successful. In this case, map->ops->elt_free() should not be called. Link: https://sashiko.dev/#/patchset/20260520223101.34710-1-rosenp%40gmail.com Cc: stable@vger.kernel.org Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Rosen Penev <rosenp@gmail.com> Reported-by: Sashiko <sashiko-bot@kernel.org> Fixes: 2734b629525a ("tracing: Add per-element variable support to tracing_map") Link: https://patch.msgid.link/177933895460.108746.5396070821443932634.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-06-01ring-buffer: Flush and stop persistent ring buffer on panicMasami Hiramatsu (Google)
commit a494d3c8d5392bcdff83c2a593df0c160ff9f322 upstream. On real hardware, panic and machine reboot may not flush hardware cache to memory. This means the persistent ring buffer, which relies on a coherent state of memory, may not have its events written to the buffer and they may be lost. Moreover, there may be inconsistency with the counters which are used for validation of the integrity of the persistent ring buffer which may cause all data to be discarded. To avoid this issue, stop recording of the ring buffer on panic and flush the cache of the ring buffer's memory. Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance") Cc: stable@vger.kernel.org Cc: Will Deacon <will@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ian Rogers <irogers@google.com> Link: https://patch.msgid.link/177751969602.2136606.12031934362587643488.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-06-01ring-buffer: Fix reporting of missed events in iteratorSteven Rostedt
commit a254b6d13b0edd6272926674d2afc46d46e496b7 upstream. When tracing is active while reading the trace file, if the iterator reading the buffer detects that the writer has passed the iterator head, it will reset and set a "missed events" flag. This flag is passed to the output processing to show the user that events were missed: CPU:4 [LOST EVENTS] The problem is that the flag is reset after it is checked in ring_buffer_iter_dropped(). But the "trace" file iterates over all the CPU ring buffers and it will check if they are dropped when figuring out which buffer to print next. This prematurely clears the missed_events flag if the CPU buffer with the missed events is not the one that is printed next. On the iteration where the CPU buffer with the missed events is printed, the check if it had missed events would return false and the output does not show that events were missed. Do not reset the missed_events flag when checking if there were missed events, but instead clear it when moving the iterator head to the next event. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260520220801.4fd09d13@fedora Fixes: c9b7a4a72ff64 ("ring-buffer/tracing: Have iterator acknowledge dropped events") Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-06-01cgroup/cpuset: Reset DL migration state on can_attach() failureGuopeng Zhang
[ Upstream commit 4a39eda5fdd867fc39f3c039714dd432cee00268 ] cpuset_can_attach() accumulates temporary SCHED_DEADLINE migration state in the destination cpuset while walking the taskset. If a later task_can_attach() or security_task_setscheduler() check fails, cgroup_migrate_execute() treats cpuset as the failing subsystem and does not call cpuset_cancel_attach() for it. The partially accumulated state is then left behind and can be consumed by a later attach, corrupting cpuset DL task accounting and pending DL bandwidth accounting. Reset the pending DL migration state from the common error exit when ret is non-zero. Successful can_attach() keeps the state for cpuset_attach() or cpuset_cancel_attach(). Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails") Cc: stable@vger.kernel.org # v6.10+ Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Waiman Long <longman@redhat.com> [ omitted upstream context line `cs->dl_bw_cpu = cpu;` ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-06-01tracing/fprobe: Check the same type fprobe on table as the unregistered oneMasami Hiramatsu (Google)
[ Upstream commit 0ac0058a74ac5765c7ce09ea630f4fdeaf4d80fa ] Commit 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case") introduced a different ftrace_ops for entry-only fprobes. However, when unregistering an fprobe, the kernel only checks if another fprobe exists at the same address, without checking which type of fprobe it is. If different fprobes are registered at the same address, the same address will be registered in both fgraph_ops and ftrace_ops, but only one of them will be deleted when unregistering. (the one removed first will not be deleted from the ops). This results in junk entries remaining in either fgraph_ops or ftrace_ops. For example: ======= cd /sys/kernel/tracing # 'Add entry and exit events on the same place' echo 'f:event1 vfs_read' >> dynamic_events echo 'f:event2 vfs_read%return' >> dynamic_events # 'Enable both of them' echo 1 > events/fprobes/enable cat enabled_functions vfs_read (2) ->arch_ftrace_ops_list_func+0x0/0x210 # 'Disable and remove exit event' echo 0 > events/fprobes/event2/enable echo -:event2 >> dynamic_events # 'Disable and remove all events' echo 0 > events/fprobes/enable echo > dynamic_events # 'Add another event' echo 'f:event3 vfs_open%return' > dynamic_events cat dynamic_events f:fprobes/event3 vfs_open%return echo 1 > events/fprobes/enable cat enabled_functions vfs_open (1) tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150} vfs_read (1) tramp: 0xffffffffa0001000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 subops: {ent:fprobe_fgraph_entry+0x0/0x620 ret:fprobe_return+0x0/0x150} ======= As you can see, an entry for the vfs_read remains. To fix this issue, when unregistering, the kernel should also check if there is the same type of fprobes still exist at the same address, and if not, delete its entry from either fgraph_ops or ftrace_ops. Link: https://lore.kernel.org/all/177669367993.132053.10553046138528674802.stgit@mhiramat.tok.corp.google.com/ Fixes: 2c67dc457bc6 ("tracing: fprobe: optimization for entry only case") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-06-01tracing/fprobe: Avoid kcalloc() in rcu_read_lock sectionMasami Hiramatsu (Google)
[ Upstream commit aa72812b49104bb5a38272fc9541feb62ca6fd32 ] fprobe_remove_node_in_module() is called under RCU read locked, but this invokes kcalloc() if there are more than 8 fprobes installed on the module. Sashiko warns it because kcalloc() can sleep [1]. [1] https://sashiko.dev/#/patchset/177552432201.853249.5125045538812833325.stgit%40mhiramat.tok.corp.google.com To fix this issue, expand the batch size to 128 and do not expand the fprobe_addr_list, but just cancel walking on fprobe_ip_table, update fgraph/ftrace_ops and retry the loop again. Link: https://lore.kernel.org/all/177669367206.132053.1493637946869032744.stgit@mhiramat.tok.corp.google.com/ Fixes: 0de4c70d04a4 ("tracing: fprobe: use rhltable for fprobe_ip_table") Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Stable-dep-of: 0ac0058a74ac ("tracing/fprobe: Check the same type fprobe on table as the unregistered one") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-06-01tracing: fprobe: use ftrace if CONFIG_DYNAMIC_FTRACE_WITH_ARGSMenglong Dong
[ Upstream commit cd06078a38aaedfebbf8fa0c009da0f99f4473fb ] For now, we will use ftrace for the fprobe if fp->exit_handler not exists and CONFIG_DYNAMIC_FTRACE_WITH_REGS is enabled. However, CONFIG_DYNAMIC_FTRACE_WITH_REGS is not supported by some arch, such as arm. What we need in the fprobe is the function arguments, so we can use ftrace for fprobe if CONFIG_DYNAMIC_FTRACE_WITH_ARGS is enabled. Therefore, use ftrace if CONFIG_DYNAMIC_FTRACE_WITH_REGS or CONFIG_DYNAMIC_FTRACE_WITH_ARGS enabled. Link: https://lore.kernel.org/all/20251103063434.47388-1-dongml2@chinatelecom.cn/ Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Stable-dep-of: 0ac0058a74ac ("tracing/fprobe: Check the same type fprobe on table as the unregistered one") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-06-01tracing: fprobe: Remove unused local variableMasami Hiramatsu (Google)
[ Upstream commit 90e69d291d195d35215b578d210fd3ce0e5a3f42 ] The 'ret' local variable in fprobe_remove_node_in_module() was used for checking the error state in the loop, but commit dfe0d675df82 ("tracing: fprobe: use rhltable for fprobe_ip_table") removed the loop. So we don't need it anymore. Link: https://lore.kernel.org/all/175867358989.600222.6175459620045800878.stgit@devnote2/ Fixes: e5a4cc28a052 ("tracing: fprobe: use rhltable for fprobe_ip_table") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Menglong Dong <menglong8.dong@gmail.com> Stable-dep-of: 0ac0058a74ac ("tracing/fprobe: Check the same type fprobe on table as the unregistered one") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-06-01sched_ext: Avoid UAF in scx_root_enable_workfn() init failure pathTejun Heo
[ Upstream commit 9a415cc53711f2238e0f0ca8a6bcc796c003b127 ] In scx_root_enable_workfn(), put_task_struct(p) is called before scx_error() dereferences p->comm and p->pid. If the iterator's reference is the last drop, the task is freed synchronously and the deref becomes a UAF. Move put_task_struct() past scx_error(). Reported-by: Sashiko <sashiko-bot@kernel.org> Closes: https://lore.kernel.org/all/20260511214031.AF5E9C2BCB0@smtp.kernel.org/ Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Tejun Heo <tj@kernel.org> [ kept `scx_init_task()` call site instead of `__scx_init_task()`/`task_rq_lock` ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-06-01sched_ext: Fix missing warning in scx_set_task_state() default caseSamuele Mariotti
[ Upstream commit b905ee77d5f557a83a485b4146210f54f13365fc ] In scx_set_task_state(), the default case was setting the warn flag, but then returning immediately. This is problematic because the only purpose of the warn flag is to trigger WARN_ONCE, but the early return prevented it from ever firing, leaving invalid task states undetected and untraced. To fix this, a WARN_ONCE call is now added directly in the default case. The fix addresses two aspects: - Guarantees the invalid task states are properly logged and traced. - Provides a distinct warning message ("sched_ext: Invalid task state") specifically for states outside the defined scx_task_state enum values, making it easier to distinguish from other transition warnings. This ensures proper detection and reporting of invalid states. Signed-off-by: Samuele Mariotti <smariotti@disroot.org> Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org> Stable-dep-of: 9a415cc53711 ("sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-06-01sched/deadline: Fix missing ENQUEUE_REPLENISH during PI de-boostingJuri Lelli
[ Upstream commit d658686a1331db3bb108ca079d76deb3208ed949 ] Running stress-ng --schedpolicy 0 on an RT kernel on a big machine might lead to the following WARNINGs (edited). sched: DL de-boosted task PID 22725: REPLENISH flag missing WARNING: CPU: 93 PID: 0 at kernel/sched/deadline.c:239 dequeue_task_dl+0x15c/0x1f8 ... (running_bw underflow) Call trace: dequeue_task_dl+0x15c/0x1f8 (P) dequeue_task+0x80/0x168 deactivate_task+0x24/0x50 push_dl_task+0x264/0x2e0 dl_task_timer+0x1b0/0x228 __hrtimer_run_queues+0x188/0x378 hrtimer_interrupt+0xfc/0x260 ... The problem is that when a SCHED_DEADLINE task (lock holder) is changed to a lower priority class via sched_setscheduler(), it may fail to properly inherit the parameters of potential DEADLINE donors if it didn't already inherit them in the past (shorter deadline than donor's at that time). This might lead to bandwidth accounting corruption, as enqueue_task_dl() won't recognize the lock holder as boosted. The scenario occurs when: 1. A DEADLINE task (donor) blocks on a PI mutex held by another DEADLINE task (holder), but the holder doesn't inherit parameters (e.g., it already has a shorter deadline) 2. sched_setscheduler() changes the holder from DEADLINE to a lower class while still holding the mutex 3. The holder should now inherit DEADLINE parameters from the donor and be enqueued with ENQUEUE_REPLENISH, but this doesn't happen Fix the issue by introducing __setscheduler_dl_pi(), which detects when a DEADLINE (proper or boosted) task gets setscheduled to a lower priority class. In case, the function makes the task inherit DEADLINE parameters of the donoer (pi_se) and sets ENQUEUE_REPLENISH flag to ensure proper bandwidth accounting during the next enqueue operation. Fixes: 2279f540ea7d ("sched/deadline: Fix priority inheritance with multiple scheduling classes") Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260302-upstream-fix-deadline-piboost-b4-v3-1-6ba32184a9e0@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-06-01sched: Employ sched_change guardsPeter Zijlstra
[ Upstream commit e9139f765ac7048cadc9981e962acdf8b08eabf3 ] As proposed a long while ago -- and half done by scx -- wrap the scheduler's 'change' pattern in a guard helper. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Stable-dep-of: d658686a1331 ("sched/deadline: Fix missing ENQUEUE_REPLENISH during PI de-boosting") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-23sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_beforeTejun Heo
[ Upstream commit 4155fb489fa175ec74eedde7d02219cf2fe74303 ] scx_prio_less() runs from core-sched's pick_next_task() path with rq locked but invokes ops.core_sched_before() with NULL locked_rq, leaving scx_locked_rq_state NULL. If the BPF callback calls a kfunc that re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu) - it re-acquires the already-held rq. Pass task_rq(a). Fixes: 7b0888b7cc19 ("sched_ext: Implement core-sched support") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> [ adapted call to use stable's single `sch`/`SCX_KF_REST` mask and `scx_rq_bypassing(task_rq(a))` signature ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-23sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_newTejun Heo
[ Upstream commit 4fda9f0e7c950da4fe03cedeb2ac818edf5d03e9 ] bpf_iter_scx_dsq_new() clears kit->dsq on failure and bpf_iter_scx_dsq_{next,destroy}() guard against that. scx_dsq_move() doesn't - it dereferences kit->dsq immediately, so a BPF program that calls scx_bpf_dsq_move[_vtime]() after a failed iter_new oopses the kernel. Return false if kit->dsq is NULL. Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> [ dropped upstream `sch = src_dsq->sched` reordering since stable initializes `sch` from `scx_root` instead ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-23audit: enforce AUDIT_LOCKED for AUDIT_TRIM and AUDIT_MAKE_EQUIVSergio Correia
commit f9e1c1324b4d98d591a6f7568fdebf5cf456dfc2 upstream. AUDIT_ADD_RULE and AUDIT_DEL_RULE correctly check for AUDIT_LOCKED and return -EPERM, but AUDIT_TRIM and AUDIT_MAKE_EQUIV do not. This allows a process with CAP_AUDIT_CONTROL to modify directory tree watches and equivalence mappings even when the audit configuration has been locked, undermining the purpose of the lock. Add AUDIT_LOCKED checks to both commands. Cc: stable@vger.kernel.org Reviewed-by: Ricardo Robaina <rrobaina@redhat.com> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Sergio Correia <scorreia@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-23cgroup/dmem: Return -ENOMEM on failed pool preallocationGuopeng Zhang
commit 796ad622040f7f955ccc3973085e953415920496 upstream. get_cg_pool_unlocked() handles allocation failures under dmemcg_lock by dropping the lock, preallocating a pool with GFP_KERNEL, and retrying the locked lookup and creation path. If the fallback allocation fails too, pool remains NULL. Since the loop condition is while (!pool), the function can keep retrying instead of propagating the allocation failure to the caller. Set pool to ERR_PTR(-ENOMEM) when the fallback allocation fails so the loop exits through the existing common return path. The callers already handle ERR_PTR() from get_cg_pool_unlocked(), so this restores the expected error path. Fixes: b168ed458dde ("kernel/cgroup: Add "dmem" memory accounting cgroup") Cc: stable@vger.kernel.org # v6.14+ Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-23audit: fix incorrect inheritable capability in CAPSET recordsSergio Correia
commit e4a640475e43f406fdfd56d370b1f34b0cbbc18d upstream. __audit_log_capset() records the effective capability set into the inheritable field due to a copy-paste error. Every CAPSET audit record therefore reports cap_pi (process inheritable) with the value of cap_effective instead of cap_inheritable. This silently corrupts audit data used for compliance and forensic analysis: an attacker who modifies inheritable capabilities to prepare for a privilege-escalating exec would have the change masked in the audit trail. The bug has been present since the original introduction of CAPSET audit records in 2008. Cc: stable@vger.kernel.org Fixes: e68b75a027bb ("When the capset syscall is used it is not possible for audit to record the actual capbilities being added/removed. This patch adds a new record type which emits the target pid and the eff, inh, and perm cap sets.") Reviewed-by: Ricardo Robaina <rrobaina@redhat.com> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Sergio Correia <scorreia@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-23workqueue: Fix wq->cpu_pwq leak in alloc_and_link_pwqs() WQ_UNBOUND pathBreno Leitao
commit 0143033dc22cdff912cfc13419f5db92fea3b4cb upstream. For WQ_UNBOUND workqueues, alloc_and_link_pwqs() allocates wq->cpu_pwq via alloc_percpu() and then calls apply_workqueue_attrs_locked(). On failure it returns the error directly, bypassing the enomem: label which holds the only free_percpu(wq->cpu_pwq) in this function. The caller's error path kfree()s wq without touching wq->cpu_pwq, leaking one percpu pointer table (nr_cpu_ids * sizeof(void *) bytes) per failed call. If kmemleak is enabled, we can see: unreferenced object (percpu) 0xc0fffa5b121048 (size 8): comm "insmod", pid 776, jiffies 4294682844 backtrace (crc 0): pcpu_alloc_noprof+0x665/0xac0 __alloc_workqueue+0x33f/0xa20 alloc_workqueue_noprof+0x60/0x100 Route the error through the existing enomem: cleanup and any error before this one. Cc: stable@kernel.org Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-05-23sched/fair: Revert force wakeup preemptionVincent Guittot
[ Upstream commit 15257cc2f905dbf5813c0bfdd3c15885f28093c4 ] This agressively bypasses run_to_parity and slice protection with the assumpiton that this is what waker wants but there is no garantee that the wakee will be the next to run. It is a better choice to use yield_to_task or WF_SYNC in such case. This increases the number of resched and preemption because a task becomes quickly "ineligible" when it runs; We update the task vruntime periodically and before the task exhausted its slice or at least quantum. Example: 2 tasks A and B wake up simultaneously with lag = 0. Both are eligible. Task A runs 1st and wakes up task C. Scheduler updates task A's vruntime which becomes greater than average runtime as all others have a lag == 0 and didn't run yet. Now task A is ineligible because it received more runtime than the other task but it has not yet exhausted its slice nor a min quantum. We force preemption, disable protection but Task B will run 1st not task C. Sidenote, DELAY_ZERO increases this effect by clearing positive lag at wake up. Fixes: e837456fdca8 ("sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260123102858.52428-1-vincent.guittot@linaro.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-23sched/fair: Fix wakeup_preempt_fair() for not waking up taskVincent Guittot
[ Upstream commit 9f6d929ee2c6f0266edb564bcd2bd47fd6e884a8 ] Make sure to only call pick_next_entity() on an non-empty cfs_rq. The assumption that p is always enqueued and not delayed, is only true for wakeup. If p was moved while delayed, pick_next_entity() will dequeue it and the cfs might become empty. Test if there are still queued tasks before trying again to determine if p could be the next one to be picked. There are at least 2 cases: When cfs becomes idle, it tries to pull tasks but if those pulled tasks are delayed, they will be dequeued when attached to cfs. attach_tasks() -> attach_task() -> wakeup_preempt(rq, p, 0); A misfit task running on cfs A triggers a load balance to be pulled on a better cpu, the load balance on cfs B starts an active load balance to pulled the running misfit task. If there is a delayed dequeue task on cfs A, it can be pulled instead of the previously running misfit task. attach_one_task() -> attach_task() -> wakeup_preempt(rq, p, 0); Fixes: ac8e69e69363 ("sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260503104503.1732682-1-vincent.guittot@linaro.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-23bpf: Fix sync_linked_regs regarding BPF_ADD_CONST32 zext propagationDaniel Borkmann
[ Upstream commit bc308be380c136800e1e94c6ce49cb53141d6506 ] Jenny reported that in sync_linked_regs() the BPF_ADD_CONST32 flag is checked on known_reg (the register narrowed by a conditional branch) instead of reg (the linked target register created by an alu32 operation). Example case with reg: 1. r6 = bpf_get_prandom_u32() 2. r7 = r6 (linked, same id) 3. w7 += 5 (alu32 -- r7 gets BPF_ADD_CONST32, zero-extended by CPU) 4. if w6 < 0xFFFFFFFC goto safe (narrows r6 to [0xFFFFFFFC, 0xFFFFFFFF]) 5. sync_linked_regs() propagates to r7 but does NOT call zext_32_to_64() 6. Verifier thinks r7 is [0x100000001, 0x100000004] instead of [1, 4] Since known_reg above does not have BPF_ADD_CONST32 set above, zext_32_to_64() is never called on alu32-derived linked registers. This causes the verifier to track incorrect 64-bit bounds, while the CPU correctly zero-extends the 32-bit result. The code checking known_reg->id was correct however (see scalars_alu32_wrap selftest case), but the real fix needs to handle both directions - zext propagation should be done when either register has BPF_ADD_CONST32, since the linked relationship involves a 32-bit operation regardless of which side has the flag. Example case with known_reg (exercised also by scalars_alu32_wrap): 1. r1 = r0; w1 += 0x100 (alu32 -- r1 gets BPF_ADD_CONST32) 2. if r1 > 0x80 - known_reg = r1 (has BPF_ADD_CONST32), reg = r0 (doesn't) Hence, fix it by checking for (reg->id | known_reg->id) & BPF_ADD_CONST32. Moreover, sync_linked_regs() also has a soundness issue when two linked registers used different ALU widths: one with BPF_ADD_CONST32 and the other with BPF_ADD_CONST64. The delta relationship between linked registers assumes the same arithmetic width though. When one register went through alu32 (CPU zero-extends the 32-bit result) and the other went through alu64 (no zero-extension), the propagation produces incorrect bounds. Example: r6 = bpf_get_prandom_u32() // fully unknown if r6 >= 0x100000000 goto out // constrain r6 to [0, U32_MAX] r7 = r6 w7 += 1 // alu32: r7.id = N | BPF_ADD_CONST32 r8 = r6 r8 += 2 // alu64: r8.id = N | BPF_ADD_CONST64 if r7 < 0xFFFFFFFF goto out // narrows r7 to [0xFFFFFFFF, 0xFFFFFFFF] At the branch on r7, sync_linked_regs() runs with known_reg=r7 (BPF_ADD_CONST32) and reg=r8 (BPF_ADD_CONST64). The delta path computes: r8 = r7 + (delta_r8 - delta_r7) = 0xFFFFFFFF + (2 - 1) = 0x100000000 Then, because known_reg->id has BPF_ADD_CONST32, zext_32_to_64(r8) is called, truncating r8 to [0, 0]. But r8 used a 64-bit ALU op -- the CPU does NOT zero-extend it. The actual CPU value of r8 is 0xFFFFFFFE + 2 = 0x100000000, not 0. The verifier now underestimates r8's 64-bit bounds, which is a soundness violation. Fix sync_linked_regs() by skipping propagation when the two registers have mixed ALU widths (one BPF_ADD_CONST32, the other BPF_ADD_CONST64). Lastly, fix regsafe() used for path pruning: the existing checks used "& BPF_ADD_CONST" to test for offset linkage, which treated BPF_ADD_CONST32 and BPF_ADD_CONST64 as equivalent. Fixes: 7a433e519364 ("bpf: Support negative offsets, BPF_SUB, and alu32 for linked register tracking") Reported-by: Jenny Guanni Qu <qguanni@gmail.com> Co-developed-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260319211507.213816-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-23futex: Drop CLONE_THREAD requirement for private default hash allocDavidlohr Bueso
[ Upstream commit ee9dce44362b2d8132c32964656ab6dff7dfbc6a ] Currently need_futex_hash_allocate_default() depends on strict pthread semantics, abusing CLONE_THREAD. This breaks the non-concurrency assumptions when doing the mm->futex_ref pcpu allocations, leading to bugs[0] when sharing the mm in other ways; ie: BUG: KASAN: slab-use-after-free in futex_hash_put ... where the +1 bias can end up on a percpu counter that mm->futex_ref no longer points at. Loosen the check to cover any CLONE_VM clone, except vfork(). Excluding vfork keeps the existing paths untouched (no overhead), and we can't race in the first place: either the parent is suspended and the child runs alone, or mm->futex_ref is already allocated from an earlier CLONE_VM. Link: https://lore.kernel.org/all/CAL_bE8LsmCQ-FAtYDuwbJhOkt9p2wwYQwAbMh=PifC=VsiBM6A@mail.gmail.com/ [0] Fixes: d9b05321e21e ("futex: Move futex_hash_free() back to __mmput()") Reported-by: Yiming Qian <yimingqian591@gmail.com> Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-23futex: Prevent lockup in requeue-PI during signal/ timeout wakeupSebastian Andrzej Siewior
[ Upstream commit bc7304f3ae20972d11db6e0b1b541c63feda5f05 ] During wait-requeue-pi (task A) and requeue-PI (task B) the following race can happen: Task A Task B futex_wait_requeue_pi() futex_setup_timer() futex_do_wait() futex_requeue() CLASS(hb, hb1)(&key1); CLASS(hb, hb2)(&key2); *timeout* futex_requeue_pi_wakeup_sync() requeue_state = Q_REQUEUE_PI_IGNORE *blocks on hb->lock* futex_proxy_trylock_atomic() futex_requeue_pi_prepare() Q_REQUEUE_PI_IGNORE => -EAGAIN double_unlock_hb(hb1, hb2) *retry* Task B acquires both hb locks and attempts to acquire the PI-lock of the top most waiter (task B). Task A is leaving early due to a signal/ timeout and started removing itself from the queue. It updates its requeue_state but can not remove it from the list because this requires the hb lock which is owned by task B. Usually task A is able to swoop the lock after task B unlocked it. However if task B is of higher priority then task A may not be able to wake up in time and acquire the lock before task B gets it again. Especially on a UP system where A is never scheduled. As a result task A blocks on the lock and task B busy loops, trying to make progress but live locks the system instead. Tragic. This can be fixed by removing the top most waiter from the list in this case. This allows task B to grab the next top waiter (if any) in the next iteration and make progress. Remove the top most waiter if futex_requeue_pi_prepare() fails. Let the waiter conditionally remove itself from the list in handle_early_requeue_pi_wakeup(). Fixes: 07d91ef510fb1 ("futex: Prevent requeue_pi() lock nesting issue on RT") Reported-by: Moritz Klammler <Moritz.Klammler@ferchau.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260428103425.dywXyPd3@linutronix.de Closes: https://lore.kernel.org/all/VE1PR06MB6894BE61C173D802365BE19DFF4CA@VE1PR06MB6894.eurprd06.prod.outlook.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-23sched/fair: Clear rel_deadline when initializing forked entitiesZicheng Qu
[ Upstream commit 3da56dc063cd77b9c0b40add930767fab4e389f3 ] A yield-triggered crash can happen when a newly forked sched_entity enters the fair class with se->rel_deadline unexpectedly set. The failing sequence is: 1. A task is forked while se->rel_deadline is still set. 2. __sched_fork() initializes vruntime, vlag and other sched_entity state, but does not clear rel_deadline. 3. On the first enqueue, enqueue_entity() calls place_entity(). 4. Because se->rel_deadline is set, place_entity() treats se->deadline as a relative deadline and converts it to an absolute deadline by adding the current vruntime. 5. However, the forked entity's deadline is not a valid inherited relative deadline for this new scheduling instance, so the conversion produces an abnormally large deadline. 6. If the task later calls sched_yield(), yield_task_fair() advances se->vruntime to se->deadline. 7. The inflated vruntime is then used by the following enqueue path, where the vruntime-derived key can overflow when multiplied by the entity weight. 8. This corrupts cfs_rq->sum_w_vruntime, breaks EEVDF eligibility calculation, and can eventually make all entities appear ineligible. pick_next_entity() may then return NULL unexpectedly, leading to a later NULL dereference. A captured trace shows the effect clearly. Before yield, the entity's vruntime was around: 9834017729983308 After yield_task_fair() executed: se->vruntime = se->deadline the vruntime jumped to: 19668035460670230 and the deadline was later advanced further to: 19668035463470230 This shows that the deadline had already become abnormally large before yield_task_fair() copied it into vruntime. rel_deadline is only meaningful when se->deadline really carries a relative deadline that still needs to be placed against vruntime. A freshly forked sched_entity should not inherit or retain this state. Clear se->rel_deadline in __sched_fork(), together with the other sched_entity runtime state, so that the first enqueue does not interpret the new entity's deadline as a stale relative deadline. Fixes: 82e9d0456e06 ("sched/fair: Avoid re-setting virtual deadline on 'migrations'") Analyzed-by: Hui Tang <tanghui20@huawei.com> Analyzed-by: Zhang Qiao <zhangqiao22@huawei.com> Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260424071113.1199600-1-quzicheng@huawei.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-23sched/fair: Fix wakeup_preempt_fair() vs delayed dequeueVincent Guittot
[ Upstream commit ac8e69e693631689d74d8f1ebee6f84f737f797f ] Similar to how pick_next_entity() must dequeue delayed entities, so too must wakeup_preempt_fair(). Any delayed task being found means it is eligible and hence past the 0-lag point, ready for removal. Worse, by not removing delayed entities from consideration, it can skew the preemption decision, with the end result that a short slice wakeup will not result in a preemption. tip/sched/core tip/sched/core +this patch cyclictest slice (ms) (default)2.8 8 8 hackbench slice (ms) (default)2.8 20 20 Total Samples | 22559 22595 22683 Average (us) | 157 64( 59%) 59( 8%) Median (P50) (us) | 57 57( 0%) 58(- 2%) 90th Percentile (us) | 64 60( 6%) 60( 0%) 99th Percentile (us) | 2407 67( 97%) 67( 0%) 99.9th Percentile (us) | 3400 2288( 33%) 727( 68%) Maximum (us) | 5037 9252(-84%) 7461( 19%) Fixes: f12e148892ed ("sched/fair: Prepare pick_next_task() for delayed dequeue") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260422093400.319251-1-vincent.guittot@linaro.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-23sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goalsMel Gorman
[ Upstream commit e837456fdca81899a3c8e47b3fd39e30eae6e291 ] Reimplement NEXT_BUDDY preemption to take into account the deadline and eligibility of the wakee with respect to the waker. In the event multiple buddies could be considered, the one with the earliest deadline is selected. Sync wakeups are treated differently to every other type of wakeup. The WF_SYNC assumption is that the waker promises to sleep in the very near future. This is violated in enough cases that WF_SYNC should be treated as a suggestion instead of a contract. If a waker does go to sleep almost immediately then the delay in wakeup is negligible. In other cases, it's throttled based on the accumulated runtime of the waker so there is a chance that some batched wakeups have been issued before preemption. For all other wakeups, preemption happens if the wakee has a earlier deadline than the waker and eligible to run. While many workloads were tested, the two main targets were a modified dbench4 benchmark and hackbench because the are on opposite ends of the spectrum -- one prefers throughput by avoiding preemption and the other relies on preemption. First is the dbench throughput data even though it is a poor metric but it is the default metric. The test machine is a 2-socket machine and the backing filesystem is XFS as a lot of the IO work is dispatched to kernel threads. It's important to note that these results are not representative across all machines, especially Zen machines, as different bottlenecks are exposed on different machines and filesystems. dbench4 Throughput (misleading but traditional) 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v5 Hmean 1 1268.80 ( 0.00%) 1269.74 ( 0.07%) Hmean 4 3971.74 ( 0.00%) 3950.59 ( -0.53%) Hmean 7 5548.23 ( 0.00%) 5420.08 ( -2.31%) Hmean 12 7310.86 ( 0.00%) 7165.57 ( -1.99%) Hmean 21 8874.53 ( 0.00%) 9149.04 ( 3.09%) Hmean 30 9361.93 ( 0.00%) 10530.04 ( 12.48%) Hmean 48 9540.14 ( 0.00%) 11820.40 ( 23.90%) Hmean 79 9208.74 ( 0.00%) 12193.79 ( 32.42%) Hmean 110 8573.12 ( 0.00%) 11933.72 ( 39.20%) Hmean 141 7791.33 ( 0.00%) 11273.90 ( 44.70%) Hmean 160 7666.60 ( 0.00%) 10768.72 ( 40.46%) As throughput is misleading, the benchmark is modified to use a short loadfile report the completion time duration in milliseconds. dbench4 Loadfile Execution Time 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v5 Amean 1 14.62 ( 0.00%) 14.69 ( -0.46%) Amean 4 18.76 ( 0.00%) 18.85 ( -0.45%) Amean 7 23.71 ( 0.00%) 24.38 ( -2.82%) Amean 12 31.25 ( 0.00%) 31.87 ( -1.97%) Amean 21 45.12 ( 0.00%) 43.69 ( 3.16%) Amean 30 61.07 ( 0.00%) 54.33 ( 11.03%) Amean 48 95.91 ( 0.00%) 77.22 ( 19.49%) Amean 79 163.38 ( 0.00%) 123.08 ( 24.66%) Amean 110 243.91 ( 0.00%) 175.11 ( 28.21%) Amean 141 343.47 ( 0.00%) 239.10 ( 30.39%) Amean 160 401.15 ( 0.00%) 283.73 ( 29.27%) Stddev 1 0.52 ( 0.00%) 0.51 ( 2.45%) Stddev 4 1.36 ( 0.00%) 1.30 ( 4.04%) Stddev 7 1.88 ( 0.00%) 1.87 ( 0.72%) Stddev 12 3.06 ( 0.00%) 2.45 ( 19.83%) Stddev 21 5.78 ( 0.00%) 3.87 ( 33.06%) Stddev 30 9.85 ( 0.00%) 5.25 ( 46.76%) Stddev 48 22.31 ( 0.00%) 8.64 ( 61.27%) Stddev 79 35.96 ( 0.00%) 18.07 ( 49.76%) Stddev 110 59.04 ( 0.00%) 30.93 ( 47.61%) Stddev 141 85.38 ( 0.00%) 40.93 ( 52.06%) Stddev 160 96.38 ( 0.00%) 39.72 ( 58.79%) That is still looking good and the variance is reduced quite a bit. Finally, fairness is a concern so the next report tracks how many milliseconds does it take for all clients to complete a workfile. This one is tricky because dbench makes to effort to synchronise clients so the durations at benchmark start time differ substantially from typical runtimes. This problem could be mitigated by warming up the benchmark for a number of minutes but it's a matter of opinion whether that counts as an evasion of inconvenient results. dbench4 All Clients Loadfile Execution Time 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v5 Amean 1 15.06 ( 0.00%) 15.07 ( -0.03%) Amean 4 603.81 ( 0.00%) 524.29 ( 13.17%) Amean 7 855.32 ( 0.00%) 1331.07 ( -55.62%) Amean 12 1890.02 ( 0.00%) 2323.97 ( -22.96%) Amean 21 3195.23 ( 0.00%) 2009.29 ( 37.12%) Amean 30 13919.53 ( 0.00%) 4579.44 ( 67.10%) Amean 48 25246.07 ( 0.00%) 5705.46 ( 77.40%) Amean 79 29701.84 ( 0.00%) 15509.26 ( 47.78%) Amean 110 22803.03 ( 0.00%) 23782.08 ( -4.29%) Amean 141 36356.07 ( 0.00%) 25074.20 ( 31.03%) Amean 160 17046.71 ( 0.00%) 13247.62 ( 22.29%) Stddev 1 0.47 ( 0.00%) 0.49 ( -3.74%) Stddev 4 395.24 ( 0.00%) 254.18 ( 35.69%) Stddev 7 467.24 ( 0.00%) 764.42 ( -63.60%) Stddev 12 1071.43 ( 0.00%) 1395.90 ( -30.28%) Stddev 21 1694.50 ( 0.00%) 1204.89 ( 28.89%) Stddev 30 7945.63 ( 0.00%) 2552.59 ( 67.87%) Stddev 48 14339.51 ( 0.00%) 3227.55 ( 77.49%) Stddev 79 16620.91 ( 0.00%) 8422.15 ( 49.33%) Stddev 110 12912.15 ( 0.00%) 13560.95 ( -5.02%) Stddev 141 20700.13 ( 0.00%) 14544.51 ( 29.74%) Stddev 160 9079.16 ( 0.00%) 7400.69 ( 18.49%) This is more of a mixed bag but it at least shows that fairness is not crippled. The hackbench results are more neutral but this is still important. It's possible to boost the dbench figures by a large amount but only by crippling the performance of a workload like hackbench. The WF_SYNC behaviour is important for these workloads and is why the WF_SYNC changes are not a separate patch. hackbench-process-pipes 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v5 Amean 1 0.2657 ( 0.00%) 0.2150 ( 19.07%) Amean 4 0.6107 ( 0.00%) 0.6060 ( 0.76%) Amean 7 0.7923 ( 0.00%) 0.7440 ( 6.10%) Amean 12 1.1500 ( 0.00%) 1.1263 ( 2.06%) Amean 21 1.7950 ( 0.00%) 1.7987 ( -0.20%) Amean 30 2.3207 ( 0.00%) 2.5053 ( -7.96%) Amean 48 3.5023 ( 0.00%) 3.9197 ( -11.92%) Amean 79 4.8093 ( 0.00%) 5.2247 ( -8.64%) Amean 110 6.1160 ( 0.00%) 6.6650 ( -8.98%) Amean 141 7.4763 ( 0.00%) 7.8973 ( -5.63%) Amean 172 8.9560 ( 0.00%) 9.3593 ( -4.50%) Amean 203 10.4783 ( 0.00%) 10.8347 ( -3.40%) Amean 234 12.4977 ( 0.00%) 13.0177 ( -4.16%) Amean 265 14.7003 ( 0.00%) 15.5630 ( -5.87%) Amean 296 16.1007 ( 0.00%) 17.4023 ( -8.08%) Processes using pipes are impacted but the variance (not presented) indicates it's close to noise and the results are not always reproducible. If executed across multiple reboots, it may show neutral or small gains so the worst measured results are presented. Hackbench using sockets is more reliably neutral as the wakeup mechanisms are different between sockets and pipes. hackbench-process-sockets 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v2 Amean 1 0.3073 ( 0.00%) 0.3263 ( -6.18%) Amean 4 0.7863 ( 0.00%) 0.7930 ( -0.85%) Amean 7 1.3670 ( 0.00%) 1.3537 ( 0.98%) Amean 12 2.1337 ( 0.00%) 2.1903 ( -2.66%) Amean 21 3.4683 ( 0.00%) 3.4940 ( -0.74%) Amean 30 4.7247 ( 0.00%) 4.8853 ( -3.40%) Amean 48 7.6097 ( 0.00%) 7.8197 ( -2.76%) Amean 79 14.7957 ( 0.00%) 16.1000 ( -8.82%) Amean 110 21.3413 ( 0.00%) 21.9997 ( -3.08%) Amean 141 29.0503 ( 0.00%) 29.0353 ( 0.05%) Amean 172 36.4660 ( 0.00%) 36.1433 ( 0.88%) Amean 203 39.7177 ( 0.00%) 40.5910 ( -2.20%) Amean 234 42.1120 ( 0.00%) 43.5527 ( -3.42%) Amean 265 45.7830 ( 0.00%) 50.0560 ( -9.33%) Amean 296 50.7043 ( 0.00%) 54.3657 ( -7.22%) As schbench has been mentioned in numerous bugs recently, the results are interesting. A test case that represents the default schbench behaviour is schbench Wakeup Latency (usec) 6.18.0-rc1 6.18.0-rc1 vanilla sched-preemptnext-v5 Amean Wakeup-50th-80 7.17 ( 0.00%) 6.00 ( 16.28%) Amean Wakeup-90th-80 46.56 ( 0.00%) 19.78 ( 57.52%) Amean Wakeup-99th-80 119.61 ( 0.00%) 89.94 ( 24.80%) Amean Wakeup-99.9th-80 3193.78 ( 0.00%) 328.22 ( 89.72%) schbench Requests Per Second (ops/sec) 6.18.0-rc1 6.18.0-rc1 vanilla sched-preemptnext-v5 Hmean RPS-20th-80 8900.91 ( 0.00%) 9176.78 ( 3.10%) Hmean RPS-50th-80 8987.41 ( 0.00%) 9217.89 ( 2.56%) Hmean RPS-90th-80 9123.73 ( 0.00%) 9273.25 ( 1.64%) Hmean RPS-max-80 9193.50 ( 0.00%) 9301.47 ( 1.17%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net Stable-dep-of: ac8e69e69363 ("sched/fair: Fix wakeup_preempt_fair() vs delayed dequeue") Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-23tracing: branch: Fix inverted check on stat tracer registrationBreno Leitao
[ Upstream commit 3b75dd76e64a04771861bb5647951c264919e563 ] init_annotated_branch_stats() and all_annotated_branch_stats() check the return value of register_stat_tracer() with "if (!ret)", but register_stat_tracer() returns 0 on success and a negative errno on failure. The inverted check causes the warning to be printed on every successful registration, e.g.: Warning: could not register annotated branches stats while leaving real failures silent. The initcall also returned a hard-coded 1 instead of the actual error. Invert the check and propagate ret so that the warning fires on real errors and the initcall reports the correct status. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: https://patch.msgid.link/20260420-tracing-v1-1-d8f4cd0d6af1@debian.org Fixes: 002bb86d8d42 ("tracing/ftrace: separate events tracing and stats tracing engine") Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-23cgroup: Increment nr_dying_subsys_* from rmdir contextPetr Malat
[ Upstream commit 13e786b64bd3fd81c7eb22aa32bf8305c32f2ccf ] Incrementing nr_dying_subsys_* in offline_css(), which is executed by cgroup_offline_wq worker, leads to a race where user can see the value to be 0 if he reads cgroup.stat after calling rmdir and before the worker executes. This makes the user wrongly expect resources released by the removed cgroup to be available for a new assignment. Increment nr_dying_subsys_* from kill_css(), which is called from the cgroup_rmdir() context. Fixes: ab0312526867 ("cgroup: Show # of subsystem CSSes in cgroup.stat") Signed-off-by: Petr Malat <oss@malat.biz> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-23cgroup/rdma: fix integer overflow in rdmacg_try_charge()cuitao
[ Upstream commit c802f460dd485c1332b5a35e7adcfb2bc22536a2 ] The expression `rpool->resources[index].usage + 1` is computed in int arithmetic before being assigned to s64 variable `new`. When usage equals INT_MAX (the default "max" value), the addition overflows to INT_MIN. This negative value then passes the `new > max` check incorrectly, allowing a charge that should be rejected and corrupting usage to negative. Fix by casting usage to s64 before the addition so the arithmetic is done in 64-bit. Fixes: 39d3e7584a68 ("rdmacg: Added rdma cgroup controller") Signed-off-by: cuitao <cuitao@kylinos.cn> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-23sched/psi: fix race between file release and pressure writeEdward Adam Davis
[ Upstream commit a5b98009f16d8a5fb4a8ff9a193f5735515c38fa ] A potential race condition exists between pressure write and cgroup file release regarding the priv member of struct kernfs_open_file, which triggers the uaf reported in [1]. Consider the following scenario involving execution on two separate CPUs: CPU0 CPU1 ==== ==== vfs_rmdir() kernfs_iop_rmdir() cgroup_rmdir() cgroup_kn_lock_live() cgroup_destroy_locked() cgroup_addrm_files() cgroup_rm_file() kernfs_remove_by_name() kernfs_remove_by_name_ns() vfs_write() __kernfs_remove() new_sync_write() kernfs_drain() kernfs_fop_write_iter() kernfs_drain_open_files() cgroup_file_write() kernfs_release_file() pressure_write() cgroup_file_release() ctx = of->priv; kfree(ctx); of->priv = NULL; cgroup_kn_unlock() cgroup_kn_lock_live() cgroup_get(cgrp) cgroup_kn_unlock() if (ctx->psi.trigger) // here, trigger uaf for ctx, that is of->priv The cgroup_rmdir() is protected by the cgroup_mutex, it also safeguards the memory deallocation of of->priv performed within cgroup_file_release(). However, the operations involving of->priv executed within pressure_write() are not entirely covered by the protection of cgroup_mutex. Consequently, if the code in pressure_write(), specifically the section handling the ctx variable executes after cgroup_file_release() has completed, a uaf vulnerability involving of->priv is triggered. Therefore, the issue can be resolved by extending the scope of the cgroup_mutex lock within pressure_write() to encompass all code paths involving of->priv, thereby properly synchronizing the race condition occurring between cgroup_file_release() and pressure_write(). And, if an live kn lock can be successfully acquired while executing the pressure write operation, it indicates that the cgroup deletion process has not yet reached its final stage; consequently, the priv pointer within open_file cannot be NULL. Therefore, the operation to retrieve the ctx value must be moved to a point *after* the live kn lock has been successfully acquired. In another situation, specifically after entering cgroup_kn_lock_live() but before acquiring cgroup_mutex, there exists a different class of race condition: CPU0: write memory.pressure CPU1: write cgroup.pressure=0 =========================== ============================= kernfs_fop_write_iter() kernfs_get_active_of(of) pressure_write() cgroup_kn_lock_live(memory.pressure) cgroup_tryget(cgrp) kernfs_break_active_protection(kn) ... blocks on cgroup_mutex cgroup_pressure_write() cgroup_kn_lock_live(cgroup.pressure) cgroup_file_show(memory.pressure, false) kernfs_show(false) kernfs_drain_open_files() cgroup_file_release(of) kfree(ctx) of->priv = NULL cgroup_kn_unlock() ... acquires cgroup_mutex ctx = of->priv; // may now be NULL if (ctx->psi.trigger) // NULL dereference Consequently, there is a possibility that of->priv is NULL, the pressure write needs to check for this. Now that the scope of the cgroup_mutex has been expanded, the original explicit cgroup_get/put operations are no longer necessary, this is because acquiring/releasing the live kn lock inherently executes a cgroup get/put operation. [1] BUG: KASAN: slab-use-after-free in pressure_write+0xa4/0x210 kernel/cgroup/cgroup.c:4011 Call Trace: pressure_write+0xa4/0x210 kernel/cgroup/cgroup.c:4011 cgroup_file_write+0x36f/0x790 kernel/cgroup/cgroup.c:4311 kernfs_fop_write_iter+0x3b0/0x540 fs/kernfs/file.c:352 Allocated by task 9352: cgroup_file_open+0x90/0x3a0 kernel/cgroup/cgroup.c:4256 kernfs_fop_open+0x9eb/0xcb0 fs/kernfs/file.c:724 do_dentry_open+0x83d/0x13e0 fs/open.c:949 Freed by task 9353: cgroup_file_release+0xd6/0x100 kernel/cgroup/cgroup.c:4283 kernfs_release_file fs/kernfs/file.c:764 [inline] kernfs_drain_open_files+0x392/0x720 fs/kernfs/file.c:834 kernfs_drain+0x470/0x600 fs/kernfs/dir.c:525 Fixes: 0e94682b73bf ("psi: introduce psi monitor") Reported-by: syzbot+33e571025d88efd1312c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=33e571025d88efd1312c Tested-by: syzbot+33e571025d88efd1312c@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-05-23bpf: Validate node_id in arena_alloc_pages()Puranjay Mohan
[ Upstream commit 2845989f2ebaf7848e4eccf9a779daf3156ea0a5 ] arena_alloc_pages() accepts a plain int node_id and forwards it through the entire allocation chain without any bounds checking. Validate node_id before passing it down the allocation chain in arena_alloc_pages(). Fixes: 317460317a02 ("bpf: Introduce bpf_arena.") Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260417152135.1383754-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>