summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-12-18cpu: Make atomic hotplug callbacks run with interrupts disabled on UPSebastian Andrzej Siewior
[ Upstream commit c94291914b200e10c72cef23c8e4c67eb4fdbcd9 ] On SMP systems the CPU hotplug callbacks in the "starting" range are invoked while the CPU is brought up and interrupts are still disabled. Callbacks which are added later are invoked via the hotplug-thread on the target CPU and interrupts are explicitly disabled. In the UP case callbacks which are added later are invoked directly without the thread indirection. This is in principle okay since there is just one CPU but those callbacks are invoked with interrupt disabled code. That's incorrect as those callbacks assume interrupt disabled context. Disable interrupts before invoking the callbacks on UP if the state is atomic and interrupts are expected to be disabled. The "save" part is required because this is also invoked early in the boot process while interrupts are disabled and must not be enabled prematurely. Fixes: 06ddd17521bf1 ("sched/smp: Always define is_percpu_thread() and scheduler_ipi()") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251127144723.ev9DuXXR@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18perf/core: Fix missing read event generation on task exitThaumy Cheng
[ Upstream commit c418d8b4d7a43a86b82ee39cb52ece3034383530 ] For events with inherit_stat enabled, a "read" event will be generated to collect per task event counts on task exit. The call chain is as follows: do_exit -> perf_event_exit_task -> perf_event_exit_task_context -> perf_event_exit_event -> perf_remove_from_context -> perf_child_detach -> sync_child_event -> perf_event_read_event However, the child event context detaches the task too early in perf_event_exit_task_context, which causes sync_child_event to never generate the read event in this case, since child_event->ctx->task is always set to TASK_TOMBSTONE. Fix that by moving context lock section backward to ensure ctx->task is not set to TASK_TOMBSTONE before generating the read event. Because perf_event_free_task calls perf_event_exit_task_context with exit = false to tear down all child events from the context, and the task never lived, accessing the task PID can lead to a use-after-free. To fix that, let sync_child_event read task from argument and move the call to the only place it should be triggered to avoid the effect of setting ctx->task to TASK_TOMESTONE, and add a task parameter to perf_event_exit_event to trigger the sync_child_event properly when needed. This bug can be reproduced by running "perf record -s" and attaching to any program that generates perf events in its child tasks. If we check the result with "perf report -T", the last line of the report will leave an empty table like "# PID TID", which is expected to contain the per-task event counts by design. Fixes: ef54c1a476ae ("perf: Rework perf_event_exit_event()") Signed-off-by: Thaumy Cheng <thaumy.love@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: linux-perf-users@vger.kernel.org Link: https://patch.msgid.link/20251209041600.963586-1-thaumy.love@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18dma/pool: eliminate alloc_pages warning in atomic_pool_expandDave Kleikamp
[ Upstream commit 463d439becb81383f3a5a5d840800131f265a09c ] atomic_pool_expand iteratively tries the allocation while decrementing the page order. There is no need to issue a warning if an attempted allocation fails. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Fixes: d7e673ec2c8e ("dma-pool: Only allocate from CMA when in same memory zone") [mszyprow: fixed typo] Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20251202152810.142370-1-dave.kleikamp@oracle.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18sched/core: Fix psi_dequeue() for Proxy ExecutionJohn Stultz
[ Upstream commit c2ae8b0df2d1bb7a063f9e356e4e9a06cd4afe11 ] Currently, if the sleep flag is set, psi_dequeue() doesn't change any of the psi_flags. This is because psi_task_switch() will clear TSK_ONCPU as well as other potential flags (TSK_RUNNING), and the assumption is that a voluntary sleep always consists of a task being dequeued followed shortly there after with a psi_sched_switch() call. Proxy Execution changes this expectation, as mutex-blocked tasks that would normally sleep stay on the runqueue. But in the case where the mutex-owning task goes to sleep, or the owner is on a remote cpu, we will then deactivate the blocked task shortly after. In that situation, the mutex-blocked task will have had its TSK_ONCPU cleared when it was switched off the cpu, but it will stay TSK_RUNNING. Then if we later dequeue it (as currently done if we hit a case find_proxy_task() can't yet handle, such as the case of the owner being on another rq or a sleeping owner) psi_dequeue() won't change any state (leaving it TSK_RUNNING), as it incorrectly expects a psi_task_switch() call to immediately follow. Later on when the task get woken/re-enqueued, and psi_flags are set for TSK_RUNNING, we hit an error as the task is already TSK_RUNNING: psi: inconsistent task state! task=188:kworker/28:0 cpu=28 psi_flags=4 clear=0 set=4 To resolve this, extend the logic in psi_dequeue() so that if the sleep flag is set, we also check if psi_flags have TSK_ONCPU set (meaning the psi_task_switch is imminent) before we do the shortcut return. If TSK_ONCPU is not set, that means we've already switched away, and this psi_dequeue call needs to clear the flags. Fixes: be41bde4c3a8 ("sched: Add an initial sketch of the find_proxy_task() function") Reported-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Haiyue Wang <haiyuewa@163.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://patch.msgid.link/20251205012721.756394-1-jstultz@google.com Closes: https://lore.kernel.org/lkml/20251117185550.365156-1-kprateek.nayak@amd.com/ Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the ↵xupengbo
last task migrates out [ Upstream commit ca125231dd29fc0678dd3622e9cdea80a51dffe4 ] When a task is migrated out, there is a probability that the tg->load_avg value will become abnormal. The reason is as follows: 1. Due to the 1ms update period limitation in update_tg_load_avg(), there is a possibility that the reduced load_avg is not updated to tg->load_avg when a task migrates out. 2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key function cfs_rq_is_decayed() does not check whether cfs->tg_load_avg_contrib is null. Consequently, in some cases, __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been updated to tg->load_avg. Add a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed(), which fixes the case (2.) mentioned above. Fixes: 1528c661c24b ("sched/fair: Ratelimit update to tg->load_avg") Signed-off-by: xupengbo <xupengbo@oppo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Aaron Lu <ziqianlu@bytedance.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Link: https://patch.msgid.link/20250827022208.14487-1-xupengbo@oppo.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18rqspinlock: Use trylock fallback when per-CPU rqnode is busyKumar Kartikeya Dwivedi
[ Upstream commit 81d5a6a438595e46be191d602e5c2d6d73992fdc ] In addition to deferring to the trylock fallback in NMIs, only do so when an rqspinlock waiter is queued on the current CPU. This is detected by noticing a non-zero node index. This allows NMI waiters to join the waiter queue if it isn't interrupting an existing rqspinlock waiter, and increase the chances of fairly obtaining the lock, performing deadlock detection as the head, and not being starved while attempting the trylock. The trylock path in particular is unlikely to succeed under contention, as it relies on the lock word becoming 0, which indicates no contention. This means that the most likely result for NMIs attempting a trylock is a timeout under contention if they don't hit an AA or ABBA case. The core problem being addressed through the fixed commit was removing the dependency edge between an NMI queue waiter and the queue waiter it is interrupting. Whenever a circular dependency forms, and with no way to break it (as non-head waiters don't poll for deadlocks or timeouts), we would enter into a deadlock. A trylock either breaks such an edge by probing for deadlocks, and finally terminating the waiting loop using a timeout. By excluding queueing on CPUs where the node index is non-zero for NMIs, this sort of dependency is broken. The CPU enters the trylock path for those cases, and falls back to deadlock checks and timeouts. However, in other case where it doesn't interrupt the CPU in the slow path while its queued on the lock, it can join the queue as a normal waiter, and avoid trylock associated starvation and subsequent timeouts. There are a few remaining cases here that matter: the NMI can still preempt the owner in its critical section, and if it queues as a non-head waiter, it can end up impeding the progress of the owner. While this won't deadlock, since the head waiter will eventually signal the NMI waiter to either stop (due to a timeout), it can still lead to long timeouts. These gaps will be addressed in subsequent commits. Note that while the node count detection approach is less conservative than simply deferring NMIs to trylock, it is going to return errors where attempts to lock B in NMI happen while waiters for lock A are in a lower context on the same CPU. However, this only occurs when the lower context is queued in the slow path, and the NMI attempt can proceed without failure in all other cases. To continue to prevent AA deadlocks (or ABBA in a similar NMI interrupting lower context pattern), we'd need a more fleshed out algorithm to unlink NMI waiters after they queue and detect such cases. However, all that complexity isn't appealing yet to reduce the failure rate in the small window inside the slow path. It is important to note that reentrancy in the slow path can also happen through trace_contention_{begin,end}, but in those cases, unlike an NMI, the forward progress of the head waiter (or the predecessor in general) is not being blocked. Fixes: 0d80e7f951be ("rqspinlock: Choose trylock fallback for NMI waiters") Reported-by: Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu> Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251128232802.1031906-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18rqspinlock: Enclose lock/unlock within lock entry acquisitionsKumar Kartikeya Dwivedi
[ Upstream commit beb7021a6003d9c6a463fffca0d6311efb8e0e66 ] Ritesh reported that timeouts occurred frequently for rqspinlock despite reentrancy on the same lock on the same CPU in [0]. This patch closes one of the races leading to this behavior, and reduces the frequency of timeouts. We currently have a tiny window between the fast-path cmpxchg and the grabbing of the lock entry where an NMI could land, attempt the same lock that was just acquired, and end up timing out. This is not ideal. Instead, move the lock entry acquisition from the fast path to before the cmpxchg, and remove the grabbing of the lock entry in the slow path, assuming it was already taken by the fast path. The TAS fallback is invoked directly without being preceded by the typical fast path, therefore we must continue to grab the deadlock detection entry in that case. Case on lock leading to missed AA: cmpxchg lock A <NMI> ... rqspinlock acquisition of A ... timeout </NMI> grab_held_lock_entry(A) There is a similar case when unlocking the lock. If the NMI lands between the WRITE_ONCE and smp_store_release, it is possible that we end up in a situation where the NMI fails to diagnose the AA condition, leading to a timeout. Case on unlock leading to missed AA: WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL) <NMI> ... rqspinlock acquisition of A ... timeout </NMI> smp_store_release(A->locked, 0) The patch changes the order on unlock to smp_store_release() succeeded by WRITE_ONCE() of NULL. This avoids the missed AA detection described above, but may lead to a false positive if the NMI lands between these two statements, which is acceptable (and preferred over a timeout). The original intention of the reverse order on unlock was to prevent the following possible misdiagnosis of an ABBA scenario: grab entry A lock A grab entry B lock B unlock B smp_store_release(B->locked, 0) grab entry B lock B grab entry A lock A ! <detect ABBA> WRITE_ONCE(rqh->locks[rqh->cnt - 1], NULL) If the store release were is after the WRITE_ONCE, the other CPU would not observe B in the table of the CPU unlocking the lock B. However, since the threads are obviously participating in an ABBA deadlock, it is no longer appealing to use the order above since it may lead to a 250 ms timeout due to missed AA detection. [0]: https://lore.kernel.org/bpf/CAH6OuBTjG+N=+GGwcpOUbeDN563oz4iVcU3rbse68egp9wj9_A@mail.gmail.com Fixes: 0d80e7f951be ("rqspinlock: Choose trylock fallback for NMI waiters") Reported-by: Ritesh Oedayrajsingh Varma <ritesh@superluminal.eu> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251128232802.1031906-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"Ilias Stamatis
[ Upstream commit 6fb3acdebf65a72df0a95f9fd2c901ff2bc9a3a2 ] Commit 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only logic") removed an optimization introduced by commit 756398750e11 ("resource: avoid unnecessary lookups in find_next_iomem_res()"). That was not called out in the message of the first commit explicitly so it's not entirely clear whether removing the optimization happened inadvertently or not. As the original commit message of the optimization explains there is no point considering the children of a subtree in find_next_iomem_res() if the top level range does not match. Reinstating the optimization results in performance improvements in systems where /proc/iomem is ~5k lines long. Calling mmap() on /dev/mem in such platforms takes 700-1500μs without the optimisation and 10-50μs with the optimisation. Note that even though commit 97523a4edb7b removed the 'sibling_only' parameter from next_resource(), newer kernels have basically reinstated it under the name 'skip_children'. Link: https://lore.kernel.org/all/20251124165349.3377826-1-ilstam@amazon.com/T/#u Fixes: 97523a4edb7b ("kernel/resource: remove first_lvl / siblings_only logic") Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Huang, Ying" <huang.ying.caritas@gmail.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18bpf: Fix exclusive map memory leakEdward Adam Davis
[ Upstream commit 688b745401ab16e2e1a3b504863f0a45fd345638 ] When excl_prog_hash is 0 and excl_prog_hash_size is non-zero, the map also needs to be freed. Otherwise, the map memory will not be reclaimed, just like the memory leak problem reported by syzbot [1]. syzbot reported: BUG: memory leak backtrace (crc 7b9fb9b4): map_create+0x322/0x11e0 kernel/bpf/syscall.c:1512 __sys_bpf+0x3556/0x3610 kernel/bpf/syscall.c:6131 Fixes: baefdbdf6812 ("bpf: Implement exclusive map creation") Reported-by: syzbot+cf08c551fecea9fd1320@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=cf08c551fecea9fd1320 Tested-by: syzbot+cf08c551fecea9fd1320@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/tencent_3F226F882CE56DCC94ACE90EED1ECCFC780A@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18bpf: properly verify tail call behaviorMartin Teichmann
[ Upstream commit e3245f8990431950d20631c72236d4e8cb2dcde8 ] A successful ebpf tail call does not return to the caller, but to the caller-of-the-caller, often just finishing the ebpf program altogether. Any restrictions that the verifier needs to take into account - notably the fact that the tail call might have modified packet pointers - are to be checked on the caller-of-the-caller. Checking it on the caller made the verifier refuse perfectly fine programs that would use the packet pointers after a tail call, which is no problem as this code is only executed if the tail call was unsuccessful, i.e. nothing happened. This patch simulates the behavior of a tail call in the verifier. A conditional jump to the code after the tail call is added for the case of an unsucessful tail call, and a return to the caller is simulated for a successful tail call. For the successful case we assume that the tail call returns an int, as tail calls are currently only allowed in functions that return and int. We always assume that the tail call modified the packet pointers, as we do not know what the tail call did. For the unsuccessful case we know nothing happened, so we do not need to add new constraints. This approach also allows to check other problems that may occur with tail calls, namely we are now able to check that precision is properly propagated into subprograms using tail calls, as well as checking the live slots in such a subprogram. Fixes: 1a4607ffba35 ("bpf: consider that tail calls invalidate packet pointers") Link: https://lore.kernel.org/bpf/20251029105828.1488347-1-martin.teichmann@xfel.eu/ Signed-off-by: Martin Teichmann <martin.teichmann@xfel.eu> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251119160355.1160932-2-martin.teichmann@xfel.eu Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18cpuset: Treat cpusets in attaching as populatedChen Ridong
[ Upstream commit b1bcaed1e39a9e0dfbe324a15d2ca4253deda316 ] Currently, the check for whether a partition is populated does not account for tasks in the cpuset of attaching. This is a corner case that can leave a task stuck in a partition with no effective CPUs. The race condition occurs as follows: cpu0 cpu1 //cpuset A with cpu N migrate task p to A cpuset_can_attach // with effective cpus // check ok // cpuset_mutex is not held // clear cpuset.cpus.exclusive // making effective cpus empty update_exclusive_cpumask // tasks_nocpu_error check ok // empty effective cpus, partition valid cpuset_attach ... // task p stays in A, with non-effective cpus. To fix this issue, this patch introduces cs_is_populated, which considers tasks in the attaching cpuset. This new helper is used in validate_change and partition_is_populated. Fixes: e2d59900d936 ("cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective") Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18bpf: Fix invalid prog->stats access when update_effective_progs failsPu Lehui
[ Upstream commit 7dc211c1159d991db609bdf4b0fb9033c04adcbc ] Syzkaller triggers an invalid memory access issue following fault injection in update_effective_progs. The issue can be described as follows: __cgroup_bpf_detach update_effective_progs compute_effective_progs bpf_prog_array_alloc <-- fault inject purge_effective_progs /* change to dummy_bpf_prog */ array->items[index] = &dummy_bpf_prog.prog ---softirq start--- __do_softirq ... __cgroup_bpf_run_filter_skb __bpf_prog_run_save_cb bpf_prog_run stats = this_cpu_ptr(prog->stats) /* invalid memory access */ flags = u64_stats_update_begin_irqsave(&stats->syncp) ---softirq end--- static_branch_dec(&cgroup_bpf_enabled_key[atype]) The reason is that fault injection caused update_effective_progs to fail and then changed the original prog into dummy_bpf_prog.prog in purge_effective_progs. Then a softirq came, and accessing the members of dummy_bpf_prog.prog in the softirq triggers invalid mem access. To fix it, skip updating stats when stats is NULL. Fixes: 492ecee892c2 ("bpf: enable program stats") Signed-off-by: Pu Lehui <pulehui@huawei.com> Link: https://lore.kernel.org/r/20251115102343.2200727-1-pulehui@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18bpf: Handle return value of ftrace_set_filter_ip in register_fentryMenglong Dong
[ Upstream commit fea3f5e83c5cd80a76d97343023a2f2e6bd862bf ] The error that returned by ftrace_set_filter_ip() in register_fentry() is not handled properly. Just fix it. Fixes: 00963a2e75a8 ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)") Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20251110120705.1553694-1-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18bpf: Prevent nesting overflow in bpf_try_get_buffersSahil Chandna
[ Upstream commit c1da3df7191f1b4df9256bcd30d78f78201e1d17 ] bpf_try_get_buffers() returns one of multiple per-CPU buffers based on a per-CPU nesting counter. This mechanism expects that buffers are not endlessly acquired before being returned. migrate_disable() ensures that a task remains on the same CPU, but it does not prevent the task from being preempted by another task on that CPU. Without disabled preemption, a task may be preempted while holding a buffer, allowing another task to run on same CPU and acquire an additional buffer. Several such preemptions can cause the per-CPU nest counter to exceed MAX_BPRINTF_NEST_LEVEL and trigger the warning in bpf_try_get_buffers(). Adding preempt_disable()/preempt_enable() around buffer acquisition and release prevents this task preemption and preserves the intended bounded nesting behavior. Reported-by: syzbot+b0cff308140f79a9c4cb@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68f6a4c8.050a0220.1be48.0011.GAE@google.com/ Fixes: 4223bf833c849 ("bpf: Remove preempt_disable in bpf_try_get_buffers") Suggested-by: Yonghong Song <yonghong.song@linux.dev> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sahil Chandna <chandna.sahil@gmail.com> Link: https://lore.kernel.org/r/20251114064922.11650-1-chandna.sahil@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18bpf: Free special fields when update [lru_,]percpu_hash mapsLeon Hwang
[ Upstream commit 6af6e49a76c9af7d42eb923703e7648cb2bf401a ] As [lru_,]percpu_hash maps support BPF_KPTR_{REF,PERCPU}, missing calls to 'bpf_obj_free_fields()' in 'pcpu_copy_value()' could cause the memory referenced by BPF_KPTR_{REF,PERCPU} fields to be held until the map gets freed. Fix this by calling 'bpf_obj_free_fields()' after 'copy_map_value[,_long]()' in 'pcpu_copy_value()'. Fixes: 65334e64a493 ("bpf: Support kptrs in percpu hashmap and percpu LRU hashmap") Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20251105151407.12723-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18locktorture: Fix memory leak in param_set_cpumask()Wang Liang
[ Upstream commit e52b43883d084a9af263c573f2a1bd1ca5088389 ] With CONFIG_CPUMASK_OFFSTACK=y, the 'bind_writers' buffer is allocated via alloc_cpumask_var() in param_set_cpumask(). But it is not freed, when setting the module parameter multiple times by sysfs interface or removing module. Below kmemleak trace is seen for this issue: unreferenced object 0xffff888100aabff8 (size 8): comm "bash", pid 323, jiffies 4295059233 hex dump (first 8 bytes): 07 00 00 00 00 00 00 00 ........ backtrace (crc ac50919): __kmalloc_node_noprof+0x2e5/0x420 alloc_cpumask_var_node+0x1f/0x30 param_set_cpumask+0x26/0xb0 [locktorture] param_attr_store+0x93/0x100 module_attr_store+0x1b/0x30 kernfs_fop_write_iter+0x114/0x1b0 vfs_write+0x300/0x410 ksys_write+0x60/0xd0 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x77/0x7f This issue can be reproduced by: insmod locktorture.ko bind_writers=1 rmmod locktorture or: insmod locktorture.ko bind_writers=1 echo 2 > /sys/module/locktorture/parameters/bind_writers Considering that setting the module parameter 'bind_writers' or 'bind_readers' by sysfs interface has no real effect, set the parameter permissions to 0444. To fix the memory leak when removing module, free 'bind_writers' and 'bind_readers' memory in lock_torture_cleanup(). Fixes: 73e341242483 ("locktorture: Add readers_bind and writers_bind module parameters") Suggested-by: Zhang Changzhong <zhangchangzhong@huawei.com> Signed-off-by: Wang Liang <wangliang74@huawei.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18timers/migration: Fix imbalanced NUMA treesFrederic Weisbecker
[ Upstream commit 5eb579dfd46b4949117ecb0f1ba2f12d3dc9a6f2 ] When a CPU from a new node boots, the old root may happen to be connected to the new root even if their node mismatch, as depicted in the following scenario: 1) CPU 0 boots and creates the first group for node 0. [GRP0:0] node 0 | CPU 0 2) CPU 1 from node 1 boots and creates a new top that corresponds to node 1, but it also connects the old root from node 0 to the new root from node 1 by mistake. [GRP1:0] node 1 / \ / \ [GRP0:0] [GRP0:1] node 0 node 1 | | CPU 0 CPU 1 3) This eventually leads to an imbalanced tree where some node 0 CPUs migrate node 1 timers (and vice versa) way before reaching the crossnode groups, resulting in more frequent remote memory accesses than expected. [GRP2:0] NUMA_NO_NODE / \ [GRP1:0] [GRP1:1] node 1 node 0 / \ | / \ [...] [GRP0:0] [GRP0:1] node 0 node 1 | | CPU 0... CPU 1... A balanced tree should only contain groups having children that belong to the same node: [GRP2:0] NUMA_NO_NODE / \ [GRP1:0] [GRP1:0] node 0 node 1 / \ / \ / \ / \ [GRP0:0] [...] [...] [GRP0:1] node 0 node 1 | | CPU 0... CPU 1... In order to fix this, the hierarchy must be unfolded up to the crossnode level as soon as a node mismatch is detected. For example the stage 2 above should lead to this layout: [GRP2:0] NUMA_NO_NODE / \ [GRP1:0] [GRP1:1] node 0 node 1 / \ / \ [GRP0:0] [GRP0:1] node 0 node 1 | | CPU 0 CPU 1 This means that not only GRP1:0 must be created but also GRP1:1 and GRP2:0 in order to prepare a balanced tree for next CPUs to boot. Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024132536.39841-4-frederic@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18timers/migration: Remove locking on group connectionFrederic Weisbecker
[ Upstream commit fa9620355d4192200f15cb3d97c6eb9c02442249 ] Initializing the tmc's group, the group's number of children and the group's parent can all be done without locking because: 1) Reading the group's parent and its group mask is done locklessly. 2) The connections prepared for a given CPU hierarchy are visible to the target CPU once online, thanks to the CPU hotplug enforced memory ordering. 3) In case of a newly created upper level, the new root and its connections and initialization are made visible by the CPU which made the connections. When that CPUs goes idle in the future, the new link is published by tmigr_inactive_up() through the atomic RmW on ->migr_state. 4) If CPUs were still walking up the active hierarchy, they could observe the new root earlier. In this case the ordering is enforced by an early initialization of the group mask and by barriers that maintain address dependency as explained in: b729cc1ec21a ("timers/migration: Fix another race between hotplug and idle entry/exit") de3ced72a792 ("timers/migration: Enforce group initialization visibility to tree walkers") 5) Timers are propagated by a chain of group locking from the bottom to the top. And while doing so, the tree also propagates groups links and initialization. Therefore remote expiration, which also relies on group locking, will observe those links and initialization while holding the root lock before walking the tree remotely and update remote timers. This is especially important for migrators in the active hierarchy that may observe the new root early. Therefore the locking is unnecessary at initialization. If anything, it just brings confusion. Remove it. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024132536.39841-3-frederic@kernel.org Stable-dep-of: 5eb579dfd46b ("timers/migration: Fix imbalanced NUMA trees") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18timers/migration: Convert "while" loops to use "for"Frederic Weisbecker
[ Upstream commit 6c181b5667eea3e6564d334443536a5974190e15 ] Both the "do while" and "while" loops in tmigr_setup_groups() eventually mimic the behaviour of "for" loops. Simplify accordingly. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024132536.39841-2-frederic@kernel.org Stable-dep-of: 5eb579dfd46b ("timers/migration: Fix imbalanced NUMA trees") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18cgroup: add cgroup namespace to tree after owner is setChristian Brauner
[ Upstream commit 768b1565d9d1e1eebf7567f477f7f46c05a98a4d ] Otherwise we trip VFS_WARN_ON_ONC() in __ns_tree_add_raw(). Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-6-2e6f823ebdc0@kernel.org Fixes: 7c6059398533 ("cgroup: support ns lookup") Tested-by: syzbot@syzkaller.appspotmail.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18task_work: Fix NMI race conditionPeter Zijlstra
[ Upstream commit ef1ea98c8fffe227e5319215d84a53fa2a4bcebc ] __schedule() // disable irqs <NMI> task_work_add(current, work, TWA_NMI_CURRENT); </NMI> // current = next; // enable irqs <IRQ> task_work_set_notify_irq() test_and_set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); // wrong task! </IRQ> // original task skips task work on its next return to user (or exit!) Fixes: 466e4d801cd4 ("task_work: Add TWA_NMI_CURRENT as an additional notify mode.") Reported-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://patch.msgid.link/20250924080118.425949403@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18bpf: Fix stackmap overflow check in __bpf_get_stackid()Arnaud Lecomte
[ Upstream commit 23f852daa4bab4d579110e034e4d513f7d490846 ] Syzkaller reported a KASAN slab-out-of-bounds write in __bpf_get_stackid() when copying stack trace data. The issue occurs when the perf trace contains more stack entries than the stack map bucket can hold, leading to an out-of-bounds write in the bucket's data array. Fixes: ee2a098851bf ("bpf: Adjust BPF stack helper functions to accommodate skip > 0") Reported-by: syzbot+c9b724fbb41cf2538b7b@syzkaller.appspotmail.com Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20251025192941.1500-1-contact@arnaud-lcm.com Closes: https://syzkaller.appspot.com/bug?extid=c9b724fbb41cf2538b7b Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18bpf: Refactor stack map trace depth calculation into helper functionArnaud Lecomte
[ Upstream commit e17d62fedd10ae56e2426858bd0757da544dbc73 ] Extract the duplicated maximum allowed depth computation for stack traces stored in BPF stacks from bpf_get_stackid() and __bpf_get_stack() into a dedicated stack_map_calculate_max_depth() helper function. This unifies the logic for: - The max depth computation - Enforcing the sysctl_perf_event_max_stack limit No functional changes for existing code paths. Signed-off-by: Arnaud Lecomte <contact@arnaud-lcm.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20251025192858.31424-1-contact@arnaud-lcm.com Stable-dep-of: 23f852daa4ba ("bpf: Fix stackmap overflow check in __bpf_get_stackid()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18sched/fair: Forfeit vruntime on yieldFernand Sieber
[ Upstream commit 79104becf42baeeb4a3f2b106f954b9fc7c10a3c ] If a task yields, the scheduler may decide to pick it again. The task in turn may decide to yield immediately or shortly after, leading to a tight loop of yields. If there's another runnable task as this point, the deadline will be increased by the slice at each loop. This can cause the deadline to runaway pretty quickly, and subsequent elevated run delays later on as the task doesn't get picked again. The reason the scheduler can pick the same task again and again despite its deadline increasing is because it may be the only eligible task at that point. Fix this by making the task forfeiting its remaining vruntime and pushing the deadline one slice ahead. This implements yield behavior more authentically. We limit the forfeiting to eligible tasks. This is because core scheduling prefers running ineligible tasks rather than force idling. As such, without the condition, we can end up on a yield loop which makes the vruntime increase rapidly, leading to anomalous run delays later down the line. Fixes: 147f3efaa24182 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Fernand Sieber <sieberf@amazon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250401123622.584018-1-sieberf@amazon.com Link: https://lore.kernel.org/r/20250911095113.203439-1-sieberf@amazon.com Link: https://lore.kernel.org/r/20250916140228.452231-1-sieberf@amazon.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18bpf: Fix handling maps with no BTF and non-constant offsets for the bpf_wqMykyta Yatsenko
[ Upstream commit 5f8d41172931a92339c5cce81a3142065fa56e45 ] Fix handling maps with no BTF and non-constant offsets for the bpf_wq. This de-duplicates logic with other internal structs (task_work, timer), keeps error reporting consistent, and makes future changes to the layout handling centralized. Fixes: d940c9b94d7e ("bpf: add support for KF_ARG_PTR_TO_WORKQUEUE") Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251010164606.147298-1-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18bpf: Fix sleepable context for async callbacksKumar Kartikeya Dwivedi
[ Upstream commit 469d638d1520a9332cd0d034690e75e845610a51 ] Fix the BPF verifier to correctly determine the sleepable context of async callbacks based on the async primitive type rather than the arming program's context. The bug is in in_sleepable() which uses OR logic to check if the current execution context is sleepable. When a sleepable program arms a timer callback, the callback's state correctly has in_sleepable=false, but in_sleepable() would still return true due to env->prog->sleepable being true. This incorrectly allows sleepable helpers like bpf_copy_from_user() inside timer callbacks when armed from sleepable programs, even though timer callbacks always execute in non-sleepable context. Fix in_sleepable() to rely solely on env->cur_state->in_sleepable, and initialize state->in_sleepable to env->prog->sleepable in do_check_common() for the main program entry. This ensures the sleepable context is properly tracked per verification state rather than being overridden by the program's sleepability. The env->cur_state NULL check in in_sleepable() was only needed for do_misc_fixups() which runs after verification when env->cur_state is set to NULL. Update do_misc_fixups() to use env->prog->sleepable directly for the storage_get_function check, and remove the redundant NULL check from in_sleepable(). Introduce is_async_cb_sleepable() helper to explicitly determine async callback sleepability based on the primitive type: - bpf_timer callbacks are never sleepable - bpf_wq and bpf_task_work callbacks are always sleepable Add verifier_bug() check to catch unhandled async callback types, ensuring future additions cannot be silently mishandled. Move the is_task_work_add_kfunc() forward declaration to the top alongside other callback-related helpers. We update push_async_cb() to adjust to the new changes. At the same time, while simplifying in_sleepable(), we notice a problem in do_misc_fixups. Fix storage_get helpers to use GFP_ATOMIC when called from non-sleepable contexts within sleepable programs, such as bpf_timer callbacks. Currently, the check in do_misc_fixups assumes that env->prog->sleepable, previously in_sleepable(env) which only resolved to this check before last commit, holds across the program's execution, but that is not true. Instead, the func_atomic bit must be set whenever we see the function being called in an atomic context. Previously, this is being done when the helper is invoked in atomic contexts in sleepable programs, we can simply just set the value to true without doing an in_sleepable() check. We must also do a standalone in_sleepable() check to handle cases where the async callback itself is armed from a sleepable program, but is itself non-sleepable (e.g., timer callback) and invokes such a helper, thus needing the func_atomic bit to be true for the said call. Adjust do_misc_fixups() to drop any checks regarding sleepable nature of the program, and just depend on the func_atomic bit to decide which GFP flag to pass. Fixes: 81f1d7a583fa ("bpf: wq: add bpf_wq_set_callback_impl") Fixes: b00fa38a9c1c ("bpf: Enable non-atomic allocations in local storage") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251007220349.3852807-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-18bpf: Cleanup unused func args in rqspinlock implementationSiddharth Chintamaneni
[ Upstream commit 56b4d162392dda2365fbc1f482184a24b489d07d ] cleanup unused function args in check_deadlock* functions. Fixes: 31158ad02ddb ("rqspinlock: Add deadlock detection and recovery") Signed-off-by: Siddharth Chintamaneni <sidchintamaneni@gmail.com> Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251001172702.122838-1-sidchintamaneni@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-12locking/spinlock/debug: Fix data-race in do_raw_write_lockAlexander Sverdlin
commit c14ecb555c3ee80eeb030a4e46d00e679537f03a upstream. KCSAN reports: BUG: KCSAN: data-race in do_raw_write_lock / do_raw_write_lock write (marked) to 0xffff800009cf504c of 4 bytes by task 1102 on cpu 1: do_raw_write_lock+0x120/0x204 _raw_write_lock_irq do_exit call_usermodehelper_exec_async ret_from_fork read to 0xffff800009cf504c of 4 bytes by task 1103 on cpu 0: do_raw_write_lock+0x88/0x204 _raw_write_lock_irq do_exit call_usermodehelper_exec_async ret_from_fork value changed: 0xffffffff -> 0x00000001 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 1103 Comm: kworker/u4:1 6.1.111 Commit 1a365e822372 ("locking/spinlock/debug: Fix various data races") has adressed most of these races, but seems to be not consistent/not complete. >From do_raw_write_lock() only debug_write_lock_after() part has been converted to WRITE_ONCE(), but not debug_write_lock_before() part. Do it now. Fixes: 1a365e822372 ("locking/spinlock/debug: Fix various data races") Reported-by: Adrian Freihofer <adrian.freihofer@siemens.com> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@siemens.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-30Merge tag 'timers_urgent_for_v6.18_rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Borislav Petkov: - Have timekeeping aux clocks sysfs interface setup function return an error code on failure instead of success * tag 'timers_urgent_for_v6.18_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Fix error code in tk_aux_sysfs_init()
2025-11-27Merge tag 'dma-mapping-6.18-2025-11-27' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: "Two last minute fixes for the recently modified DMA API infrastructure: - proper handling of DMA_ATTR_MMIO in dma_iova_unlink() function (me) - regression fix for the code refactoring related to P2PDMA (Pranjal Shrivastava)" * tag 'dma-mapping-6.18-2025-11-27' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-direct: Fix missing sg_dma_len assignment in P2PDMA bus mappings iommu/dma: add missing support for DMA_ATTR_MMIO for dma_iova_unlink()
2025-11-26Merge tag 'trace-ringbuffer-v6.18-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ring-buffer fix from Steven Rostedt: - Do not allow mmapped ring buffer to be split When the ring buffer VMA is split by a partial munmap or a MAP_FIXED, the kernel calls vm_ops->close() on each portion. This causes the ring_buffer_unmap() to be called multiple times. This causes subsequent calls to return -ENODEV and triggers a warning. There's no reason to allow user space to split up memory mapping of the ring buffer. Have it return -EINVAL when that happens. * tag 'trace-ringbuffer-v6.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix WARN_ON in tracing_buffers_mmap_close for split VMAs
2025-11-26dma-direct: Fix missing sg_dma_len assignment in P2PDMA bus mappingsPranjal Shrivastava
Prior to commit a25e7962db0d7 ("PCI/P2PDMA: Refactor the p2pdma mapping helpers"), P2P segments were mapped using the pci_p2pdma_map_segment() helper. This helper was responsible for populating sg->dma_address, marking the bus address, and also setting sg_dma_len(sg). The refactor[1] removed this helper and moved the mapping logic directly into the callers. While iommu_dma_map_sg() was correctly updated to set the length in the new flow, it was missed in dma_direct_map_sg(). Thus, in dma_direct_map_sg(), the PCI_P2PDMA_MAP_BUS_ADDR case sets the dma_address and marks the segment, but immediately executes 'continue', which causes the loop to skip the standard assignment logic at the end: sg_dma_len(sg) = sg->length; As a result, when CONFIG_NEED_SG_DMA_LENGTH is enabled, the dma_length field remains uninitialized (zero) for P2P bus address mappings. This breaks upper-layer drivers (for e.g. RDMA/IB) that rely on sg_dma_len() to determine the transfer size. Fix this by explicitly setting the DMA length in the PCI_P2PDMA_MAP_BUS_ADDR case before continuing to the next scatterlist entry. Fixes: a25e7962db0d7 ("PCI/P2PDMA: Refactor the p2pdma mapping helpers") Reported-by: Jacob Moroni <jmoroni@google.com> Signed-off-by: Pranjal Shrivastava <praan@google.com> [1] https://lore.kernel.org/all/ac14a0e94355bf898de65d023ccf8a2ad22a3ece.1746424934.git.leon@kernel.org/ Reviewed-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Shivaji Kant <shivajikant@google.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20251126114112.3694469-1-praan@google.com
2025-11-25tracing: Fix WARN_ON in tracing_buffers_mmap_close for split VMAsDeepanshu Kartikey
When a VMA is split (e.g., by partial munmap or MAP_FIXED), the kernel calls vm_ops->close on each portion. For trace buffer mappings, this results in ring_buffer_unmap() being called multiple times while ring_buffer_map() was only called once. This causes ring_buffer_unmap() to return -ENODEV on subsequent calls because user_mapped is already 0, triggering a WARN_ON. Trace buffer mappings cannot support partial mappings because the ring buffer structure requires the complete buffer including the meta page. Fix this by adding a may_split callback that returns -EINVAL to prevent VMA splits entirely. Cc: stable@vger.kernel.org Fixes: cf9f0f7c4c5bb ("tracing: Allow user-space mapping of the ring-buffer") Link: https://patch.msgid.link/20251119064019.25904-1-kartikey406@gmail.com Closes: https://syzkaller.appspot.com/bug?extid=a72c325b042aae6403c7 Tested-by: syzbot+a72c325b042aae6403c7@syzkaller.appspotmail.com Reported-by: syzbot+a72c325b042aae6403c7@syzkaller.appspotmail.com Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-25timekeeping: Fix error code in tk_aux_sysfs_init()Dan Carpenter
If kobject_create_and_add() fails on the first iteration, then the error code is set to -ENOMEM which is correct. But if it fails in subsequent iterations then "ret" is zero, which means success, but it should be -ENOMEM. Set the error code to -ENOMEM correctly. Fixes: 7b5ab04f035f ("timekeeping: Fix resource leak in tk_aux_sysfs_init() error paths") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Malaya Kumar Rout <mrout@redhat.com> Link: https://patch.msgid.link/aSW1R8q5zoY_DgQE@stanley.mountain
2025-11-23Merge tag 'timers-urgent-2025-11-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: - Fix a race in timer->function clearing in timer_shutdown_sync() - Fix a timekeeper sysfs-setup resource leak in error paths - Fix the NOHZ report_idle_softirq() syslog rate-limiting logic to have no side effects on the return value * tag 'timers-urgent-2025-11-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers: Fix NULL function pointer race in timer_shutdown_sync() timekeeping: Fix resource leak in tk_aux_sysfs_init() error paths tick/sched: Fix bogus condition in report_idle_softirq()
2025-11-23Merge tag 'perf-urgent-2025-11-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Fix perf CPU-clock counters, and address a static checker warning" * tag 'perf-urgent-2025-11-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix 0 count issue of cpu-clock perf/x86/intel/uncore: Remove superfluous check
2025-11-22timers: Fix NULL function pointer race in timer_shutdown_sync()Yipeng Zou
There is a race condition between timer_shutdown_sync() and timer expiration that can lead to hitting a WARN_ON in expire_timers(). The issue occurs when timer_shutdown_sync() clears the timer function to NULL while the timer is still running on another CPU. The race scenario looks like this: CPU0 CPU1 <SOFTIRQ> lock_timer_base() expire_timers() base->running_timer = timer; unlock_timer_base() [call_timer_fn enter] mod_timer() ... timer_shutdown_sync() lock_timer_base() // For now, will not detach the timer but only clear its function to NULL if (base->running_timer != timer) ret = detach_if_pending(timer, base, true); if (shutdown) timer->function = NULL; unlock_timer_base() [call_timer_fn exit] lock_timer_base() base->running_timer = NULL; unlock_timer_base() ... // Now timer is pending while its function set to NULL. // next timer trigger <SOFTIRQ> expire_timers() WARN_ON_ONCE(!fn) // hit ... lock_timer_base() // Now timer will detach if (base->running_timer != timer) ret = detach_if_pending(timer, base, true); if (shutdown) timer->function = NULL; unlock_timer_base() The problem is that timer_shutdown_sync() clears the timer function regardless of whether the timer is currently running. This can leave a pending timer with a NULL function pointer, which triggers the WARN_ON_ONCE(!fn) check in expire_timers(). Fix this by only clearing the timer function when actually detaching the timer. If the timer is running, leave the function pointer intact, which is safe because the timer will be properly detached when it finishes running. Fixes: 0cc04e80458a ("timers: Add shutdown mechanism to the internal functions") Signed-off-by: Yipeng Zou <zouyipeng@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20251122093942.301559-1-zouyipeng@huawei.com
2025-11-20Merge tag 'sched_ext-for-6.18-rc6-fixes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fix from Tejun Heo: "One low risk and obvious fix: scx_enable() was dereferencing an error pointer on helper kthread creation failure. Fixed" * tag 'sched_ext-for-6.18-rc6-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix scx_enable() crash on helper kthread creation failure
2025-11-20sched_ext: Fix scx_enable() crash on helper kthread creation failureSaket Kumar Bhaskar
A crash was observed when the sched_ext selftests runner was terminated with Ctrl+\ while test 15 was running: NIP [c00000000028fa58] scx_enable.constprop.0+0x358/0x12b0 LR [c00000000028fa2c] scx_enable.constprop.0+0x32c/0x12b0 Call Trace: scx_enable.constprop.0+0x32c/0x12b0 (unreliable) bpf_struct_ops_link_create+0x18c/0x22c __sys_bpf+0x23f8/0x3044 sys_bpf+0x2c/0x6c system_call_exception+0x124/0x320 system_call_vectored_common+0x15c/0x2ec kthread_run_worker() returns an ERR_PTR() on failure rather than NULL, but the current code in scx_alloc_and_add_sched() only checks for a NULL helper. Incase of failure on SIGQUIT, the error is not handled in scx_alloc_and_add_sched() and scx_enable() ends up dereferencing an error pointer. Error handling is fixed in scx_alloc_and_add_sched() to propagate PTR_ERR() into ret, so that scx_enable() jumps to the existing error path, avoiding random dereference on failure. Fixes: bff3b5aec1b7 ("sched_ext: Move disable machinery into scx_sched") Cc: stable@vger.kernel.org # v6.16+ Reported-and-tested-by: Samir Mulani <samir@linux.ibm.com> Signed-off-by: Saket Kumar Bhaskar <skb99@linux.ibm.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Vishal Chourasia <vishalc@linux.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-20timekeeping: Fix resource leak in tk_aux_sysfs_init() error pathsMalaya Kumar Rout
tk_aux_sysfs_init() returns immediately on error during the auxiliary clock initialization loop without cleaning up previously allocated kobjects and sysfs groups. If kobject_create_and_add() or sysfs_create_group() fails during loop iteration, the parent kobjects (tko and auxo) and any previously created child kobjects are leaked. Fix this by adding proper error handling with goto labels to ensure all allocated resources are cleaned up on failure. kobject_put() on the parent kobjects will handle cleanup of their children. Fixes: 7b95663a3d96 ("timekeeping: Provide interface to control auxiliary clocks") Signed-off-by: Malaya Kumar Rout <mrout@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251120150213.246777-1-mrout@redhat.com
2025-11-20perf: Fix 0 count issue of cpu-clockDapeng Mi
Currently cpu-clock event always returns 0 count, e.g., perf stat -e cpu-clock -- sleep 1 Performance counter stats for 'sleep 1': 0 cpu-clock # 0.000 CPUs utilized 1.002308394 seconds time elapsed The root cause is the commit 'bc4394e5e79c ("perf: Fix the throttle error of some clock events")' adds PERF_EF_UPDATE flag check before calling cpu_clock_event_update() to update the count, however the PERF_EF_UPDATE flag is never set when the cpu-clock event is stopped in counting mode (pmu->dev() -> cpu_clock_event_del() -> cpu_clock_event_stop()). This leads to the cpu-clock event count is never updated. To fix this issue, force to set PERF_EF_UPDATE flag for cpu-clock event just like what task-clock does. Fixes: bc4394e5e79c ("perf: Fix the throttle error of some clock events") Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://patch.msgid.link/20251112080526.3971392-1-dapeng1.mi@linux.intel.com
2025-11-19tick/sched: Fix bogus condition in report_idle_softirq()Wen Yang
In commit 0345691b24c0 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") the new function report_idle_softirq() was created by breaking code out of the existing can_stop_idle_tick() for kernels v5.18 and newer. In doing so, the code essentially went from this form: if (A) { static int ratelimit; if (ratelimit < 10 && !C && A&D) { pr_warn("NOHZ tick-stop error: ..."); ratelimit++; } return false; } to a new function: static bool report_idle_softirq(void) { static int ratelimit; if (likely(!A)) return false; if (ratelimit < 10) return false; ... pr_warn("NOHZ tick-stop error: local softirq work is pending, handler #%02x!!!\n", pending); ratelimit++; return true; } commit a7e282c77785 ("tick/rcu: Fix bogus ratelimit condition") realized ratelimit was essentially set to zero instead of ten, and hence *no* softirq pending messages would ever be issued, but "fixed" it as: - if (ratelimit < 10) + if (ratelimit >= 10) return false; However, this fix introduced another issue: When ratelimit is greater than or equal 10, even if A is true, it will directly return false. While ratelimit in the original code was only used to control printing and will not affect the return value. Restore the original logic and restrict ratelimit to control the printk and not the return value. Fixes: 0345691b24c0 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") Fixes: a7e282c77785 ("tick/rcu: Fix bogus ratelimit condition") Signed-off-by: Wen Yang <wen.yang@linux.dev> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251119174525.29470-1-wen.yang@linux.dev
2025-11-17Merge tag 'vfs-6.18-rc7.fixes' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix unitialized variable in statmount_string() - Fix hostfs mounting when passing host root during boot - Fix dynamic lookup to fail on cell lookup failure - Fix missing file type when reading bfs inodes from disk - Enforce checking of sb_min_blocksize() calls and update all callers accordingly - Restore write access before closing files opened by open_exec() in binfmt_misc - Always freeze efivarfs during suspend/hibernate cycles - Fix statmount()'s and listmount()'s grab_requested_mnt_ns() helper to actually allow mount namespace file descriptor in addition to mount namespace ids - Fix tmpfs remount when noswap is specified - Switch Landlock to iput_not_last() to remove false-positives from might_sleep() annotations in iput() - Remove dead node_to_mnt_ns() code - Ensure that per-queue kobjects are successfully created * tag 'vfs-6.18-rc7.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: landlock: fix splats from iput() after it started calling might_sleep() fs: add iput_not_last() shmem: fix tmpfs reconfiguration (remount) when noswap is set fs/namespace: correctly handle errors returned by grab_requested_mnt_ns power: always freeze efivarfs binfmt_misc: restore write access before closing files opened by open_exec() block: add __must_check attribute to sb_min_blocksize() virtio-fs: fix incorrect check for fsvq->kobj xfs: check the return value of sb_min_blocksize() in xfs_fs_fill_super isofs: check the return value of sb_min_blocksize() in isofs_fill_super exfat: check return value of sb_min_blocksize in exfat_read_boot_sector vfat: fix missing sb_min_blocksize() return value checks mnt: Remove dead code which might prevent from building bfs: Reconstruct file type when loading from disk afs: Fix dynamic lookup to fail on cell lookup failure hostfs: Fix only passing host root in boot stage with new mount fs: Fix uninitialized 'offp' in statmount_string()
2025-11-17Merge tag 'sched_ext-for-6.18-rc6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "Five fixes addressing PREEMPT_RT compatibility and locking issues. Three commits fix potential deadlocks and sleeps in atomic contexts on RT kernels by converting locks to raw spinlocks and ensuring IRQ work runs in hard-irq context. The remaining two fix unsafe locking in the debug dump path and a variable dereference typo" * tag 'sched_ext-for-6.18-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Use IRQ_WORK_INIT_HARD() to initialize rq->scx.kick_cpus_irq_work sched_ext: Fix possible deadlock in the deferred_irq_workfn() sched/ext: convert scx_tasks_lock to raw spinlock sched_ext: Fix unsafe locking in the scx_dump_state() sched_ext: Fix use of uninitialized variable in scx_bpf_cpuperf_set()
2025-11-17sched_ext: Use IRQ_WORK_INIT_HARD() to initialize rq->scx.kick_cpus_irq_workZqiang
For PREEMPT_RT kernels, the kick_cpus_irq_workfn() be invoked in the per-cpu irq_work/* task context and there is no rcu-read critical section to protect. this commit therefore use IRQ_WORK_INIT_HARD() to initialize the per-cpu rq->scx.kick_cpus_irq_work in the init_sched_ext_class(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-16Merge tag 'mm-hotfixes-stable-2025-11-16-10-40' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "7 hotfixes. 5 are cc:stable, 4 are against mm/ All are singletons - please see the respective changelogs for details" * tag 'mm-hotfixes-stable-2025-11-16-10-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm, swap: fix potential UAF issue for VMA readahead selftests/user_events: fix type cast for write_index packed member in perf_test lib/test_kho: check if KHO is enabled mm/huge_memory: fix folio split check for anon folios in swapcache MAINTAINERS: update David Hildenbrand's email address crash: fix crashkernel resource shrink mm: fix MAX_FOLIO_ORDER on powerpc configs with hugetlb
2025-11-15crash: fix crashkernel resource shrinkSourabh Jain
When crashkernel is configured with a high reservation, shrinking its value below the low crashkernel reservation causes two issues: 1. Invalid crashkernel resource objects 2. Kernel crash if crashkernel shrinking is done twice For example, with crashkernel=200M,high, the kernel reserves 200MB of high memory and some default low memory (say 256MB). The reservation appears as: cat /proc/iomem | grep -i crash af000000-beffffff : Crash kernel 433000000-43f7fffff : Crash kernel If crashkernel is then shrunk to 50MB (echo 52428800 > /sys/kernel/kexec_crash_size), /proc/iomem still shows 256MB reserved: af000000-beffffff : Crash kernel Instead, it should show 50MB: af000000-b21fffff : Crash kernel Further shrinking crashkernel to 40MB causes a kernel crash with the following trace (x86): BUG: kernel NULL pointer dereference, address: 0000000000000038 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI <snip...> Call Trace: <TASK> ? __die_body.cold+0x19/0x27 ? page_fault_oops+0x15a/0x2f0 ? search_module_extables+0x19/0x60 ? search_bpf_extables+0x5f/0x80 ? exc_page_fault+0x7e/0x180 ? asm_exc_page_fault+0x26/0x30 ? __release_resource+0xd/0xb0 release_resource+0x26/0x40 __crash_shrink_memory+0xe5/0x110 crash_shrink_memory+0x12a/0x190 kexec_crash_size_store+0x41/0x80 kernfs_fop_write_iter+0x141/0x1f0 vfs_write+0x294/0x460 ksys_write+0x6d/0xf0 <snip...> This happens because __crash_shrink_memory()/kernel/crash_core.c incorrectly updates the crashk_res resource object even when crashk_low_res should be updated. Fix this by ensuring the correct crashkernel resource object is updated when shrinking crashkernel memory. Link: https://lkml.kernel.org/r/20251101193741.289252-1-sourabhjain@linux.ibm.com Fixes: 16c6006af4d4 ("kexec: enable kexec_crash_size to support two crash kernel regions") Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Zhen Lei <thunder.leizhen@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-15Merge tag 'timers-urgent-2025-11-15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix a memory leak in the posix timer creation logic" * tag 'timers-urgent-2025-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-timers: Plug potential memory leak in do_timer_create()
2025-11-14Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Alexei Starovoitov: - Fix interaction between livepatch and BPF fexit programs (Song Liu) With Steven and Masami acks. - Fix stack ORC unwind from BPF kprobe_multi (Jiri Olsa) With Steven and Masami acks. - Fix out of bounds access in widen_imprecise_scalars() in the verifier (Eduard Zingerman) - Fix conflicts between MPTCP and BPF sockmap (Jiayuan Chen) - Fix net_sched storage collision with BPF data_meta/data_end (Eric Dumazet) - Add _impl suffix to BPF kfuncs with implicit args to avoid breaking them in bpf-next when KF_IMPLICIT_ARGS is added (Mykyta Yatsenko) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Test widen_imprecise_scalars() with different stack depth bpf: account for current allocated stack depth in widen_imprecise_scalars() bpf: Add bpf_prog_run_data_pointers() selftests/bpf: Add mptcp test with sockmap mptcp: Fix proto fallback detection with BPF mptcp: Disallow MPTCP subflows from sockmap selftests/bpf: Add stacktrace ips test for raw_tp selftests/bpf: Add stacktrace ips test for kprobe_multi/kretprobe_multi x86/fgraph,bpf: Fix stack ORC unwind from kprobe_multi return probe Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()" bpf: add _impl suffix for bpf_stream_vprintk() kfunc bpf:add _impl suffix for bpf_task_work_schedule* kfuncs selftests/bpf: Add tests for livepatch + bpf trampoline ftrace: bpf: Fix IPMODIFY + DIRECT in modify_ftrace_direct() ftrace: Fix BPF fexit with livepatch
2025-11-14bpf: account for current allocated stack depth in widen_imprecise_scalars()Eduard Zingerman
The usage pattern for widen_imprecise_scalars() looks as follows: prev_st = find_prev_entry(env, ...); queued_st = push_stack(...); widen_imprecise_scalars(env, prev_st, queued_st); Where prev_st is an ancestor of the queued_st in the explored states tree. This ancestor is not guaranteed to have same allocated stack depth as queued_st. E.g. in the following case: def main(): for i in 1..2: foo(i) // same callsite, differnt param def foo(i): if i == 1: use 128 bytes of stack iterator based loop Here, for a second 'foo' call prev_st->allocated_stack is 128, while queued_st->allocated_stack is much smaller. widen_imprecise_scalars() needs to take this into account and avoid accessing bpf_verifier_state->frame[*]->stack out of bounds. Fixes: 2793a8b015f7 ("bpf: exact states comparison for iterator convergence checks") Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251114025730.772723-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>