summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
12 hoursaudit: enforce AUDIT_LOCKED for AUDIT_TRIM and AUDIT_MAKE_EQUIVSergio Correia
commit f9e1c1324b4d98d591a6f7568fdebf5cf456dfc2 upstream. AUDIT_ADD_RULE and AUDIT_DEL_RULE correctly check for AUDIT_LOCKED and return -EPERM, but AUDIT_TRIM and AUDIT_MAKE_EQUIV do not. This allows a process with CAP_AUDIT_CONTROL to modify directory tree watches and equivalence mappings even when the audit configuration has been locked, undermining the purpose of the lock. Add AUDIT_LOCKED checks to both commands. Cc: stable@vger.kernel.org Reviewed-by: Ricardo Robaina <rrobaina@redhat.com> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Sergio Correia <scorreia@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 hoursaudit: fix incorrect inheritable capability in CAPSET recordsSergio Correia
commit e4a640475e43f406fdfd56d370b1f34b0cbbc18d upstream. __audit_log_capset() records the effective capability set into the inheritable field due to a copy-paste error. Every CAPSET audit record therefore reports cap_pi (process inheritable) with the value of cap_effective instead of cap_inheritable. This silently corrupts audit data used for compliance and forensic analysis: an attacker who modifies inheritable capabilities to prepare for a privilege-escalating exec would have the change masked in the audit trail. The bug has been present since the original introduction of CAPSET audit records in 2008. Cc: stable@vger.kernel.org Fixes: e68b75a027bb ("When the capset syscall is used it is not possible for audit to record the actual capbilities being added/removed. This patch adds a new record type which emits the target pid and the eff, inh, and perm cap sets.") Reviewed-by: Ricardo Robaina <rrobaina@redhat.com> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Sergio Correia <scorreia@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 hoursworkqueue: Fix wq->cpu_pwq leak in alloc_and_link_pwqs() WQ_UNBOUND pathBreno Leitao
commit 0143033dc22cdff912cfc13419f5db92fea3b4cb upstream. For WQ_UNBOUND workqueues, alloc_and_link_pwqs() allocates wq->cpu_pwq via alloc_percpu() and then calls apply_workqueue_attrs_locked(). On failure it returns the error directly, bypassing the enomem: label which holds the only free_percpu(wq->cpu_pwq) in this function. The caller's error path kfree()s wq without touching wq->cpu_pwq, leaking one percpu pointer table (nr_cpu_ids * sizeof(void *) bytes) per failed call. If kmemleak is enabled, we can see: unreferenced object (percpu) 0xc0fffa5b121048 (size 8): comm "insmod", pid 776, jiffies 4294682844 backtrace (crc 0): pcpu_alloc_noprof+0x665/0xac0 __alloc_workqueue+0x33f/0xa20 alloc_workqueue_noprof+0x60/0x100 Route the error through the existing enomem: cleanup and any error before this one. Cc: stable@kernel.org Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
12 hoursfutex: Prevent lockup in requeue-PI during signal/ timeout wakeupSebastian Andrzej Siewior
[ Upstream commit bc7304f3ae20972d11db6e0b1b541c63feda5f05 ] During wait-requeue-pi (task A) and requeue-PI (task B) the following race can happen: Task A Task B futex_wait_requeue_pi() futex_setup_timer() futex_do_wait() futex_requeue() CLASS(hb, hb1)(&key1); CLASS(hb, hb2)(&key2); *timeout* futex_requeue_pi_wakeup_sync() requeue_state = Q_REQUEUE_PI_IGNORE *blocks on hb->lock* futex_proxy_trylock_atomic() futex_requeue_pi_prepare() Q_REQUEUE_PI_IGNORE => -EAGAIN double_unlock_hb(hb1, hb2) *retry* Task B acquires both hb locks and attempts to acquire the PI-lock of the top most waiter (task B). Task A is leaving early due to a signal/ timeout and started removing itself from the queue. It updates its requeue_state but can not remove it from the list because this requires the hb lock which is owned by task B. Usually task A is able to swoop the lock after task B unlocked it. However if task B is of higher priority then task A may not be able to wake up in time and acquire the lock before task B gets it again. Especially on a UP system where A is never scheduled. As a result task A blocks on the lock and task B busy loops, trying to make progress but live locks the system instead. Tragic. This can be fixed by removing the top most waiter from the list in this case. This allows task B to grab the next top waiter (if any) in the next iteration and make progress. Remove the top most waiter if futex_requeue_pi_prepare() fails. Let the waiter conditionally remove itself from the list in handle_early_requeue_pi_wakeup(). Fixes: 07d91ef510fb1 ("futex: Prevent requeue_pi() lock nesting issue on RT") Reported-by: Moritz Klammler <Moritz.Klammler@ferchau.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260428103425.dywXyPd3@linutronix.de Closes: https://lore.kernel.org/all/VE1PR06MB6894BE61C173D802365BE19DFF4CA@VE1PR06MB6894.eurprd06.prod.outlook.com Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hourstracing: branch: Fix inverted check on stat tracer registrationBreno Leitao
[ Upstream commit 3b75dd76e64a04771861bb5647951c264919e563 ] init_annotated_branch_stats() and all_annotated_branch_stats() check the return value of register_stat_tracer() with "if (!ret)", but register_stat_tracer() returns 0 on success and a negative errno on failure. The inverted check causes the warning to be printed on every successful registration, e.g.: Warning: could not register annotated branches stats while leaving real failures silent. The initcall also returned a hard-coded 1 instead of the actual error. Invert the check and propagate ret so that the warning fires on real errors and the initcall reports the correct status. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: https://patch.msgid.link/20260420-tracing-v1-1-d8f4cd0d6af1@debian.org Fixes: 002bb86d8d42 ("tracing/ftrace: separate events tracing and stats tracing engine") Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hourscgroup/rdma: fix integer overflow in rdmacg_try_charge()cuitao
[ Upstream commit c802f460dd485c1332b5a35e7adcfb2bc22536a2 ] The expression `rpool->resources[index].usage + 1` is computed in int arithmetic before being assigned to s64 variable `new`. When usage equals INT_MAX (the default "max" value), the addition overflows to INT_MIN. This negative value then passes the `new > max` check incorrectly, allowing a charge that should be rejected and corrupting usage to negative. Fix by casting usage to s64 before the addition so the arithmetic is done in 64-bit. Fixes: 39d3e7584a68 ("rdmacg: Added rdma cgroup controller") Signed-off-by: cuitao <cuitao@kylinos.cn> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hoursbpf: allow UTF-8 literals in bpf_bprintf_prepare()Yihan Ding
[ Upstream commit b960430ea8862ef37ce53c8bf74a8dc79d3f2404 ] bpf_bprintf_prepare() only needs ASCII parsing for conversion specifiers. Plain text can safely carry bytes >= 0x80, so allow UTF-8 literals outside '%' sequences while keeping ASCII control bytes rejected and format specifiers ASCII-only. This keeps existing parsing rules for format directives unchanged, while allowing helpers such as bpf_trace_printk() to emit UTF-8 literal text. Update test_snprintf_negative() in the same commit so selftests keep matching the new plain-text vs format-specifier split during bisection. Fixes: 48cac3f4a96d ("bpf: Implement formatted output helpers with bstr_printf") Signed-off-by: Yihan Ding <dingyihan@uniontech.com> Acked-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/20260416120142.1420646-2-dingyihan@uniontech.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hoursbpf: Fix NULL deref in map_kptr_match_type for scalar regsMykyta Yatsenko
[ Upstream commit 4d0a375887ab4d49e4da1ff10f9606cab8f7c3ad ] Commit ab6c637ad027 ("bpf: Fix a bpf_kptr_xchg() issue with local kptr") refactored map_kptr_match_type() to branch on btf_is_kernel() before checking base_type(). A scalar register stored into a kptr slot has no btf, so the btf_is_kernel(reg->btf) call dereferences NULL. Move the base_type() != PTR_TO_BTF_ID guard before any reg->btf access. Fixes: ab6c637ad027 ("bpf: Fix a bpf_kptr_xchg() issue with local kptr") Reported-by: Hiker Cl <clhiker365@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221372 Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/20260416-kptr_crash-v1-1-5589356584b4@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hourstracing: Rebuild full_name on each hist_field_name() callPengpeng Hou
[ Upstream commit 5ec1d1e97de134beed3a5b08235a60fc1c51af96 ] hist_field_name() uses a static MAX_FILTER_STR_VAL buffer for fully qualified variable-reference names, but it currently appends into that buffer with strcat() without rebuilding it first. As a result, repeated calls append a new "system.event.field" name onto the previous one, which can eventually run past the end of full_name. Build the name with snprintf() on each call and return NULL if the fully qualified name does not fit in MAX_FILTER_STR_VAL. Link: https://patch.msgid.link/20260401112224.85582-1-pengpeng@iscas.ac.cn Fixes: 067fe038e70f ("tracing: Add variable reference handling to hist triggers") Reviewed-by: Tom Zanussi <zanussi@kernel.org> Tested-by: Tom Zanussi <zanussi@kernel.org> Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hoursunshare: fix nsproxy leak in ksys_unshare() on set_cred_ucounts() failureMichal Grzedzicki
[ Upstream commit a98621a0f187a934c115dcfe79a49520ae892111 ] When set_cred_ucounts() fails in ksys_unshare() new_nsproxy is leaked. Let's call put_nsproxy() if that happens. Link: https://lkml.kernel.org/r/20260213193959.2556730-1-mge@meta.com Fixes: 905ae01c4ae2 ("Add a reference to ucounts for each cred") Signed-off-by: Michal Grzedzicki <mge@meta.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Gladkov (Intel) <legion@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hourspadata: Put CPU offline callback in ONLINE section to allow failureDaniel Jordan
[ Upstream commit c8c4a2972f83c8b68ff03b43cecdb898939ff851 ] syzbot reported the following warning: DEAD callback error for CPU1 WARNING: kernel/cpu.c:1463 at _cpu_down+0x759/0x1020 kernel/cpu.c:1463, CPU#0: syz.0.1960/14614 at commit 4ae12d8bd9a8 ("Merge tag 'kbuild-fixes-7.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux") which tglx traced to padata_cpu_dead() given it's the only sub-CPUHP_TEARDOWN_CPU callback that returns an error. Failure isn't allowed in hotplug states before CPUHP_TEARDOWN_CPU so move the CPU offline callback to the ONLINE section where failure is possible. Fixes: 894c9ef9780c ("padata: validate cpumask without removed CPU during offline") Reported-by: syzbot+123e1b70473ce213f3af@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69af0a05.050a0220.310d8.002f.GAE@google.com/ Debugged-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hourspadata: Remove cpu online check from cpu add and removalChuyi Zhou
[ Upstream commit 73117ea6470dca787f70f33c001f9faf437a1c0b ] During the CPU offline process, the dying CPU is cleared from the cpu_online_mask in takedown_cpu(). After this step, various CPUHP_*_DEAD callbacks are executed to perform cleanup jobs for the dead CPU, so this cpu online check in padata_cpu_dead() is unnecessary. Similarly, when executing padata_cpu_online() during the CPUHP_AP_ONLINE_DYN phase, the CPU has already been set in the cpu_online_mask, the action even occurs earlier than the CPUHP_AP_ONLINE_IDLE stage. Remove this unnecessary cpu online check in __padata_add_cpu() and __padata_remove_cpu(). Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Stable-dep-of: c8c4a2972f83 ("padata: Put CPU offline callback in ONLINE section to allow failure") Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hoursbpf: Fix OOB in pcpu_init_valueLang Xu
[ Upstream commit 576afddfee8d1108ee299bf10f581593540d1a36 ] An out-of-bounds read occurs when copying element from a BPF_MAP_TYPE_CGROUP_STORAGE map to another pcpu map with the same value_size that is not rounded up to 8 bytes. The issue happens when: 1. A CGROUP_STORAGE map is created with value_size not aligned to 8 bytes (e.g., 4 bytes) 2. A pcpu map is created with the same value_size (e.g., 4 bytes) 3. Update element in 2 with data in 1 pcpu_init_value assumes that all sources are rounded up to 8 bytes, and invokes copy_map_value_long to make a data copy, However, the assumption doesn't stand since there are some cases where the source may not be rounded up to 8 bytes, e.g., CGROUP_STORAGE, skb->data. the verifier verifies exactly the size that the source claims, not the size rounded up to 8 bytes by kernel, an OOB happens when the source has only 4 bytes while the copy size(4) is rounded up to 8. Fixes: d3bec0138bfb ("bpf: Zero-fill re-used per-cpu map element") Reported-by: Kaiyan Mei <kaiyanm@hust.edu.cn> Closes: https://lore.kernel.org/all/14e6c70c.6c121.19c0399d948.Coremail.kaiyanm@hust.edu.cn/ Link: https://lore.kernel.org/r/420FEEDDC768A4BE+20260402074236.2187154-1-xulang@uniontech.com Signed-off-by: Lang Xu <xulang@uniontech.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hoursbpf: Fix RCU stall in bpf_fd_array_map_clear()Sechang Lim
[ Upstream commit 4406942e65ca128c56c67443832988873c21d2e9 ] Add a missing cond_resched() in bpf_fd_array_map_clear() loop. For PROG_ARRAY maps with many entries this loop calls prog_array_map_poke_run() per entry which can be expensive, and without yielding this can cause RCU stalls under load: rcu: Stack dump where RCU GP kthread last ran: CPU: 0 UID: 0 PID: 30932 Comm: kworker/0:2 Not tainted 6.14.0-13195-g967e8def1100 #2 PREEMPT(undef) Workqueue: events prog_array_map_clear_deferred RIP: 0010:write_comp_data+0x38/0x90 kernel/kcov.c:246 Call Trace: <TASK> prog_array_map_poke_run+0x77/0x380 kernel/bpf/arraymap.c:1096 __fd_array_map_delete_elem+0x197/0x310 kernel/bpf/arraymap.c:925 bpf_fd_array_map_clear kernel/bpf/arraymap.c:1000 [inline] prog_array_map_clear_deferred+0x119/0x1b0 kernel/bpf/arraymap.c:1141 process_one_work+0x898/0x19d0 kernel/workqueue.c:3238 process_scheduled_works kernel/workqueue.c:3319 [inline] worker_thread+0x770/0x10b0 kernel/workqueue.c:3400 kthread+0x465/0x880 kernel/kthread.c:464 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:153 ret_from_fork_asm+0x19/0x30 arch/x86/entry/entry_64.S:245 </TASK> Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com> Fixes: da765a2f5993 ("bpf: Add poke dependency tracking for prog array maps") Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com> Link: https://lore.kernel.org/r/20260407103823.3942156-1-rhkrqnwk98@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hoursbpf: Drop task_to_inode and inet_conn_established from lsm sleepable hooksJiayuan Chen
[ Upstream commit beaf0e96b1da74549a6cabd040f9667d83b2e97e ] bpf_lsm_task_to_inode() is called under rcu_read_lock() and bpf_lsm_inet_conn_established() is called from softirq context, so neither hook can be used by sleepable LSM programs. Fixes: 423f16108c9d8 ("bpf: Augment the set of sleepable LSM hooks") Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn> Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reported-by: Dongliang Mu <dzm91@hust.edu.cn> Closes: https://lore.kernel.org/bpf/3ab69731-24d1-431a-a351-452aafaaf2a5@std.uestc.edu.cn/T/#u Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260407122334.344072-1-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hoursbpf: Fix stale offload->prog pointer after constant blindingMingTao Huang
[ Upstream commit a1aa9ef47c299c5bbc30594d3c2f0589edf908e6 ] When a dev-bound-only BPF program (BPF_F_XDP_DEV_BOUND_ONLY) undergoes JIT compilation with constant blinding enabled (bpf_jit_harden >= 2), bpf_jit_blind_constants() clones the program. The original prog is then freed in bpf_jit_prog_release_other(), which updates aux->prog to point to the surviving clone, but fails to update offload->prog. This leaves offload->prog pointing to the freed original program. When the network namespace is subsequently destroyed, cleanup_net() triggers bpf_dev_bound_netdev_unregister(), which iterates ondev->progs and calls __bpf_prog_offload_destroy(offload->prog). Accessing the freed prog causes a page fault: BUG: unable to handle page fault for address: ffffc900085f1038 Workqueue: netns cleanup_net RIP: 0010:__bpf_prog_offload_destroy+0xc/0x80 Call Trace: __bpf_offload_dev_netdev_unregister+0x257/0x350 bpf_dev_bound_netdev_unregister+0x4a/0x90 unregister_netdevice_many_notify+0x2a2/0x660 ... cleanup_net+0x21a/0x320 The test sequence that triggers this reliably is: 1. Set net.core.bpf_jit_harden=2 (echo 2 > /proc/sys/net/core/bpf_jit_harden) 2. Run xdp_metadata selftest, which creates a dev-bound-only XDP program on a veth inside a netns (./test_progs -t xdp_metadata) 3. cleanup_net -> page fault in __bpf_prog_offload_destroy Dev-bound-only programs are unique in that they have an offload structure but go through the normal JIT path instead of bpf_prog_offload_compile(). This means they are subject to constant blinding's prog clone-and-replace, while also having offload->prog that must stay in sync. Fix this by updating offload->prog in bpf_jit_prog_release_other(), alongside the existing aux->prog update. Both are back-pointers to the prog that must be kept in sync when the prog is replaced. Fixes: 2b3486bc2d23 ("bpf: Introduce device-bound XDP programs") Signed-off-by: MingTao Huang <mintaohuang@tencent.com> Link: https://lore.kernel.org/r/tencent_BCF692F45859CCE6C22B7B0B64827947D406@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hoursbpf: fix end-of-list detection in cgroup_storage_get_next_key()Weiming Shi
[ Upstream commit 5828b9e5b272ecff7cf5d345128d3de7324117f7 ] list_next_entry() never returns NULL -- when the current element is the last entry it wraps to the list head via container_of(). The subsequent NULL check is therefore dead code and get_next_key() never returns -ENOENT for the last element, instead reading storage->key from a bogus pointer that aliases internal map fields and copying the result to userspace. Replace it with list_entry_is_head() so the function correctly returns -ENOENT when there are no more entries. Fixes: de9cbbaadba5 ("bpf: introduce cgroup storage maps") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com> Acked-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/20260403132951.43533-2-bestswngs@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hoursbpf: Use RCU-safe iteration in dev_map_redirect_multi() SKB pathDavid Carlier
[ Upstream commit 8ed82f807bb09d2c8455aaa665f2c6cb17bc6a19 ] The DEVMAP_HASH branch in dev_map_redirect_multi() uses hlist_for_each_entry_safe() to iterate hash buckets, but this function runs under RCU protection (called from xdp_do_generic_redirect_map() in softirq context). Concurrent writers (__dev_map_hash_update_elem, dev_map_hash_delete_elem) modify the list using RCU primitives (hlist_add_head_rcu, hlist_del_rcu). hlist_for_each_entry_safe() performs plain pointer dereferences without rcu_dereference(), missing the acquire barrier needed to pair with writers' rcu_assign_pointer(). On weakly-ordered architectures (ARM64, POWER), a reader can observe a partially-constructed node. It also defeats CONFIG_PROVE_RCU lockdep validation and KCSAN data-race detection. Replace with hlist_for_each_entry_rcu() using rcu_read_lock_bh_held() as the lockdep condition, consistent with the rcu_dereference_check() used in the DEVMAP (non-hash) branch of the same functions. Also fix the same incorrect lockdep_is_held(&dtab->index_lock) condition in dev_map_enqueue_multi(), where the lock is not held either. Fixes: e624d4ed4aa8 ("xdp: Extend xdp_redirect_map with broadcast support") Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20260320072645.16731-1-devnexen@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hoursbpf, devmap: Remove unnecessary if check in for loopThorsten Blum
[ Upstream commit 2317dc2c22cc353b699c7d1db47b2fe91f54055c ] The iterator variable dst cannot be NULL and the if check can be removed. Remove it and fix the following Coccinelle/coccicheck warning reported by itnull.cocci: ERROR: iterator variable bound on line 762 cannot be NULL Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20240529101900.103913-2-thorsten.blum@toblux.com Stable-dep-of: 8ed82f807bb0 ("bpf: Use RCU-safe iteration in dev_map_redirect_multi() SKB path") Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hoursmodule: Fix freeing of charp module parameters when CONFIG_SYSFS=nPetr Pavlu
[ Upstream commit deffe1edba626d474fef38007c03646ca5876a0e ] When setting a charp module parameter, the param_set_charp() function allocates memory to store a copy of the input value. Later, when the module is potentially unloaded, the destroy_params() function is called to free this allocated memory. However, destroy_params() is available only when CONFIG_SYSFS=y, otherwise only a dummy variant is present. In the unlikely case that the kernel is configured with CONFIG_MODULES=y and CONFIG_SYSFS=n, this results in a memory leak of charp values when a module is unloaded. Fix this issue by making destroy_params() always available when CONFIG_MODULES=y. Rename the function to module_destroy_params() to clarify that it is intended for use by the module loader. Fixes: e180a6b7759a ("param: fix charp parameters set via sysfs") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hoursparams: Replace __modinit with __init_or_modulePetr Pavlu
[ Upstream commit 3cb0c3bdea5388519bc1bf575dca6421b133302b ] Remove the custom __modinit macro from kernel/params.c and instead use the common __init_or_module macro from include/linux/module.h. Both provide the same functionality. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Stable-dep-of: deffe1edba62 ("module: Fix freeing of charp module parameters when CONFIG_SYSFS=n") Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hourshrtimer: Reduce trace noise in hrtimer_start()Thomas Gleixner
[ Upstream commit f2e388a019e4cf83a15883a3d1f1384298e9a6aa ] hrtimer_start() when invoked with an already armed timer traces like: <comm>-.. [032] d.h2. 5.002263: hrtimer_cancel: hrtimer= .... <comm>-.. [032] d.h1. 5.002263: hrtimer_start: hrtimer= .... Which is incorrect as the timer doesn't get canceled. Just the expiry time changes. The internal dequeue operation which is required for that is not really interesting for trace analysis. But it makes it tedious to keep real cancellations and the above case apart. Remove the cancel tracing in hrtimer_start() and add a 'was_armed' indicator to the hrtimer start tracepoint, which clearly indicates what the state of the hrtimer is when hrtimer_start() is invoked: <comm>-.. [032] d.h1. 6.200103: hrtimer_start: hrtimer= .... was_armed=0 <comm>-.. [032] d.h1. 6.200558: hrtimer_start: hrtimer= .... was_armed=1 Fixes: c6a2a1770245 ("hrtimer: Add tracepoint for hrtimers") Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163430.208491877@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hourshrtimer: Avoid pointless reprogramming in __hrtimer_start_range_ns()Peter Zijlstra
[ Upstream commit d19ff16c11db38f3ee179d72751fb9b340174330 ] Much like hrtimer_reprogram(), skip programming if the cpu_base is running the hrtimer interrupt. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Reviewed-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260224163429.069535561@kernel.org Stable-dep-of: f2e388a019e4 ("hrtimer: Reduce trace noise in hrtimer_start()") Signed-off-by: Sasha Levin <sashal@kernel.org>
12 hourshrtimers: Update the return type of enqueue_hrtimer()Richard Clark
[ Upstream commit da7100d3bf7d6f5c49ef493ea963766898e9b069 ] The return type should be 'bool' instead of 'int' according to the calling context in the kernel, and its internal implementation, i.e. : return timerqueue_add(); which is a bool-return function. [ tglx: Adjust function arguments ] Signed-off-by: Richard Clark <richard.xnu.clark@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/Z2ppT7me13dtxm1a@MBC02GN1V4Q05P Stable-dep-of: f2e388a019e4 ("hrtimer: Reduce trace noise in hrtimer_start()") Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daystracing/probes: Limit size of event probe to 3KSteven Rostedt
[ Upstream commit b2aa3b4d64e460ac606f386c24e7d8a873ce6f1a ] There currently isn't a max limit an event probe can be. One could make an event greater than PAGE_SIZE, which makes the event useless because if it's bigger than the max event that can be recorded into the ring buffer, then it will never be recorded. A event probe should never need to be greater than 3K, so make that the max size. As long as the max is less than the max that can be recorded onto the ring buffer, it should be fine. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Fixes: 93ccae7a22274 ("tracing/kprobes: Support basic types on dynamic events") Link: https://patch.msgid.link/20260428122302.706610ba@gandalf.local.home Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [ adjusted context to place MAX_PROBE_EVENT_SIZE near MAX_STRING_SIZE and appended EVENT_TOO_BIG after NEED_STRING_TYPE ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 daystracepoint: balance regfunc() on func_add() failure in tracepoint_add_func()David Carlier
[ Upstream commit fad217e16fded7f3c09f8637b0f6a224d58b5f2e ] When a tracepoint goes through the 0 -> 1 transition, tracepoint_add_func() invokes the subsystem's ext->regfunc() before attempting to install the new probe via func_add(). If func_add() then fails (for example, when allocate_probes() cannot allocate a new probe array under memory pressure and returns -ENOMEM), the function returns the error without calling the matching ext->unregfunc(), leaving the side effects of regfunc() behind with no installed probe to justify them. For syscall tracepoints this is particularly unpleasant: syscall_regfunc() bumps sys_tracepoint_refcount and sets SYSCALL_TRACEPOINT on every task. After a leaked failure, the refcount is stuck at a non-zero value with no consumer, and every task continues paying the syscall trace entry/exit overhead until reboot. Other subsystems providing regfunc()/unregfunc() pairs exhibit similarly scoped persistent state. Mirror the existing 1 -> 0 cleanup and call ext->unregfunc() in the func_add() error path, gated on the same condition used there so the unwind is symmetric with the registration. Fixes: 8cf868affdc4 ("tracing: Have the reg function allow to fail") Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260413190601.21993-1-devnexen@gmail.com Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> [ changed `tp->ext->unregfunc` to `tp->unregfunc` to match older struct layout ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 dayssched: Use u64 for bandwidth ratio calculationsJoseph Salisbury
[ Upstream commit c6e80201e057dfb7253385e60bf541121bf5dc33 ] to_ratio() computes BW_SHIFT-scaled bandwidth ratios from u64 period and runtime values, but it returns unsigned long. tg_rt_schedulable() also stores the current group limit and the accumulated child sum in unsigned long. On 32-bit builds, large bandwidth ratios can be truncated and the RT group sum can wrap when enough siblings are present. That can let an overcommitted RT hierarchy pass the schedulability check, and it also narrows the helper result for other callers. Return u64 from to_ratio() and use u64 for the RT group totals so bandwidth ratios are preserved and compared at full width on both 32-bit and 64-bit builds. Fixes: b40b2e8eb521 ("sched: rt: multi level group constraints") Assisted-by: Codex:GPT-5 Signed-off-by: Joseph Salisbury <joseph.salisbury@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260403210014.2713404-1-joseph.salisbury@oracle.com [ dropped `extern` keyword from `to_ratio()` declaration ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 daysexit: Sleep at TASK_IDLE when waiting for application core dumpPaul E. McKenney
commit b8e753128ed074fcb48e9ceded940752f6b1c19f upstream. Currently, the coredump_task_exit() function sets the task state to TASK_UNINTERRUPTIBLE|TASK_FREEZABLE, which usually works well. But a combination of large memory and slow (and/or highly contended) mass storage can cause application core dumps to take more than two minutes, which can cause check_hung_task(), which is invoked by check_hung_uninterruptible_tasks(), to produce task-blocked splats. There does not seem to be any reasonable benefit to getting these splats. Furthermore, as Oleg Nesterov points out, TASK_UNINTERRUPTIBLE could be misleading because the task sleeping in coredump_task_exit() really is killable, albeit indirectly. See the check of signal->core_state in prepare_signal() and the check of fatal_signal_pending() in dump_interrupted(), which bypass the normal unkillability of TASK_UNINTERRUPTIBLE, resulting in coredump_finish() invoking wake_up_process() on any threads sleeping in coredump_task_exit(). Therefore, change that TASK_UNINTERRUPTIBLE to TASK_IDLE. Reported-by: Anhad Jai Singh <ffledgling@meta.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christian Brauner <brauner@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Rik van Riel <riel@surriel.com> Cc: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 daysexit: prevent preemption of oopsing TASK_DEAD taskJann Horn
commit c1fa0bb633e4a6b11e83ffc57fa5abe8ebb87891 upstream. When an already-exiting task oopses, make_task_dead() currently calls do_task_dead() with preemption enabled. That is forbidden: do_task_dead() calls __schedule(), which has a comment saying "WARNING: must be called with preemption disabled!". If an oopsing task is preempted in do_task_dead(), between becoming TASK_DEAD and entering the scheduler explicitly, bad things happen: finish_task_switch() assumes that once the scheduler has switched away from a TASK_DEAD task, the task can never run again and its stack is no longer needed; but that assumption apparently doesn't hold if the dead task was preempted (the SM_PREEMPT case). This means that the scheduler ends up repeatedly dropping references on the dead task's stack, which can lead to use-after-free or double-free of the entire task stack; in other words, two tasks can end up running on the same stack, resulting in various kinds of memory corruption. (This does not just affect "recursively oopsing" tasks; it is enough to oops once during task exit, for example in a file_operations::release handler) Fixes: 7f80a2fd7db9 ("exit: Stop poorly open coding do_task_dead in make_task_dead") Cc: stable@kernel.org Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 daysbpf: Don't mark STACK_INVALID as STACK_MISC in mark_stack_slot_miscKumar Kartikeya Dwivedi
[ Upstream commit 69772f509e084ec6bca12dbcdeeeff41b0103774 ] Inside mark_stack_slot_misc, we should not upgrade STACK_INVALID to STACK_MISC when allow_ptr_leaks is false, since invalid contents shouldn't be read unless the program has the relevant capabilities. The relaxation only makes sense when env->allow_ptr_leaks is true. However, such conversion in privileged mode becomes unnecessary, as invalid slots can be read without being upgraded to STACK_MISC. Currently, the condition is inverted (i.e. checking for true instead of false), simply remove it to restore correct behavior. Fixes: eaf18febd6eb ("bpf: preserve STACK_ZERO slots on partial reg spills") Acked-by: Andrii Nakryiko <andrii@kernel.org> Reported-by: Tao Lyu <tao.lyu@epfl.ch> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204044757.1483141-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysbpf: handle fake register spill to stack with BPF_ST_MEM instructionAndrii Nakryiko
[ Upstream commit 482d548d40b0af9af730e4869903d4433e44f014 ] When verifier validates BPF_ST_MEM instruction that stores known constant to stack (e.g., *(u64 *)(r10 - 8) = 123), it effectively spills a fake register with a constant (but initially imprecise) value to a stack slot. Because read-side logic treats it as a proper register fill from stack slot, we need to mark such stack slot initialization as INSN_F_STACK_ACCESS instruction to stop precision backtracking from missing it. Fixes: 41f6f64e6999 ("bpf: support non-r10 register spill/fill to/from stack in precision tracking") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231209010958.66758-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysbpf: track aligned STACK_ZERO cases as imprecise spilled registersAndrii Nakryiko
[ Upstream commit 18a433b62061e3d787bfc3e670fa711fecbd7cb4 ] Now that precision backtracing is supporting register spill/fill to/from stack, there is another oportunity to be exploited here: minimizing precise STACK_ZERO cases. With a simple code change we can rely on initially imprecise register spill tracking for cases when register spilled to stack was a known zero. This is a very common case for initializing on the stack variables, including rather large structures. Often times zero has no special meaning for the subsequent BPF program logic and is often overwritten with non-zero values soon afterwards. But due to STACK_ZERO vs STACK_MISC tracking, such initial zero initialization actually causes duplication of verifier states as STACK_ZERO is clearly different than STACK_MISC or spilled SCALAR_VALUE register. The effect of this (now) trivial change is huge, as can be seen below. These are differences between BPF selftests, Cilium, and Meta-internal BPF object files relative to previous patch in this series. You can see improvements ranging from single-digit percentage improvement for instructions and states, all the way to 50-60% reduction for some of Meta-internal host agent programs, and even some Cilium programs. For Meta-internal ones I left only the differences for largest BPF object files by states/instructions, as there were too many differences in the overall output. All the differences were improvements, reducting number of states and thus instructions validated. Note, Meta-internal BPF object file names are not printed below. Many copies of balancer_ingress are actually many different configurations of Katran, so they are different BPF programs, which explains state reduction going from -16% all the way to 31%, depending on BPF program logic complexity. I also tooked a closer look at a few small-ish BPF programs to validate the behavior. Let's take bpf_iter_netrlink.bpf.o (first row below). While it's just 8 vs 5 states, verifier log is still pretty long to include it here. But the reduction in states is due to the following piece of C code: unsigned long ino; ... sk = s->sk_socket; if (!sk) { ino = 0; } else { inode = SOCK_INODE(sk); bpf_probe_read_kernel(&ino, sizeof(ino), &inode->i_ino); } BPF_SEQ_PRINTF(seq, "%-8u %-8lu\n", s->sk_drops.counter, ino); return 0; You can see that in some situations `ino` is zero-initialized, while in others it's unknown value filled out by bpf_probe_read_kernel(). Before this change code after if/else branches have to be validated twice. Once with (precise) ino == 0, due to eager STACK_ZERO logic, and then again for when ino is just STACK_MISC. But BPF_SEQ_PRINTF() doesn't care about precise value of ino, so with the change in this patch verifier is able to prune states from after one of the branches, reducing number of total states (and instructions) required for successful validation. Similar principle applies to bigger real-world applications, just at a much larger scale. SELFTESTS ========= File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) --------------------------------------- ----------------------- --------- --------- --------------- ---------- ---------- ------------- bpf_iter_netlink.bpf.linked3.o dump_netlink 148 104 -44 (-29.73%) 8 5 -3 (-37.50%) bpf_iter_unix.bpf.linked3.o dump_unix 8474 8404 -70 (-0.83%) 151 147 -4 (-2.65%) bpf_loop.bpf.linked3.o stack_check 560 324 -236 (-42.14%) 42 24 -18 (-42.86%) local_storage_bench.bpf.linked3.o get_local 120 77 -43 (-35.83%) 9 6 -3 (-33.33%) loop6.bpf.linked3.o trace_virtqueue_add_sgs 10167 9868 -299 (-2.94%) 226 206 -20 (-8.85%) pyperf600_bpf_loop.bpf.linked3.o on_event 4872 3423 -1449 (-29.74%) 322 229 -93 (-28.88%) strobemeta.bpf.linked3.o on_event 180697 176036 -4661 (-2.58%) 4780 4734 -46 (-0.96%) test_cls_redirect.bpf.linked3.o cls_redirect 65594 65401 -193 (-0.29%) 4230 4212 -18 (-0.43%) test_global_func_args.bpf.linked3.o test_cls 145 136 -9 (-6.21%) 10 9 -1 (-10.00%) test_l4lb.bpf.linked3.o balancer_ingress 4760 2612 -2148 (-45.13%) 113 102 -11 (-9.73%) test_l4lb_noinline.bpf.linked3.o balancer_ingress 4845 4877 +32 (+0.66%) 219 221 +2 (+0.91%) test_l4lb_noinline_dynptr.bpf.linked3.o balancer_ingress 2072 2087 +15 (+0.72%) 97 98 +1 (+1.03%) test_seg6_loop.bpf.linked3.o __add_egr_x 12440 9975 -2465 (-19.82%) 364 353 -11 (-3.02%) test_tcp_hdr_options.bpf.linked3.o estab 2558 2572 +14 (+0.55%) 179 180 +1 (+0.56%) test_xdp_dynptr.bpf.linked3.o _xdp_tx_iptunnel 645 596 -49 (-7.60%) 26 24 -2 (-7.69%) test_xdp_noinline.bpf.linked3.o balancer_ingress_v6 3520 3516 -4 (-0.11%) 216 216 +0 (+0.00%) xdp_synproxy_kern.bpf.linked3.o syncookie_tc 82661 81241 -1420 (-1.72%) 5073 5155 +82 (+1.62%) xdp_synproxy_kern.bpf.linked3.o syncookie_xdp 84964 82297 -2667 (-3.14%) 5130 5157 +27 (+0.53%) META-INTERNAL ============= Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) -------------------------------------- --------- --------- ----------------- ---------- ---------- --------------- balancer_ingress 27925 23608 -4317 (-15.46%) 1488 1482 -6 (-0.40%) balancer_ingress 31824 27546 -4278 (-13.44%) 1658 1652 -6 (-0.36%) balancer_ingress 32213 27935 -4278 (-13.28%) 1689 1683 -6 (-0.36%) balancer_ingress 32213 27935 -4278 (-13.28%) 1689 1683 -6 (-0.36%) balancer_ingress 31824 27546 -4278 (-13.44%) 1658 1652 -6 (-0.36%) balancer_ingress 38647 29562 -9085 (-23.51%) 2069 1835 -234 (-11.31%) balancer_ingress 38647 29562 -9085 (-23.51%) 2069 1835 -234 (-11.31%) balancer_ingress 40339 30792 -9547 (-23.67%) 2193 1934 -259 (-11.81%) balancer_ingress 37321 29055 -8266 (-22.15%) 1972 1795 -177 (-8.98%) balancer_ingress 38176 29753 -8423 (-22.06%) 2008 1831 -177 (-8.81%) balancer_ingress 29193 20910 -8283 (-28.37%) 1599 1422 -177 (-11.07%) balancer_ingress 30013 21452 -8561 (-28.52%) 1645 1447 -198 (-12.04%) balancer_ingress 28691 24290 -4401 (-15.34%) 1545 1531 -14 (-0.91%) balancer_ingress 34223 28965 -5258 (-15.36%) 1984 1875 -109 (-5.49%) balancer_ingress 35481 26158 -9323 (-26.28%) 2095 1806 -289 (-13.79%) balancer_ingress 35481 26158 -9323 (-26.28%) 2095 1806 -289 (-13.79%) balancer_ingress 35868 26455 -9413 (-26.24%) 2140 1827 -313 (-14.63%) balancer_ingress 35868 26455 -9413 (-26.24%) 2140 1827 -313 (-14.63%) balancer_ingress 35481 26158 -9323 (-26.28%) 2095 1806 -289 (-13.79%) balancer_ingress 35481 26158 -9323 (-26.28%) 2095 1806 -289 (-13.79%) balancer_ingress 34844 29485 -5359 (-15.38%) 2036 1918 -118 (-5.80%) fbflow_egress 3256 2652 -604 (-18.55%) 218 192 -26 (-11.93%) fbflow_ingress 1026 944 -82 (-7.99%) 70 63 -7 (-10.00%) sslwall_tc_egress 8424 7360 -1064 (-12.63%) 498 458 -40 (-8.03%) syar_accept_protect 15040 9539 -5501 (-36.58%) 364 220 -144 (-39.56%) syar_connect_tcp_v6 15036 9535 -5501 (-36.59%) 360 216 -144 (-40.00%) syar_connect_udp_v4 15039 9538 -5501 (-36.58%) 361 217 -144 (-39.89%) syar_connect_connect4_protect4 24805 15833 -8972 (-36.17%) 756 480 -276 (-36.51%) syar_lsm_file_open 167772 151813 -15959 (-9.51%) 1836 1667 -169 (-9.20%) syar_namespace_create_new 14805 9304 -5501 (-37.16%) 353 209 -144 (-40.79%) syar_python3_detect 17531 12030 -5501 (-31.38%) 391 247 -144 (-36.83%) syar_ssh_post_fork 16412 10911 -5501 (-33.52%) 405 261 -144 (-35.56%) syar_enter_execve 14728 9227 -5501 (-37.35%) 345 201 -144 (-41.74%) syar_enter_execveat 14728 9227 -5501 (-37.35%) 345 201 -144 (-41.74%) syar_exit_execve 16622 11121 -5501 (-33.09%) 376 232 -144 (-38.30%) syar_exit_execveat 16622 11121 -5501 (-33.09%) 376 232 -144 (-38.30%) syar_syscalls_kill 15288 9787 -5501 (-35.98%) 398 254 -144 (-36.18%) syar_task_enter_pivot_root 14898 9397 -5501 (-36.92%) 357 213 -144 (-40.34%) syar_syscalls_setreuid 16678 11177 -5501 (-32.98%) 429 285 -144 (-33.57%) syar_syscalls_setuid 16678 11177 -5501 (-32.98%) 429 285 -144 (-33.57%) syar_syscalls_process_vm_readv 14959 9458 -5501 (-36.77%) 364 220 -144 (-39.56%) syar_syscalls_process_vm_writev 15757 10256 -5501 (-34.91%) 390 246 -144 (-36.92%) do_uprobe 15519 10018 -5501 (-35.45%) 373 229 -144 (-38.61%) edgewall 179715 55783 -123932 (-68.96%) 12607 3999 -8608 (-68.28%) bictcp_state 7570 4131 -3439 (-45.43%) 496 269 -227 (-45.77%) cubictcp_state 7570 4131 -3439 (-45.43%) 496 269 -227 (-45.77%) tcp_rate_skb_delivered 447 272 -175 (-39.15%) 29 18 -11 (-37.93%) kprobe__bbr_set_state 4566 2615 -1951 (-42.73%) 209 124 -85 (-40.67%) kprobe__bictcp_state 4566 2615 -1951 (-42.73%) 209 124 -85 (-40.67%) inet_sock_set_state 1501 1337 -164 (-10.93%) 93 85 -8 (-8.60%) tcp_retransmit_skb 1145 981 -164 (-14.32%) 67 59 -8 (-11.94%) tcp_retransmit_synack 1183 951 -232 (-19.61%) 67 55 -12 (-17.91%) bpf_tcptuner 1459 1187 -272 (-18.64%) 99 80 -19 (-19.19%) tw_egress 801 776 -25 (-3.12%) 69 66 -3 (-4.35%) tw_ingress 795 770 -25 (-3.14%) 69 66 -3 (-4.35%) ttls_tc_ingress 19025 19383 +358 (+1.88%) 470 465 -5 (-1.06%) ttls_nat_egress 490 299 -191 (-38.98%) 33 20 -13 (-39.39%) ttls_nat_ingress 448 285 -163 (-36.38%) 32 21 -11 (-34.38%) tw_twfw_egress 511127 212071 -299056 (-58.51%) 16733 8504 -8229 (-49.18%) tw_twfw_ingress 500095 212069 -288026 (-57.59%) 16223 8504 -7719 (-47.58%) tw_twfw_tc_eg 511113 212064 -299049 (-58.51%) 16732 8504 -8228 (-49.18%) tw_twfw_tc_in 500095 212069 -288026 (-57.59%) 16223 8504 -7719 (-47.58%) tw_twfw_egress 12632 12435 -197 (-1.56%) 276 260 -16 (-5.80%) tw_twfw_ingress 12631 12454 -177 (-1.40%) 278 261 -17 (-6.12%) tw_twfw_tc_eg 12595 12435 -160 (-1.27%) 274 259 -15 (-5.47%) tw_twfw_tc_in 12631 12454 -177 (-1.40%) 278 261 -17 (-6.12%) tw_xdp_dump 266 209 -57 (-21.43%) 9 8 -1 (-11.11%) CILIUM ========= File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) ------------- -------------------------------- --------- --------- ---------------- ---------- ---------- -------------- bpf_host.o cil_to_netdev 6047 4578 -1469 (-24.29%) 362 249 -113 (-31.22%) bpf_host.o handle_lxc_traffic 2227 1585 -642 (-28.83%) 156 103 -53 (-33.97%) bpf_host.o tail_handle_ipv4_from_netdev 2244 1458 -786 (-35.03%) 163 106 -57 (-34.97%) bpf_host.o tail_handle_nat_fwd_ipv4 21022 10479 -10543 (-50.15%) 1289 670 -619 (-48.02%) bpf_host.o tail_handle_nat_fwd_ipv6 15433 11375 -4058 (-26.29%) 905 643 -262 (-28.95%) bpf_host.o tail_ipv4_host_policy_ingress 2219 1367 -852 (-38.40%) 161 96 -65 (-40.37%) bpf_host.o tail_nodeport_nat_egress_ipv4 22460 19862 -2598 (-11.57%) 1469 1293 -176 (-11.98%) bpf_host.o tail_nodeport_nat_ingress_ipv4 5526 3534 -1992 (-36.05%) 366 243 -123 (-33.61%) bpf_host.o tail_nodeport_nat_ingress_ipv6 5132 4256 -876 (-17.07%) 241 219 -22 (-9.13%) bpf_host.o tail_nodeport_nat_ipv6_egress 3702 3542 -160 (-4.32%) 215 205 -10 (-4.65%) bpf_lxc.o tail_handle_nat_fwd_ipv4 21022 10479 -10543 (-50.15%) 1289 670 -619 (-48.02%) bpf_lxc.o tail_handle_nat_fwd_ipv6 15433 11375 -4058 (-26.29%) 905 643 -262 (-28.95%) bpf_lxc.o tail_ipv4_ct_egress 5073 3374 -1699 (-33.49%) 262 172 -90 (-34.35%) bpf_lxc.o tail_ipv4_ct_ingress 5093 3385 -1708 (-33.54%) 262 172 -90 (-34.35%) bpf_lxc.o tail_ipv4_ct_ingress_policy_only 5093 3385 -1708 (-33.54%) 262 172 -90 (-34.35%) bpf_lxc.o tail_ipv6_ct_egress 4593 3878 -715 (-15.57%) 194 151 -43 (-22.16%) bpf_lxc.o tail_ipv6_ct_ingress 4606 3891 -715 (-15.52%) 194 151 -43 (-22.16%) bpf_lxc.o tail_ipv6_ct_ingress_policy_only 4606 3891 -715 (-15.52%) 194 151 -43 (-22.16%) bpf_lxc.o tail_nodeport_nat_ingress_ipv4 5526 3534 -1992 (-36.05%) 366 243 -123 (-33.61%) bpf_lxc.o tail_nodeport_nat_ingress_ipv6 5132 4256 -876 (-17.07%) 241 219 -22 (-9.13%) bpf_overlay.o tail_handle_nat_fwd_ipv4 20524 10114 -10410 (-50.72%) 1271 638 -633 (-49.80%) bpf_overlay.o tail_nodeport_nat_egress_ipv4 22718 19490 -3228 (-14.21%) 1475 1275 -200 (-13.56%) bpf_overlay.o tail_nodeport_nat_ingress_ipv4 5526 3534 -1992 (-36.05%) 366 243 -123 (-33.61%) bpf_overlay.o tail_nodeport_nat_ingress_ipv6 5132 4256 -876 (-17.07%) 241 219 -22 (-9.13%) bpf_overlay.o tail_nodeport_nat_ipv6_egress 3638 3548 -90 (-2.47%) 209 203 -6 (-2.87%) bpf_overlay.o tail_rev_nodeport_lb4 4368 3820 -548 (-12.55%) 248 215 -33 (-13.31%) bpf_overlay.o tail_rev_nodeport_lb6 2867 2428 -439 (-15.31%) 167 140 -27 (-16.17%) bpf_sock.o cil_sock6_connect 1718 1703 -15 (-0.87%) 100 99 -1 (-1.00%) bpf_xdp.o tail_handle_nat_fwd_ipv4 12917 12443 -474 (-3.67%) 875 849 -26 (-2.97%) bpf_xdp.o tail_handle_nat_fwd_ipv6 13515 13264 -251 (-1.86%) 715 702 -13 (-1.82%) bpf_xdp.o tail_lb_ipv4 39492 36367 -3125 (-7.91%) 2430 2251 -179 (-7.37%) bpf_xdp.o tail_lb_ipv6 80441 78058 -2383 (-2.96%) 3647 3523 -124 (-3.40%) bpf_xdp.o tail_nodeport_ipv6_dsr 1038 901 -137 (-13.20%) 61 55 -6 (-9.84%) bpf_xdp.o tail_nodeport_nat_egress_ipv4 13027 12096 -931 (-7.15%) 868 809 -59 (-6.80%) bpf_xdp.o tail_nodeport_nat_ingress_ipv4 7617 5900 -1717 (-22.54%) 522 413 -109 (-20.88%) bpf_xdp.o tail_nodeport_nat_ingress_ipv6 7575 7395 -180 (-2.38%) 383 374 -9 (-2.35%) bpf_xdp.o tail_rev_nodeport_lb4 6808 6739 -69 (-1.01%) 403 396 -7 (-1.74%) bpf_xdp.o tail_rev_nodeport_lb6 16173 15847 -326 (-2.02%) 1010 990 -20 (-1.98%) Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231205184248.1502704-9-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysbpf: preserve constant zero when doing partial register restoreAndrii Nakryiko
[ Upstream commit e322f0bcb8d371f4606eaf141c7f967e1a79bcb7 ] Similar to special handling of STACK_ZERO, when reading 1/2/4 bytes from stack from slot that has register spilled into it and that register has a constant value zero, preserve that zero and mark spilled register as precise for that. This makes spilled const zero register and STACK_ZERO cases equivalent in their behavior. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231205184248.1502704-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysbpf: preserve STACK_ZERO slots on partial reg spillsAndrii Nakryiko
[ Upstream commit eaf18febd6ebc381aeb61543705148b3e28c7c47 ] Instead of always forcing STACK_ZERO slots to STACK_MISC, preserve it in situations where this is possible. E.g., when spilling register as 1/2/4-byte subslots on the stack, all the remaining bytes in the stack slot do not automatically become unknown. If we knew they contained zeroes, we can preserve those STACK_ZERO markers. Add a helper mark_stack_slot_misc(), similar to scrub_spilled_slot(), but that doesn't overwrite either STACK_INVALID nor STACK_ZERO. Note that we need to take into account possibility of being in unprivileged mode, in which case STACK_INVALID is forced to STACK_MISC for correctness, as treating STACK_INVALID as equivalent STACK_MISC is only enabled in privileged mode. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231205184248.1502704-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysbpf: support non-r10 register spill/fill to/from stack in precision trackingAndrii Nakryiko
[ Upstream commit 41f6f64e6999a837048b1bd13a2f8742964eca6b ] Use instruction (jump) history to record instructions that performed register spill/fill to/from stack, regardless if this was done through read-only r10 register, or any other register after copying r10 into it *and* potentially adjusting offset. To make this work reliably, we push extra per-instruction flags into instruction history, encoding stack slot index (spi) and stack frame number in extra 10 bit flags we take away from prev_idx in instruction history. We don't touch idx field for maximum performance, as it's checked most frequently during backtracking. This change removes basically the last remaining practical limitation of precision backtracking logic in BPF verifier. It fixes known deficiencies, but also opens up new opportunities to reduce number of verified states, explored in the subsequent patches. There are only three differences in selftests' BPF object files according to veristat, all in the positive direction (less states). File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) -------------------------------------- ------------- --------- --------- ------------- ---------- ---------- ------------- test_cls_redirect_dynptr.bpf.linked3.o cls_redirect 2987 2864 -123 (-4.12%) 240 231 -9 (-3.75%) xdp_synproxy_kern.bpf.linked3.o syncookie_tc 82848 82661 -187 (-0.23%) 5107 5073 -34 (-0.67%) xdp_synproxy_kern.bpf.linked3.o syncookie_xdp 85116 84964 -152 (-0.18%) 5162 5130 -32 (-0.62%) Note, I avoided renaming jmp_history to more generic insn_hist to minimize number of lines changed and potential merge conflicts between bpf and bpf-next trees. Notice also cur_hist_entry pointer reset to NULL at the beginning of instruction verification loop. This pointer avoids the problem of relying on last jump history entry's insn_idx to determine whether we already have entry for current instruction or not. It can happen that we added jump history entry because current instruction is_jmp_point(), but also we need to add instruction flags for stack access. In this case, we don't want to entries, so we need to reuse last added entry, if it is present. Relying on insn_idx comparison has the same ambiguity problem as the one that was fixed recently in [0], so we avoid that. [0] https://patchwork.kernel.org/project/netdevbpf/patch/20231110002638.4168352-3-andrii@kernel.org/ Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reported-by: Tao Lyu <tao.lyu@epfl.ch> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231205184248.1502704-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> [ Note: Adapted the expected log format for selftests as the map format in verifier logs was changed in commits 1db747d75b1d and 0c95c9fdb696. ] Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 daysrtmutex: Use waiter::task instead of current in remove_waiter()Keenan Dong
commit 3bfdc63936dd4773109b7b8c280c0f3b5ae7d349 upstream. remove_waiter() is used by the slowlock paths, but it is also used for proxy-lock rollback in rt_mutex_start_proxy_lock() when invoked from futex_requeue(). In the latter case waiter::task is not current, but remove_waiter() operates on current for the dequeue operation. That results in several problems: 1) the rbtree dequeue happens without waiter::task::pi_lock being held 2) the waiter task's pi_blocked_on state is not cleared, which leaves a dangling pointer primed for UAF around. 3) rt_mutex_adjust_prio_chain() operates on the wrong top priority waiter task Use waiter::task instead of current in all related operations in remove_waiter() to cure those problems. [ tglx: Fixup rt_mutex_adjust_prio_chain(), add a comment and amend the changelog ] Fixes: 8161239a8bcc ("rtmutex: Simplify PI algorithm and make highest prio task get lock") Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Keenan Dong <keenanat2000@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 daystaskstats: set version in TGID exit notificationsYiyang Chen
commit 16c4f0211aaa1ec1422b11b59f64f1abe9009fc0 upstream. delay accounting started populating taskstats records with a valid version field via fill_pid() and fill_tgid(). Later, commit ad4ecbcba728 ("[PATCH] delay accounting taskstats interface send tgid once") changed the TGID exit path to send the cached signal->stats aggregate directly instead of building the outgoing record through fill_tgid(). Unlike fill_tgid(), fill_tgid_exit() only accumulates accounting data and never initializes stats->version. As a result, TGID exit notifications can reach userspace with version == 0 even though PID exit notifications and TASKSTATS_CMD_GET replies carry a valid taskstats version. This is easy to reproduce with `tools/accounting/getdelays.c`. I have a small follow-up patch for that tool which: 1. increases the receive buffer/message size so the pid+tgid combined exit notification is not dropped/truncated 2. prints `stats->version`. With that patch, the reproducer is: Terminal 1: ./getdelays -d -v -l -m 0 Terminal 2: taskset -c 0 python3 -c 'import threading,time; t=threading.Thread(target=time.sleep,args=(0.1,)); t.start(); t.join()' That produces both PID and TGID exit notifications for the same process. The PID exit record reports a valid taskstats version, while the TGID exit record reports `version 0`. This patch (of 2): Set stats->version = TASKSTATS_VERSION after copying the cached TGID aggregate into the outgoing netlink payload so all taskstats records are self-describing again. Link: https://lkml.kernel.org/r/ba83d934e59edd431b693607de573eb9ca059309.1774810498.git.cyyzero16@gmail.com Fixes: ad4ecbcba728 ("[PATCH] delay accounting taskstats interface send tgid once") Signed-off-by: Yiyang Chen <cyyzero16@gmail.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dr. Thomas Orgis <thomas.orgis@uni-hamburg.de> Cc: Fan Yu <fan.yu9@zte.com.cn> Cc: Wang Yaxin <wang.yaxin@zte.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 daysrandomize_kstack: Maintain kstack_offset per taskRyan Roberts
commit 37beb42560165869838e7d91724f3e629db64129 upstream. kstack_offset was previously maintained per-cpu, but this caused a couple of issues. So let's instead make it per-task. Issue 1: add_random_kstack_offset() and choose_random_kstack_offset() expected and required to be called with interrupts and preemption disabled so that it could manipulate per-cpu state. But arm64, loongarch and risc-v are calling them with interrupts and preemption enabled. I don't _think_ this causes any functional issues, but it's certainly unexpected and could lead to manipulating the wrong cpu's state, which could cause a minor performance degradation due to bouncing the cache lines. By maintaining the state per-task those functions can safely be called in preemptible context. Issue 2: add_random_kstack_offset() is called before executing the syscall and expands the stack using a previously chosen random offset. choose_random_kstack_offset() is called after executing the syscall and chooses and stores a new random offset for the next syscall. With per-cpu storage for this offset, an attacker could force cpu migration during the execution of the syscall and prevent the offset from being updated for the original cpu such that it is predictable for the next syscall on that cpu. By maintaining the state per-task, this problem goes away because the per-task random offset is updated after the syscall regardless of which cpu it is executing on. Fixes: 39218ff4c625 ("stack: Optionally randomize kernel stack offset each syscall") Closes: https://lore.kernel.org/all/dd8c37bc-795f-4c7a-9086-69e584d8ab24@arm.com/ Cc: stable@vger.kernel.org Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Link: https://patch.msgid.link/20260303150840.3789438-2-ryan.roberts@arm.com Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 daysregset: use kvzalloc() for regset_get_alloc()Douglas Anderson
commit 6b839b3b76cf17296ebd4a893841f32cae08229c upstream. While browsing through ChromeOS crash reports, I found one with an allocation failure that looked like this: chrome: page allocation failure: order:7, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=urgent,mems_allowed=0 CPU: 7 PID: 3295 Comm: chrome Not tainted 5.15.133-20574-g8044615ac35c #1 (HASH:1162 1) Hardware name: Google Lazor (rev3 - 8) with KB Backlight (DT) Call trace: ... warn_alloc+0x104/0x174 __alloc_pages+0x5f0/0x6e4 kmalloc_order+0x44/0x98 kmalloc_order_trace+0x34/0x124 __kmalloc+0x228/0x36c __regset_get+0x68/0xcc regset_get_alloc+0x1c/0x28 elf_core_dump+0x3d8/0xd8c do_coredump+0xeb8/0x1378 get_signal+0x14c/0x804 ... An order 7 allocation is (1 << 7) contiguous pages, or 512K. It's not a surprise that this allocation failed on a system that's been running for a while. More digging showed that it was fairly easy to see the order 7 allocation by just sending a SIGQUIT to chrome (or other processes) to generate a core dump. The actual amount being allocated was 279,584 bytes and it was for "core_note_type" NT_ARM_SVE. There was quite a bit of discussion [1] on the mailing lists in response to my v1 patch attempting to switch to vmalloc. The overall conclusion was that we could likely reduce the 279,584 byte allocation by quite a bit and Mark Brown has sent a patch to that effect [2]. However even with the 279,584 byte allocation gone there are still 65,552 byte allocations. These are just barely more than the 65,536 bytes and thus would require an order 5 allocation. An order 5 allocation is still something to avoid unless necessary and nothing needs the memory here to be contiguous. Change the allocation to kvzalloc() which should still be efficient for small allocations but doesn't force the memory subsystem to work hard (and maybe fail) at getting a large contiguous chunk. [1] https://lore.kernel.org/r/20240201171159.1.Id9ad163b60d21c9e56c2d686b0cc9083a8ba7924@changeid [2] https://lore.kernel.org/r/20240203-arm64-sve-ptrace-regset-size-v1-1-2c3ba1386b9e@kernel.org Link: https://lkml.kernel.org/r/20240205092626.v2.1.Id9ad163b60d21c9e56c2d686b0cc9083a8ba7924@changeid Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Dave Martin <Dave.Martin@arm.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Jan Kara <jack@suse.cz> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Brown <broonie@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Wen Yang <wen.yang@linux.dev> Signed-off-by: Sasha Levin <sashal@kernel.org>
6 dayspadata: Fix pd UAF once and for allHerbert Xu
[ Upstream commit 71203f68c7749609d7fc8ae6ad054bdedeb24f91 ] There is a race condition/UAF in padata_reorder that goes back to the initial commit. A reference count is taken at the start of the process in padata_do_parallel, and released at the end in padata_serial_worker. This reference count is (and only is) required for padata_replace to function correctly. If padata_replace is never called then there is no issue. In the function padata_reorder which serves as the core of padata, as soon as padata is added to queue->serial.list, and the associated spin lock released, that padata may be processed and the reference count on pd would go away. Fix this by getting the next padata before the squeue->serial lock is released. In order to make this possible, simplify padata_reorder by only calling it once the next padata arrives. Fixes: 16295bec6398 ("padata: Generic parallelization/serialization interface") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> [ Adjust context of padata_find_next(). Replace cpumask_next_wrap(cpu, pd->cpumask.pcpu) with cpumask_next_wrap(cpu, pd->cpumask.pcpu, -1, false) in padata_reorder() in v6.6 according to dc5bb9b769c9 ("cpumask: deprecate cpumask_next_wrap()") and f954a2d37637 ("padata: switch padata_find_next() to using cpumask_next_wrap()") . ] Signed-off-by: Bin Lan <lanbincn@139.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
8 daysptrace: slightly saner 'get_dumpable()' logicLinus Torvalds
commit 31e62c2ebbfdc3fe3dbdf5e02c92a9dc67087a3a upstream. The 'dumpability' of a task is fundamentally about the memory image of the task - the concept comes from whether it can core dump or not - and makes no sense when you don't have an associated mm. And almost all users do in fact use it only for the case where the task has a mm pointer. But we have one odd special case: ptrace_may_access() uses 'dumpable' to check various other things entirely independently of the MM (typically explicitly using flags like PTRACE_MODE_READ_FSCREDS). Including for threads that no longer have a VM (and maybe never did, like most kernel threads). It's not what this flag was designed for, but it is what it is. The ptrace code does check that the uid/gid matches, so you do have to be uid-0 to see kernel thread details, but this means that the traditional "drop capabilities" model doesn't make any difference for this all. Make it all make a *bit* more sense by saying that if you don't have a MM pointer, we'll use a cached "last dumpability" flag if the thread ever had a MM (it will be zero for kernel threads since it is never set), and require a proper CAP_SYS_PTRACE capability to override. Reported-by: Qualys Security Advisory <qsa@qualys.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-27blktrace: fix __this_cpu_read/write in preemptible contextChaitanya Kulkarni
[ Upstream commit da46b5dfef48658d03347cda21532bcdbb521e67 ] tracing_record_cmdline() internally uses __this_cpu_read() and __this_cpu_write() on the per-CPU variable trace_cmdline_save, and trace_save_cmdline() explicitly asserts preemption is disabled via lockdep_assert_preemption_disabled(). These operations are only safe when preemption is off, as they were designed to be called from the scheduler context (probe_wakeup_sched_switch() / probe_wakeup()). __blk_add_trace() was calling tracing_record_cmdline(current) early in the blk_tracer path, before ring buffer reservation, from process context where preemption is fully enabled. This triggers the following using blktests/blktrace/002: blktrace/002 (blktrace ftrace corruption with sysfs trace) [failed] runtime 0.367s ... 0.437s something found in dmesg: [ 81.211018] run blktests blktrace/002 at 2026-02-25 22:24:33 [ 81.239580] null_blk: disk nullb1 created [ 81.357294] BUG: using __this_cpu_read() in preemptible [00000000] code: dd/2516 [ 81.362842] caller is tracing_record_cmdline+0x10/0x40 [ 81.362872] CPU: 16 UID: 0 PID: 2516 Comm: dd Tainted: G N 7.0.0-rc1lblk+ #84 PREEMPT(full) [ 81.362877] Tainted: [N]=TEST [ 81.362878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 [ 81.362881] Call Trace: [ 81.362884] <TASK> [ 81.362886] dump_stack_lvl+0x8d/0xb0 ... (See '/mnt/sda/blktests/results/nodev/blktrace/002.dmesg' for the entire message) [ 81.211018] run blktests blktrace/002 at 2026-02-25 22:24:33 [ 81.239580] null_blk: disk nullb1 created [ 81.357294] BUG: using __this_cpu_read() in preemptible [00000000] code: dd/2516 [ 81.362842] caller is tracing_record_cmdline+0x10/0x40 [ 81.362872] CPU: 16 UID: 0 PID: 2516 Comm: dd Tainted: G N 7.0.0-rc1lblk+ #84 PREEMPT(full) [ 81.362877] Tainted: [N]=TEST [ 81.362878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 [ 81.362881] Call Trace: [ 81.362884] <TASK> [ 81.362886] dump_stack_lvl+0x8d/0xb0 [ 81.362895] check_preemption_disabled+0xce/0xe0 [ 81.362902] tracing_record_cmdline+0x10/0x40 [ 81.362923] __blk_add_trace+0x307/0x5d0 [ 81.362934] ? lock_acquire+0xe0/0x300 [ 81.362940] ? iov_iter_extract_pages+0x101/0xa30 [ 81.362959] blk_add_trace_bio+0x106/0x1e0 [ 81.362968] submit_bio_noacct_nocheck+0x24b/0x3a0 [ 81.362979] ? lockdep_init_map_type+0x58/0x260 [ 81.362988] submit_bio_wait+0x56/0x90 [ 81.363009] __blkdev_direct_IO_simple+0x16c/0x250 [ 81.363026] ? __pfx_submit_bio_wait_endio+0x10/0x10 [ 81.363038] ? rcu_read_lock_any_held+0x73/0xa0 [ 81.363051] blkdev_read_iter+0xc1/0x140 [ 81.363059] vfs_read+0x20b/0x330 [ 81.363083] ksys_read+0x67/0xe0 [ 81.363090] do_syscall_64+0xbf/0xf00 [ 81.363102] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 81.363106] RIP: 0033:0x7f281906029d [ 81.363111] Code: 31 c0 e9 c6 fe ff ff 50 48 8d 3d 66 63 0a 00 e8 59 ff 01 00 66 0f 1f 84 00 00 00 00 00 80 3d 41 33 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec [ 81.363113] RSP: 002b:00007ffca127dd48 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 81.363120] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281906029d [ 81.363122] RDX: 0000000000001000 RSI: 0000559f8bfae000 RDI: 0000000000000000 [ 81.363123] RBP: 0000000000001000 R08: 0000002863a10a81 R09: 00007f281915f000 [ 81.363124] R10: 00007f2818f77b60 R11: 0000000000000246 R12: 0000559f8bfae000 [ 81.363126] R13: 0000000000000000 R14: 0000000000000000 R15: 000000000000000a [ 81.363142] </TASK> The same BUG fires from blk_add_trace_plug(), blk_add_trace_unplug(), and blk_add_trace_rq() paths as well. The purpose of tracing_record_cmdline() is to cache the task->comm for a given PID so that the trace can later resolve it. It is only meaningful when a trace event is actually being recorded. Ring buffer reservation via ring_buffer_lock_reserve() disables preemption, and preemption remains disabled until the event is committed :- __blk_add_trace() __trace_buffer_lock_reserve() __trace_buffer_lock_reserve() ring_buffer_lock_reserve() preempt_disable_notrace(); <--- With this fix blktests for blktrace pass: blktests (master) # ./check blktrace blktrace/001 (blktrace zone management command tracing) [passed] runtime 3.650s ... 3.647s blktrace/002 (blktrace ftrace corruption with sysfs trace) [passed] runtime 0.411s ... 0.384s Fixes: 7ffbd48d5cab ("tracing: Cache comms only after an event occurred") Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Rajani Kantha <681739313@139.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-27tracing/probe: reject non-closed empty immediate stringsPengpeng Hou
[ Upstream commit 4346be6577aaa04586167402ae87bbdbe32484a4 ] parse_probe_arg() accepts quoted immediate strings and passes the body after the opening quote to __parse_imm_string(). That helper currently computes strlen(str) and immediately dereferences str[len - 1], which underflows when the body is empty and not closed with double-quotation. Reject empty non-closed immediate strings before checking for the closing quote. Link: https://lore.kernel.org/all/20260401160315.88518-1-pengpeng@iscas.ac.cn/ Fixes: a42e3c4de964 ("tracing/probe: Add immediate string parameter support") Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-11fork: defer linking file vma until vma is fully initializedMiaohe Lin
[ Upstream commit 35e351780fa9d8240dd6f7e4f245f9ea37e96c19 ] Thorvald reported a WARNING [1]. And the root cause is below race: CPU 1 CPU 2 fork hugetlbfs_fallocate dup_mmap hugetlbfs_punch_hole i_mmap_lock_write(mapping); vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree. i_mmap_unlock_write(mapping); hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem! i_mmap_lock_write(mapping); hugetlb_vmdelete_list vma_interval_tree_foreach hugetlb_vma_trylock_write -- Vma_lock is cleared. tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem! hugetlb_vma_unlock_write -- Vma_lock is assigned!!! i_mmap_unlock_write(mapping); hugetlb_dup_vma_private() and hugetlb_vm_op_open() are called outside i_mmap_rwsem lock while vma lock can be used in the same time. Fix this by deferring linking file vma until vma is fully initialized. Those vmas should be initialized first before they can be used. [tk: Adapted to 6.6 stable where vma_iter_bulk_store() can fail (unlike mainline which uses __mt_dup() for pre-allocation). Preserved error handling via goto fail_nomem_vmi_store. Previous backport (cec11fa2eb512) was reverted (dd782da470761) due to xfstests failures.] Link: https://lkml.kernel.org/r/20240410091441.3539905-1-linmiaohe@huawei.com Fixes: 8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reported-by: Thorvald Natvig <thorvald@google.com> Closes: https://lore.kernel.org/linux-mm/20240129161735.6gmjsswx62o4pbja@revolver/T/ [1] Reviewed-by: Jane Chu <jane.chu@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Tycho Andersen <tandersen@netflix.com> Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Assisted-by: Claude:claude-opus-4.6 Suggested-by: David Nyström <david.nystrom@est.tech> Signed-off-by: Tugrul Kukul <tugrul.kukul@est.tech> Acked-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-11bpf: reject direct access to nullable PTR_TO_BUF pointersQi Tang
[ Upstream commit b0db1accbc7395657c2b79db59fa9fae0d6656f3 ] check_mem_access() matches PTR_TO_BUF via base_type() which strips PTR_MAYBE_NULL, allowing direct dereference without a null check. Map iterator ctx->key and ctx->value are PTR_TO_BUF | PTR_MAYBE_NULL. On stop callbacks these are NULL, causing a kernel NULL dereference. Add a type_may_be_null() guard to the PTR_TO_BUF branch, matching the existing PTR_TO_BTF_ID pattern. Fixes: 20b2aff4bc15 ("bpf: Introduce MEM_RDONLY flag") Signed-off-by: Qi Tang <tpluszz77@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260402092923.38357-2-tpluszz77@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-11bpf: Fix regsafe() for pointers to packetAlexei Starovoitov
[ Upstream commit a8502a79e832b861e99218cbd2d8f4312d62e225 ] In case rold->reg->range == BEYOND_PKT_END && rcur->reg->range == N regsafe() may return true which may lead to current state with valid packet range not being explored. Fix the bug. Fixes: 6d94e741a8ff ("bpf: Support for pointers beyond pkt_end.") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Amery Hung <ameryhung@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20260331204228.26726-1-alexei.starovoitov@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-04-02futex: Clear stale exiting pointer in futex_lock_pi() retry pathDavidlohr Bueso
commit 210d36d892de5195e6766c45519dfb1e65f3eb83 upstream. Fuzzying/stressing futexes triggered: WARNING: kernel/futex/core.c:825 at wait_for_owner_exiting+0x7a/0x80, CPU#11: futex_lock_pi_s/524 When futex_lock_pi_atomic() sees the owner is exiting, it returns -EBUSY and stores a refcounted task pointer in 'exiting'. After wait_for_owner_exiting() consumes that reference, the local pointer is never reset to nil. Upon a retry, if futex_lock_pi_atomic() returns a different error, the bogus pointer is passed to wait_for_owner_exiting(). CPU0 CPU1 CPU2 futex_lock_pi(uaddr) // acquires the PI futex exit() futex_cleanup_begin() futex_state = EXITING; futex_lock_pi(uaddr) futex_lock_pi_atomic() attach_to_pi_owner() // observes EXITING *exiting = owner; // takes ref return -EBUSY wait_for_owner_exiting(-EBUSY, owner) put_task_struct(); // drops ref // exiting still points to owner goto retry; futex_lock_pi_atomic() lock_pi_update_atomic() cmpxchg(uaddr) *uaddr ^= WAITERS // whatever // value changed return -EAGAIN; wait_for_owner_exiting(-EAGAIN, exiting) // stale WARN_ON_ONCE(exiting) Fix this by resetting upon retry, essentially aligning it with requeue_pi. Fixes: 3ef240eaff36 ("futex: Prevent exit livelock") Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260326001759.4129680-1-dave@stgolabs.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02tracing: Fix potential deadlock in cpu hotplug with osnoiseLuo Haiyang
[ Upstream commit 1f9885732248d22f788e4992c739a98c88ab8a55 ] The following sequence may leads deadlock in cpu hotplug: task1 task2 task3 ----- ----- ----- mutex_lock(&interface_lock) [CPU GOING OFFLINE] cpus_write_lock(); osnoise_cpu_die(); kthread_stop(task3); wait_for_completion(); osnoise_sleep(); mutex_lock(&interface_lock); cpus_read_lock(); [DEAD LOCK] Fix by swap the order of cpus_read_lock() and mutex_lock(&interface_lock). Cc: stable@vger.kernel.org Cc: <mathieu.desnoyers@efficios.com> Cc: <zhang.run@zte.com.cn> Cc: <yang.tao172@zte.com.cn> Cc: <ran.xiaokai@zte.com.cn> Fixes: bce29ac9ce0bb ("trace: Add osnoise tracer") Link: https://patch.msgid.link/20260326141953414bVSj33dAYktqp9Oiyizq8@zte.com.cn Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Luo Haiyang <luo.haiyang@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02tracing: Switch trace_osnoise.c code over to use guard() and __free()Steven Rostedt
[ Upstream commit 930d2b32c0af6895ba4c6ca6404e7f7b6dc214ed ] The osnoise_hotplug_workfn() grabs two mutexes and cpu_read_lock(). It has various gotos to handle unlocking them. Switch them over to guard() and let the compiler worry about it. The osnoise_cpus_read() has a temporary mask_str allocated and there's some gotos to make sure it gets freed on error paths. Switch that over to __free() to let the compiler worry about it. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/20241225222931.517329690@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Stable-dep-of: 1f9885732248 ("tracing: Fix potential deadlock in cpu hotplug with osnoise") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-04-02alarmtimer: Fix argument order in alarm_timer_forward()Zhan Xusheng
commit 5d16467ae56343b9205caedf85e3a131e0914ad8 upstream. alarm_timer_forward() passes arguments to alarm_forward() in the wrong order: alarm_forward(alarm, timr->it_interval, now); However, alarm_forward() is defined as: u64 alarm_forward(struct alarm *alarm, ktime_t now, ktime_t interval); and uses the second argument as the current time: delta = ktime_sub(now, alarm->node.expires); Passing the interval as "now" results in incorrect delta computation, which can lead to missed expirations or incorrect overrun accounting. This issue has been present since the introduction of alarm_timer_forward(). Fix this by swapping the arguments. Fixes: e7561f1633ac ("alarmtimer: Implement forward callback") Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260323061130.29991-1-zhanxusheng@xiaomi.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>