summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2026-06-14bpf: Raise maximum call chain depth to 16 framesAlexei Starovoitov
Bump MAX_CALL_FRAMES from 8 to 16 to allow deeper call chains that Rust-BPF requires and update selftests. Link: https://lore.kernel.org/r/20260613180755.29671-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-13posix-cpu-timers: Fix pid refcount leak in do_cpu_nanosleep() error pathrefs/merge-window/c00c858bfaf38e9c7aec8391ba34e0a51ccd96a2WenTao Liang
In do_cpu_nanosleep(), posix_cpu_timer_create() takes a pid reference via get_pid() and stores it in timer.it.cpu.pid. If the subsequent posix_cpu_timer_set() call fails, the function returns immediately without calling posix_cpu_timer_del() to release the pid reference, causing a leak. Fix it by calling posix_cpu_timer_del() before the unlock-and-return on the error path, consistent with the other exit paths in the same function. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: WenTao Liang <vulab@iscas.ac.cn> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260611161738.97043-1-vulab@iscas.ac.cn
2026-06-13time/jiffies: Register jiffies clocksource before usageThomas Gleixner
Teddy reported that a XEN HVM has a long boot delay, which was bisected to the recent enhancements to the negative motion detection. It turned out that the jiffies clocksource is used in early boot before it is registered, which leaves the max_delta_raw field at zero. That causes the read out to be clamped to the max delta of 0, which means time is not making progress. Cure it by ensuring that it is initialized before its first usage in timekeeping_init(). Fixes: 76031d9536a0 ("clocksource: Make negative motion detection more robust") Reported-by: Teddy Astie <teddy.astie@vates.tech> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Teddy Astie <teddy.astie@vates.tech> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/87y0gn3fve.ffs@fw13 Closes: https://lore.kernel.org/all/1780914594.8631fc262581453bbf619ec5b2062170.19ea6c8227b000701b@vates.tech
2026-06-12bpf: Fix setting retval to -EPERM for cgroup hooks not returning errnoXu Kuohai
When a cgroup BPF program exits with 0, bpf_prog_run_array_cg() sets the hook return value to -EPERM if it is not a valid errno. This is correct for errno-based hooks, which return 0 on success and negative errno on failure, but wrong for boolean and void LSM hooks. Boolean LSM hooks should only return true or false, and void LSM hooks have no return value at all. Fix it by skipping setting -EPERM for hooks not returning errno. Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor") Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/r/20260610201724.733943-2-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-12bpf: Run generic devmap egress prog on private skbSun Jian
Generic XDP devmap multi redirect uses skb_clone() for intermediate destinations and sends the last destination with the original skb. This can leave multiple destinations sharing the same packet data. This becomes visible after generic devmap egress-program support was added: a devmap egress program may mutate packet data, and another destination sharing the same data can observe that mutation. Native XDP broadcast redirect does not have this issue because xdpf_clone() copies the frame data for each destination. Generic XDP should provide the same per-destination isolation before running a devmap egress program. Fix this by making cloned skbs private before running the generic devmap egress program. Use skb_copy() instead of skb_unshare() so allocation failure does not consume the skb and the existing caller error paths keep their ownership semantics. Fixes: 2ea5eabaf04a ("bpf: devmap: Implement devmap prog execution for generic XDP") Suggested-by: Jiayuan Chen <jiayuan.chen@linux.dev> Suggested-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com> Link: https://lore.kernel.org/r/20260612114032.244616-2-sun.jian.kdev@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-11Merge tag 'dma-mapping-7.1-2026-06-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fix from Marek Szyprowski: "Three more fixes for the DMA-mapping code, related to PCI P2PDMA, DMA debug and DMA link ranges API (Li RongQing and Jason Gunthorpe)" * tag 'dma-mapping-7.1-2026-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: iommu/dma: Do not try to iommu_map a 0 length region in swiotlb dma-debug: fix physical address retrieval in debug_dma_sync_sg_for_device dma-mapping: direct: fix missing mapping for THRU_HOST_BRIDGE segments
2026-06-11Merge branches 'pm-sleep', 'pm-powercap' and 'pm-tools'refs/merge-window/b7b9fef85ccd38cebf187688b321fb1bd2429d63Rafael J. Wysocki
Merge updates related to system sleep support, two updates of the intel_rapl power capping driver, and a pm-graph utility fix for 7.2-rc1: - Add sysctl interface for DPM watchdog timeouts (Tzung-Bi Shih) - Use complete() instead of complete_all() in device_pm_sleep_init() to avoid a false-positive warning from lockdep_assert_RT_in_threaded_ctx() when CONFIG_PROVE_RAW_LOCK_NESTING is enabled (Jiakai Xu) - Use a flexible array for CRC uncompressed buffers during hibernation image saving (Rosen Penev) - Make the LZ4 algorithm available for hibernation compression (l1rox3) - Move the preallocate_image() call during hibernation after the "prepare" phase of the "freeze" transition (Matthew Leach) - Fix a memory leak in rapl_add_package_cpuslocked() in the intel_rapl power capping driver and use sysfs_emit() in cpumask_show() in that driver (Sumeet Pawnikar, Yury Norov) - Fix ValueError when parsing incomplete device properties in the pm-graph utility (Gongwei Li) * pm-sleep: PM: dpm_watchdog: Add sysctl interface for DPM watchdog timeouts PM: hibernate: Use flexible array for CRC uncompressed buffers PM: hibernate: make LZ4 available for hibernation compression PM: sleep: Use complete() in device_pm_sleep_init() PM: hibernate: call preallocate_image() after freeze prepare * pm-powercap: powercap: intel_rapl: Use sysfs_emit() in cpumask_show() powercap: intel_rapl: Fix memory leak in rapl_add_package_cpuslocked() * pm-tools: PM: tools: pm-graph: fix ValueError when parsing incomplete device properties
2026-06-11Merge branches 'pm-cpuidle', 'pm-opp' and 'pm-qos'Rafael J. Wysocki
Merge cpuidle updates, OPP (operating performance points) updates and a PM QoS update for 7.2-rc1: - Allow the intel_idle driver to avoid exposing C-states that are redundant when PC6 is disabled (Artem Bityutskiy) - Fix memory leak and a potential race in the OPP core (Abdun Nihaal, Di Shen) - Mark Rust OPP methods as inline (Nicolás Antinori) - Fix misc device registration failure path in the PM QoS core (Yuho Choi) * pm-cpuidle: intel_idle: Drop C-states redundant when PC6 is disabled intel_idle: Introduce a helper for checking PC6 intel_idle: Add constants for MSR_PKG_CST_CONFIG_CONTROL * pm-opp: opp: rust: mark OPP methods as inline OPP: of: Fix potential memory leak in opp_parse_supplies() OPP: Fix race between OPP addition and lookup * pm-qos: PM: QoS: Fix misc device registration unwind
2026-06-11locking: Add contended_release tracepoint to sleepable locksDmitry Ilvokhin
Add the contended_release trace event. This tracepoint fires on the holder side when a contended lock is released, complementing the existing contention_begin/contention_end tracepoints which fire on the waiter side. This enables correlating lock hold time under contention with waiter events by lock address. Add trace_contended_release()/trace_call__contended_release() calls to the slowpath unlock paths of sleepable locks: mutex, rtmutex, semaphore, rwsem, percpu-rwsem, and RT-specific rwbase locks. Where possible, trace_contended_release() fires before the lock is released and before the waiter is woken. For some lock types, the tracepoint fires after the release but before the wake. Making the placement consistent across all lock types is not worth the added complexity. For reader/writer locks, the tracepoint fires for every reader releasing while a writer is waiting, not only for the last reader. Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Usama Arif <usama.arif@linux.dev> Link: https://patch.msgid.link/02f4f6c5ce6761e7f6587cf0ff2289d962ecddd4.1780506267.git.d@ilvokhin.com
2026-06-11locking/percpu-rwsem: Extract __percpu_up_read()Dmitry Ilvokhin
Move the percpu_up_read() slowpath out of the inline function into a new __percpu_up_read() to avoid binary size increase from adding a tracepoint to an inlined function. Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Usama Arif <usama.arif@linux.dev> Link: https://patch.msgid.link/3dd2a1b9ab4f469e1892766cb63f41d6b0f53d29.1780506267.git.d@ilvokhin.com
2026-06-11futex: Optimize futex hash bucket access patternsPeter Zijlstra
Breno reported significant c2c HITM in a futex hash heavy workload. It turns out that the hash bucket to private hash table reverse pointer (futex_hash_bucket::priv) was to blame. Notably when the hash buckets are heavily contended, the: 'fph = bh->priv;' load in futex_hash() will typically miss and consequently become quite expensive. Since this load in particular is quite superfluous, removing it is fairly straight forward. However, removing it does not in fact achieve anything much. The pain moves to the next user, notably: futex_hash_put(). Therefore rework the whole private hash refcounting to avoid needing this back pointer (and removing it). Instead of passing around 'struct futex_hash_bucket *hb', pass around a new structure that contains it and the related 'struct futex_private_hash *fph' pointer in tandem. Funnily this turns out to remove more code than it adds and significantly improves futex hash performance (as measured by 'perf bench futex hash'): SKL dual socket 112 threads: Baseline Patched shared (16k) 1571857 1641435 + 4.4% autosize (512) 646390 903371 +39.7% -b 256 464395 587014 +26.4% -b 512 715687 995943 +39.2% -b 1024 995085 1396328 +40.3% -b 2048 1293114 1668395 +29.0% -b 4096 2124438 2240228 + 5.5% Zen3 dual socket 256 threads: Baseline Patched shared (16k) 1275840 1381279 + 8.2% autosize (512) 1252745 1482179 +18.3% -b 256 856274 955455 +11.5% -b 512 1267490 1544010 +21.8% -b 1024 1424013 1625424 +14.1% -b 2048 1505181 1669342 +10.9% -b 4096 1465993 1688932 +15.2% AMD EPYC 9D64 (Zen4, single socket) 176 threads: Baseline Patched Delta shared (16k) 1,230,599 1,368,655 +11.2% autosize (1024) 1,285,440 1,556,946 +21.1% -b 256 1,341,471 1,520,303 +13.3% -b 512 1,438,330 1,599,319 +11.2% -b 1024 1,443,772 1,622,493 +12.4% -b 2048 1,472,108 1,643,975 +11.7% -b 4096 1,333,098 1,570,897 +17.8% Reported-by: Breno Leitao <leitao@debian.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Breno Leitao <leitao@debian.org> Tested-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260610135510.GB1430057@noisy.programming.kicks-ass.net
2026-06-11sched/fair: Fix newidle vs core-schedAaron Lu
While testing Prateek's throttle series, I noticed a panic issue when coresched is enabled and bisected to this patch. I fed the panic log and this patch to an agent and its analysis looks correct to me(cpu56 and cpu57 are siblings in a VM): cpu57 (holds core-wide lock) pick_next_task() [core scheduling] for_each_cpu_wrap(i, smt_mask, 57): i=57: pick_task(rq_57) pick_task_fair(rq_57) -> picks task A rq_57->core_pick = task A // task_rq(A) == rq_57 i=56: pick_task(rq_56) pick_task_fair(rq_56) cfs_rq->nr_queued == 0 goto idle sched_balance_newidle(rq_56) raw_spin_rq_unlock(rq_56) // core-wide lock released newidle_balance() pulls task A: rq_57 -> rq_56 // task_rq(A) == rq_56 now raw_spin_rq_lock(rq_56) // core-wide lock re-acquired return > 0 goto again pick_task_fair(rq_56) -> picks task A rq_56->core_pick = task A // first loop done // rq_57->core_pick is still task A (set before lock release) // but task_rq(A) == rq_56 now next = rq_57->core_pick // = task A put_prev_set_next_task(rq_57, prev, task A) __set_next_task_fair(rq_57, task A) hrtick_start_fair(rq_57, task A) WARN_ON_ONCE(task_rq(task A) != rq_57) // task_rq(A) == rq_56 IOW: by allowing pick_task_fair() to do newidle_balance and not returning RETRY_TASK, it can end up selecting the same task on two CPUs. Restore the previous state by never doing newidle when core scheduling is enabled. Tested-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: "Aaron Lu" <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260603095108.GA1684319@bytedance.com
2026-06-10bpf: Tighten cgroup storage cookie checks for prog arraysDaniel Borkmann
The fix in commit abad3d0bad72 ("bpf: Fix oob access in cgroup local storage") is still incomplete. The prog-array compatibility check treats a program with no cgroup storage as compatible with any stored storage cookie. This allows a storage-less program to bridge a tail call chain between an entry program and a storage-using callee even though cgroup local storage at runtime still follows the caller's context, that is, A -> B(no storage) -> C(storage) path. Requiring exact cookie equality would break the legitimate case of a storage-less leaf program being tail called from a storage-using one. Instead, only accept a zero storage cookie if the program cannot perform tail calls itself. This keeps A -> B(no storage) working while rejecting the A -> B(no storage) -> C(storage) bridge. Fixes: abad3d0bad72 ("bpf: Fix oob access in cgroup local storage") Reported-by: Lin Ma <malin89@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260610105539.705887-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-10vhost_task_create: kill unnecessary .exit_signal initializationOleg Nesterov
The only reason for this janitorial change is that this initialization adds unnecessary noise to "git grep exit_signal". args.exit_signal has no effect with CLONE_THREAD, not to mention it is zero-initialized by the compiler anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Message-ID: <acAIm732QPFZs15C@redhat.com>
2026-06-09bpf: Cancel special fields on map value recycleJustin Suess
Map update and delete paths currently call bpf_obj_free_fields() when a value is being replaced or recycled. That makes field destruction depend on the context of the update/delete operation. For tracing programs this can include NMI context, where referenced kptr destructors, uptr unpinning, and graph root destruction are not generally safe. Introduce bpf_obj_cancel_fields() for the reusable-value path. It only performs NMI-safe cleanup for timer, workqueue, and task_work fields. Fields that need full destruction are left attached to the recycled value and are destroyed by the final cleanup path instead. Switch array and hashtab update/delete/recycle paths to this cancel helper. Keep bpf_obj_free_fields() for final map destruction and for bpf_mem_alloc destructors. Preallocated hashtabs do not have allocator destructors, so teardown continues to walk the normal and extra elements and fully destroy their fields. This deliberately relaxes the eager-free semantics of map update/delete for special fields. Programs that relied on a recycled map slot becoming empty immediately after update/delete were relying on behavior that cannot be implemented safely from every BPF execution context without offloading arbitrary destructors. There is a chance this change breaks programs making assumptions regarding the eager freeing of fields. If so, we can relax semantics to cancellation only when irqs_disabled() is true in the future. However, theoretically, map values that get reused eagerly already have weaker guarantees as parallel users can recreate freed fields before the new element becomes visible again. Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr") Signed-off-by: Justin Suess <utilityemal77@gmail.com> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260609202548.3571690-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-09bpf: Reject bpf_obj_drop() from tracing progsJustin Suess
bpf_obj_drop() runs bpf_obj_free_fields() synchronously for program-allocated objects. When such an object contains NMI unsafe fields, tracing programs that can run from arbitrary instrumented context can reach that destruction from unsafe contexts, including NMI. NMI is likely one instance of this problem, and other instances would include possible unsafe reentrancy. Deferring bpf_obj_drop() is not appealing either: it would add delayed-free machinery to a release operation that otherwise has straightforward synchronous ownership semantics. Reject bpf_obj_drop() and bpf_percpu_obj_drop() from tracing programs that may run from unsafe contexts unless every field in the object's BTF record is explicitly NMI safe. Do not reject sleepable BPF_PROG_TYPE_TRACING programs, since they are not the arbitrary/NMI contexts that motivate the restriction. Note that while bpf_rb_root and bpf_list_head would be NMI safe on their own to free, the objects recursively held by them may not be; be conservative and just mark them as not NMI safe for now. Use a whitelist for the NMI-safe field set instead of listing only known NMI unsafe fields. Locks, async fields, unreferenced kptrs, and refcounts are known to be NMI safe because their destruction is either a no-op, simple state reset, or async cancellation. Referenced kptrs, percpu referenced kptrs, uptrs, graph roots, graph nodes, and any future field type are rejected until audited for arbitrary tracing and NMI contexts. This is less susceptible to future changes in fields that were previously safe by exclusion, and to new fields being added without updating this check. Convert the existing recursive local-object drop success case to a syscall program in the same commit, since this verifier change makes the old tracing program form invalid. The test still exercises bpf_obj_drop() releasing a referenced task kptr from a safe program type. Fixes: ac9f06050a35 ("bpf: Introduce bpf_obj_drop") Signed-off-by: Justin Suess <utilityemal77@gmail.com> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260609202548.3571690-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-09Merge tag 'trace-rv-v7.1-rc6-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull runtime verifier fixes from Steven Rostedt: - Fix reset ordering on per-task destruction Reset the task before dropping the slot instead of after, which was causing out-of-bound memory accesses. - Fix HA monitor synchronization and cleanup Ensure synchronous cleanup for HA monitors by running timer callbacks in RCU read-side critical sections and using synchronize_rcu() during destruction. - Avoid armed timers after tasks exit Add automatic cleanup for per-task HA monitors to prevent timers from firing after task exit. - Fix memory ordering for DA/HA monitors Fix race conditions during monitor start by using release-acquire semantics for the monitoring flag. - Fix initialization for DA/HA monitors Ensure monitors are not initialized relying on potentially corrupted state like the monitoring flag, that is not reset by all monitors type and may have an unknown state in monitors reusing the storage (per-task). - Fix memory safety in per-task and per-object monitors Prevent use-after-free and out-of-bounds access by synchronizing with in-flight tracepoint probes using tracepoint_synchronize_unregister() before freeing monitor storage or releasing task slots. - Adjust monitors for preemptible tracepoints Fix monitors that relied on tracepoints disabling preemption. Explicitly disable task migration when per-CPU monitors handle events to avoid accessing the wrong state and update the opid monitor logic. - Fix incorrect __user specifier usage Remove __user from a non-pointer variable in the extract_params() helper. - Fix bugs in the rv tool Ensure strings are NUL-terminated, fix substring matching in monitor searches, and improve cleanup and exit status handling. - Fix several bugs in rvgen Fix LTL literal stringification, subparsers' options handling, and suffix stripping in dot2k. * tag 'trace-rv-v7.1-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: verification/rvgen: Fix ltl2k writing True as a literal verification/rvgen: Fix options shared among commands verification/rvgen: Fix suffix strip in dot2k tools/rv: Fix cleanup after failed trace setup tools/rv: Fix substring match when listing container monitors tools/rv: Fix substring match bug in monitor name search tools/rv: Ensure monitor name and desc are NUL-terminated rv: Use 0 to check preemption enabled in opid rv: Prevent task migration while handling per-CPU events rv: Ensure synchronous cleanup for HA monitors rv: Add automatic cleanup handlers for per-task HA monitors rv: Do not rely on clean monitor when initialising HA rv: Fix monitor start ordering and memory ordering for monitoring flag rv: Ensure all pending probes terminate on per-obj monitor destroy rv: Prevent in-flight per-task handlers from using invalid slots rv: Reset per-task DA monitors before releasing the slot rv: Fix __user specifier usage in extract_params()
2026-06-09bpf: Allow sleepable programs to use LPM trie maps directlyVlad Poenaru
The previous change relaxed the rcu_dereference annotations in lpm_trie.c so the trie walks no longer trip lockdep when reached from a sleepable BPF program holding only rcu_read_lock_trace(). By itself that only helps tries reached as the inner map of a map-of-maps, or from the classic-RCU syscall path: a sleepable program that references an LPM trie directly is still rejected at load time by check_map_prog_compatibility(), whose sleepable whitelist omits BPF_MAP_TYPE_LPM_TRIE: Sleepable programs can only use array, hash, ringbuf and local storage maps LPM trie nodes are allocated from a bpf_mem_alloc (trie->ma) and freed with bpf_mem_cache_free_rcu(), which chains a regular RCU grace period into a Tasks Trace grace period before the node -- and the value embedded in it that trie_lookup_elem() returns to the program -- is released. That is the same reclaim discipline BPF_MAP_TYPE_HASH relies on for sleepable access, so a value handed to a sleepable reader cannot be freed while the program is still running under rcu_read_lock_trace(). The writer paths take trie->lock across the walk and never relied on the RCU read-side lock to keep nodes alive. Add BPF_MAP_TYPE_LPM_TRIE to the sleepable map whitelist so these programs can use LPM tries directly. Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260609135558.193287-3-vlad.wing@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-09bpf: Allow LPM map access from sleepable BPF programsVlad Poenaru
trie_lookup_elem() annotates its rcu_dereference_check() walks with only rcu_read_lock_bh_held(). Because rcu_dereference_check(p, c) resolves to "c || rcu_read_lock_held()", this passes for XDP/NAPI and classic RCU readers but fails for sleepable BPF programs, which enter via __bpf_prog_enter_sleepable() and hold only rcu_read_lock_trace(). trie_update_elem() and trie_delete_elem() have the same problem in a different form: they walk the trie with plain rcu_dereference(), which asserts rcu_read_lock_held() unconditionally. Both are reachable from sleepable BPF programs via the bpf_map_update_elem / bpf_map_delete_elem helpers, and from the syscall path under classic rcu_read_lock(). In the writer paths the trie is actually protected by trie->lock (an rqspinlock taken across the walk); we never relied on the RCU read-side lock to keep nodes alive there. A sleepable LSM hook that ends up touching an LPM trie therefore triggers lockdep on debug kernels: ============================= WARNING: suspicious RCU usage 7.1.0-... Tainted: G E ----------------------------- kernel/bpf/lpm_trie.c:249 suspicious rcu_dereference_check() usage! 1 lock held by net_tests/540: #0: (rcu_tasks_trace_srcu_struct){....}-{0:0}, at: __bpf_prog_enter_sleepable+0x26/0x280 Call Trace: dump_stack_lvl lockdep_rcu_suspicious trie_lookup_elem bpf_prog_..._enforce_security_socket_connect bpf_trampoline_... security_socket_connect __sys_connect do_syscall_64 This is lockdep-only -- no UAF, since Tasks Trace RCU does serialize against the trie's reclaim path -- but it spams the console once per distinct callsite on every debug kernel running a sleepable BPF LSM that touches an LPM trie, which is increasingly common. For the lookup path, switch the rcu_dereference_check() annotation from rcu_read_lock_bh_held() to bpf_rcu_lock_held(), which accepts all three contexts (classic, BH, Tasks Trace). Other map types already follow this convention. For trie_update_elem() and trie_delete_elem(), annotate the walks as rcu_dereference_protected(*p, 1) -- matching trie_free() in the same file -- since trie->lock is held across the walk. rqspinlock has no lockdep_map, so the predicate degenerates to '1' rather than lockdep_is_held(&trie->lock); the protection is real but not machine-verifiable. trie_get_next_key() also uses bare rcu_dereference() but is reachable only from the BPF syscall, which holds classic rcu_read_lock() before dispatching, so it is left untouched. Fixes: 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context") Cc: stable@vger.kernel.org Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260609135558.193287-2-vlad.wing@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-09PM: QoS: Fix misc device registration unwindYuho Choi
cpu_latency_qos_init() registers cpu_dma_latency first and, when CONFIG_PM_QOS_CPU_SYSTEM_WAKEUP is enabled, registers cpu_wakeup_latency afterwards. The second registration overwrites the first return value. As a result, a failure to register cpu_dma_latency can be masked if the second registration succeeds. Conversely, if cpu_dma_latency succeeds and cpu_wakeup_latency fails, the function returns an error while leaving the first misc device registered. Return immediately on the first registration failure and deregister cpu_dma_latency if the second registration fails. Fixes: a4e6512a79d8 ("PM: QoS: Introduce a CPU system wakeup QoS limit") Signed-off-by: Yuho Choi <dbgh9129@gmail.com> Link: https://patch.msgid.link/20260608170748.82273-1-dbgh9129@gmail.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-06-09bpf: Enforce write checks for BTF pointer helper accessNuoqi Gui
check_mem_reg() verifies both read and write access for global subprogram memory arguments. When the caller register is PTR_TO_BTF_ID, check_helper_mem_access() currently forwards the access to check_ptr_to_btf_access() as BPF_READ regardless of the requested access type. This lets a BTF-backed kernel object field pointer pass the caller-side writable memory check for a global subprogram argument. The callee is then validated with a generic writable PTR_TO_MEM argument and can store through it, even though an equivalent direct BTF field store is rejected with "only read is supported". Forward the requested access type to check_ptr_to_btf_access(). This enforces existing BTF write restrictions for global subprogram memory arguments as well. Fixes: 3e30be4288b3 ("bpf: Allow helpers access trusted PTR_TO_BTF_ID.") Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn> Link: https://lore.kernel.org/bpf/20260609-f01-04-btf-writable-arg-v1-1-f449cd970669@mails.tsinghua.edu.cn Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
2026-06-09timers/migration: Temporarily disable per capacity hierarchiesFrederic Weisbecker
Some workloads with different CPU capacities consume more power with timer migration than before. The recently introduced per capacity hierarchies were supposed to alleviate this problem. However it appears to also regress other types of workloads, especially when plenty of capacities live together in the same machine. Disable the feature until a reasonable solution is found. Fixes: 098cbaad8e57 ("timers/migration: Split per-capacity hierarchies") Reported-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260609123356.28449-1-frederic@kernel.org Closes: https://lore.kernel.org/all/3b79338f-6cfc-4722-8062-9103db2c8ad1@arm.com
2026-06-09bpf: Validate BTF repeated field counts before expansionPaul Moses
btf_parse_struct_metas() walks user-supplied BTF during BPF_BTF_LOAD, and btf_repeat_fields() expands repeatable fields from array elements into the fixed BTF_FIELDS_MAX scratch array used by btf_parse_fields(). The remaining-capacity check performs the expanded field count calculation in u32. A malformed BTF can wrap that calculation, causing the check to pass even when the expanded field count exceeds the scratch array capacity. The following memcpy() can then write past the end of the array. Use checked addition and multiplication before copying repeated fields and reject impossible counts. Fixes: 797d73ee232d ("bpf: Check the remaining info_cnt before repeating btf fields") Cc: stable@vger.kernel.org Signed-off-by: Paul Moses <p@1g4.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20260605234301.1109063-1-p@1g4.org Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
2026-06-09sched/deadline: Use task_on_rq_migrating() helperLiang Luo
Replace the open-coded "p->on_rq == TASK_ON_RQ_MIGRATING" comparisons in enqueue_task_dl() and dequeue_task_dl() with the existing task_on_rq_migrating() helper, consistent with the rest of the scheduler code. No functional change. Signed-off-by: Liang Luo <luoliang@kylinos.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260608075500.387271-1-luoliang@kylinos.cn
2026-06-09sched/core: Combine separate 'else' and 'if' statementsLiang Luo
The kernel coding style recommends using 'else if' instead of placing 'if' on a separate line after 'else'. This change makes the code consistent with the rest of the kernel codebase. Signed-off-by: Liang Luo <luoliang@kylinos.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260608071842.325159-1-luoliang@kylinos.cn
2026-06-09sched/fair: Fix cpu_util runnable_avg arithmeticHongyan Xia
If we take runnable_avg in max(runnable_avg, util_avg) in cpu_util(), we should then add or subtract task runnable_avg, but the arithmetic below is still with task util_avg. This mixes runnable_avg with util_avg which is incorrect. Fix by always doing arithmetic with runnable_avg and only take max(runnable_avg, util_avg) at the last step. Fixes: 7d0583cf9ec7 ("sched/fair, cpufreq: Introduce 'runnable boosting'") Signed-off-by: Hongyan Xia <hongyan.xia@transsion.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260605094318.37931-1-hongyan.xia@transsion.com
2026-06-08mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap deviceYoungjun Park
Patch series "mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap device", v8. Currently, in the uswsusp path, only the swap type value is retrieved at lookup time without holding a reference. If swapoff races after the type is acquired, subsequent slot allocations operate on a stale swap device. Additionally, grabbing and releasing the swap device reference on every slot allocation is inefficient across the entire hibernation swap path. This patch series addresses these issues: - Patch 1: Fixes the swapoff race in uswsusp by pinning the swap device from the point it is looked up until the session completes. - Patch 2: Removes the overhead of per-slot reference counting in alloc/free paths and cleans up the redundant SWP_WRITEOK check. This patch (of 2): Hibernation via uswsusp (/dev/snapshot ioctls) has a race window: after selecting the resume swap area but before user space is frozen, swapoff may run and invalidate the selected swap device. Fix this by pinning the swap device with SWP_HIBERNATION while it is in use. The pin is exclusive, which is sufficient since hibernate_acquire() already prevents concurrent hibernation sessions. The kernel swsusp path (sysfs-based hibernate/resume) uses find_hibernation_swap_type() which is not affected by the pin. It freezes user space before touching swap, so swapoff cannot race. Introduce dedicated helpers: - pin_hibernation_swap_type(): Look up and pin the swap device. Used by the uswsusp path. - find_hibernation_swap_type(): Lookup without pinning. Used by the kernel swsusp path. - unpin_hibernation_swap_type(): Clear the hibernation pin. While a swap device is pinned, swapoff is prevented from proceeding. Link: https://lore.kernel.org/20260323160822.1409904-1-youngjun.park@lge.com Link: https://lore.kernel.org/20260323160822.1409904-2-youngjun.park@lge.com Signed-off-by: Youngjun Park <youngjun.park@lge.com> Reviewed-by: Kairui Song <kasong@tencent.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: "Rafael J . Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-08kernel: param: initialize module_kset in a pure_initcallShashank Balaji
Commit "driver core: platform: set mod_name in driver registration" will set struct device_driver's mod_name member for platform driver registration. For a driver to be registered with its mod_name set, module_kset needs to be initialized, which currently happens in a subsys_initcall in param_sysfs_init(). The tegra cbb drivers register themselves before module_kset init, in a core_initcall. This works currently because lookup_or_create_module_kobject(), which dereferences module_kset via kset_find_obj(), is not called if mod_name is not set, which is the case now. So in preparation for the commit "driver core: platform: set mod_name in driver registration", move module_kset init to pure_initcall level, ensuring it happens before tegra cbb driver registration. Suggested-by: Gary Guo <gary@garyguo.net> Reviewed-by: Gary Guo <gary@garyguo.net> Co-developed-by: Rahul Bukte <rahul.bukte@sony.com> Signed-off-by: Rahul Bukte <rahul.bukte@sony.com> Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com> Link: https://patch.msgid.link/20260601101942.4002661-1-shashank.mahadasyam@sony.com Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2026-06-08bpf: Fix NULL pointer dereference in bpf_task_from_vpid()Sechang Lim
bpf_task_from_vpid() looks up a task in the pid namespace of the current task, via find_task_by_vpid(): find_task_by_vpid(vpid) find_task_by_pid_ns(vpid, task_active_pid_ns(current)) find_pid_ns(nr, ns) -> idr_find(&ns->idr, nr) cgroup_skb programs run in softirq, which may interrupt a task that is itself in do_exit(). Once that task has passed exit_notify() -> release_task() -> __unhash_process(), its thread_pid is cleared, so task_active_pid_ns(current) returns NULL and find_pid_ns() dereferences &NULL->idr: BUG: kernel NULL pointer dereference, address: 0000000000000050 RIP: 0010:idr_find+0x11/0x30 lib/idr.c:176 Call Trace: <IRQ> find_pid_ns kernel/pid.c:370 [inline] find_task_by_pid_ns+0x3b/0xe0 kernel/pid.c:485 bpf_task_from_vpid+0x5b/0x200 kernel/bpf/helpers.c:2916 bpf_prog_run_array_cg+0x17e/0x530 kernel/bpf/cgroup.c:81 __cgroup_bpf_run_filter_skb+0x12b/0x250 kernel/bpf/cgroup.c:1612 sk_filter_trim_cap+0x1dc/0x4c0 net/core/filter.c:148 tcp_v4_rcv+0x18d1/0x2200 net/ipv4/tcp_ipv4.c:2223 </IRQ> <TASK> do_exit+0xa63/0x1270 kernel/exit.c:1010 get_signal+0x141c/0x1530 kernel/signal.c:3037 Bail out when current has no pid namespace. Fixes: 675c3596ff32 ("bpf: Add bpf_task_from_vpid() kfunc") Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com> Acked-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/bpf/20260608050001.2545245-1-rhkrqnwk98@gmail.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
2026-06-08bpf: Keep dynamic inner array lookups nullableNuoqi Gui
An ARRAY_OF_MAPS can use an array created with BPF_F_INNER_MAP as its inner map template. A concrete inner array with a different max_entries value can then replace the template. After a successful outer map lookup, the verifier represents the resulting map pointer using the inner map template. Const-key lookup nullness elision consequently uses the template max_entries even though the runtime helper uses the concrete inner map max_entries. Do not elide lookup result nullness for maps marked with BPF_F_INNER_MAP, because the template max_entries does not prove that the key is in bounds for the concrete runtime map. Fixes: d2102f2f5d75 ("bpf: verifier: Support eliding map lookup nullness") Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/bpf/20260607-f01-v2-v2-1-da48453146e8@mails.tsinghua.edu.cn Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
2026-06-07bpf: Fix NMI/tracepoint re-entry deadlock on lru locksMykyta Yatsenko
NMI and tracepoint BPF programs can re-enter the per-CPU or global LRU lock that bpf_lru_pop_free()/push_free() already hold on the same CPU, AA-deadlocking. Lockdep reports "inconsistent {INITIAL USE} -> {IN-NMI}" on &l->lock (syzbot c69a0a2c816716f1e0d5) and "possible recursive locking detected" on &loc_l->lock (syzbot 18b26edb69b2e19f3b33). Prior trylock and rqspinlock based fixes (see links) were nacked because compromised on reliability. This patch converts every LRU lock site to rqspinlock_t and adds a recovery path for some failure windows to avoid node leaks. Failure recovery: - *_pop_free top-level: return NULL; prealloc_lru_pop() already treats that as no-free-element (-ENOMEM). - Cross-CPU steal: skip the victim's locked loc_l, try next CPU. - Post-steal local lock fail: publish stolen node to lockless per-CPU free_llist; next pop on this CPU picks it up. - push_free fail: mark node pending_free=1. __local_list_flush(), __local_list_pop_pending() reclaim the node from pending_list. __bpf_lru_list_shrink_inactive() reclaims the node from inactive list. Nodes from active list are reclaimed by __bpf_lru_list_shrink() or after __bpf_lru_list_rotate_active() demotes it to the inactive. Fixes: 3a08c2fd7634 ("bpf: LRU List") Reported-by: syzbot+c69a0a2c816716f1e0d5@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c69a0a2c816716f1e0d5 Reported-by: syzbot+18b26edb69b2e19f3b33@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=18b26edb69b2e19f3b33 Link: https://lore.kernel.org/bpf/CAPPBnEYO4R+m+SpVc2gNj_x31R6fo1uJvj2bK2YS1P09GWT6kQ@mail.gmail.com/ Link: https://lore.kernel.org/bpf/CAPPBnEZmFA3ab8Uc=PEm0bdojZy=7T_F5_+eyZSHyZR3MBG4Vw@mail.gmail.com/ Link: https://lore.kernel.org/bpf/20251030030010.95352-1-dongml2@chinatelecom.cn/ Link: https://lore.kernel.org/bpf/20260119142120.28170-1-leon.hwang@linux.dev/ Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260607-lru_map_spin-v3-1-bcd9332e911b@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07Merge tag 'timers-urgent-2026-06-07' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: - Fix the arch_inlined_clockevent_set_next_coupled() prototype in the !CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST case (Naveen Kumar Chaudhary) - Fix an off-by-1 bug in the sys_settimeofday() usecs validation code (Naveen Kumar Chaudhary) - Mark vdso_k_*_data pointers as __ro_after_init (Thomas Weißschuh) - Fix livelock race in tmigr_handle_remote_up() (Amit Matityahu) * tag 'timers-urgent-2026-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers/migration: Fix livelock in tmigr_handle_remote_up() vdso/datastore: Mark vdso_k_*_data pointers as __ro_after_init time: Fix off-by-one in settimeofday() usec validation clockevents: Fix duplicate type specifier in stub function parameter
2026-06-07Merge tag 'locking-urgent-2026-06-07' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: - Fix a NULL pointer dereference bug in the FUTEX_CMP_REQUEUE_PI code (Ji'an Zhou) - Fix a NULL pointer dereference bug in the rtmutex code (Davidlohr Bueso) * tag 'locking-urgent-2026-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rtmutex: Skip remove_waiter() when waiter is not enqueued futex/requeue: Prevent NULL pointer dereference in remove_waiter() on self-deadlock
2026-06-07bpf: Add support for tracing_multi link fdinfoJiri Olsa
Adding tracing_multi link fdinfo support with following output: pos: 0 flags: 02000000 mnt_id: 19 ino: 3087 link_type: tracing_multi link_id: 9 prog_tag: 599ba0e317244f86 prog_id: 94 attach_type: 59 cnt: 10 obj-id btf-id cookie func 1 91593 8 bpf_fentry_test1+0x4/0x10 1 91595 9 bpf_fentry_test2+0x4/0x10 1 91596 7 bpf_fentry_test3+0x4/0x20 1 91597 5 bpf_fentry_test4+0x4/0x20 1 91598 4 bpf_fentry_test5+0x4/0x20 1 91599 2 bpf_fentry_test6+0x4/0x20 1 91600 3 bpf_fentry_test7+0x4/0x10 1 91601 1 bpf_fentry_test8+0x4/0x10 1 91602 10 bpf_fentry_test9+0x4/0x10 1 91594 6 bpf_fentry_test10+0x4/0x10 Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-17-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Add support for tracing_multi link sessionJiri Olsa
Adding support to use session attachment with tracing_multi link. Adding new BPF_TRACE_FSESSION_MULTI program attach type, that follows the BPF_TRACE_FSESSION behaviour but on the tracing_multi link. Such program is called on entry and exit of the attached function and allows to pass cookie value from entry to exit execution. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-16-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Add support for tracing_multi link cookiesJiri Olsa
Add support to specify cookies for tracing_multi link. Cookies are provided in array where each value is paired with provided BTF ID value with the same array index. Such cookie can be retrieved by bpf program with bpf_get_attach_cookie helper call. We need to sort cookies array together with ids array in check_dup_ids, to keep the id->cookie relation. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-15-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Add support for tracing multi linkJiri Olsa
Adding new link to allow to attach program to multiple function BTF IDs. The link is represented by struct bpf_tracing_multi_link. To configure the link, new fields are added to bpf_attr::link_create to pass array of BTF IDs; struct { __aligned_u64 ids; __u32 cnt; } tracing_multi; Each BTF ID represents function (BTF_KIND_FUNC) that the link will attach bpf program to. We use previously added bpf_trampoline_multi_attach/detach functions to attach/detach the link. The linkinfo/fdinfo callbacks will be implemented in following changes. Note this is supported only for archs (x86_64) with ftrace direct and have single ops support. CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS && CONFIG_HAVE_SINGLE_FTRACE_DIRECT_OPS Note using sort_r (instead of plain sort) in check_dup_ids, because we will use the swap callback in following changes. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-14-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Add bpf_trampoline_multi_attach/detach functionsJiri Olsa
Adding bpf_trampoline_multi_attach/detach functions that allows to attach/detach tracing program to multiple functions/trampolines. The attachment is defined with bpf_program and array of BTF ids of functions to attach the bpf program to. Adding bpf_tracing_multi_link object that holds all the attached trampolines and is initialized in attach and used in detach. The attachment allocates or uses currently existing trampoline for each function to attach and links it with the bpf program. The attach works as follows: - we get all the needed trampolines - lock them and add the bpf program to each (__bpf_trampoline_link_prog) - the trampoline_multi_ops passed in __bpf_trampoline_link_prog gathers ftrace_hash (ip -> trampoline) objects - we call update_ftrace_direct_add/mod to update needed locations - we unlock all the trampolines The detach works as follows: - we lock all the needed trampolines - remove the program from each (__bpf_trampoline_unlink_prog) - the trampoline_multi_ops passed in __bpf_trampoline_unlink_prog gathers ftrace_hash (ip -> trampoline) objects - we call update_ftrace_direct_del/mod to update needed locations - we unlock and put all the trampolines We store the old image/flags in the trampoline before the update and use it in case we need to rollback the attachment. We keep the ftrace_hash objects allocated during attach in the link so they can be used for detach as well. Adding trampoline_(un)lock_all functions to (un)lock all trampolines to gate the tracing_multi attachment. Note this is supported only for archs (x86_64) with ftrace direct and have single ops support. CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS && CONFIG_HAVE_SINGLE_FTRACE_DIRECT_OPS It also needs CONFIG_BPF_SYSCALL enabled. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-13-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Move sleepable verification code to btf_id_allow_sleepableJiri Olsa
Move sleepable verification code to btf_id_allow_sleepable function. It will be used in following changes. Adding code to retrieve type's name instead of passing it from bpf_check_attach_target function, because this function will be called from another place in following changes and it's easier to retrieve the name directly in here. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-12-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Add multi tracing attach typesJiri Olsa
Adding new program attach types multi tracing attachment: BPF_TRACE_FENTRY_MULTI BPF_TRACE_FEXIT_MULTI and their base support in verifier code. Programs with such attach type will use specific link attachment interface coming in following changes. This was suggested by Andrii some (long) time ago and turned out to be easier than having special program flag for that. Bpf programs with such types have 'bpf_multi_func' function set as their attach_btf_id and keep module reference when it's specified by attach_prog_fd. They are also accepted as sleepable programs during verification, and the real validation for specific BTF_IDs/functions will happen during the multi link attachment in following changes. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-11-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Factor fsession link to use struct bpf_tramp_nodeJiri Olsa
Now that we split trampoline attachment object (bpf_tramp_node) from the link object (bpf_tramp_link) we can use bpf_tramp_node as fsession's fexit attachment object and get rid of the bpf_fsession_link object. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-10-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Add struct bpf_tramp_node objectJiri Olsa
Adding struct bpf_tramp_node to decouple the link out of the trampoline attachment info. At the moment the object for attaching bpf program to the trampoline is 'struct bpf_tramp_link': struct bpf_tramp_link { struct bpf_link link; struct hlist_node tramp_hlist; u64 cookie; } The link holds the bpf_prog pointer and forces one link - one program binding logic. In following changes we want to attach program to multiple trampolines but we want to keep just one bpf_link object. Splitting struct bpf_tramp_link into: struct bpf_tramp_link { struct bpf_link link; struct bpf_tramp_node node; }; struct bpf_tramp_node { struct bpf_link *link; struct hlist_node tramp_hlist; u64 cookie; }; The 'struct bpf_tramp_link' defines standard single trampoline link and 'struct bpf_tramp_node' is the attachment trampoline object with pointer to the bpf_link object. This will allow us to define link for multiple trampolines, like: struct bpf_tracing_multi_link { struct bpf_link link; ... int nodes_cnt; struct bpf_tracing_multi_node nodes[] __counted_by(nodes_cnt); }; Cc: Hengqi Chen <hengqi.chen@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-9-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Add bpf_trampoline_add/remove_prog functionsJiri Olsa
Separate bpf_trampoline_add/remove_prog functions from __bpf_trampoline_link/unlink functions to be able to add/remove trampoline programs without the image being updated in following changes. No functional change is intended. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-8-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Move trampoline image setup into bpf_trampoline_ops callbacksJiri Olsa
Moving trampoline image setup into bpf_trampoline_ops callbacks, so we can have different image handling for multi attachment which is coming in following changes. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-7-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Add struct bpf_trampoline_ops objectJiri Olsa
In following changes we will need to override ftrace direct attachment behaviour. In order to do that we are adding struct bpf_trampoline_ops object that defines callbacks for ftrace direct attachment: register_fentry unregister_fentry modify_fentry The new struct bpf_trampoline_ops object is passed as an argument to __bpf_trampoline_link/unlink_prog functions. At the moment the default trampoline_ops is set to the current ftrace direct attachment functions, so there's no functional change for the current code. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-6-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Use mutex lock pool for bpf trampolinesJiri Olsa
Adding mutex lock pool that replaces bpf trampolines mutex. For tracing_multi link coming in following changes we need to lock all the involved trampolines during the attachment. This could mean thousands of mutex locks, which is not convenient. As suggested by Andrii we can replace bpf trampolines mutex with mutex pool, where each trampoline is hash-ed to one of the locks from the pool. It's better to lock all the pool mutexes (32 at the moment) than thousands of them. There is 48 (MAX_LOCK_DEPTH) lock limit allowed to be simultaneously held by task, so we need to keep 32 mutexes (5 bits) in the pool, so when we lock them all in following changes the lockdep won't scream. Removing the mutex_is_locked in bpf_trampoline_put, because we removed the mutex from bpf_trampoline. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-5-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07ftrace: Add add_ftrace_hash_entry functionJiri Olsa
Renaming __add_hash_entry to add_ftrace_hash_entry and making it global, it will be used in following changes outside ftrace.c object. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-4-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07ftrace: Add ftrace_hash_remove functionJiri Olsa
Adding ftrace_hash_remove function that removes all entries from struct ftrace_hash object without freeing them. It will be used in following changes where entries are allocated as part of another structure and are free-ed separately. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-3-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07ftrace: Add ftrace_hash_count functionJiri Olsa
Adding external ftrace_hash_count function so we could get hash count outside of ftrace object. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Reject sleepable BPF_LSM_CGROUP programs at load timeDavid Windsor
The cgroup shim runs under rcu_read_lock_dont_migrate(), so we should not attach any sleepable BPF programs there. Add support to the verifier to explicitly reject attempts to load sleepable BPF programs destined for LSM cgroup attachment. Without this, we get the following splat from a BPF_LSM_CGROUP program marked BPF_F_SLEEPABLE attached to file_open when it calls bpf_get_dentry_xattr(): BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:1567 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 34317, name: load preempt_count: 0, expected: 0 RCU nest depth: 2, expected: 0 Call Trace: down_read+0x76/0x480 ext4_xattr_get+0x11f/0x700 __vfs_getxattr+0xf0/0x150 bpf_get_dentry_xattr+0xbb/0xf0 bpf_prog_e76a298dac9218c6_test_open+0x6a/0x85 __cgroup_bpf_run_lsm_current+0x326/0x840 bpf_trampoline_6442534646+0x62/0x14d security_file_open+0x34/0x60 do_dentry_open+0x340/0x1260 vfs_open+0x7a/0x440 path_openat+0x1bac/0x30a0 libbpf provides a .s named section variant for every sleepable program type except lsm_cgroup, reflecting that per-cgroup LSM programs are intended to only run in a non-sleepable context. The above splat was obtained by bypassing libbpf by using bpf(2) directly. Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor") Signed-off-by: David Windsor <dwindsor@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20260605145707.608579-1-dwindsor@gmail.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>