summaryrefslogtreecommitdiff
path: root/kernel/bpf
AgeCommit message (Collapse)Author
7 daysbpf: Disable xfrm_decode_session hook attachmentBradley Morgan
BPF LSM programs can currently attach to xfrm_decode_session(). That hook may return an error, but security_skb_classify_flow() calls it from a void path and triggers BUG_ON() if an error is returned. Disable BPF attachment to the hook to prevent a BPF LSM program from turning packet classification into a full panic. Fixes: 9e4e01dfd325 ("bpf: lsm: Implement attach, detach and execution") Signed-off-by: Bradley Morgan <include@grrlz.net> Link: https://lore.kernel.org/r/20260619130305.27779-1-include@grrlz.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 daysbpf: Reset register bounds before narrowing retval range in check_mem_access()Tristan Madani
When the BPF verifier processes a context load of an LSM hook return value, it calls __mark_reg_s32_range() to narrow the register to the hook's valid range. However, __mark_reg_s32_range() intersects the new range with the register's existing bounds using max_t()/min_t() rather than replacing them. If the destination register carries stale bounds from a prior instruction (e.g. BPF_MOV64_IMM), the intersection can produce a range narrower than reality. The verifier then believes it knows the register's exact value, while at runtime the actual hook return value is loaded, creating a verifier/runtime mismatch that can be used to bypass BPF memory safety checks. The else branch already calls mark_reg_unknown() to reset register state before any narrowing. Apply the same reset in the is_retval path so stale bounds are cleared before __mark_reg_s32_range() intersects. Fixes: 5d99e198be27 ("bpf, lsm: Add check for BPF LSM return value") Cc: stable@vger.kernel.org Signed-off-by: Tristan Madani <tristan@talencesecurity.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260622230123.3695446-2-tristmd@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 daysbpf: Preserve pointer spill metadata during half-slot cleanupNuoqi Gui
__clean_func_state() cleans dead stack slots in 4-byte halves. When the high half of a STACK_SPILL slot is dead and the low half remains live, cleanup converts the live low half to STACK_MISC or STACK_ZERO and clears the saved spilled_ptr metadata. That conversion is safe only for scalar spills. For a pointer spill, this metadata clear lets a later 32-bit fill from the still-live half avoid the normal non-scalar register-fill check and be treated as an ordinary scalar stack read. Leave non-scalar spill slots intact in this half-live shape. This is conservative for pruning and preserves the existing check_stack_read_fixed_off() rejection path for partial fills from pointer spills. Fixes: be23266b4a08 ("bpf: 4-byte precise clean_verifier_state") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn> Link: https://lore.kernel.org/r/20260617-f01-06-half-slot-pointer-spill-v2-1-42b9cdc3cf64@mails.tsinghua.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 daysbpf: Fix effective prog array index with BPF_F_PREORDERAmery Hung
replace_effective_prog() and purge_effective_progs() located the slot in the effective array by walking the program hlist and counting entries linearly. That count does not match the array layout: compute_effective_ progs() places BPF_F_PREORDER programs at the front (ancestor cgroup first, attach order within a cgroup) and the rest after them (descendant cgroup first). So when a preorder program is present, the linear hlist position no longer equals the program's index in the effective array. For replace_effective_prog() (bpf_link_update()) this overwrote the wrong slot, corrupting the effective order. For purge_effective_progs(), it could dummy out a slot belonging to a different program and leave the detached program in the array while bpf_prog_put() drops its reference, i.e. a use-after-free. Fix both by replaying compute_effective_progs()'s placement (including the per-cgroup preorder reversal) in a shared effective_prog_pos() helper. Identify the entry by its struct bpf_prog_list pointer rather than by (prog, link) value, so the lookup resolves to exactly the attachment the syscall selected even when the same bpf_prog is attached to several cgroups in the hierarchy. Fixes: 4b82b181a26c ("bpf: Allow pre-ordering for bpf cgroup progs") Signed-off-by: Amery Hung <ameryhung@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260619063520.2690547-2-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 daysbpf: Fix BPF_PROG_ASSOC_STRUCT_OPS last field checkThiébaud Weksteen
When struct prog_assoc_struct_ops was added, BPF_PROG_ASSOC_STRUCT_OPS_LAST_FIELD referenced prog_fd instead of the actual last field, flags. Fixes: b5709f6d26d6 ("bpf: Support associating BPF program with struct_ops") Signed-off-by: Thiébaud Weksteen <tweek@google.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260618040934.4113938-1-tweek@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 daysbpf: Allow type tag BTF records to succeed other modifier recordsEmil Tsalapatis
llvm commit [1] allowed attaching type tag records to modifier BTF records. This is useful for using typedefs that encompass a base type and a type tag, e.g.: typedef struct rbtree __arena rbtree_t; Modify btf_check_type_tags() so that it allows this sequence of records. The function now only checks for record loops in BTF modifier record chains. Rename to btf_check_modifier_chain_length to reflect this. Also expand the BTF modifier traversal code to take into account that type record can be interleaved with other modifier records. In effect this means traversing all modifiers to collect the type tags. Also modify existing selftests to now accept modifier records (const, typedef) that point to type tag records. [1] https://github.com/llvm/llvm-project/pull/203089 Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260616061454.7869-1-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 daysbpf: Emit verbose message when prog-specific btf_struct_access rejects a writeAlexei Starovoitov
When BPF_WRITE goes through a PTR_TO_BTF_ID register, check_ptr_to_btf_access() delegates to env->ops->btf_struct_access(). Most implementations (bpf_scx_btf_struct_access, tc_cls_act_btf_struct_access, etc.) return -EACCES for disallowed fields without logging anything, so the verifier rejects the program with an empty message. For example a scx program doing 1: R1=trusted_ptr_task_struct() ... 4: (7b) *(u64 *)(r1 +0) = r2 verification time 83 usec the program is rejected leaves the user guessing which field is off-limits. Emit verbose message. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260615232146.5491-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 daysbpf: Fix build_id caching in stack_map_get_build_id_offset()Ihor Solodrai
This patch is a follow up to recent implementation of stack_map_get_build_id_offset_sleepable() [1]. stack_map_get_build_id_offset() and its sleepable variant each cached only the last successfully resolved VMA, with separate bookkeeping in each function. A run of IPs in a VMA with no usable build ID will repeat the lookup for every frame: find_vma() in the non-sleepable path, a VMA lock and a blocking build_id_parse_file() in the sleepable. Factor the per-call cache into a shared struct stack_map_build_id_cache with two independent slots [2][3], used by both functions: * resolved - last VMA that produced a build ID (file, build_id and range), reused to skip the lookup and the parse; * unresolved - last VMA with no usable build ID (range only), reused to emit a raw IP without another lookup or parse. Keeping the slots independent means a build-ID-less VMA no longer evicts the last resolved build ID, so a trace alternating between a binary and a region without one stops re-resolving the binary on every return. The shared lookup tests [vm_start, vm_end), matching the sleepable path; the non-sleepable path previously reused a build ID for ip == vm_end (range_in_vma() is inclusive) and now re-resolves it correctly. [1] https://lore.kernel.org/bpf/20260525223948.1920986-1-ihor.solodrai@linux.dev/ [2] https://lore.kernel.org/bpf/CAEf4Bza2fRDGhLQoPE-EzM7F34xaEJfi5Exmxb-iWVUN3F06=g@mail.gmail.com/ [3] https://lore.kernel.org/bpf/CAEf4BzZXJFr=1iiVx937ht=4PYQkQHg=eFk810zhMDzXQG3ihw@mail.gmail.com/ Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Link: https://lore.kernel.org/r/20260615195536.1065107-1-ihor.solodrai@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 daysbpf: Fix stack slot index in nospec checksNuoqi Gui
check_stack_write_fixed_off() computes the byte slot for a fixed-offset stack write as -off - 1, and records each written byte in slot_type[] with (slot - i) % BPF_REG_SIZE. The Spectre v4 sanitization pre-check uses slot_type[i] instead. For a 4-byte write at fp-8 after the lower half of fp-8 has been zeroed, the pre-check scans bytes 0..3 and sees STACK_ZERO while the actual write updates bytes 7..4. That can leave the second half-slot write without nospec_result even though the bytes being overwritten still require sanitization. Use the same slot index in the sanitization pre-check that the write path uses when updating slot_type[]. Fixes: 2039f26f3aca ("bpf: Fix leakage due to insufficient speculative store bypass mitigation") Acked-by: Luis Gerhorst <luis.gerhorst@fau.de> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn> Link: https://lore.kernel.org/r/20260618-f01-11-stack-nospec-slot-index-v3-1-780297041721@mails.tsinghua.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
9 daysMerge tag 'mm-nonmm-stable-2026-06-21-10-22' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "taskstats: fix TGID dead-thread stat retention" (Yiyang Chen) Fix a taskstats TGID aggregation bug where fields added in the TGID query path were not preserved after thread exit, and adds a kselftest covering the regression. - "lib/tests: string_helpers: Slight improvements" (Andy Shevchenko) Improve lib/tests/string_helpers_kunit.c a little - "lib/base64: decode fixes" (Josh Law) Address minor issues in lib/base64.c - "selftests/filelock: Make output more kselftestish" (Mark Brown) Make the output from the ofdlocks test a bit easier for tooling to work with. Also ignore the generated file - "uaccess: unify inline vs outline copy_{from,to}_user() selection" (Yury Norov) Simplify the usercopy code by removing the selectability of inlining copy_{from,to}_user(). - "ocfs2: validate inline xattr header consumers" (ZhengYuan Huang) Fix a number of possible issues in the ocfs2 xattr code - "lib and lib/cmdline enhancements" (Dmitry Antipov) Provide additional robustness checking in the cmdline handling code and its in-kernel testing and selftests - "cleanup the RAID6 P/Q library" (Christoph Hellwig) Clean up the RAID6 P/Q library to match the recent updates to the RAID 5 XOR library and other CRC/crypto libraries - "ocfs2: harden inode validators against forged metadata" (Michael Bommarito) Add three structural checks to OCFS2 dinode validation so malformed on-disk fields are rejected before ocfs2_populate_inode() copies them into the in-core inode - "lib/raid: replace __get_free_pages() call with kmalloc()" (Mike Rapoport) Clean up the lib/raid code by using kmalloc() in more places * tag 'mm-nonmm-stable-2026-06-21-10-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (108 commits) ocfs2: fix circular locking dependency in ocfs2_dio_end_io_write ocfs2: fix NULL h_transaction deref in ocfs2_assure_trans_credits lib: interval_tree_test: validate benchmark parameters ocfs2: avoid moving extents to occupied clusters treewide: fix transposed "sign" typos and update spelling.txt ocfs2: fix UBSAN array-index-out-of-bounds in ocfs2_sum_rightmost_rec fat: reject BPB volumes whose data area starts beyond total sectors selftests/uevent: increase __UEVENT_BUFFER_SIZE to avoid ENOBUFS on busy systems lib/test_firmware: allocate the configured into_buf size fs: efs: remove unneeded debug prints checkpatch: cuppress warnings when Reported-by: is followed by Link: MAINTAINERS: add Alexander as a kcov reviewer mailmap: update Alexander Sverdlin's Email addresses fs: fat: inode: replace sprintf() with scnprintf() ocfs2: fix out-of-bounds write in ocfs2_remove_refcount_extent ocfs2: fix race between ocfs2_control_install_private() and ocfs2_control_release() ocfs2/dlm: require a ref for locking_state debugfs open ocfs2: reject FITRIM ranges shorter than a cluster ocfs2: validate fast symlink target during inode read ocfs2: add journal NULL check in ocfs2_checkpoint_inode() ...
11 daysMerge tag 'mm-stable-2026-06-18-09-26' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "selftests/mm: clean up build output and verbosity" (Li Wang) Remove some noise from the MM selftests build - "mm: Free contiguous order-0 pages efficiently" (Ryan Roberts) Speed up the freeing of a batch of 0-order pages by first scanning them for coalescing opportunities. This is applicable to vfree() and to the releasing of frozen pages - "mm/damon: introduce DAMOS failed region quota charge ratio" (SeongJae Park) Address a DAMOS usability issue: The DAMOS quota often exhausts prematurely because it charges for all memory attempted, causing slow and inconsistent performance when actions fail on unreclaimable memory. To fix this, a new feature lets users set a smaller, flexible quota charge ratio (via a numerator and denominator) for failed regions. Since failed actions cause less overhead, reducing their quota cost ensures more predictable and efficient DAMOS processing - "selftests/cgroup: improve zswap tests robustness and support large page sizes" (Li Wang) Fix various spurious failures and improves the overall robustness of the cgroup zswap selftests - "fix MAP_DROPPABLE not supported errno" (Anthony Yznaga) Fix an issue in the mlock selftests on arm32 - "mm: huge_memory: clean up defrag sysfs with shared" (Breno Leitao) Some maintenance work in the huge_memory code - "treewide: fixup gfp_t printks" (Brendan Jackman) Use the special vprintf() gfp_t conversion in various places - "mm: Fix vmemmap optimization accounting and initialization" (Muchun Song) Fix several bugs in the vmemmap optimization, mainly around incorrect page accounting and memmap initialization in the DAX and memory hotplug paths. It also fixes pageblock migratetype initialization and struct page initialization for ZONE_DEVICE compound pages - "mm/damon: repost non-hotfix reviewed patches in damon/next tree" A sprinkle of unrelated minor bugfixes for DAMON - "mm: remove page_mapped()" (David Hildenbrand) Remove this function from the tree, replacing it with folio_mapped() - "mm/damon: let DAMON be paused and resumed" (SeongJae Park) Allow DAMON to be paused and resumed without losing its current state - "kasan: hw_tags: Disable tagging for stack and page-tables" (Muhammad Usama Anjum) Simplify and speed up kasan by removing its ineffective tagging of stacks and page tables - "mm/damon/reclaim,lru_sort: monitor all system rams by default" (SeongJae Park) Simplify deployment on diverse hardware like NUMA systems by updating DAMON_RECLAIM and DAMON_LRU_SORT to automatically monitor the physical address range covering all System RAM areas by default, replacing the overly restrictive behavior that only targeted the single largest memory block to save on negligible overhead - "mm/damon/sysfs: document filters/ directory as deprecated" (SeongJae Park) Update some DAMON docs - "mm: use spinlock guards for zone lock" (Dmitry Ilvokhin) Switch zone->lock handling over to using the guard() mechanisms - "mm/filemap: tighten mmap_miss hit accounting" (fujunjie) Fix a flaw where the mmap_miss counter over-credited page cache hits during fault-arounds and page-fault retries. This results in significant reduction of redundant synchronous mmap readahead I/O, drastically cutting down execution time and gigabytes read for sparse random or strided memory access workloads - "selftests/cgroup: Fix false positive failures in test_percpu_basic" (Li Wang) Fix a couple of false-positives in the cgroup kmem selftests - "mm/damon/reclaim: support monitoring intervals auto-tuning" (SeongJae Park) Add a new parameter to DAMON permitting DAMON_RECLAIM to automatically tune DAMON's sampling and aggregation intervals - "mm/damon/stat: add kdamond_pid parameter" (SeongJae Park) Change DAMON_STAT to provide the pid of its kdamond - "mm/kmemleak: dedupe verbose scan output" (Breno Leitao) Remove large amounts of duplicated backtraces from the verbose-mode kmemleak output - "mm: remove CONFIG_HAVE_BOOTMEM_INFO_NODE (Part 1)" (David Hildenbrand) Reduce our use of CONFIG_HAVE_BOOTMEM_INFO_NODE, with a view to removing it entirely in a later series - "mm/damon: validate min_region_size to be power of 2" (Liew Rui Yan) Prevent users from passing a non-power-of-2 value of `addr_unit', as this later results in undesirable behavior - "mm: document read_pages and simplify usage" (Frederick Mayle) - "tools/mm/page-types: Fix misc bugs" (Ye Liu) Fix three issues in tools/mm/page-types.c - "mm: misc cleanups from __GFP_UNMAPPED series" (Brendan Jackman) Implement several cleanups in the page allocator and related code - "mm, swap: swap table phase IV: unify allocation" (Kairui Song) Unify the allocation and charging of anon and shmem swap in folios, provides better synchronization, consolidates the metadata management, hence dropping the static array and map, and improves performance - "mm/damon: introduce data attributes monitoring" (SeongJae Park( Extend DAMON to monitor general data attributes other than accesses - "mm/vmalloc: free unused pages on vrealloc() shrink" (Shivam Kalra) Implement the TODO in vrealloc() to unmap and free unused pages when shrinking across a page boundary - "mm/damon: documentation and comment fixes" (niecheng) - "remove mmap_action success, error hooks" (Lorenzo Stoakes) Eliminate custom hooks from mmap_action by removing the problematic success_hook which allowed drivers to improperly access uninitialized VMAs. It replaces the error_hook with a simple error-code field and updates the memory char driver accordingly - "mm/damon: minor improvements for code readability and tests" (SeongJae Park) - "mm/damon: fix macro arguments and clarify quota goals doc" (Maksym Shcherba) - "userfaultfd: merge fs/userfaultfd.c into mm/userfaultfd.c" (Mike Rapoport) - "mm/mglru: improve reclaim loop and dirty folio" (Kairui Song and others) Clean up and slightly improves MGLRU's reclaim loop and dirty writeback handling. Large performance improvements are measured - "use vma locks for proc/pid/{smaps|numa_maps} reads" (Suren Baghdasaryan) Use per-vma locks when reading /proc/pid/smaps and numa_maps similar to reduce contention on central mmap_lock - "refactors thpsize_shmem_enabled_store() and thpsize_shmem_enabled_show()" (Ran Xiaokai) Some cleanup work in the THP code - "selftests/memfd: fix compilation warnings" (Konstantin Khorenko) Fix a few build glitches in the memfd selftest code. - "memcg: shrink obj_stock_pcp and cache multiple objcgs" (Shakeel Butt) Resolve a 68% performance regression caused by NUMA-node cache thrashing around struct obj_stock_pcp by shrinking its existing fields and expanding it into a multi-slot array that caches up to five obj_cgroup pointers per CPU, allowing per-node variants of the same memcg to coexist within a single 64-byte cache line. - "zram: writeback fixes" (Sergey Senozhatsky) address a couple of unrelated zram writeback issues - "mm: switch THP shrinker to list_lru" (Johannes Weiner) Resolve NUMA-awareness issues and streamlines callsite interaction by refactoring and extending the list_lru API to completely replace the complex, open-coded deferred split queue for Transparent Huge Pages - "mm: improve large folio readahead for exec memory" (Usama Arif) Improve large-folio readahead on systems like 64K-page arm64 by preventing the mmap_miss check from permanently disabling target-oriented VM_EXEC readahead, and by generalizing the force_thp_readahead gate to support mappings with any usefully large maximum folio order under the cache cap. - "userfaultfd/pagemap: pre-existing fixes" (Kiryl Shutsemau) Fix a bunch of minor issues in the userfaultfd/pagemap, all of which were flagged by Sashiko review of proposed new material - "mm/sparse-vmemmap: Provide generic vmemmap_set_pmd() and vmemmap_check_pmd()" (Muchun Song) Provide generic versions of these two functions so the four arch-specific implementations can be removed. - "mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap device" (Youngjun Park) Address a uswsusp-vs-swapoff race and reduces the swap device reference taking/releasing frequency. - "mm/hmm: A fix and a selftest" (Dev Jain) * tag 'mm-stable-2026-06-18-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) selftests/mm/hmm-tests: test pagemap reads of PMD device-private entries fs/proc/task_mmu: do not warn on seeing non-migration pmd entry lib/test_hmm: check alloc_page_vma() return value and handle OOM mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX mm/swap: remove redundant swap device reference in alloc/free mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap device mm/filemap: use folio_next_index() for start vmalloc: fix NULL pointer dereference in is_vm_area_hugepages() sparc/mm: drop vmemmap_check_pmd helper and use generic code loongarch/mm: drop vmemmap_check_pmd helper and use generic code riscv/mm: drop vmemmap_pmd helpers and use generic code arm64/mm: drop vmemmap_pmd helpers and use generic code mm/sparse-vmemmap: provide generic vmemmap_set_pmd() and vmemmap_check_pmd() rust: page: mark Page::nid as inline userfaultfd: build __VMA_UFFD_FLAGS from config-gated masks userfaultfd: gate must_wait writability check on pte_present() mm/huge_memory: preserve pmd_swp_uffd_wp on device-private PMD downgrade fs/proc/task_mmu: fix hugetlb self-deadlock in pagemap_scan_pte_hole() fs/proc/task_mmu: use huge_page_size() in pagemap_scan_hugetlb_entry() fs/proc/task_mmu: fix make_uffd_wp_huge_pte() prot-update race ...
12 daystreewide: fix transposed "sign" typos and update spelling.txtShardul Deshpande
Several comments transpose the letters in "assigned" and "unsigned", spelling them with "sing" instead of "sign". Correct all of them. Of these, the misspelling of "assigned" is not yet flagged by checkpatch, so also add it to scripts/spelling.txt. The remaining matches of `grep -ri singed` are RISINGEDGE register and enum names, not typos. Link: https://lore.kernel.org/20260612181633.734458-1-iamsharduld@gmail.com Signed-off-by: Shardul Deshpande <iamsharduld@gmail.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysMerge tag 'bpf-next-7.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: "Major changes: - Recover from BPF arena page faults using a scratch page and add ptep_try_set() for lockless empty-slot installs on x86 and arm64. This allows BPF kfuncs to access arena pointers directly. The 'arena_direct_access' stable branch was created for this work and was pulled into sched-ext and bpf-next trees (Tejun Heo, Kumar Kartikeya Dwivedi) - Lift old restriction and support 6+ arguments in BPF programs and kfuncs on x86 and arm64 (Yonghong Song, Puranjay Mohan) Other features and fixes: - Add 24-bit BTF vlen and reclaim unused bits in the BTF UAPI to ease addition of new BTF kinds (Alan Maguire) - Raise the maximum BPF call chain depth from 8 to 16 frames (Alexei Starovoitov) - Refactor object relationship tracking in the verifier and fix a dynptr use-after-free bug (Amery Hung) - Harden the signed program loader and reject exclusive maps as inner maps (Daniel Borkmann) - Replace the verifier min/max bounds fields with a circular number (cnum) representation and improve 32->64 bit range refinements (Eduard Zingerman) - Introduce the arena library and runtime (libarena) with a buddy allocator, rbtree and SPMC queue data structures, ASAN support and a parallel test harness. Allow subprograms to return arena pointers and switch to a BTF type-tag based __arena annotation (Emil Tsalapatis) - Cache build IDs in the sleepable stackmap path and avoid faultable build ID reads under mm locks (Ihor Solodrai) - Introduce the tracing_multi link to attach a single BPF program to many kernel functions at once. Allow specifying the uprobe_multi target via FD (Jiri Olsa) - Extend the bpf_list family of kfuncs with bpf_list_add/del(), and bpf_list_is_first/is_last/empty() (Kaitao Cheng) - Extend the BPF syscall with common attributes support for prog_load, btf_load and map_create (Leon Hwang) - Wrap rhashtable as BPF map (Mykyta Yatsenko, Herbert Xu) - Add sleepable support for tracepoint programs and fix deadlocks in LRU map due to NMI reentry (Mykyta Yatsenko) - Fix OOB access in bpf_flow_keys, fix nullness analysis of inner arrays, enforce write checks for global subprograms (Nuoqi Gui) - Report the maximum combined stack depth and print a breakdown of instructions processed per subprogram (Paul Chaignon) - Add an XDP load-balancer benchmark and arm64 JIT support for stack arguments (Puranjay Mohan) - Add kfuncs to traverse over wakeup_sources (Samuel Wu) - Allow sleepable BPF programs to use LPM trie maps directly (Vlad Poenaru) - Many more fixes and cleanups across the verifier, BTF, sockmap, devmap, bpffs, security hooks, s390/riscv/loongarch JITs, rqspinlock, libbpf, bpftool, selftests" * tag 'bpf-next-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (336 commits) selftests/bpf: Work around llvm stack overflow in crypto progs selftests/bpf: add test for bpf_msg_pop_data() overflow bpf, sockmap: fix integer overflow in bpf_msg_pop_data() bounds check sockmap: Fix use-after-free in udp_bpf_recvmsg() bpf, sockmap: keep sk_msg copy state in sync bpf, sockmap: Fix wrong rsge offset in bpf_msg_push_data() bpf, sockmap: reject overflowing copy + len in bpf_msg_push_data() selftsets/bpf: Retry map update on helper_fill_hashmap() selftests/bpf: Add test for sleepable lsm_cgroup rejection selftests/bpf: Add test to verify the fix for bpf_setsockopt() helper bpf: Fix bpf_get/setsockopt to tos for ipv4-mapped ipv6 socket selftests/bpf: Avoid static LLVM linking for cross builds selftests/bpf: Use common CFLAGS for urandom_read selftests/bpf: Initialize operation name before use tools/bpf: build: Append extra cflags libbpf: Initialize CFLAGS before including Makefile.include bpftool: Append extra host flags bpftool: Avoid adding EXTRA_CFLAGS to HOST_CFLAGS bpftool: Pass host flags to bootstrap libbpf selftests/bpf: correct CONFIG_PPC64 macro name in comment ...
2026-06-14bpf: Add support to specify uprobe_multi target via file descriptorJiri Olsa
Allow uprobe_multi link to identify the target binary by an already opened file descriptor. Adding new BPF_F_UPROBE_MULTI_PATH_FD flag and the path_fd field for the attr.link_create.uprobe_multi struct. When the flag is set, we resolve the target from path_fd, without the flag, we keep the existing string path behavior. I don't see a use case for supporting O_PATH file descriptors, because we need to read the binary first to get probes offsets, so I'm using the CLASS(fd, f), which fails for O_PATH fds. Assisted-by: Codex:GPT-5.4 Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260611114230.950379-4-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-15Merge tag 'vfs-7.2-rc1.xattr' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull simple_xattr updates from Christian Brauner: "This reworks the simple xattr api to make it more efficient and easier to use for all consumers. The simple_xattr hash table moves from the inode into a per-superblock cache, removing the per-inode overhead for the common case of few or no xattrs. The interface now passes struct simple_xattrs ** so lazy allocation is handled internally instead of by every caller, kernfs xattr operations on kernfs nodes shared between multiple superblocks are properly serialized, and tmpfs constructs "security.foo" xattr names with kasprintf() instead of kmalloc() plus two memcpy()s. A follow-up fix links kernfs nodes to their parent before the LSM init hook runs: with the per-sb cache kernfs_xattr_set() computes the cache via kernfs_root(kn), which faulted on a freshly allocated node when selinux_kernfs_init_security() called into it - reproducible as a NULL pointer dereference on the first cgroup mkdir on SELinux-enabled systems. On top of this bpffs gains support for trusted.* and security.* xattrs so that user space and BPF LSM programs can attach metadata - for example a content hash or a security label - to pinned objects and directories and inspect it uniformly like on other filesystems. The store is in-memory and non-persistent, living only for the lifetime of the mount like everything else in bpffs" * tag 'vfs-7.2-rc1.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: bpf: Add simple xattr support to bpffs kernfs: link kn to its parent before the LSM init hook simpe_xattr: use per-sb cache simple_xattr: change interface to pass struct simple_xattrs ** tmpfs: simplify constructing "security.foo" xattr names kernfs: fix xattr race condition with multiple superblocks
2026-06-14bpf: Raise maximum call chain depth to 16 framesAlexei Starovoitov
Bump MAX_CALL_FRAMES from 8 to 16 to allow deeper call chains that Rust-BPF requires and update selftests. Link: https://lore.kernel.org/r/20260613180755.29671-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-12bpf: Fix setting retval to -EPERM for cgroup hooks not returning errnoXu Kuohai
When a cgroup BPF program exits with 0, bpf_prog_run_array_cg() sets the hook return value to -EPERM if it is not a valid errno. This is correct for errno-based hooks, which return 0 on success and negative errno on failure, but wrong for boolean and void LSM hooks. Boolean LSM hooks should only return true or false, and void LSM hooks have no return value at all. Fix it by skipping setting -EPERM for hooks not returning errno. Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor") Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/r/20260610201724.733943-2-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-12bpf: Run generic devmap egress prog on private skbSun Jian
Generic XDP devmap multi redirect uses skb_clone() for intermediate destinations and sends the last destination with the original skb. This can leave multiple destinations sharing the same packet data. This becomes visible after generic devmap egress-program support was added: a devmap egress program may mutate packet data, and another destination sharing the same data can observe that mutation. Native XDP broadcast redirect does not have this issue because xdpf_clone() copies the frame data for each destination. Generic XDP should provide the same per-destination isolation before running a devmap egress program. Fix this by making cloned skbs private before running the generic devmap egress program. Use skb_copy() instead of skb_unshare() so allocation failure does not consume the skb and the existing caller error paths keep their ownership semantics. Fixes: 2ea5eabaf04a ("bpf: devmap: Implement devmap prog execution for generic XDP") Suggested-by: Jiayuan Chen <jiayuan.chen@linux.dev> Suggested-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com> Link: https://lore.kernel.org/r/20260612114032.244616-2-sun.jian.kdev@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-10bpf: Tighten cgroup storage cookie checks for prog arraysDaniel Borkmann
The fix in commit abad3d0bad72 ("bpf: Fix oob access in cgroup local storage") is still incomplete. The prog-array compatibility check treats a program with no cgroup storage as compatible with any stored storage cookie. This allows a storage-less program to bridge a tail call chain between an entry program and a storage-using callee even though cgroup local storage at runtime still follows the caller's context, that is, A -> B(no storage) -> C(storage) path. Requiring exact cookie equality would break the legitimate case of a storage-less leaf program being tail called from a storage-using one. Instead, only accept a zero storage cookie if the program cannot perform tail calls itself. This keeps A -> B(no storage) working while rejecting the A -> B(no storage) -> C(storage) bridge. Fixes: abad3d0bad72 ("bpf: Fix oob access in cgroup local storage") Reported-by: Lin Ma <malin89@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260610105539.705887-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-09bpf: Cancel special fields on map value recycleJustin Suess
Map update and delete paths currently call bpf_obj_free_fields() when a value is being replaced or recycled. That makes field destruction depend on the context of the update/delete operation. For tracing programs this can include NMI context, where referenced kptr destructors, uptr unpinning, and graph root destruction are not generally safe. Introduce bpf_obj_cancel_fields() for the reusable-value path. It only performs NMI-safe cleanup for timer, workqueue, and task_work fields. Fields that need full destruction are left attached to the recycled value and are destroyed by the final cleanup path instead. Switch array and hashtab update/delete/recycle paths to this cancel helper. Keep bpf_obj_free_fields() for final map destruction and for bpf_mem_alloc destructors. Preallocated hashtabs do not have allocator destructors, so teardown continues to walk the normal and extra elements and fully destroy their fields. This deliberately relaxes the eager-free semantics of map update/delete for special fields. Programs that relied on a recycled map slot becoming empty immediately after update/delete were relying on behavior that cannot be implemented safely from every BPF execution context without offloading arbitrary destructors. There is a chance this change breaks programs making assumptions regarding the eager freeing of fields. If so, we can relax semantics to cancellation only when irqs_disabled() is true in the future. However, theoretically, map values that get reused eagerly already have weaker guarantees as parallel users can recreate freed fields before the new element becomes visible again. Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr") Signed-off-by: Justin Suess <utilityemal77@gmail.com> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260609202548.3571690-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-09bpf: Reject bpf_obj_drop() from tracing progsJustin Suess
bpf_obj_drop() runs bpf_obj_free_fields() synchronously for program-allocated objects. When such an object contains NMI unsafe fields, tracing programs that can run from arbitrary instrumented context can reach that destruction from unsafe contexts, including NMI. NMI is likely one instance of this problem, and other instances would include possible unsafe reentrancy. Deferring bpf_obj_drop() is not appealing either: it would add delayed-free machinery to a release operation that otherwise has straightforward synchronous ownership semantics. Reject bpf_obj_drop() and bpf_percpu_obj_drop() from tracing programs that may run from unsafe contexts unless every field in the object's BTF record is explicitly NMI safe. Do not reject sleepable BPF_PROG_TYPE_TRACING programs, since they are not the arbitrary/NMI contexts that motivate the restriction. Note that while bpf_rb_root and bpf_list_head would be NMI safe on their own to free, the objects recursively held by them may not be; be conservative and just mark them as not NMI safe for now. Use a whitelist for the NMI-safe field set instead of listing only known NMI unsafe fields. Locks, async fields, unreferenced kptrs, and refcounts are known to be NMI safe because their destruction is either a no-op, simple state reset, or async cancellation. Referenced kptrs, percpu referenced kptrs, uptrs, graph roots, graph nodes, and any future field type are rejected until audited for arbitrary tracing and NMI contexts. This is less susceptible to future changes in fields that were previously safe by exclusion, and to new fields being added without updating this check. Convert the existing recursive local-object drop success case to a syscall program in the same commit, since this verifier change makes the old tracing program form invalid. The test still exercises bpf_obj_drop() releasing a referenced task kptr from a safe program type. Fixes: ac9f06050a35 ("bpf: Introduce bpf_obj_drop") Signed-off-by: Justin Suess <utilityemal77@gmail.com> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260609202548.3571690-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-09bpf: Allow sleepable programs to use LPM trie maps directlyVlad Poenaru
The previous change relaxed the rcu_dereference annotations in lpm_trie.c so the trie walks no longer trip lockdep when reached from a sleepable BPF program holding only rcu_read_lock_trace(). By itself that only helps tries reached as the inner map of a map-of-maps, or from the classic-RCU syscall path: a sleepable program that references an LPM trie directly is still rejected at load time by check_map_prog_compatibility(), whose sleepable whitelist omits BPF_MAP_TYPE_LPM_TRIE: Sleepable programs can only use array, hash, ringbuf and local storage maps LPM trie nodes are allocated from a bpf_mem_alloc (trie->ma) and freed with bpf_mem_cache_free_rcu(), which chains a regular RCU grace period into a Tasks Trace grace period before the node -- and the value embedded in it that trie_lookup_elem() returns to the program -- is released. That is the same reclaim discipline BPF_MAP_TYPE_HASH relies on for sleepable access, so a value handed to a sleepable reader cannot be freed while the program is still running under rcu_read_lock_trace(). The writer paths take trie->lock across the walk and never relied on the RCU read-side lock to keep nodes alive. Add BPF_MAP_TYPE_LPM_TRIE to the sleepable map whitelist so these programs can use LPM tries directly. Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260609135558.193287-3-vlad.wing@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-09bpf: Allow LPM map access from sleepable BPF programsVlad Poenaru
trie_lookup_elem() annotates its rcu_dereference_check() walks with only rcu_read_lock_bh_held(). Because rcu_dereference_check(p, c) resolves to "c || rcu_read_lock_held()", this passes for XDP/NAPI and classic RCU readers but fails for sleepable BPF programs, which enter via __bpf_prog_enter_sleepable() and hold only rcu_read_lock_trace(). trie_update_elem() and trie_delete_elem() have the same problem in a different form: they walk the trie with plain rcu_dereference(), which asserts rcu_read_lock_held() unconditionally. Both are reachable from sleepable BPF programs via the bpf_map_update_elem / bpf_map_delete_elem helpers, and from the syscall path under classic rcu_read_lock(). In the writer paths the trie is actually protected by trie->lock (an rqspinlock taken across the walk); we never relied on the RCU read-side lock to keep nodes alive there. A sleepable LSM hook that ends up touching an LPM trie therefore triggers lockdep on debug kernels: ============================= WARNING: suspicious RCU usage 7.1.0-... Tainted: G E ----------------------------- kernel/bpf/lpm_trie.c:249 suspicious rcu_dereference_check() usage! 1 lock held by net_tests/540: #0: (rcu_tasks_trace_srcu_struct){....}-{0:0}, at: __bpf_prog_enter_sleepable+0x26/0x280 Call Trace: dump_stack_lvl lockdep_rcu_suspicious trie_lookup_elem bpf_prog_..._enforce_security_socket_connect bpf_trampoline_... security_socket_connect __sys_connect do_syscall_64 This is lockdep-only -- no UAF, since Tasks Trace RCU does serialize against the trie's reclaim path -- but it spams the console once per distinct callsite on every debug kernel running a sleepable BPF LSM that touches an LPM trie, which is increasingly common. For the lookup path, switch the rcu_dereference_check() annotation from rcu_read_lock_bh_held() to bpf_rcu_lock_held(), which accepts all three contexts (classic, BH, Tasks Trace). Other map types already follow this convention. For trie_update_elem() and trie_delete_elem(), annotate the walks as rcu_dereference_protected(*p, 1) -- matching trie_free() in the same file -- since trie->lock is held across the walk. rqspinlock has no lockdep_map, so the predicate degenerates to '1' rather than lockdep_is_held(&trie->lock); the protection is real but not machine-verifiable. trie_get_next_key() also uses bare rcu_dereference() but is reachable only from the BPF syscall, which holds classic rcu_read_lock() before dispatching, so it is left untouched. Fixes: 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context") Cc: stable@vger.kernel.org Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260609135558.193287-2-vlad.wing@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-09bpf: Enforce write checks for BTF pointer helper accessNuoqi Gui
check_mem_reg() verifies both read and write access for global subprogram memory arguments. When the caller register is PTR_TO_BTF_ID, check_helper_mem_access() currently forwards the access to check_ptr_to_btf_access() as BPF_READ regardless of the requested access type. This lets a BTF-backed kernel object field pointer pass the caller-side writable memory check for a global subprogram argument. The callee is then validated with a generic writable PTR_TO_MEM argument and can store through it, even though an equivalent direct BTF field store is rejected with "only read is supported". Forward the requested access type to check_ptr_to_btf_access(). This enforces existing BTF write restrictions for global subprogram memory arguments as well. Fixes: 3e30be4288b3 ("bpf: Allow helpers access trusted PTR_TO_BTF_ID.") Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn> Link: https://lore.kernel.org/bpf/20260609-f01-04-btf-writable-arg-v1-1-f449cd970669@mails.tsinghua.edu.cn Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
2026-06-09bpf: Validate BTF repeated field counts before expansionPaul Moses
btf_parse_struct_metas() walks user-supplied BTF during BPF_BTF_LOAD, and btf_repeat_fields() expands repeatable fields from array elements into the fixed BTF_FIELDS_MAX scratch array used by btf_parse_fields(). The remaining-capacity check performs the expanded field count calculation in u32. A malformed BTF can wrap that calculation, causing the check to pass even when the expanded field count exceeds the scratch array capacity. The following memcpy() can then write past the end of the array. Use checked addition and multiplication before copying repeated fields and reject impossible counts. Fixes: 797d73ee232d ("bpf: Check the remaining info_cnt before repeating btf fields") Cc: stable@vger.kernel.org Signed-off-by: Paul Moses <p@1g4.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20260605234301.1109063-1-p@1g4.org Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
2026-06-08bpf: Fix NULL pointer dereference in bpf_task_from_vpid()Sechang Lim
bpf_task_from_vpid() looks up a task in the pid namespace of the current task, via find_task_by_vpid(): find_task_by_vpid(vpid) find_task_by_pid_ns(vpid, task_active_pid_ns(current)) find_pid_ns(nr, ns) -> idr_find(&ns->idr, nr) cgroup_skb programs run in softirq, which may interrupt a task that is itself in do_exit(). Once that task has passed exit_notify() -> release_task() -> __unhash_process(), its thread_pid is cleared, so task_active_pid_ns(current) returns NULL and find_pid_ns() dereferences &NULL->idr: BUG: kernel NULL pointer dereference, address: 0000000000000050 RIP: 0010:idr_find+0x11/0x30 lib/idr.c:176 Call Trace: <IRQ> find_pid_ns kernel/pid.c:370 [inline] find_task_by_pid_ns+0x3b/0xe0 kernel/pid.c:485 bpf_task_from_vpid+0x5b/0x200 kernel/bpf/helpers.c:2916 bpf_prog_run_array_cg+0x17e/0x530 kernel/bpf/cgroup.c:81 __cgroup_bpf_run_filter_skb+0x12b/0x250 kernel/bpf/cgroup.c:1612 sk_filter_trim_cap+0x1dc/0x4c0 net/core/filter.c:148 tcp_v4_rcv+0x18d1/0x2200 net/ipv4/tcp_ipv4.c:2223 </IRQ> <TASK> do_exit+0xa63/0x1270 kernel/exit.c:1010 get_signal+0x141c/0x1530 kernel/signal.c:3037 Bail out when current has no pid namespace. Fixes: 675c3596ff32 ("bpf: Add bpf_task_from_vpid() kfunc") Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com> Acked-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/bpf/20260608050001.2545245-1-rhkrqnwk98@gmail.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
2026-06-08bpf: Keep dynamic inner array lookups nullableNuoqi Gui
An ARRAY_OF_MAPS can use an array created with BPF_F_INNER_MAP as its inner map template. A concrete inner array with a different max_entries value can then replace the template. After a successful outer map lookup, the verifier represents the resulting map pointer using the inner map template. Const-key lookup nullness elision consequently uses the template max_entries even though the runtime helper uses the concrete inner map max_entries. Do not elide lookup result nullness for maps marked with BPF_F_INNER_MAP, because the template max_entries does not prove that the key is in bounds for the concrete runtime map. Fixes: d2102f2f5d75 ("bpf: verifier: Support eliding map lookup nullness") Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/bpf/20260607-f01-v2-v2-1-da48453146e8@mails.tsinghua.edu.cn Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
2026-06-07bpf: Fix NMI/tracepoint re-entry deadlock on lru locksMykyta Yatsenko
NMI and tracepoint BPF programs can re-enter the per-CPU or global LRU lock that bpf_lru_pop_free()/push_free() already hold on the same CPU, AA-deadlocking. Lockdep reports "inconsistent {INITIAL USE} -> {IN-NMI}" on &l->lock (syzbot c69a0a2c816716f1e0d5) and "possible recursive locking detected" on &loc_l->lock (syzbot 18b26edb69b2e19f3b33). Prior trylock and rqspinlock based fixes (see links) were nacked because compromised on reliability. This patch converts every LRU lock site to rqspinlock_t and adds a recovery path for some failure windows to avoid node leaks. Failure recovery: - *_pop_free top-level: return NULL; prealloc_lru_pop() already treats that as no-free-element (-ENOMEM). - Cross-CPU steal: skip the victim's locked loc_l, try next CPU. - Post-steal local lock fail: publish stolen node to lockless per-CPU free_llist; next pop on this CPU picks it up. - push_free fail: mark node pending_free=1. __local_list_flush(), __local_list_pop_pending() reclaim the node from pending_list. __bpf_lru_list_shrink_inactive() reclaims the node from inactive list. Nodes from active list are reclaimed by __bpf_lru_list_shrink() or after __bpf_lru_list_rotate_active() demotes it to the inactive. Fixes: 3a08c2fd7634 ("bpf: LRU List") Reported-by: syzbot+c69a0a2c816716f1e0d5@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c69a0a2c816716f1e0d5 Reported-by: syzbot+18b26edb69b2e19f3b33@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=18b26edb69b2e19f3b33 Link: https://lore.kernel.org/bpf/CAPPBnEYO4R+m+SpVc2gNj_x31R6fo1uJvj2bK2YS1P09GWT6kQ@mail.gmail.com/ Link: https://lore.kernel.org/bpf/CAPPBnEZmFA3ab8Uc=PEm0bdojZy=7T_F5_+eyZSHyZR3MBG4Vw@mail.gmail.com/ Link: https://lore.kernel.org/bpf/20251030030010.95352-1-dongml2@chinatelecom.cn/ Link: https://lore.kernel.org/bpf/20260119142120.28170-1-leon.hwang@linux.dev/ Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260607-lru_map_spin-v3-1-bcd9332e911b@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Add support for tracing_multi link sessionJiri Olsa
Adding support to use session attachment with tracing_multi link. Adding new BPF_TRACE_FSESSION_MULTI program attach type, that follows the BPF_TRACE_FSESSION behaviour but on the tracing_multi link. Such program is called on entry and exit of the attached function and allows to pass cookie value from entry to exit execution. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-16-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Add support for tracing_multi link cookiesJiri Olsa
Add support to specify cookies for tracing_multi link. Cookies are provided in array where each value is paired with provided BTF ID value with the same array index. Such cookie can be retrieved by bpf program with bpf_get_attach_cookie helper call. We need to sort cookies array together with ids array in check_dup_ids, to keep the id->cookie relation. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-15-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Add support for tracing multi linkJiri Olsa
Adding new link to allow to attach program to multiple function BTF IDs. The link is represented by struct bpf_tracing_multi_link. To configure the link, new fields are added to bpf_attr::link_create to pass array of BTF IDs; struct { __aligned_u64 ids; __u32 cnt; } tracing_multi; Each BTF ID represents function (BTF_KIND_FUNC) that the link will attach bpf program to. We use previously added bpf_trampoline_multi_attach/detach functions to attach/detach the link. The linkinfo/fdinfo callbacks will be implemented in following changes. Note this is supported only for archs (x86_64) with ftrace direct and have single ops support. CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS && CONFIG_HAVE_SINGLE_FTRACE_DIRECT_OPS Note using sort_r (instead of plain sort) in check_dup_ids, because we will use the swap callback in following changes. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-14-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Add bpf_trampoline_multi_attach/detach functionsJiri Olsa
Adding bpf_trampoline_multi_attach/detach functions that allows to attach/detach tracing program to multiple functions/trampolines. The attachment is defined with bpf_program and array of BTF ids of functions to attach the bpf program to. Adding bpf_tracing_multi_link object that holds all the attached trampolines and is initialized in attach and used in detach. The attachment allocates or uses currently existing trampoline for each function to attach and links it with the bpf program. The attach works as follows: - we get all the needed trampolines - lock them and add the bpf program to each (__bpf_trampoline_link_prog) - the trampoline_multi_ops passed in __bpf_trampoline_link_prog gathers ftrace_hash (ip -> trampoline) objects - we call update_ftrace_direct_add/mod to update needed locations - we unlock all the trampolines The detach works as follows: - we lock all the needed trampolines - remove the program from each (__bpf_trampoline_unlink_prog) - the trampoline_multi_ops passed in __bpf_trampoline_unlink_prog gathers ftrace_hash (ip -> trampoline) objects - we call update_ftrace_direct_del/mod to update needed locations - we unlock and put all the trampolines We store the old image/flags in the trampoline before the update and use it in case we need to rollback the attachment. We keep the ftrace_hash objects allocated during attach in the link so they can be used for detach as well. Adding trampoline_(un)lock_all functions to (un)lock all trampolines to gate the tracing_multi attachment. Note this is supported only for archs (x86_64) with ftrace direct and have single ops support. CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS && CONFIG_HAVE_SINGLE_FTRACE_DIRECT_OPS It also needs CONFIG_BPF_SYSCALL enabled. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-13-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Move sleepable verification code to btf_id_allow_sleepableJiri Olsa
Move sleepable verification code to btf_id_allow_sleepable function. It will be used in following changes. Adding code to retrieve type's name instead of passing it from bpf_check_attach_target function, because this function will be called from another place in following changes and it's easier to retrieve the name directly in here. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-12-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Add multi tracing attach typesJiri Olsa
Adding new program attach types multi tracing attachment: BPF_TRACE_FENTRY_MULTI BPF_TRACE_FEXIT_MULTI and their base support in verifier code. Programs with such attach type will use specific link attachment interface coming in following changes. This was suggested by Andrii some (long) time ago and turned out to be easier than having special program flag for that. Bpf programs with such types have 'bpf_multi_func' function set as their attach_btf_id and keep module reference when it's specified by attach_prog_fd. They are also accepted as sleepable programs during verification, and the real validation for specific BTF_IDs/functions will happen during the multi link attachment in following changes. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-11-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Factor fsession link to use struct bpf_tramp_nodeJiri Olsa
Now that we split trampoline attachment object (bpf_tramp_node) from the link object (bpf_tramp_link) we can use bpf_tramp_node as fsession's fexit attachment object and get rid of the bpf_fsession_link object. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-10-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Add struct bpf_tramp_node objectJiri Olsa
Adding struct bpf_tramp_node to decouple the link out of the trampoline attachment info. At the moment the object for attaching bpf program to the trampoline is 'struct bpf_tramp_link': struct bpf_tramp_link { struct bpf_link link; struct hlist_node tramp_hlist; u64 cookie; } The link holds the bpf_prog pointer and forces one link - one program binding logic. In following changes we want to attach program to multiple trampolines but we want to keep just one bpf_link object. Splitting struct bpf_tramp_link into: struct bpf_tramp_link { struct bpf_link link; struct bpf_tramp_node node; }; struct bpf_tramp_node { struct bpf_link *link; struct hlist_node tramp_hlist; u64 cookie; }; The 'struct bpf_tramp_link' defines standard single trampoline link and 'struct bpf_tramp_node' is the attachment trampoline object with pointer to the bpf_link object. This will allow us to define link for multiple trampolines, like: struct bpf_tracing_multi_link { struct bpf_link link; ... int nodes_cnt; struct bpf_tracing_multi_node nodes[] __counted_by(nodes_cnt); }; Cc: Hengqi Chen <hengqi.chen@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-9-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Add bpf_trampoline_add/remove_prog functionsJiri Olsa
Separate bpf_trampoline_add/remove_prog functions from __bpf_trampoline_link/unlink functions to be able to add/remove trampoline programs without the image being updated in following changes. No functional change is intended. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-8-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Move trampoline image setup into bpf_trampoline_ops callbacksJiri Olsa
Moving trampoline image setup into bpf_trampoline_ops callbacks, so we can have different image handling for multi attachment which is coming in following changes. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-7-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Add struct bpf_trampoline_ops objectJiri Olsa
In following changes we will need to override ftrace direct attachment behaviour. In order to do that we are adding struct bpf_trampoline_ops object that defines callbacks for ftrace direct attachment: register_fentry unregister_fentry modify_fentry The new struct bpf_trampoline_ops object is passed as an argument to __bpf_trampoline_link/unlink_prog functions. At the moment the default trampoline_ops is set to the current ftrace direct attachment functions, so there's no functional change for the current code. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-6-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Use mutex lock pool for bpf trampolinesJiri Olsa
Adding mutex lock pool that replaces bpf trampolines mutex. For tracing_multi link coming in following changes we need to lock all the involved trampolines during the attachment. This could mean thousands of mutex locks, which is not convenient. As suggested by Andrii we can replace bpf trampolines mutex with mutex pool, where each trampoline is hash-ed to one of the locks from the pool. It's better to lock all the pool mutexes (32 at the moment) than thousands of them. There is 48 (MAX_LOCK_DEPTH) lock limit allowed to be simultaneously held by task, so we need to keep 32 mutexes (5 bits) in the pool, so when we lock them all in following changes the lockdep won't scream. Removing the mutex_is_locked in bpf_trampoline_put, because we removed the mutex from bpf_trampoline. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260606123955.345967-5-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-07bpf: Reject sleepable BPF_LSM_CGROUP programs at load timeDavid Windsor
The cgroup shim runs under rcu_read_lock_dont_migrate(), so we should not attach any sleepable BPF programs there. Add support to the verifier to explicitly reject attempts to load sleepable BPF programs destined for LSM cgroup attachment. Without this, we get the following splat from a BPF_LSM_CGROUP program marked BPF_F_SLEEPABLE attached to file_open when it calls bpf_get_dentry_xattr(): BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:1567 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 34317, name: load preempt_count: 0, expected: 0 RCU nest depth: 2, expected: 0 Call Trace: down_read+0x76/0x480 ext4_xattr_get+0x11f/0x700 __vfs_getxattr+0xf0/0x150 bpf_get_dentry_xattr+0xbb/0xf0 bpf_prog_e76a298dac9218c6_test_open+0x6a/0x85 __cgroup_bpf_run_lsm_current+0x326/0x840 bpf_trampoline_6442534646+0x62/0x14d security_file_open+0x34/0x60 do_dentry_open+0x340/0x1260 vfs_open+0x7a/0x440 path_openat+0x1bac/0x30a0 libbpf provides a .s named section variant for every sleepable program type except lsm_cgroup, reflecting that per-cgroup LSM programs are intended to only run in a non-sleepable context. The above splat was obtained by bypassing libbpf by using bpf(2) directly. Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor") Signed-off-by: David Windsor <dwindsor@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/bpf/20260605145707.608579-1-dwindsor@gmail.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
2026-06-06bpf: Fold reg->var_off into PTR_TO_FLOW_KEYS bounds checkNuoqi Gui
Constant pointer arithmetic on a PTR_TO_FLOW_KEYS register lands the constant in reg->var_off (e.g. flow_keys(imm=4096)), but the PTR_TO_FLOW_KEYS path in check_mem_access() passes only insn->off to check_flow_keys_access() and never folds reg->var_off.value. The verifier therefore accepts an access that, at runtime, dereferences past struct bpf_flow_keys -- a verifier/runtime divergence that yields an out-of-bounds read and write of kernel stack memory. Commit 022ac0750883 ("bpf: use reg->var_off instead of reg->off for pointers") removed the generic "off += reg->off" that check_mem_access() applied before the per-type dispatch and replaced it with per-path folding of reg->var_off.value (for example the ctx path now folds the register offset via check_ctx_access()). The PTR_TO_FLOW_KEYS path was not given the equivalent fold, so a constant offset that used to be folded and rejected is now silently accepted: before 022ac0750883: the offset stays in reg->off and is folded generically, so the access is checked with off=4096 and rejected. after 022ac0750883: the offset lands in reg->var_off, the flow_keys path checks off=0 and accepts; at runtime the access dereferences base + 0x1000. For a BPF_PROG_TYPE_FLOW_DISSECTOR program the following is accepted: r2 = *(u64 *)(r1 + 144) ; R2=flow_keys (PTR_TO_FLOW_KEYS) r2 += 0x1000 ; R2=flow_keys(imm=4096), accepted r0 = *(u64 *)(r2 + 0) ; accepted, var_off.value=0x1000 ignored while the equivalent insn->off form r0 = *(u64 *)(r2 + 0x1000) has the same effective offset but is correctly rejected with "invalid access to flow keys off=4096 size=8", which isolates the defect to the missing var_off fold. Once attached as a flow dissector, the accepted program reads kernel stack past struct bpf_flow_keys (a kernel-stack / KASLR information leak) and can likewise write past it, corrupting kernel memory. Fix it by folding reg->var_off.value into the offset before the bounds check and rejecting non-constant offsets, mirroring the other pointer types (e.g. check_ctx_access()). Fixes: 022ac0750883 ("bpf: use reg->var_off instead of reg->off for pointers") Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260606-c3-01-v3-v3-1-97c51f592f15@mails.tsinghua.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-06bpf: Add simple xattr support to bpffsrefs/merge-window/90272c66977cd3593c735fe51cb0a52cc0e89077Daniel Borkmann
Add support for extended attributes on bpffs inodes so that user space and BPF LSM programs can attach metadata, for example, a content hash or a security label - to a pinned object or directory. BPF LSM or user space tooling can then uniformly look at this (e.g. security.bpf.*) in similar way to other fs'es. The store is in-memory and non-persistent: it lives only for the lifetime of the mount, like everything else in bpffs. The modelling is similar to tmpfs. bpffs serves the trusted.* and security.* namespaces; user.* is left unsupported. As bpffs is FS_USERNS_MOUNT, security.* is reachable by the unprivileged mounter in a user namespace, and thus we are using the simple_xattr_set_limited infra there (trusted.* needs global CAP_SYS_ADMIN). bpf_fill_super() is open-coded instead of using simple_fill_super(), because the root inode must now be allocated through bpf_fs_alloc_inode() i.e. carry the bpf_fs_inode wrapper and come from the right cache - which requires s_op (and s_xattr) to be installed before the first inode is created. While at it, also harden s_iflags with SB_I_NOEXEC and SB_I_NODEV. bpf_fs_listxattr() is only reachable through the filesystem via i_op->listxattr, so the BPF token inode is left untouched. Name-based fsetxattr()/fgetxattr() on a token fd still work since the get/set handlers are installed at the superblock. For security.* namespace, we use simple_xattr_set_limited() but there was no simple_xattr_add_limited() API yet which was needed in bpf_fs_initxattrs() to avoid underflows in the accounting. The symlink target is freed in bpf_free_inode() rather than in bpf_destroy_inode() so that it is released only after an RCU grace period, as an RCU path walk following the symlink may still dereference inode->i_link in security_inode_follow_link(). Lastly, the bpf_symlink() allocated the symlink target is switched to GFP_KERNEL_ACCOUNT, so the string is charged to the caller's memcg. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://patch.msgid.link/20260602074012.416289-1-daniel@iogearbox.net Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-05bpf: Expose signature verdict via bpf_prog_auxKP Singh
BPF_PROG_LOAD verifies the loader signature but does not record the outcome on the BPF program. [BPF] LSMs and audit can read attr->signature and attr->keyring_id to infer "was this signed, and if so, against which keyring". Add prog->aux->sig (verdict + keyring_{type,serial}), populated by bpf_prog_load before the LSM hook. keyring_type classifies the keyring the load referenced (builtin, secondary, platform or user), while keyring_serial records the serial of the keyring the signature was actually validated against. System keyrings carry a pseudo key pointer with no user-visible serial and are reported as 0, as are unsigned loads. Failed verifications reject the load before the hook runs, so it observes only either UNSIGNED or VERIFIED. Signed-off-by: KP Singh <kpsingh@kernel.org> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260605213518.544262-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-05bpf: Add validation for bpf_set_retval argumentXu Kuohai
The bpf_set_retval() helper is used by cgroup BPF programs to set the return value of the target hook. The argument type for this helper is ARG_ANYTHING. This allows setting a positive value, which no cgroup hook expects and can cause issues, such as: - BPF_LSM_CGROUP: a positive value from bpf_lsm_socket_create bypasses the err < 0 check in __sock_create(), leaving the socket object unallocated. The positive return value is then propagated to the syscall entry __sys_socket(), which also bypasses the IS_ERR() guard and ultimately causes a NULL pointer dereference. - BPF_CGROUP_DEVICE: a positive value can be returned through cgroup device bpf prog -> devcgroup_check_permission() -> bdev_permission() -> bdev_file_open_by_dev(), where ERR_PTR(positive) produces a pointer that IS_ERR() does not catch, leading to a wild pointer dereference. - BPF_CGROUP_SOCK: a positive value can be returned through cgroup sock bpf prog -> __cgroup_bpf_run_filter_sk() -> inet_create() -> __sock_create(), where inet_create() frees the newly allocated sk via sk_common_release() and sets sock->sk = NULL on the non-zero return, but __sock_create() only checks err < 0 for cleanup, so a positive retval bypasses cleanup and returns a socket with NULL sk to userspace, triggering a NULL pointer dereference on subsequent socket operations. - BPF_CGROUP_SYSCTL: a positive value can be returned through the cgroup bpf prog -> __cgroup_bpf_run_filter_sysctl() -> proc_sys_call_handler(), where a non-zero return bypasses the normal sysctl proc_handler and is returned directly to userspace as return value of read() or write() syscall. So add validation for the argument of the bpf_set_retval() helper. For BPF_LSM_CGROUP, enforce the LSM hook specific range returned by bpf_lsm_get_retval_range(). For all other cgroup program types, restrict the argument to [-MAX_ERRNO, 0], which matches the kernel convention of 0 for success and negative errno for error. BPF_CGROUP_GETSOCKOPT is an exception, since valid getsockopt implementations may return positive values, as allowed by commit c4dcfdd406aa ("bpf: Move getsockopt retval to struct bpf_cg_run_ctx"). Also refine the return value range of bpf_get_retval() so that values returned by bpf_get_retval() can be passed directly to bpf_set_retval() without extra manual bounds checking. Fixes: b44123b4a3dc ("bpf: Add cgroup helpers bpf_{get,set}_retval to get/set syscall return value") Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor") Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn> Closes: https://lore.kernel.org/all/567d3206-74a5-44e5-99c6-779c425f399e@std.uestc.edu.cn Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/r/20260605140243.664590-3-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-05bpf: Restore sysctl new-value from 1 to 0Dawei Feng
Commit 4e63acdff864 ("bpf: Introduce bpf_sysctl_{get,set}_new_value helpers") changed the success return value to 0, but failed to update the corresponding check in __cgroup_bpf_run_filter_sysctl(). Since bpf_prog_run_array_cg() now returns 0 on success, the legacy ret == 1 condition is never satisfied. As a result, the modified value is ignored, and bpf_sysctl_set_new_value() fails to replace the write buffer. Fix this by checking for a return value of 0 instead, so cgroup/sysctl programs can correctly replace the pending sysctl buffer. This bug was discovered during a manual code review. Tested via a cgroup/sysctl BPF reproducer overriding writes to a target sysctl. Pre-fix, bpf_sysctl_set_new_value("foo") was silently ignored: the write returned 8192 and the value remained "600". Post-fix, the BPF replacement buffer properly propagates: the write returns 3 and the value updates to "foo". Fixes: f10d05966196 ("bpf: Make BPF_PROG_RUN_ARRAY return -err instead of allow boolean") Cc: stable@vger.kernel.org Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Acked-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/r/20260603105317.944304-4-dawei.feng@seu.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-05bpf: use kvfree() for replaced sysctl write bufferDawei Feng
proc_sys_call_handler() allocates its temporary sysctl buffer with kvzalloc() and passes it to __cgroup_bpf_run_filter_sysctl(). Since kvzalloc() may fall back to vmalloc() for large allocations, freeing that buffer with kfree() is wrong and can corrupt memory. Use kvfree() to safely handle both kmalloc and kvzalloc()/vmalloc allocations. The bug was first flagged by an experimental analysis tool we are developing for kernel memory-management bugs while analyzing v6.13-rc1. The tool is still under development and is not yet publicly available. Manual inspection confirms that the bug is still present in v7.1-rc5. Reproduced the bug based on v7.1-rc4 in a QEMU x86_64 guest booted with KASAN and CONFIG_FAILSLAB enabled. To exercise the replacement path, the test tree also included the accompanying fix for the stale ret == 1 check in __cgroup_bpf_run_filter_sysctl(). The reproducer confines failslab injections to the proc_sys_call_handler() range, uses stacktrace-depth=32, and injects fail-nth=1 while writing 8191 bytes to /proc/sys/kernel/domainname from a task in the target cgroup. Under that setup, fail-nth=1 triggered the fault: BUG: unable to handle page fault for address: ffffeb0200024d48 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 SMP KASAN NOPTI CPU: 2 UID: 0 PID: 209 Comm: repro_proc_sys_ Not tainted 7.1.0-rc4-00686-g97625979a5d4 PREEMPT(lazy) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 RIP: 0010:kfree+0x6e/0x510 ... Call Trace: <TASK> ? __cgroup_bpf_run_filter_sysctl+0x626/0xc30 __cgroup_bpf_run_filter_sysctl+0x74d/0xc30 ? __pfx___cgroup_bpf_run_filter_sysctl+0x10/0x10 ? srso_return_thunk+0x5/0x5f ? __kvmalloc_node_noprof+0x345/0x870 ? proc_sys_call_handler+0x250/0x480 ? srso_return_thunk+0x5/0x5f proc_sys_call_handler+0x3a2/0x480 ? __pfx_proc_sys_call_handler+0x10/0x10 ? srso_return_thunk+0x5/0x5f ? selinux_file_permission+0x39f/0x500 ? srso_return_thunk+0x5/0x5f ? lock_is_held_type+0x9e/0x120 vfs_write+0x98e/0x1000 ... </TASK> With this fix applied on top of the same test setup, rerunning the reproducer with fail-nth=1 yields no corresponding Oops reports. Fixes: 4508943794ef ("proc: use kvzalloc for our kernel buffer") Cc: stable@vger.kernel.org Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn> Link: https://lore.kernel.org/r/20260603105317.944304-3-dawei.feng@seu.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-05bpf: NUL-terminate replaced sysctl valueDawei Feng
When writing to sysctls, proc_sys_call_handler() guarantees that the buffer passed to proc handlers is NUL-terminated. If bpf_sysctl_set_new_value() replaces the pending sysctl value, it can hand a replacement buffer directly to proc handlers. However, the helper currently copies only buf_len bytes into that buffer without appending a NUL terminator, leaving downstream parsers vulnerable to out-of-bounds access. Fix this by appending a '\0' after the replaced value to restore the expected sysctl semantics. Since the helper already rejects buf_len greater than PAGE_SIZE - 1, there is always room for the extra byte. Reproduced in a QEMU x86_64 guest booted with KASAN while exercising the sysctl replacement path with a cgroup/sysctl BPF program. The reproducer targets `/proc/sys/net/core/flow_limit_cpu_bitmap`, fills the original user write buffer with non-zero bytes, and overrides the sysctl value so the replacement buffer lacks a terminating NUL. Under that setup, the pre-fix kernel reported: BUG: KASAN: slab-out-of-bounds in strnchrnul+0x72/0x90 Read of size 1 at addr ffff88800de57000 by task repro_patch3/66 CPU: 0 UID: 0 PID: 66 Comm: repro_patch3 Not tainted 7.1.0-rc3-00269-g8370ca1f87cc #6 PREEMPT(lazy) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x68/0xa0 print_report+0xcb/0x5e0 ? __virt_addr_valid+0x21d/0x3f0 ? strnchrnul+0x72/0x90 ? strnchrnul+0x72/0x90 kasan_report+0xca/0x100 ? strnchrnul+0x72/0x90 strnchrnul+0x72/0x90 bitmap_parse+0x37/0x2e0 flow_limit_cpu_sysctl+0xc6/0x840 ? __pfx_flow_limit_cpu_sysctl+0x10/0x10 ? __kvmalloc_node_noprof+0x5ba/0x870 proc_sys_call_handler+0x31d/0x480 ? __pfx_proc_sys_call_handler+0x10/0x10 ? selinux_file_permission+0x39f/0x500 ? lock_is_held_type+0x9e/0x120 vfs_write+0x98e/0x1000 ... </TASK> The buggy address is located 0 bytes to the right of allocated 4096-byte region [ffff88800de56000, ffff88800de57000) With this fix applied, rerunning the same sysctl-targeted path yields no corresponding KASAN reports. Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20260603105317.944304-2-dawei.feng@seu.edu.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-05bpf: Check tail zero of bpf_prog_infoLeon Hwang
Since there're 4 bytes padding at the end of struct bpf_prog_info, they won't be checked by bpf_check_uarg_tail_zero(). pahole -C bpf_prog_info ./vmlinux struct bpf_prog_info { ... __u32 attach_btf_obj_id; /* 220 4 */ __u32 attach_btf_id; /* 224 4 */ /* size: 232, cachelines: 4, members: 38 */ /* sum members: 224 */ /* sum bitfield members: 1 bits, bit holes: 1, sum bit holes: 31 bits */ /* padding: 4 */ /* forced alignments: 9 */ /* last cacheline: 40 bytes */ } __attribute__((__aligned__(8))); If a future kernel extension adds a new 4-byte field, older userspace programs allocating this structure on the stack might inadvertently pass uninitialized stack garbage into the new field, permanently breaking backward compatibility. -- sashiko [1] Fix it by changing sizeof(info) to offsetofend(struct bpf_prog_info, attach_btf_id). And, add "__u32 :32" to the tail of struct bpf_prog_info. [1] https://lore.kernel.org/bpf/20260513224823.6494FC19425@smtp.kernel.org/ Fixes: aba64c7da983 ("bpf: Add verified_insns to bpf_prog_info and fdinfo") Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260605155249.20772-3-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-05bpf: Check tail zero of bpf_map_infoLeon Hwang
Since there're 4 bytes padding at the end of struct bpf_map_info, they won't be checked by bpf_check_uarg_tail_zero(). pahole -C bpf_map_info ./vmlinux struct bpf_map_info { ... __u64 hash __attribute__((__aligned__(8))); /* 88 8 */ __u32 hash_size; /* 96 4 */ /* size: 104, cachelines: 2, members: 18 */ /* padding: 4 */ /* forced alignments: 1 */ /* last cacheline: 40 bytes */ } __attribute__((__aligned__(8))); If a future kernel extension adds a new 4-byte field, older userspace programs allocating this structure on the stack might inadvertently pass uninitialized stack garbage into the new field, permanently breaking backward compatibility. -- sashiko [1] Fix it by changing sizeof(info) to offsetofend(struct bpf_map_info, hash_size). And, add "__u32 :32" to the tail of struct bpf_map_info. [1] https://lore.kernel.org/bpf/20260513224823.6494FC19425@smtp.kernel.org/ Fixes: ea2e6467ac36 ("bpf: Return hashes of maps in BPF_OBJ_GET_INFO_BY_FD") Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260605155249.20772-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>