linux.git/kernel/bpf/core.c, branch v7.2-rc1

bpf: Tighten cgroup storage cookie checks for prog arrays

2026-06-10T23:16:46+00:00

The fix in commit abad3d0bad72 ("bpf: Fix oob access in cgroup local
storage") is still incomplete. The prog-array compatibility check
treats a program with no cgroup storage as compatible with any stored
storage cookie. This allows a storage-less program to bridge a tail
call chain between an entry program and a storage-using callee even
though cgroup local storage at runtime still follows the caller's
context, that is, A -> B(no storage) -> C(storage) path.

Requiring exact cookie equality would break the legitimate case of a
storage-less leaf program being tail called from a storage-using one.
Instead, only accept a zero storage cookie if the program cannot
perform tail calls itself. This keeps A -> B(no storage) working
while rejecting the A -> B(no storage) -> C(storage) bridge.

Fixes: abad3d0bad72 ("bpf: Fix oob access in cgroup local storage")
Reported-by: Lin Ma 
Signed-off-by: Daniel Borkmann 
Acked-by: Yonghong Song 
Link: https://lore.kernel.org/r/20260610105539.705887-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov

bpf: Fix bpf_arena_handle_page_fault() redefinition without CONFIG_BPF_SYSCALL

2026-05-27T20:50:35+00:00

On configs with CONFIG_BPF=y but CONFIG_BPF_SYSCALL=n (e.g. arm
multi_v7_defconfig), kernel/bpf/core.c defines a __weak
bpf_arena_handle_page_fault() while bpf_defs.h already supplies a static
inline stub for it, causing a redefinition error. Build the __weak
definition only under CONFIG_BPF_SYSCALL, matching the bpf_defs.h
declaration and the CONFIG_BPF_SYSCALL-gated strong definition in arena.c.

Fixes: dc11a4dba246 ("bpf: Recover arena kernel faults with scratch page")
Reported-by: Mark Brown 
Signed-off-by: Tejun Heo 
Acked-by: Song Liu 
Link: https://lore.kernel.org/r/20260527192632.2109419-1-tj@kernel.org
Signed-off-by: Alexei Starovoitov

Merge branch 'arena_direct_access'

2026-05-25T15:35:48+00:00

Tejun Heo says:

====================
This makes BPF arena memory directly dereferenceable from kernel code
(struct_ops callbacks, kfuncs). Each arena gets a per-arena scratch page
that an arch fault hook installs into empty PTEs on kernel-side faults,
after KFENCE. The faulting instruction retries and the violation is reported
through the program's BPF stream.

v4:
- Patch 1: note that the strict-zero cmpxchg is narrower than pte_none() in
  inline comments on both x86 and arm64. (Andrea)
- Patch 2: stub bpf_arena_handle_page_fault() for !CONFIG_BPF_SYSCALL via a
  new include/linux/bpf_defs.h. (lkp)
- Patch 7: scx_arena_alloc() retries via a loop instead of a single retry on
  pool growth. (Andrea)
- Picked up Reviewed-by tags from Emil and Andrea.

v3: https://lore.kernel.org/r/20260520235052.4180316-1-tj@kernel.org
v2: https://lore.kernel.org/r/20260517211232.1670594-1-tj@kernel.org
v1 (RFC): https://lore.kernel.org/r/20260427105109.2554518-1-tj@kernel.org

Motivation
----------

sched_ext's ops_cid.set_cmask() hands the BPF scheduler a struct scx_cmask
*. The kernel translates a kernel cpumask to a cmask, but it had no way to
write into the arena, so the cmask lived in kernel memory and was passed as
a trusted pointer. BPF cmask helpers all operate on arena cmasks though, so
the BPF side had to word-by-word probe-read the kernel cmask into an arena
cmask via cmask_copy_from_kernel() before any helper could touch it. It
works, but is clumsy.

The shape isn't unique to set_cmask. Sub-scheduler support is on the way and
more sched_ext callbacks will want to pass structured data to BPF. Anywhere
a kfunc or struct_ops callback wants to hand a struct to a BPF program,
arena residence is the natural answer.

Approach
--------

Each arena gets a per-arena scratch page. Arenas stay sparsely mapped as
today - PTEs are populated only for allocated pages. A new arch fault hook
(bpf_arena_handle_page_fault) is wired into x86 page_fault_oops() and arm64
__do_kernel_fault(), after KFENCE. When a kernel-side access faults inside
an arena's kern_vm range, the helper walks the stack to find the BPF program
responsible, range-checks the fault address against prog->aux->arena, and
atomically installs the scratch page into the empty PTE via the new
ptep_try_set() wrapper. The kernel instruction retries and reads/writes the
scratch page. Free paths and map destruction treat scratch as non-owned.
Real allocation refuses to overwrite scratch (apply_range_set_cb returns
-EBUSY). A scratched address stays dead until map destroy, since its
presence means the BPF program has already malfunctioned.

The mechanism is default behavior - no UAPI flag.

What this preserves
-------------------

All the debugging properties of today's sparse-PTE design are preserved:

* BPF programs still fault on unmapped arena accesses. The fault semantics
  (instruction retry with rdst = 0) and the violation report through
  bpf_streams are unchanged for prog-side accesses.

* The first kernel-side touch of an unmapped address is reported via
  bpf_streams the same way as a prog-side fault, with the stack walk
  attributing it to the originating prog.

* User-side fault on a never-scratched address still lazy-allocates a real
  page (or returns SIGSEGV under BPF_F_SEGV_ON_FAULT). User-side fault on a
  scratched address SIGSEGVs.

What changes for the kernel-side caller is just that an unmapped deref no
longer oopses - it retries through the scratch page and emits a violation
report. The same shape today's BPF instruction faults have.

Patches 1-2 (atomic PTE install + arena scratch-page recovery)
--------------------------------------------------------------

  mm: Add ptep_try_set() for lockless empty-slot installs
  bpf: Recover arena kernel faults with scratch page

Patches 3-5 (helpers used by struct_ops registration)
-----------------------------------------------------

  bpf: Add sleepable variant of bpf_arena_alloc_pages for kernel callers
  bpf: Add bpf_struct_ops_for_each_prog()
  bpf/arena: Add bpf_arena_map_kern_vm_start() and bpf_prog_arena()
====================

Link: https://lore.kernel.org/bpf/20260522172219.1423324-1-tj@kernel.org/
Signed-off-by: Alexei Starovoitov

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf 7.1-rc5

2026-05-25T13:33:30+00:00

Cross-merge BPF and other fixes after downstream PR.

Signed-off-by: Alexei Starovoitov

bpf: Recover arena kernel faults with scratch page

2026-05-23T08:50:33+00:00

BPF arena usage is becoming more prevalent, but kernel <-> BPF communication
over arena memory is awkward today. Data has to be staged through a trusted
kernel pointer with extra code and copying on the BPF side. While reads
through arena pointers can use a fault-safe helper, writes don't have a good
solution. The in-line alternative would need instruction emulation or asm
fixup labels.

Enable direct kernel-side reads and writes within GUARD_SZ / 2 of any
handed-in arena pointer, without bounds checking. A per-arena scratch page
is installed by the arch fault path into empty arena kernel PTEs - x86 from
page_fault_oops() for not-present faults, arm64 from __do_kernel_fault() for
translation faults, both after the existing exception-table and KFENCE
handling. The faulting instruction retries and the access is also reported
through the program's BPF stream, preserving error reporting.

bpf_prog_find_from_stack() resolves the current BPF program (and its arena)
from the kernel stack - no new bpf_run_ctx state is added. Recovery covers
the 4 GiB arena plus the upper half-guard (GUARD_SZ / 2). The lower
half-guard is excluded because well-behaved kfuncs only access forward from
arena pointers. The kfunc-author contract - access at most GUARD_SZ / 2 past
a handed-in pointer - is documented in Documentation/bpf/kfuncs.rst.

The install is lock-free via ptep_try_set(). On race-loss the winning
installer's PTE is already valid, so the access retry succeeds. The arena
clear path uses ptep_get_and_clear() so installer and clearer race through
atomic accessors. No flush_tlb_kernel_range() afterwards. Stale "not mapped"
entries just cause one extra re-fault, cheaper than a global IPI on every
install.

Scratch exists only to keep the kernel from oopsing on an in-line arena
access. Its presence at a PTE means the BPF program has already
malfunctioned, and the violation is reported through the program's BPF
stream. The only requirement for behavior on a scratched PTE is that the
kernel doesn't crash. In particular, any user-side access through such a PTE
may segfault. The shared scratch page is freed once during map destruction.

BPF instruction faults continue to use the existing JIT exception-table
path. This patch changes only the kernel-text fault path. No UAPI flag is
added. The new behavior is the default.

v2: Use ptep_get_and_clear() in apply_range_clear_cb(). (David)
v3: Stub bpf_arena_handle_page_fault() for !CONFIG_BPF_SYSCALL. (lkp)

Suggested-by: Alexei Starovoitov 
Signed-off-by: Kumar Kartikeya Dwivedi 
Signed-off-by: Tejun Heo 
Reviewed-by: Emil Tsalapatis 
Cc: David Hildenbrand 
Link: https://lore.kernel.org/r/20260522172219.1423324-3-tj@kernel.org
Signed-off-by: Alexei Starovoitov

bpf: Clean up redundant stack arg checks for non-JITed programs

2026-05-17T00:46:16+00:00

Remove a redundant stack_arg_cnt check in __bpf_prog_select_runtime()
and start the stack arg loop from index 0 in bpf_fixup_call_args().
Both changes are no-ops that simplify the code:

In __bpf_prog_select_runtime(), the subprog_info[0].stack_arg_cnt
check is unreachable:
  - when there is only a main program (no bpf-to-bpf calls),
    subprog_info[0].stack_arg_cnt is always 0 because the main
    program's arg_cnt is forced to 1
  - when bpf-to-bpf calls use stack args and JIT succeeds,
    fp->bpf_func is set and this code is skipped
  - when JIT fails, bpf_fixup_call_args() rejects the program
    before we get to __bpf_prog_select_runtime().

In bpf_fixup_call_args(), starting the loop at i=1 skipped subprog 0,
which is safe since the main program always has arg_cnt=1 and thus
bpf_in_stack_arg_cnt() returns 0. Starting at i=0 removes the need
to reason about this invariant.

Signed-off-by: Yonghong Song 
Link: https://lore.kernel.org/r/20260515225101.824054-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov

bpf,x86: Implement JIT support for stack arguments

2026-05-13T16:27:31+00:00

Add x86_64 JIT support for BPF functions and kfuncs with more than
5 arguments. The extra arguments are passed through a stack area
addressed by register r11 (BPF_REG_PARAMS) in BPF bytecode,
which the JIT translates to native code.

The JIT follows the x86-64 calling convention for both BPF-to-BPF
and kfunc calls:
  - Arg 6 is passed in the R9 register
  - Args 7+ are passed on the stack

Incoming arg 6 (BPF r11+8) is translated to a MOV from R9 rather
than a memory load. Incoming args 7+ (BPF r11+16, r11+24, ...) map
directly to [rbp + 16], [rbp + 24], ..., matching the x86-64 stack
layout after CALL + PUSH RBP, so no offset adjustment is needed.

tail_call_reachable is rejected by the verifier and priv_stack is
disabled by the JIT when stack args exist, so R9 is always
available. When BPF bytecode writes to the arg-6 stack slot
(offset -8), the JIT emits a MOV into R9 instead of a memory store.
Outgoing args 7+ are placed at [rsp] in a pre-allocated area below
callee-saved registers, using:
  native_off = outgoing_arg_base - outgoing_rsp - bpf_off - 16

The native x86_64 stack layout with stack arguments:

  high address
  +-------------------------+
  | incoming stack arg N    |  [rbp + 16 + (N-7)*8]  (from caller)
  | ...                     |
  | incoming stack arg 7    |  [rbp + 16]
  +-------------------------+
  | return address          |  [rbp + 8]
  | saved rbp               |  [rbp]
  +-------------------------+
  | BPF program stack       |  (round_up(stack_depth, 8) bytes)
  +-------------------------+
  | callee-saved regs       |  (r12, rbx, r13, r14, r15 as needed)
  +-------------------------+
  | outgoing arg M          |  [rsp + (M-7)*8]
  | ...                     |
  | outgoing arg 7          |  [rsp]
  +-------------------------+  rsp
  low address

Acked-by: Puranjay Mohan 
Signed-off-by: Yonghong Song 
Link: https://lore.kernel.org/r/20260513045122.2393118-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov

bpf: Prepare architecture JIT support for stack arguments

2026-05-13T16:27:31+00:00

Add bpf_jit_supports_stack_args() as a weak function defaulting to
false. Architectures that implement JIT support for stack arguments
override it to return true.

Reject BPF functions with more than 5 parameters at verification
time if the architecture does not support stack arguments.

Acked-by: Puranjay Mohan 
Signed-off-by: Yonghong Song 
Link: https://lore.kernel.org/r/20260513045054.2390945-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov

bpf: Reject stack arguments in non-JITed programs

2026-05-13T16:27:31+00:00

The interpreter does not understand the bpf register r11
(BPF_REG_PARAMS) used for stack arguments. So reject interpreter
usage if stack arguments are used either in the main program or
any subprogram.

Signed-off-by: Yonghong Song 
Link: https://lore.kernel.org/r/20260513045049.2390444-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov

bpf: Fix s16 truncation for large bpf-to-bpf call offsets

2026-05-11T15:27:02+00:00

Currently, the BPF instruction set allows bpf-to-bpf calls (or internal
calls, pseudo calls) to use a 32-bit imm field to represent the relative
jump offset.

However, when JIT is disabled or falls back to the interpreter, the
verifier invokes bpf_patch_call_args() to rewrite the call instruction.
In this function, the 32-bit imm is downcast to s16 and stored in the off
field.

    void bpf_patch_call_args(struct bpf_insn *insn, u32 stack_depth)
    {
        stack_depth = max_t(u32, stack_depth, 1);
        insn->off = (s16) insn->imm;
        insn->imm = interpreters_args[(round_up(stack_depth, 32) / 32) - 1] -
            __bpf_call_base_args;
        insn->code = BPF_JMP | BPF_CALL_ARGS;
    }

If the original imm exceeds the s16 range (i.e., a jump offset greater
than 32767 instructions), this downcast silently truncates the offset,
resulting in an incorrect call target.

Fix this by:
1. In bpf_patch_call_args(), keeping the imm field unchanged and using the
   off field to store the index of the interpreter function.
2. In ___bpf_prog_run() for the JMP_CALL_ARGS case, retrieving the
   interpreter function pointer from the interpreters_args array using the
   off field as the index, and passing the original imm to calculate the
   last argument of the interpreter function.

After these changes, the truncation issue is resolved, and __bpf_call_base_args
is also no longer needed and can be removed, which makes the code cleaner.

Performance: In ___bpf_prog_run() for the JMP_CALL_ARGS case, changing the
retrieval of the interpreter function pointer from pointer addition to
direct array indexing improves performance. The possible reason is that the
latter has better instruction-level parallelism. See the v5 discussion [1]
for more details.

[1] https://lore.kernel.org/bpf/f120c3c4-6999-414a-b514-518bb64b4758@zju.edu.cn/

To avoid requiring bpftool changes, keep the new imm/off encoding internal
and restore the legacy xlated dump layout in bpf_insn_prepare_dump().
For bpf-to-bpf call offsets that do not fit in s16, export off as 0 instead
of a truncated and misleading value.

Fixes: 1ea47e01ad6e ("bpf: add support for bpf_call to interpreter")
Fixes: 7105e828c087 ("bpf: allow for correlation of maps and helpers in dump")
Suggested-by: Xu Kuohai 
Suggested-by: Puranjay Mohan 
Co-developed-by: Tianci Cao 
Signed-off-by: Tianci Cao 
Co-developed-by: Shenghao Yuan 
Signed-off-by: Shenghao Yuan 
Signed-off-by: Yazhou Tang 
Link: https://lore.kernel.org/r/20260506094714.419842-3-tangyazhou@zju.edu.cn
Signed-off-by: Alexei Starovoitov