linux-stable.git/kernel/bpf/helpers.c, branch linux-6.16.y

bpf: Tell memcg to use allow_spinning=false path in bpf_timer_init()

2025-09-19T14:37:29+00:00

[ Upstream commit 6d78b4473cdb08b74662355a9e8510bde09c511e ]

Currently, calling bpf_map_kmalloc_node() from __bpf_async_init() can
cause various locking issues; see the following stack trace (edited for
style) as one example:

...
 [10.011566]  do_raw_spin_lock.cold
 [10.011570]  try_to_wake_up             (5) double-acquiring the same
 [10.011575]  kick_pool                      rq_lock, causing a hardlockup
 [10.011579]  __queue_work
 [10.011582]  queue_work_on
 [10.011585]  kernfs_notify
 [10.011589]  cgroup_file_notify
 [10.011593]  try_charge_memcg           (4) memcg accounting raises an
 [10.011597]  obj_cgroup_charge_pages        MEMCG_MAX event
 [10.011599]  obj_cgroup_charge_account
 [10.011600]  __memcg_slab_post_alloc_hook
 [10.011603]  __kmalloc_node_noprof
...
 [10.011611]  bpf_map_kmalloc_node
 [10.011612]  __bpf_async_init
 [10.011615]  bpf_timer_init             (3) BPF calls bpf_timer_init()
 [10.011617]  bpf_prog_xxxxxxxxxxxxxxxx_fcg_runnable
 [10.011619]  bpf__sched_ext_ops_runnable
 [10.011620]  enqueue_task_scx           (2) BPF runs with rq_lock held
 [10.011622]  enqueue_task
 [10.011626]  ttwu_do_activate
 [10.011629]  sched_ttwu_pending         (1) grabs rq_lock
...

The above was reproduced on bpf-next (b338cf849ec8) by modifying
./tools/sched_ext/scx_flatcg.bpf.c to call bpf_timer_init() during
ops.runnable(), and hacking the memcg accounting code a bit to make
a bpf_timer_init() call more likely to raise an MEMCG_MAX event.

We have also run into other similar variants (both internally and on
bpf-next), including double-acquiring cgroup_file_kn_lock, the same
worker_pool::lock, etc.

As suggested by Shakeel, fix this by using __GFP_HIGH instead of
GFP_ATOMIC in __bpf_async_init(), so that e.g. if try_charge_memcg()
raises an MEMCG_MAX event, we call __memcg_memory_event() with
@allow_spinning=false and avoid calling cgroup_file_notify() there.

Depends on mm patch
"memcg: skip cgroup_file_notify if spinning is not allowed":
https://lore.kernel.org/bpf/20250905201606.66198-1-shakeel.butt@linux.dev/

v0 approach s/bpf_map_kmalloc_node/bpf_mem_alloc/
https://lore.kernel.org/bpf/20250905061919.439648-1-yepeilin@google.com/
v1 approach:
https://lore.kernel.org/bpf/20250905234547.862249-1-yepeilin@google.com/

Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
Suggested-by: Shakeel Butt 
Signed-off-by: Peilin Ye 
Link: https://lore.kernel.org/r/20250909095222.2121438-1-yepeilin@google.com
Signed-off-by: Alexei Starovoitov 
Signed-off-by: Sasha Levin

bpf: Ensure RCU lock is held around bpf_prog_ksym_find

2025-08-15T14:38:42+00:00

[ Upstream commit d090326860096df9dac6f27cff76d3f8df44d4f1 ]

Add a warning to ensure RCU lock is held around tree lookup, and then
fix one of the invocations in bpf_stack_walker. The program has an
active stack frame and won't disappear. Use the opportunity to remove
unneeded invocation of is_bpf_text_address.

Fixes: f18b03fabaa9 ("bpf: Implement BPF exceptions")
Reviewed-by: Emil Tsalapatis 
Signed-off-by: Kumar Kartikeya Dwivedi 
Link: https://lore.kernel.org/r/20250703204818.925464-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov 
Signed-off-by: Sasha Levin

bpf: Reject %p% format string in bprintf-like helpers

2025-07-01T22:22:46+00:00

static const char fmt[] = "%p%";
    bpf_trace_printk(fmt, sizeof(fmt));

The above BPF program isn't rejected and causes a kernel warning at
runtime:

    Please remove unsupported %\x00 in format string
    WARNING: CPU: 1 PID: 7244 at lib/vsprintf.c:2680 format_decode+0x49c/0x5d0

This happens because bpf_bprintf_prepare skips over the second %,
detected as punctuation, while processing %p. This patch fixes it by
not skipping over punctuation. %\x00 is then processed in the next
iteration and rejected.

Reported-by: syzbot+e2c932aec5c8a6e1d31c@syzkaller.appspotmail.com
Fixes: 48cac3f4a96d ("bpf: Implement formatted output helpers with bstr_printf")
Acked-by: Yonghong Song 
Signed-off-by: Paul Chaignon 
Link: https://lore.kernel.org/r/a0e06cc479faec9e802ae51ba5d66420523251ee.1751395489.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov

bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem()

2025-05-27T17:45:59+00:00

bpf_map_lookup_percpu_elem() helper is also available for sleepable bpf
program. When BPF JIT is disabled or under 32-bit host,
bpf_map_lookup_percpu_elem() will not be inlined. Using it in a
sleepable bpf program will trigger the warning in
bpf_map_lookup_percpu_elem(), because the bpf program only holds
rcu_read_lock_trace lock. Therefore, add the missed check.

Reported-by: syzbot+dce5aae19ae4d6399986@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/bpf/000000000000176a130617420310@google.com/
Signed-off-by: Hou Tao 
Link: https://lore.kernel.org/r/20250526062534.1105938-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov

bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variable

2025-05-27T16:54:20+00:00

Marc Suñé (Isovalent, part of Cisco) reported an issue where an
uninitialized variable caused generating bpf prog binary code not
working as expected. The reproducer is in [1] where the flags
“-Wall -Werror” are enabled, but there is no warning as the compiler
takes advantage of uninitialized variable to do aggressive optimization.
The optimized code looks like below:

      ; {
           0:       bf 16 00 00 00 00 00 00 r6 = r1
      ;       bpf_printk("Start");
           1:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
                    0000000000000008:  R_BPF_64_64  .rodata
           3:       b4 02 00 00 06 00 00 00 w2 = 0x6
           4:       85 00 00 00 06 00 00 00 call 0x6
      ; DEFINE_FUNC_CTX_POINTER(data)
           5:       61 61 4c 00 00 00 00 00 w1 = *(u32 *)(r6 + 0x4c)
      ;       bpf_printk("pre ipv6_hdrlen_offset");
           6:       18 01 00 00 06 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x6 ll
                    0000000000000030:  R_BPF_64_64  .rodata
           8:       b4 02 00 00 17 00 00 00 w2 = 0x17
           9:       85 00 00 00 06 00 00 00 call 0x6
      

The verifier will report the following failure:
  9: (85) call bpf_trace_printk#6
  last insn is not an exit or jmp

The above verifier log does not give a clear hint about how to fix
the problem and user may take quite some time to figure out that
the issue is due to compiler taking advantage of uninitialized variable.

In llvm internals, uninitialized variable usage may generate
'unreachable' IR insn and these 'unreachable' IR insns may indicate
uninitialized variable impact on code optimization. So far, llvm
BPF backend ignores 'unreachable' IR hence the above code is generated.
With clang21 patch [2], those 'unreachable' IR insn are converted
to func __bpf_trap(). In order to maintain proper control flow
graph for bpf progs, [2] also adds an 'exit' insn after bpf_trap()
if __bpf_trap() is the last insn in the function. The new code looks like:

      ; {
           0:       bf 16 00 00 00 00 00 00 r6 = r1
      ;       bpf_printk("Start");
           1:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x0 ll
                    0000000000000008:  R_BPF_64_64  .rodata
           3:       b4 02 00 00 06 00 00 00 w2 = 0x6
           4:       85 00 00 00 06 00 00 00 call 0x6
      ; DEFINE_FUNC_CTX_POINTER(data)
           5:       61 61 4c 00 00 00 00 00 w1 = *(u32 *)(r6 + 0x4c)
      ;       bpf_printk("pre ipv6_hdrlen_offset");
           6:       18 01 00 00 06 00 00 00 00 00 00 00 00 00 00 00 r1 = 0x6 ll
                    0000000000000030:  R_BPF_64_64  .rodata
           8:       b4 02 00 00 17 00 00 00 w2 = 0x17
           9:       85 00 00 00 06 00 00 00 call 0x6
          10:       85 10 00 00 ff ff ff ff call -0x1
                    0000000000000050:  R_BPF_64_32  __bpf_trap
          11:       95 00 00 00 00 00 00 00 exit
      

In kernel, a new kfunc __bpf_trap() is added. During insn
verification, any hit with __bpf_trap() will result in
verification failure. The kernel is able to provide better
log message for debugging.

With llvm patch [2] and without this patch (no __bpf_trap()
kfunc for existing kernel), e.g., for old kernels, the verifier
outputs
  10: 
  kfunc '__bpf_trap' is referenced but wasn't resolved
Basically, kernel does not support __bpf_trap() kfunc.
This still didn't give clear signals about possible reason.

With llvm patch [2] and with this patch, the verifier outputs
  10: (85) call __bpf_trap#74479
  unexpected __bpf_trap() due to uninitialized variable?
It gives much better hints for verification failure.

  [1] https://github.com/msune/clang_bpf/blob/main/Makefile#L3
  [2] https://github.com/llvm/llvm-project/pull/131731

Signed-off-by: Yonghong Song 
Link: https://lore.kernel.org/r/20250523205326.1291640-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov

bpf: Add open coded dmabuf iterator

2025-05-27T16:51:25+00:00

This open coded iterator allows for more flexibility when creating BPF
programs. It can support output in formats other than text. With an open
coded iterator, a single BPF program can traverse multiple kernel data
structures (now including dmabufs), allowing for more efficient analysis
of kernel data compared to multiple reads from procfs, sysfs, or
multiple traditional BPF iterator invocations.

Signed-off-by: T.J. Mercier 
Acked-by: Christian König 
Acked-by: Song Liu 
Link: https://lore.kernel.org/r/20250522230429.941193-4-tjmercier@google.com
Signed-off-by: Alexei Starovoitov

bpf: Add support for __prog argument suffix to pass in prog->aux

2025-05-14T01:47:54+00:00

Instead of hardcoding the list of kfuncs that need prog->aux passed to
them with a combination of fixup_kfunc_call adjustment + __ign suffix,
combine both in __prog suffix, which ignores the argument passed in, and
fixes it up to the prog->aux. This allows kfuncs to have the prog->aux
passed into them without having to touch the verifier.

Cc: Tejun Heo 
Signed-off-by: Kumar Kartikeya Dwivedi 
Link: https://lore.kernel.org/r/20250513142812.1021591-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov

bpf: Implement dynptr copy kfuncs

2025-05-13T01:31:51+00:00

This patch introduces a new set of kfuncs for working with dynptrs in
BPF programs, enabling reading variable-length user or kernel data
into dynptr directly. To enable memory-safety, verifier allows only
constant-sized reads via existing bpf_probe_read_{user|kernel} etc.
kfuncs, dynptr-based kfuncs allow dynamically-sized reads without memory
safety shortcomings.

The following kfuncs are introduced:
* `bpf_probe_read_kernel_dynptr()`: probes kernel-space data into a dynptr
* `bpf_probe_read_user_dynptr()`: probes user-space data into a dynptr
* `bpf_probe_read_kernel_str_dynptr()`: probes kernel-space string into
a dynptr
* `bpf_probe_read_user_str_dynptr()`: probes user-space string into a
dynptr
* `bpf_copy_from_user_dynptr()`: sleepable, copies user-space data into
a dynptr for the current task
* `bpf_copy_from_user_str_dynptr()`: sleepable, copies user-space string
into a dynptr for the current task
* `bpf_copy_from_user_task_dynptr()`: sleepable, copies user-space data
of the task into a dynptr
* `bpf_copy_from_user_task_str_dynptr()`: sleepable, copies user-space
string of the task into a dynptr

The implementation is built on two generic functions:
 * __bpf_dynptr_copy
 * __bpf_dynptr_copy_str
These functions take function pointers as arguments, enabling the
copying of data from various sources, including both kernel and user
space.
Use __always_inline for generic functions and callbacks to make sure the
compiler doesn't generate indirect calls into callbacks, which is more
expensive, especially on some kernel configurations. Inlining allows
compiler to put direct calls into all the specific callback implementations
(copy_user_data_sleepable, copy_user_data_nofault, and so on).

Reviewed-by: Andrii Nakryiko 
Signed-off-by: Mykyta Yatsenko 
Link: https://lore.kernel.org/r/20250512205348.191079-3-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov

helpers: make few bpf helpers public

2025-05-13T01:29:03+00:00

Make bpf_dynptr_slice_rdwr, bpf_dynptr_check_off_len and
__bpf_dynptr_write available outside of the helpers.c by
adding their prototypes into linux/include/bpf.h.
bpf_dynptr_check_off_len() implementation is moved to header and made
inline explicitly, as small function should typically be inlined.

These functions are going to be used from bpf_trace.c in the next
patch of this series.

Acked-by: Andrii Nakryiko 
Signed-off-by: Mykyta Yatsenko 
Link: https://lore.kernel.org/r/20250512205348.191079-2-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov

bpf: Allow some trace helpers for all prog types

2025-05-09T17:37:10+00:00

if it works under NMI and doesn't use any context-dependent things,
should be fine for any program type. The detailed discussion is in [1].

[1] https://lore.kernel.org/all/CAEf4Bza6gK3dsrTosk6k3oZgtHesNDSrDd8sdeQ-GiS6oJixQg@mail.gmail.com/

Suggested-by: Andrii Nakryiko 
Signed-off-by: Feng Yang 
Signed-off-by: Andrii Nakryiko 
Acked-by: Tejun Heo 
Link: https://lore.kernel.org/bpf/20250506061434.94277-2-yangfeng59949@163.com