| Age | Commit message (Collapse) | Author |
|
The sleepable context check for global function calls in
check_func_call() open-codes the same checks that in_sleepable_context()
already performs. Replace the open-coded check with a call to
in_sleepable_context() and use non_sleepable_context_description() for
the error message, consistent with check_helper_call() and
check_kfunc_call().
Note that in_sleepable_context() also checks active_locks, which
overlaps with the existing active_locks check above it. However, the two
checks serve different purposes: the active_locks check rejects all
global function calls while holding a lock (not just sleepable ones), so
it must remain as a separate guard.
Update the expected error messages in the irq and preempt_lock selftests
to match.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260318174327.3151925-4-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
check_kfunc_call() has multiple scattered checks that reject sleepable
kfuncs in various non-sleepable contexts (RCU, preempt-disabled, IRQ-
disabled). These are the same conditions already checked by
in_sleepable_context(), so replace them with a single consolidated
check.
This also simplifies the preempt lock tracking by flattening the nested
if/else structure into a linear chain: preempt_disable increments,
preempt_enable checks for underflow and decrements. The sleepable check
is kept as a separate block since it is logically distinct from the lock
accounting.
No functional change since in_sleepable_context() checks all the same
state (active_rcu_locks, active_preempt_locks, active_locks,
active_irq_id, in_sleepable).
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260318174327.3151925-3-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
check_helper_call() prints the error message for every
env->cur_state->active* element when calling a sleepable helper.
Consolidate all of them into a single print statement.
The check for env->cur_state->active_locks was not part of the removed
print statements and will not be triggered with the consolidated print
as well because it is checked in do_check() before check_helper_call()
is even reached.
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260318174327.3151925-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
process_bpf_exit_full() passes check_lock = !curframe to
check_resource_leak(), which is false in cases when bpf_throw() is
called from a static subprog. This makes check_resource_leak() to skip
validation of active_rcu_locks, active_preempt_locks, and
active_irq_id on exception exits from subprogs.
At runtime bpf_throw() unwinds the stack via ORC without releasing any
user-acquired locks, which may cause various issues as the result.
Fix by setting check_lock = true for exception exits regardless of
curframe, since exceptions bypass all intermediate frame
cleanup. Update the error message prefix to "bpf_throw" for exception
exits to distinguish them from normal BPF_EXIT.
Fix reject_subprog_with_rcu_read_lock test which was previously
passing for the wrong reason. Test program returned directly from the
subprog call without closing the RCU section, so the error was
triggered by the unclosed RCU lock on normal exit, not by
bpf_throw. Update __msg annotations for affected tests to match the
new "bpf_throw" error prefix.
The spin_lock case is not affected because they are already checked [1]
at the call site in do_check_insn() before bpf_throw can run.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/verifier.c?h=v7.0-rc4#n21098
Assisted-by: Claude:claude-opus-4-6
Fixes: f18b03fabaa9 ("bpf: Implement BPF exceptions")
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260320000809.643798-1-ihor.solodrai@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
While very unlikely, local storage theoretically may leak memory of the
size of "struct bpf_local_storage" when destroy() fails to grab
local_storage->lock and initializes selem->local_storage before other
racing map_free() see it. Warn the user to allow debugging the issue
instead of leaking the memory silently.
Note that test_maps in bpf selftests already stress tested
bpf_selem_unlink_nofail() by creating 4096 sockets and then immediately
destroying them in multiple threads. With 64 threads, 64 x 4096 socket
local storages were created and destroyed during the test and no warning
in the function were triggered.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://patch.msgid.link/20260318224219.615105-1-ameryhung@gmail.com
|
|
Currently, local storage may deadlock when deferring freeing selem or
local storage through kfree_rcu(), call_rcu() or call_rcu_tasks_trace()
in NMI or reentrant. Since deleting selem in NMI is an unlikely use
case, partially mitigate it by returning error when calling from
bpf_xxx_storage_delete() helpers in NMI. Note that, it is still possible
to deadlock through reentrant. A full mitigation requires returning
error when irqs_disabled() is true, which, however is too heavy-handed
for bpf_xxx_storage_delete().
The long-term solution requires _nolock versions of call_rcu. Another
possible solution is to defer the free through irq_work [0], but it
would grow the size of selem, which is non-ideal.
The check is only needed in bpf_selem_unlink(), which is used by helpers
and syscalls. bpf_selem_unlink_nofail() is fine as it is called during
map and owner tear down that never run in NMI or reentrant.
[0] https://lore.kernel.org/bpf/20260205190233.912-1-alexei.starovoitov@gmail.com/
Fixes: a10787e6d58c ("bpf: Enable task local storage for tracing programs")
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://patch.msgid.link/20260319025716.2361065-1-ameryhung@gmail.com
|
|
Gregory reported in [0] that the global_map_resize test when run in
repeatedly ends up failing during program load. This stems from the fact
that BTF reference has not dropped to zero after the previous run's
module is unloaded, and the older module's BTF is still discoverable and
visible. Later, in libbpf, load_module_btfs() will find the ID for this
stale BTF, open its fd, and then it will be used during program load
where later steps taking module reference using btf_try_get_module()
fail since the underlying module for the BTF is gone.
Logically, once a module is unloaded, it's associated BTF artifacts
should become hidden. The BTF object inside the kernel may still remain
alive as long its reference counts are alive, but it should no longer be
discoverable.
To fix this, let us call btf_free_id() from the MODULE_STATE_GOING case
for the module unload to free the BTF associated IDR entry, and disable
its discovery once module unload returns to user space. If a race
happens during unload, the outcome is non-deterministic anyway. However,
user space should be able to rely on the guarantee that once it has
synchronously established a successful module unload, no more stale
artifacts associated with this module can be obtained subsequently.
Note that we must be careful to not invoke btf_free_id() in btf_put()
when btf_is_module() is true now. There could be a window where the
module unload drops a non-terminal reference, frees the IDR, but the
same ID gets reused and the second unconditional btf_free_id() ends up
releasing an unrelated entry.
To avoid a special case for btf_is_module() case, set btf->id to zero to
make btf_free_id() idempotent, such that we can unconditionally invoke it
from btf_put(), and also from the MODULE_STATE_GOING case. Since zero is
an invalid IDR, the idr_remove() should be a noop.
Note that we can be sure that by the time we reach final btf_put() for
btf_is_module() case, the btf_free_id() is already done, since the
module itself holds the BTF reference, and it will call this function
for the BTF before dropping its own reference.
[0]: https://lore.kernel.org/bpf/cover.1773170190.git.grbell@redhat.com
Fixes: 36e68442d1af ("bpf: Load and verify kernel module BTFs")
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Reported-by: Gregory Bell <grbell@redhat.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260312205307.1346991-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The BPF verifier currently enforces a call stack depth of 8 frames,
regardless of the actual stack space consumption of those frames. The
limit is necessary for static call stacks, because the bookkeeping data
structures used by the verifier when stepping into static functions
during verification only support 8 stack frames. However, this
limitation only matters for static stack frames: Global subprogs are
verified by themselves and do not require limiting the call depth.
Relax this limitation to only apply to static stack frames. Verification
now only fails when there is a sequence of 8 calls to non-global
subprogs. Calling into a global subprog resets the counter. This allows
deeper call stacks, provided all frames still fit in the stack.
The change does not increase the maximum size of the call stack, only
the maximum number of frames we can place in it.
Also change the progs/test_global_func3.c selftest to use static
functions, since with the new patch it would otherwise unexpectedly
pass verification.
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260316161225.128011-2-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In commit 5dbb19b16ac49 ("bpf: Add third round of bounds deduction"), I
added a new round of bounds deduction because two rounds were not enough
to converge to a fixed point. This commit slightly refactor the bounds
deduction logic such that two rounds are enough.
In [1], Eduard noticed that after we improved the refinement logic, a
third call to the bounds deduction (__reg_deduce_bounds) was needed to
converge to a fixed point. More specifically, we needed this third call
to improve the s64 range using the s32 range. We added the third call
and postponed a more detailed analysis of the refinement logic.
I've been looking into this more recently. The register refinement
consists of the following calls.
__update_reg_bounds();
3 x __reg_deduce_bounds() {
deduce_bounds_32_from_64();
deduce_bounds_32_from_32();
deduce_bounds_64_from_64();
deduce_bounds_64_from_32();
};
__reg_bound_offset();
__update_reg_bounds();
From this, we can observe that we first improve the 32bit ranges from
the 64bit ranges in deduce_bounds_32_from_64, then improve the 64bit
ranges on their own in deduce_bounds_64_from_64. Intuitively, if we
were to improve the 64bit ranges on their own *before* we use them to
improve the 32bit ranges, we may reach a fixed point earlier.
In a similar manner, using CBMC, Eduard found that it's best to improve
the 32bit ranges on their own *after* we've improve them using the 64bit
ranges. That is, running deduce_bounds_32_from_32 after
deduce_bounds_32_from_64.
These changes allow us to lose one call to __reg_deduce_bounds. Without
this reordering, the test "verifier_bounds/bounds deduction cross sign
boundary, negative overlap" fails when removing one call to
__reg_deduce_bounds. In some cases, this change can even improve
precision a little bit, as illustrated in the new selftest in the next
patch.
As expected, this change didn't have any impact on the number of
instructions processed when running it through the Cilium complexity
test suite [2].
Link: https://lore.kernel.org/bpf/aIKtSK9LjQXB8FLY@mail.gmail.com/ [1]
Link: https://pchaigno.github.io/test-verifier-complexity.html [2]
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Co-developed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/1b00d2749ec4c774c3ada84e265ac7fda72cfe56.1773401138.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This renaming will also help reshuffle the different parts in the
subsequent patch.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/a988ecf2c57e265b97917136b14b421038534e8c.1773401138.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
BPF_ST | BPF_PROBE_MEM32 immediate stores are not handled by
bpf_jit_blind_insn(), allowing user-controlled 32-bit immediates to
survive unblinded into JIT-compiled native code when bpf_jit_harden >= 1.
The root cause is that convert_ctx_accesses() rewrites BPF_ST|BPF_MEM
to BPF_ST|BPF_PROBE_MEM32 for arena pointer stores during verification,
before bpf_jit_blind_constants() runs during JIT compilation. The
blinding switch only matches BPF_ST|BPF_MEM (mode 0x60), not
BPF_ST|BPF_PROBE_MEM32 (mode 0xa0). The instruction falls through
unblinded.
Add BPF_ST|BPF_PROBE_MEM32 cases to bpf_jit_blind_insn() alongside the
existing BPF_ST|BPF_MEM cases. The blinding transformation is identical:
load the blinded immediate into BPF_REG_AX via mov+xor, then convert
the immediate store to a register store (BPF_STX).
The rewritten STX instruction must preserve the BPF_PROBE_MEM32 mode so
the architecture JIT emits the correct arena addressing (R12-based on
x86-64). Cannot use the BPF_STX_MEM() macro here because it hardcodes
BPF_MEM mode; construct the instruction directly instead.
Fixes: 6082b6c328b5 ("bpf: Recognize addr_space_cast instruction in the verifier.")
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Sachin Kumar <xcyfun@protonmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/Y6IT5VvNRchPBLI5D7JZHBzZrU9rb0ycRJPJzJSXGj7kJlX8RJwZFSM2YZjcDxoQKABkxt1T8Os2gi23PYyFuQe6KkZGWVyfz8K5afdy9ak=@protonmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This patch adds support to validate a pointer as not null when its
value is compared to a register whose value the verifier knows to be
null.
Initial pattern only verifies against an immediate operand.
Signed-off-by: Cupertino Miranda <cupertino.miranda@oracle.com>
Cc: David Faust <david.faust@oracle.com>
Cc: Jose Marchesi <jose.marchesi@oracle.com>
Cc: Elena Zannoni <elena.zannoni@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260304195018.181396-3-cupertino.miranda@oracle.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When a register undergoes a BPF_END (byte swap) operation, its scalar
value is mutated in-place. If this register previously shared a scalar ID
with another register (e.g., after an `r1 = r0` assignment), this tie must
be broken.
Currently, the verifier misses resetting `dst_reg->id` to 0 for BPF_END.
Consequently, if a conditional jump checks the swapped register, the
verifier incorrectly propagates the learned bounds to the linked register,
leading to false confidence in the linked register's value and potentially
allowing out-of-bounds memory accesses.
Fix this by explicitly resetting `dst_reg->id` to 0 in the BPF_END case
to break the scalar tie, similar to how BPF_NEG handles it via
`__mark_reg_known`.
Fixes: 9d2119984224 ("bpf: Add bitwise tracking for BPF_END")
Closes: https://lore.kernel.org/bpf/AMBPR06MB108683CFEB1CB8D9E02FC95ECF17EA@AMBPR06MB10868.eurprd06.prod.outlook.com/
Link: https://lore.kernel.org/bpf/4be25f7442a52244d0dd1abb47bc6750e57984c9.camel@gmail.com/
Reported-by: Guillaume Laporte <glapt.pro@outlook.com>
Co-developed-by: Tianci Cao <ziye@zju.edu.cn>
Signed-off-by: Tianci Cao <ziye@zju.edu.cn>
Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260304083228.142016-2-tangyazhou@zju.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
fmod_ret BPF programs can only be attached to selected functions. For
convenience, the error injection list was originally used (along with
functions prefixed with "security_"), which contains syscalls and
several other functions.
When error injection is disabled (CONFIG_FUNCTION_ERROR_INJECTION=n),
that list is empty and fmod_ret programs are effectively unavailable for
most of the functions. In such a case, at least enable fmod_ret programs
on syscalls.
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/472310f9a5f4944ad03214e4d943a4830fd8eb76.1773055375.git.vmalik@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Sleepable BPF programs can only be attached to selected functions. For
convenience, the error injection list was originally used, which
contains syscalls and several other functions.
When error injection is disabled (CONFIG_FUNCTION_ERROR_INJECTION=n),
that list is empty and sleepable tracing programs are effectively
unavailable. In such a case, at least enable sleepable programs on
syscalls. For discussion why syscalls were chosen, see [1].
To detect that a function is a syscall handler, we check for
arch-specific prefixes for the most common architectures. Unfortunately,
the prefixes are hard-coded in arch syscall code so we need to hard-code
them, too.
[1] https://lore.kernel.org/bpf/CAADnVQK6qP8izg+k9yV0vdcT-+=axtFQ2fKw7D-2Ei-V6WS5Dw@mail.gmail.com/
Signed-off-by: Viktor Malik <vmalik@redhat.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/2704a8512746655037e3c02b471b31bd0d76c8db.1773055375.git.vmalik@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Cross-merge BPF and other fixes after downstream PR.
No conflicts.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Pull bpf fixes from Alexei Starovoitov:
- Fix u32/s32 bounds when ranges cross min/max boundary (Eduard
Zingerman)
- Fix precision backtracking with linked registers (Eduard Zingerman)
- Fix linker flags detection for resolve_btfids (Ihor Solodrai)
- Fix race in update_ftrace_direct_add/del (Jiri Olsa)
- Fix UAF in bpf_trampoline_link_cgroup_shim (Lang Xu)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
resolve_btfids: Fix linker flags detection
selftests/bpf: add reproducer for spurious precision propagation through calls
bpf: collect only live registers in linked regs
Revert "selftests/bpf: Update reg_bound range refinement logic"
selftests/bpf: test refining u32/s32 bounds when ranges cross min/max boundary
bpf: Fix u32/s32 bounds when ranges cross min/max boundary
bpf: Fix a UAF issue in bpf_trampoline_link_cgroup_shim
ftrace: Add missing ftrace_lock to update_ftrace_direct_add/del
|
|
Fix an inconsistency between func_states_equal() and
collect_linked_regs():
- regsafe() uses check_ids() to verify that cached and current states
have identical register id mapping.
- func_states_equal() calls regsafe() only for registers computed as
live by compute_live_registers().
- clean_live_states() is supposed to remove dead registers from cached
states, but it can skip states belonging to an iterator-based loop.
- collect_linked_regs() collects all registers sharing the same id,
ignoring the marks computed by compute_live_registers().
Linked registers are stored in the state's jump history.
- backtrack_insn() marks all linked registers for an instruction
as precise whenever one of the linked registers is precise.
The above might lead to a scenario:
- There is an instruction I with register rY known to be dead at I.
- Instruction I is reached via two paths: first A, then B.
- On path A:
- There is an id link between registers rX and rY.
- Checkpoint C is created at I.
- Linked register set {rX, rY} is saved to the jump history.
- rX is marked as precise at I, causing both rX and rY
to be marked precise at C.
- On path B:
- There is no id link between registers rX and rY,
otherwise register states are sub-states of those in C.
- Because rY is dead at I, check_ids() returns true.
- Current state is considered equal to checkpoint C,
propagate_precision() propagates spurious precision
mark for register rY along the path B.
- Depending on a program, this might hit verifier_bug()
in the backtrack_insn(), e.g. if rY ∈ [r1..r5]
and backtrack_insn() spots a function call.
The reproducer program is in the next patch.
This was hit by sched_ext scx_lavd scheduler code.
Changes in tests:
- verifier_scalar_ids.c selftests need modification to preserve
some registers as live for __msg() checks.
- exceptions_assert.c adjusted to match changes in the verifier log,
R0 is dead after conditional instruction and thus does not get
range.
- precise.c adjusted to match changes in the verifier log, register r9
is dead after comparison and it's range is not important for test.
Reported-by: Emil Tsalapatis <emil@etsalapatis.com>
Fixes: 0fb3cf6110a5 ("bpf: use register liveness information for func_states_equal")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260306-linked-regs-and-propagate-precision-v1-1-18e859be570d@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Same as in __reg64_deduce_bounds(), refine s32/u32 ranges
in __reg32_deduce_bounds() in the following situations:
- s32 range crosses U32_MAX/0 boundary, positive part of the s32 range
overlaps with u32 range:
0 U32_MAX
| [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] |
|----------------------------|----------------------------|
|xxxxx s32 range xxxxxxxxx] [xxxxxxx|
0 S32_MAX S32_MIN -1
- s32 range crosses U32_MAX/0 boundary, negative part of the s32 range
overlaps with u32 range:
0 U32_MAX
| [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] |
|----------------------------|----------------------------|
|xxxxxxxxx] [xxxxxxxxxxxx s32 range |
0 S32_MAX S32_MIN -1
- No refinement if ranges overlap in two intervals.
This helps for e.g. consider the following program:
call %[bpf_get_prandom_u32];
w0 &= 0xffffffff;
if w0 < 0x3 goto 1f; // on fall-through u32 range [3..U32_MAX]
if w0 s> 0x1 goto 1f; // on fall-through s32 range [S32_MIN..1]
if w0 s< 0x0 goto 1f; // range can be narrowed to [S32_MIN..-1]
r10 = 0;
1: ...;
The reg_bounds.c selftest is updated to incorporate identical logic,
refinement based on non-overflowing range halves:
((x ∩ [0, smax]) ∩ (y ∩ [0, smax])) ∪
((x ∩ [smin,-1]) ∩ (y ∩ [smin,-1]))
Reported-by: Andrea Righi <arighi@nvidia.com>
Reported-by: Emil Tsalapatis <emil@etsalapatis.com>
Closes: https://lore.kernel.org/bpf/aakqucg4vcujVwif@gpd4/T/
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260306-bpf-32-bit-range-overflow-v3-1-f7f67e060a6b@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
kthread_exit became a macro to do_exit in commit 28aaa9c39945
("kthread: consolidate kthread exit paths to prevent use-after-free"),
so there is no kthread_exit function BTF ID to resolve. Remove it from
noreturn_deny to avoid resolve_btfids unresolved symbol warnings.
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The root cause of this bug is that when 'bpf_link_put' reduces the
refcount of 'shim_link->link.link' to zero, the resource is considered
released but may still be referenced via 'tr->progs_hlist' in
'cgroup_shim_find'. The actual cleanup of 'tr->progs_hlist' in
'bpf_shim_tramp_link_release' is deferred. During this window, another
process can cause a use-after-free via 'bpf_trampoline_link_cgroup_shim'.
Based on Martin KaFai Lau's suggestions, I have created a simple patch.
To fix this:
Add an atomic non-zero check in 'bpf_trampoline_link_cgroup_shim'.
Only increment the refcount if it is not already zero.
Testing:
I verified the fix by adding a delay in
'bpf_shim_tramp_link_release' to make the bug easier to trigger:
static void bpf_shim_tramp_link_release(struct bpf_link *link)
{
/* ... */
if (!shim_link->trampoline)
return;
+ msleep(100);
WARN_ON_ONCE(bpf_trampoline_unlink_prog(&shim_link->link,
shim_link->trampoline, NULL));
bpf_trampoline_put(shim_link->trampoline);
}
Before the patch, running a PoC easily reproduced the crash(almost 100%)
with a call trace similar to KaiyanM's report.
After the patch, the bug no longer occurs even after millions of
iterations.
Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor")
Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Closes: https://lore.kernel.org/bpf/3c4ebb0b.46ff8.19abab8abe2.Coremail.kaiyanm@hust.edu.cn/
Signed-off-by: Lang Xu <xulang@uniontech.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/279EEE1BA1DDB49D+20260303095217.34436-1-xulang@uniontech.com
|
|
Global subprogs are currently not allowed to return void. Adjust
verifier logic to allow global functions with a void return type.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260228184759.108145-5-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
From: Eduard Zingerman <eddyz87@gmail.com>
Both main progs and subprogs use the same function in the verifier,
check_return_code, to verify the type and value range of the register
being returned. However, subprogs only need a subset of the logic in
check_return_code. this also goes the way - check_return_code explicitly
checks whether it is handling a subprogram in multiple places, complicating
the logic. Separate the handling of the two into separate functions.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260228184759.108145-4-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
From: Eduard Zingerman <eddyz87@gmail.com>
The check_return_code function has explicit checks on whether
a program type can return void. Factor this logic out to reuse
it later for both main progs and subprogs.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260228184759.108145-3-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Factor the return value range calculation logic in check_return_code
out of the function in preparation for separating the return value
validation logic for BPF_EXIT and bpf_throw().
No functional changes. The change made in return_retval_code's handling
of PROG_TRACING program types (not error'ing on the default case) is a
no-op because the match on the program attach type is exhaustive.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260228184759.108145-2-emil@etsalapatis.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The limitation on fixed offsets stems from the fact that certain program
types rewrite the accesses to the context structure and translate them
to accesses to the real underlying type. Technically, in the past, we
could have stashed the register offset in insn_aux and made rewrites
work, but we've never needed it in the past since the offset for such
context structures easily fit in the s16 signed instruction offset.
Regardless, the consequence is that for program types where the program
type's verifier ops doesn't supply a convert_ctx_access callback, we
unnecessarily reject accesses with a modified ctx pointer (i.e., one
whose offset has been shifted) in check_ptr_off_reg. Make an exception
for such program types (like syscall, tracepoint, raw_tp, etc.).
Pass in fixed_off_ok as true to __check_ptr_off_reg for such cases, and
accumulate the register offset into insn->off passed to check_ctx_access.
In particular, the accumulation is critical since we need to correctly
track the max_ctx_offset which is used for bounds checking the buffer
for syscall programs at runtime.
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Schatzberg <dschatzberg@meta.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260227005725.1247305-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
cpu_map_bpf_prog_run_xdp() handles XDP_PASS, XDP_REDIRECT, and
XDP_DROP but is missing an XDP_ABORTED case. Without it, XDP_ABORTED
falls into the default case which logs a misleading "invalid XDP
action" warning instead of tracing the abort via trace_xdp_exception().
Add the missing XDP_ABORTED case with trace_xdp_exception(), matching
the handling already present in the skb path (cpu_map_bpf_prog_run_skb),
devmap (dev_map_bpf_prog_run), and the generic XDP path (do_xdp_generic).
Also pass xdpf->dev_rx instead of NULL to bpf_warn_invalid_xdp_action()
in the default case, so the warning includes the actual device name.
This aligns with the generic XDP path in net/core/dev.c which already
passes the real device.
Signed-off-by: Anand Kumar Shaw <anandkrshawheritage@gmail.com>
Link: https://lore.kernel.org/r/20260218042924.42931-1-anandkrshawheritage@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Implementing BPF version of preempt_count() requires accessing lowcore
from BPF. Since lowcore can be relocated, open-coding
(struct lowcore *)0 does not work, so add a kfunc.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20260217160813.100855-2-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Cross-merge BPF and other fixes after downstream PR.
No conflicts.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
We're hitting an invariant violation in Cilium that sometimes leads to
BPF programs being rejected and Cilium failing to start [1]. The
following extract from verifier logs shows what's happening:
from 201 to 236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0
236: R1=0 R6=ctx() R7=1 R9=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100)) R10=fp0
; if (magic == MARK_MAGIC_HOST || magic == MARK_MAGIC_OVERLAY || magic == MARK_MAGIC_ENCRYPT) @ bpf_host.c:1337
236: (16) if w9 == 0xe00 goto pc+45 ; R9=scalar(smin=umin=smin32=umin32=3585,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100))
237: (16) if w9 == 0xf00 goto pc+1
verifier bug: REG INVARIANTS VIOLATION (false_reg1): range bounds violation u64=[0xe01, 0xe00] s64=[0xe01, 0xe00] u32=[0xe01, 0xe00] s32=[0xe01, 0xe00] var_off=(0xe00, 0x0)
We reach instruction 236 with two possible values for R9, 0xe00 and
0xf00. This is perfectly reflected in the tnum, but of course the ranges
are less accurate and cover [0xe00; 0xf00]. Taking the fallthrough path
at instruction 236 allows the verifier to reduce the range to
[0xe01; 0xf00]. The tnum is however not updated.
With these ranges, at instruction 237, the verifier is not able to
deduce that R9 is always equal to 0xf00. Hence the fallthrough pass is
explored first, the verifier refines the bounds using the assumption
that R9 != 0xf00, and ends up with an invariant violation.
This pattern of impossible branch + bounds refinement is common to all
invariant violations seen so far. The long-term solution is likely to
rely on the refinement + invariant violation check to detect dead
branches, as started by Eduard. To fix the current issue, we need
something with less refactoring that we can backport.
This patch uses the tnum_step helper introduced in the previous patch to
detect the above situation. In particular, three cases are now detected
in the bounds refinement:
1. The u64 range and the tnum only overlap in umin.
u64: ---[xxxxxx]-----
tnum: --xx----------x-
2. The u64 range and the tnum only overlap in the maximum value
represented by the tnum, called tmax.
u64: ---[xxxxxx]-----
tnum: xx-----x--------
3. The u64 range and the tnum only overlap in between umin (excluded)
and umax.
u64: ---[xxxxxx]-----
tnum: xx----x-------x-
To detect these three cases, we call tnum_step(tnum, umin), which
returns the smallest member of the tnum greater than umin, called
tnum_next here. We're in case (1) if umin is part of the tnum and
tnum_next is greater than umax. We're in case (2) if umin is not part of
the tnum and tnum_next is equal to tmax. Finally, we're in case (3) if
umin is not part of the tnum, tnum_next is inferior or equal to umax,
and calling tnum_step a second time gives us a value past umax.
This change implements these three cases. With it, the above bytecode
looks as follows:
0: (85) call bpf_get_prandom_u32#7 ; R0=scalar()
1: (47) r0 |= 3584 ; R0=scalar(smin=0x8000000000000e00,umin=umin32=3584,smin32=0x80000e00,var_off=(0xe00; 0xfffffffffffff1ff))
2: (57) r0 &= 3840 ; R0=scalar(smin=umin=smin32=umin32=3584,smax=umax=smax32=umax32=3840,var_off=(0xe00; 0x100))
3: (15) if r0 == 0xe00 goto pc+2 ; R0=3840
4: (15) if r0 == 0xf00 goto pc+1
4: R0=3840
6: (95) exit
In addition to the new selftests, this change was also verified with
Agni [3]. For the record, the raw SMT is available at [4]. The property
it verifies is that: If a concrete value x is contained in all input
abstract values, after __update_reg_bounds, it will continue to be
contained in all output abstract values.
Link: https://github.com/cilium/cilium/issues/44216 [1]
Link: https://pchaigno.github.io/test-verifier-complexity.html [2]
Link: https://github.com/bpfverif/agni [3]
Link: https://pastebin.com/raw/naCfaqNx [4]
Fixes: 0df1a55afa83 ("bpf: Warn on internal verifier errors")
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Tested-by: Marco Schirrmeister <mschirrmeister@gmail.com>
Co-developed-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/ef254c4f68be19bd393d450188946821c588565d.1772225741.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This commit introduces tnum_step(), a function that, when given t, and a
number z returns the smallest member of t larger than z. The number z
must be greater or equal to the smallest member of t and less than the
largest member of t.
The first step is to compute j, a number that keeps all of t's known
bits, and matches all unknown bits to z's bits. Since j is a member of
the t, it is already a candidate for result. However, we want our result
to be (minimally) greater than z.
There are only two possible cases:
(1) Case j <= z. In this case, we want to increase the value of j and
make it > z.
(2) Case j > z. In this case, we want to decrease the value of j while
keeping it > z.
(Case 1) j <= z
t = xx11x0x0
z = 10111101 (189)
j = 10111000 (184)
^
k
(Case 1.1) Let's first consider the case where j < z. We will address j
== z later.
Since z > j, there had to be a bit position that was 1 in z and a 0 in
j, beyond which all positions of higher significance are equal in j and
z. Further, this position could not have been unknown in a, because the
unknown positions of a match z. This position had to be a 1 in z and
known 0 in t.
Let k be position of the most significant 1-to-0 flip. In our example, k
= 3 (starting the count at 1 at the least significant bit). Setting (to
1) the unknown bits of t in positions of significance smaller than
k will not produce a result > z. Hence, we must set/unset the unknown
bits at positions of significance higher than k. Specifically, we look
for the next larger combination of 1s and 0s to place in those
positions, relative to the combination that exists in z. We can achieve
this by concatenating bits at unknown positions of t into an integer,
adding 1, and writing the bits of that result back into the
corresponding bit positions previously extracted from z.
>From our example, considering only positions of significance greater
than k:
t = xx..x
z = 10..1
+ 1
-----
11..0
This is the exact combination 1s and 0s we need at the unknown bits of t
in positions of significance greater than k. Further, our result must
only increase the value minimally above z. Hence, unknown bits in
positions of significance smaller than k should remain 0. We finally
have,
result = 11110000 (240)
(Case 1.2) Now consider the case when j = z, for example
t = 1x1x0xxx
z = 10110100 (180)
j = 10110100 (180)
Matching the unknown bits of the t to the bits of z yielded exactly z.
To produce a number greater than z, we must set/unset the unknown bits
in t, and *all* the unknown bits of t candidates for being set/unset. We
can do this similar to Case 1.1, by adding 1 to the bits extracted from
the masked bit positions of z. Essentially, this case is equivalent to
Case 1.1, with k = 0.
t = 1x1x0xxx
z = .0.1.100
+ 1
---------
.0.1.101
This is the exact combination of bits needed in the unknown positions of
t. After recalling the known positions of t, we get
result = 10110101 (181)
(Case 2) j > z
t = x00010x1
z = 10000010 (130)
j = 10001011 (139)
^
k
Since j > z, there had to be a bit position which was 0 in z, and a 1 in
j, beyond which all positions of higher significance are equal in j and
z. This position had to be a 0 in z and known 1 in t. Let k be the
position of the most significant 0-to-1 flip. In our example, k = 4.
Because of the 0-to-1 flip at position k, a member of t can become
greater than z if the bits in positions greater than k are themselves >=
to z. To make that member *minimally* greater than z, the bits in
positions greater than k must be exactly = z. Hence, we simply match all
of t's unknown bits in positions more significant than k to z's bits. In
positions less significant than k, we set all t's unknown bits to 0
to retain minimality.
In our example, in positions of greater significance than k (=4),
t=x000. These positions are matched with z (1000) to produce 1000. In
positions of lower significance than k, t=10x1. All unknown bits are set
to 0 to produce 1001. The final result is:
result = 10001001 (137)
This concludes the computation for a result > z that is a member of t.
The procedure for tnum_step() in this commit implements the idea
described above. As a proof of correctness, we verified the algorithm
against a logical specification of tnum_step. The specification asserts
the following about the inputs t, z and output res that:
1. res is a member of t, and
2. res is strictly greater than z, and
3. there does not exist another value res2 such that
3a. res2 is also a member of t, and
3b. res2 is greater than z
3c. res2 is smaller than res
We checked the implementation against this logical specification using
an SMT solver. The verification formula in SMTLIB format is available
at [1]. The verification returned an "unsat": indicating that no input
assignment exists for which the implementation and the specification
produce different outputs.
In addition, we also automatically generated the logical encoding of the
C implementation using Agni [2] and verified it against the same
specification. This verification also returned an "unsat", confirming
that the implementation is equivalent to the specification. The formula
for this check is also available at [3].
Link: https://pastebin.com/raw/2eRWbiit [1]
Link: https://github.com/bpfverif/agni [2]
Link: https://pastebin.com/raw/EztVbBJ2 [3]
Co-developed-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Signed-off-by: Srinivas Narayana <srinivas.narayana@rutgers.edu>
Co-developed-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Santosh Nagarakatte <santosh.nagarakatte@rutgers.edu>
Signed-off-by: Harishankar Vishwanathan <harishankar.vishwanathan@gmail.com>
Link: https://lore.kernel.org/r/93fdf71910411c0f19e282ba6d03b4c65f9c5d73.1772225741.git.paul.chaignon@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
On PREEMPT_RT kernels, the per-CPU xdp_dev_bulk_queue (bq) can be
accessed concurrently by multiple preemptible tasks on the same CPU.
The original code assumes bq_enqueue() and __dev_flush() run atomically
with respect to each other on the same CPU, relying on
local_bh_disable() to prevent preemption. However, on PREEMPT_RT,
local_bh_disable() only calls migrate_disable() (when
PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable
preemption, which allows CFS scheduling to preempt a task during
bq_xmit_all(), enabling another task on the same CPU to enter
bq_enqueue() and operate on the same per-CPU bq concurrently.
This leads to several races:
1. Double-free / use-after-free on bq->q[]: bq_xmit_all() snapshots
cnt = bq->count, then iterates bq->q[0..cnt-1] to transmit frames.
If preempted after the snapshot, a second task can call bq_enqueue()
-> bq_xmit_all() on the same bq, transmitting (and freeing) the
same frames. When the first task resumes, it operates on stale
pointers in bq->q[], causing use-after-free.
2. bq->count and bq->q[] corruption: concurrent bq_enqueue() modifying
bq->count and bq->q[] while bq_xmit_all() is reading them.
3. dev_rx/xdp_prog teardown race: __dev_flush() clears bq->dev_rx and
bq->xdp_prog after bq_xmit_all(). If preempted between
bq_xmit_all() return and bq->dev_rx = NULL, a preempting
bq_enqueue() sees dev_rx still set (non-NULL), skips adding bq to
the flush_list, and enqueues a frame. When __dev_flush() resumes,
it clears dev_rx and removes bq from the flush_list, orphaning the
newly enqueued frame.
4. __list_del_clearprev() on flush_node: similar to the cpumap race,
both tasks can call __list_del_clearprev() on the same flush_node,
the second dereferences the prev pointer already set to NULL.
The race between task A (__dev_flush -> bq_xmit_all) and task B
(bq_enqueue -> bq_xmit_all) on the same CPU:
Task A (xdp_do_flush) Task B (ndo_xdp_xmit redirect)
---------------------- --------------------------------
__dev_flush(flush_list)
bq_xmit_all(bq)
cnt = bq->count /* e.g. 16 */
/* start iterating bq->q[] */
<-- CFS preempts Task A -->
bq_enqueue(dev, xdpf)
bq->count == DEV_MAP_BULK_SIZE
bq_xmit_all(bq, 0)
cnt = bq->count /* same 16! */
ndo_xdp_xmit(bq->q[])
/* frames freed by driver */
bq->count = 0
<-- Task A resumes -->
ndo_xdp_xmit(bq->q[])
/* use-after-free: frames already freed! */
Fix this by adding a local_lock_t to xdp_dev_bulk_queue and acquiring
it in bq_enqueue() and __dev_flush(). These paths already run under
local_bh_disable(), so use local_lock_nested_bh() which on non-RT is
a pure annotation with no overhead, and on PREEMPT_RT provides a
per-CPU sleeping lock that serializes access to the bq.
Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT")
Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20260225121459.183121-3-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
On PREEMPT_RT kernels, the per-CPU xdp_bulk_queue (bq) can be accessed
concurrently by multiple preemptible tasks on the same CPU.
The original code assumes bq_enqueue() and __cpu_map_flush() run
atomically with respect to each other on the same CPU, relying on
local_bh_disable() to prevent preemption. However, on PREEMPT_RT,
local_bh_disable() only calls migrate_disable() (when
PREEMPT_RT_NEEDS_BH_LOCK is not set) and does not disable
preemption, which allows CFS scheduling to preempt a task during
bq_flush_to_queue(), enabling another task on the same CPU to enter
bq_enqueue() and operate on the same per-CPU bq concurrently.
This leads to several races:
1. Double __list_del_clearprev(): after bq->count is reset in
bq_flush_to_queue(), a preempting task can call bq_enqueue() ->
bq_flush_to_queue() on the same bq when bq->count reaches
CPU_MAP_BULK_SIZE. Both tasks then call __list_del_clearprev()
on the same bq->flush_node, the second call dereferences the
prev pointer that was already set to NULL by the first.
2. bq->count and bq->q[] races: concurrent bq_enqueue() can corrupt
the packet queue while bq_flush_to_queue() is processing it.
The race between task A (__cpu_map_flush -> bq_flush_to_queue) and
task B (bq_enqueue -> bq_flush_to_queue) on the same CPU:
Task A (xdp_do_flush) Task B (cpu_map_enqueue)
---------------------- ------------------------
bq_flush_to_queue(bq)
spin_lock(&q->producer_lock)
/* flush bq->q[] to ptr_ring */
bq->count = 0
spin_unlock(&q->producer_lock)
bq_enqueue(rcpu, xdpf)
<-- CFS preempts Task A --> bq->q[bq->count++] = xdpf
/* ... more enqueues until full ... */
bq_flush_to_queue(bq)
spin_lock(&q->producer_lock)
/* flush to ptr_ring */
spin_unlock(&q->producer_lock)
__list_del_clearprev(flush_node)
/* sets flush_node.prev = NULL */
<-- Task A resumes -->
__list_del_clearprev(flush_node)
flush_node.prev->next = ...
/* prev is NULL -> kernel oops */
Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it
in bq_enqueue() and __cpu_map_flush(). These paths already run under
local_bh_disable(), so use local_lock_nested_bh() which on non-RT is
a pure annotation with no overhead, and on PREEMPT_RT provides a
per-CPU sleeping lock that serializes access to the bq.
To reproduce, insert an mdelay(100) between bq->count = 0 and
__list_del_clearprev() in bq_flush_to_queue(), then run reproducer
provided by syzkaller.
Fixes: 3253cb49cbad ("softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT")
Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20260225121459.183121-2-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This assumption will always hold going forward, hence just remove the
various checks and assume it is true with a comment for the uninformed
reader.
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260227224806.646888-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Currently, when use_kmalloc_nolock is false, the freeing of fields for a
local storage selem is done eagerly before waiting for the RCU or RCU
tasks trace grace period to elapse. This opens up a window where the
program which has access to the selem can recreate the fields after the
freeing of fields is done eagerly, causing memory leaks when the element
is finally freed and returned to the kernel.
Make a few changes to address this. First, delay the freeing of fields
until after the grace periods have expired using a __bpf_selem_free_rcu
wrapper which is eventually invoked after transitioning through the
necessary number of grace period waits. Replace usage of the kfree_rcu
with call_rcu to be able to take a custom callback. Finally, care needs
to be taken to extend the rcu barriers for all cases, and not just when
use_kmalloc_nolock is true, as RCU and RCU tasks trace callbacks can be
in flight for either case and access the smap field, which is used to
obtain the BTF record to walk over special fields in the map value.
While we're at it, drop migrate_disable() from bpf_selem_free_rcu, since
migration should be disabled for RCU callbacks already.
Fixes: 9bac675e6368 ("bpf: Postpone bpf_obj_free_fields to the rcu callback")
Reviewed-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260227224806.646888-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
BPF hash map may now use the map_check_btf() callback to decide whether
to set a dtor on its bpf_mem_alloc or not. Unlike C++ where members can
opt out of const-ness using mutable, we must lose the const qualifier on
the callback such that we can avoid the ugly cast. Make the change and
adjust all existing users, and lose the comment in hashtab.c.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260227224806.646888-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
There is a race window where BPF hash map elements can leak special
fields if the program with access to the map value recreates these
special fields between the check_and_free_fields done on the map value
and its eventual return to the memory allocator.
Several ways were explored prior to this patch, most notably [0] tried
to use a poison value to reject attempts to recreate special fields for
map values that have been logically deleted but still accessible to BPF
programs (either while sitting in the free list or when reused). While
this approach works well for task work, timers, wq, etc., it is harder
to apply the idea to kptrs, which have a similar race and failure mode.
Instead, we change bpf_mem_alloc to allow registering destructor for
allocated elements, such that when they are returned to the allocator,
any special fields created while they were accessible to programs in the
mean time will be freed. If these values get reused, we do not free the
fields again before handing the element back. The special fields thus
may remain initialized while the map value sits in a free list.
When bpf_mem_alloc is retired in the future, a similar concept can be
introduced to kmalloc_nolock-backed kmem_cache, paired with the existing
idea of a constructor.
Note that the destructor registration happens in map_check_btf, after
the BTF record is populated and (at that point) avaiable for inspection
and duplication. Duplication is necessary since the freeing of embedded
bpf_mem_alloc can be decoupled from actual map lifetime due to logic
introduced to reduce the cost of rcu_barrier()s in mem alloc free path in
9f2c6e96c65e ("bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.").
As such, once all callbacks are done, we must also free the duplicated
record. To remove dependency on the bpf_map itself, also stash the key
size of the map to obtain value from htab_elem long after the map is
gone.
[0]: https://lore.kernel.org/bpf/20260216131341.1285427-1-mykyta.yatsenko5@gmail.com
Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
Fixes: 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: syzbot@syzkaller.appspotmail.com
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260227224806.646888-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
get_upper_ifindexes() iterates over all upper devices and writes their
indices into an array without checking bounds.
Also the callers assume that the max number of upper devices is
MAX_NEST_DEV and allocate excluded_devices[1+MAX_NEST_DEV] on the stack,
but that assumption is not correct and the number of upper devices could
be larger than MAX_NEST_DEV (e.g., many macvlans), causing a
stack-out-of-bounds write.
Add a max parameter to get_upper_ifindexes() to avoid the issue.
When there are too many upper devices, return -EOVERFLOW and abort the
redirect.
To reproduce, create more than MAX_NEST_DEV(8) macvlans on a device with
an XDP program attached using BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS.
Then send a packet to the device to trigger the XDP redirect path.
Reported-by: syzbot+10cc7f13760b31bd2e61@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/698c4ce3.050a0220.340abe.000b.GAE@google.com/T/
Fixes: aeea1b86f936 ("bpf, devmap: Exclude XDP broadcast to master device")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Kohei Enju <kohei@enjuk.jp>
Link: https://lore.kernel.org/r/20260225053506.4738-1-kohei@enjuk.jp
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Currently, tailcall count is incremented in the interpreter even when
tailcall fails due to non-existent prog. Fix this by holding off on
the tailcall count increment until after NULL check on the prog.
Suggested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Link: https://lore.kernel.org/r/20260220062959.195101-1-hbathini@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
For the following scenario:
struct tree_node {
struct bpf_refcount ref;
struct bpf_rb_node node;
struct node_data __kptr * node_data;
u64 key;
};
This means node_data would have the type PTR_TO_BTF_ID | MEM_ALLOC |
NON_OWN_REF | MEM_RCU.
When traversing an rbtree using bpf_rbtree_left/right, if we need to
use bpf_kptr_xchg to read the __kptr pointer, we still need to follow
the remove-read-add sequence.
This patch allows us to use bpf_kptr_xchg to directly read the __kptr
pointer without any prior operations.
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Link: https://lore.kernel.org/r/20260214124042.62229-5-pilgrimtao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When traversing an rbtree using bpf_rbtree_left/right, if bpf_kptr_xchg
is used to access the __kptr pointer contained in a node, it currently
requires first removing the node with bpf_rbtree_remove and clearing the
NON_OWN_REF flag, then re-adding the node to the original rbtree with
bpf_rbtree_add after usage. This process significantly degrades rbtree
traversal performance. The patch enables accessing __kptr pointers with
the NON_OWN_REF flag set while holding the lock, eliminating the need
for this remove-read-add sequence.
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Link: https://lore.kernel.org/r/20260214124042.62229-3-pilgrimtao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
For the following scenario:
struct tree_node {
struct bpf_rb_node node;
struct request __kptr *req;
u64 key;
};
struct bpf_rb_root tree_root __contains(tree_node, node);
struct bpf_spin_lock tree_lock;
If we need to traverse all nodes in the rbtree, retrieve the __kptr
pointer from each node, and read kernel data from the referenced
object, using bpf_kptr_xchg appears unavoidable.
This patch skips the BPF verifier checks for bpf_kptr_xchg when
called while holding a lock.
Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn>
Link: https://lore.kernel.org/r/20260214124042.62229-2-pilgrimtao@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Cross-merge trees after 7.0-rc1.
No conflicts.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Conversion performed via this Coccinelle script:
// SPDX-License-Identifier: GPL-2.0-only
// Options: --include-headers-for-types --all-includes --include-headers --keep-comments
virtual patch
@gfp depends on patch && !(file in "tools") && !(file in "samples")@
identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex,
kzalloc_obj,kzalloc_objs,kzalloc_flex,
kvmalloc_obj,kvmalloc_objs,kvmalloc_flex,
kvzalloc_obj,kvzalloc_objs,kvzalloc_flex};
@@
ALLOC(...
- , GFP_KERNEL
)
$ make coccicheck MODE=patch COCCI=gfp.cocci
Build and boot tested x86_64 with Fedora 42's GCC and Clang:
Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This converts some of the visually simpler cases that have been split
over multiple lines. I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.
Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script. I probably had made it a bit _too_ trivial.
So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.
The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.
As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
Cross-merge BPF and other fixes after downstream PR.
No conflicts.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This field is now used only for linked scalar registers tracking.
Rename it to 'delta' to better describe it's purpose:
constant delta between "linked" scalars with the same ID.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260212-ptrs-off-migration-v2-4-00820e4d3438@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|