| Age | Commit message (Collapse) | Author |
|
Bump MAX_CALL_FRAMES from 8 to 16 to allow deeper call chains
that Rust-BPF requires and update selftests.
Link: https://lore.kernel.org/r/20260613180755.29671-1-alexei.starovoitov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In do_cpu_nanosleep(), posix_cpu_timer_create() takes a pid reference
via get_pid() and stores it in timer.it.cpu.pid. If the subsequent
posix_cpu_timer_set() call fails, the function returns immediately
without calling posix_cpu_timer_del() to release the pid reference,
causing a leak.
Fix it by calling posix_cpu_timer_del() before the unlock-and-return
on the error path, consistent with the other exit paths in the same
function.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: WenTao Liang <vulab@iscas.ac.cn>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260611161738.97043-1-vulab@iscas.ac.cn
|
|
Teddy reported that a XEN HVM has a long boot delay, which was bisected to
the recent enhancements to the negative motion detection. It turned out
that the jiffies clocksource is used in early boot before it is registered,
which leaves the max_delta_raw field at zero. That causes the read out to
be clamped to the max delta of 0, which means time is not making progress.
Cure it by ensuring that it is initialized before its first usage in
timekeeping_init().
Fixes: 76031d9536a0 ("clocksource: Make negative motion detection more robust")
Reported-by: Teddy Astie <teddy.astie@vates.tech>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Teddy Astie <teddy.astie@vates.tech>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/87y0gn3fve.ffs@fw13
Closes: https://lore.kernel.org/all/1780914594.8631fc262581453bbf619ec5b2062170.19ea6c8227b000701b@vates.tech
|
|
When a cgroup BPF program exits with 0, bpf_prog_run_array_cg() sets
the hook return value to -EPERM if it is not a valid errno. This is
correct for errno-based hooks, which return 0 on success and negative
errno on failure, but wrong for boolean and void LSM hooks. Boolean
LSM hooks should only return true or false, and void LSM hooks have
no return value at all.
Fix it by skipping setting -EPERM for hooks not returning errno.
Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20260610201724.733943-2-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Generic XDP devmap multi redirect uses skb_clone() for intermediate
destinations and sends the last destination with the original skb. This
can leave multiple destinations sharing the same packet data.
This becomes visible after generic devmap egress-program support was
added: a devmap egress program may mutate packet data, and another
destination sharing the same data can observe that mutation.
Native XDP broadcast redirect does not have this issue because
xdpf_clone() copies the frame data for each destination. Generic XDP
should provide the same per-destination isolation before running a
devmap egress program.
Fix this by making cloned skbs private before running the generic devmap
egress program. Use skb_copy() instead of skb_unshare() so allocation
failure does not consume the skb and the existing caller error paths keep
their ownership semantics.
Fixes: 2ea5eabaf04a ("bpf: devmap: Implement devmap prog execution for generic XDP")
Suggested-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Sun Jian <sun.jian.kdev@gmail.com>
Link: https://lore.kernel.org/r/20260612114032.244616-2-sun.jian.kdev@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fix from Marek Szyprowski:
"Three more fixes for the DMA-mapping code, related to PCI P2PDMA, DMA
debug and DMA link ranges API (Li RongQing and Jason Gunthorpe)"
* tag 'dma-mapping-7.1-2026-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
iommu/dma: Do not try to iommu_map a 0 length region in swiotlb
dma-debug: fix physical address retrieval in debug_dma_sync_sg_for_device
dma-mapping: direct: fix missing mapping for THRU_HOST_BRIDGE segments
|
|
Merge updates related to system sleep support, two updates of the
intel_rapl power capping driver, and a pm-graph utility fix for
7.2-rc1:
- Add sysctl interface for DPM watchdog timeouts (Tzung-Bi Shih)
- Use complete() instead of complete_all() in device_pm_sleep_init() to
avoid a false-positive warning from lockdep_assert_RT_in_threaded_ctx()
when CONFIG_PROVE_RAW_LOCK_NESTING is enabled (Jiakai Xu)
- Use a flexible array for CRC uncompressed buffers during hibernation
image saving (Rosen Penev)
- Make the LZ4 algorithm available for hibernation compression (l1rox3)
- Move the preallocate_image() call during hibernation after the
"prepare" phase of the "freeze" transition (Matthew Leach)
- Fix a memory leak in rapl_add_package_cpuslocked() in the intel_rapl
power capping driver and use sysfs_emit() in cpumask_show() in that
driver (Sumeet Pawnikar, Yury Norov)
- Fix ValueError when parsing incomplete device properties in the
pm-graph utility (Gongwei Li)
* pm-sleep:
PM: dpm_watchdog: Add sysctl interface for DPM watchdog timeouts
PM: hibernate: Use flexible array for CRC uncompressed buffers
PM: hibernate: make LZ4 available for hibernation compression
PM: sleep: Use complete() in device_pm_sleep_init()
PM: hibernate: call preallocate_image() after freeze prepare
* pm-powercap:
powercap: intel_rapl: Use sysfs_emit() in cpumask_show()
powercap: intel_rapl: Fix memory leak in rapl_add_package_cpuslocked()
* pm-tools:
PM: tools: pm-graph: fix ValueError when parsing incomplete device properties
|
|
Merge cpuidle updates, OPP (operating performance points) updates and a
PM QoS update for 7.2-rc1:
- Allow the intel_idle driver to avoid exposing C-states that are
redundant when PC6 is disabled (Artem Bityutskiy)
- Fix memory leak and a potential race in the OPP core (Abdun Nihaal,
Di Shen)
- Mark Rust OPP methods as inline (Nicolás Antinori)
- Fix misc device registration failure path in the PM QoS core (Yuho
Choi)
* pm-cpuidle:
intel_idle: Drop C-states redundant when PC6 is disabled
intel_idle: Introduce a helper for checking PC6
intel_idle: Add constants for MSR_PKG_CST_CONFIG_CONTROL
* pm-opp:
opp: rust: mark OPP methods as inline
OPP: of: Fix potential memory leak in opp_parse_supplies()
OPP: Fix race between OPP addition and lookup
* pm-qos:
PM: QoS: Fix misc device registration unwind
|
|
Add the contended_release trace event. This tracepoint fires on the
holder side when a contended lock is released, complementing the
existing contention_begin/contention_end tracepoints which fire on the
waiter side.
This enables correlating lock hold time under contention with waiter
events by lock address.
Add trace_contended_release()/trace_call__contended_release() calls to
the slowpath unlock paths of sleepable locks: mutex, rtmutex, semaphore,
rwsem, percpu-rwsem, and RT-specific rwbase locks.
Where possible, trace_contended_release() fires before the lock is
released and before the waiter is woken. For some lock types, the
tracepoint fires after the release but before the wake. Making the
placement consistent across all lock types is not worth the added
complexity.
For reader/writer locks, the tracepoint fires for every reader releasing
while a writer is waiting, not only for the last reader.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Acked-by: Usama Arif <usama.arif@linux.dev>
Link: https://patch.msgid.link/02f4f6c5ce6761e7f6587cf0ff2289d962ecddd4.1780506267.git.d@ilvokhin.com
|
|
Move the percpu_up_read() slowpath out of the inline function into a new
__percpu_up_read() to avoid binary size increase from adding a
tracepoint to an inlined function.
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Usama Arif <usama.arif@linux.dev>
Link: https://patch.msgid.link/3dd2a1b9ab4f469e1892766cb63f41d6b0f53d29.1780506267.git.d@ilvokhin.com
|
|
Breno reported significant c2c HITM in a futex hash heavy workload.
It turns out that the hash bucket to private hash table reverse pointer
(futex_hash_bucket::priv) was to blame. Notably when the hash buckets are
heavily contended, the: 'fph = bh->priv;' load in futex_hash() will typically
miss and consequently become quite expensive.
Since this load in particular is quite superfluous, removing it is fairly
straight forward. However, removing it does not in fact achieve anything much.
The pain moves to the next user, notably: futex_hash_put().
Therefore rework the whole private hash refcounting to avoid needing this back
pointer (and removing it). Instead of passing around 'struct futex_hash_bucket
*hb', pass around a new structure that contains it and the related 'struct
futex_private_hash *fph' pointer in tandem.
Funnily this turns out to remove more code than it adds and significantly
improves futex hash performance (as measured by 'perf bench futex hash'):
SKL dual socket 112 threads:
Baseline Patched
shared (16k) 1571857 1641435 + 4.4%
autosize (512) 646390 903371 +39.7%
-b 256 464395 587014 +26.4%
-b 512 715687 995943 +39.2%
-b 1024 995085 1396328 +40.3%
-b 2048 1293114 1668395 +29.0%
-b 4096 2124438 2240228 + 5.5%
Zen3 dual socket 256 threads:
Baseline Patched
shared (16k) 1275840 1381279 + 8.2%
autosize (512) 1252745 1482179 +18.3%
-b 256 856274 955455 +11.5%
-b 512 1267490 1544010 +21.8%
-b 1024 1424013 1625424 +14.1%
-b 2048 1505181 1669342 +10.9%
-b 4096 1465993 1688932 +15.2%
AMD EPYC 9D64 (Zen4, single socket) 176 threads:
Baseline Patched Delta
shared (16k) 1,230,599 1,368,655 +11.2%
autosize (1024) 1,285,440 1,556,946 +21.1%
-b 256 1,341,471 1,520,303 +13.3%
-b 512 1,438,330 1,599,319 +11.2%
-b 1024 1,443,772 1,622,493 +12.4%
-b 2048 1,472,108 1,643,975 +11.7%
-b 4096 1,333,098 1,570,897 +17.8%
Reported-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Breno Leitao <leitao@debian.org>
Tested-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260610135510.GB1430057@noisy.programming.kicks-ass.net
|
|
While testing Prateek's throttle series, I noticed a panic issue when
coresched is enabled and bisected to this patch.
I fed the panic log and this patch to an agent and its analysis looks
correct to me(cpu56 and cpu57 are siblings in a VM):
cpu57 (holds core-wide lock)
pick_next_task() [core scheduling]
for_each_cpu_wrap(i, smt_mask, 57):
i=57: pick_task(rq_57)
pick_task_fair(rq_57)
-> picks task A
rq_57->core_pick = task A
// task_rq(A) == rq_57
i=56: pick_task(rq_56)
pick_task_fair(rq_56)
cfs_rq->nr_queued == 0
goto idle
sched_balance_newidle(rq_56)
raw_spin_rq_unlock(rq_56)
// core-wide lock released
newidle_balance() pulls
task A: rq_57 -> rq_56
// task_rq(A) == rq_56 now
raw_spin_rq_lock(rq_56)
// core-wide lock re-acquired
return > 0
goto again
pick_task_fair(rq_56)
-> picks task A
rq_56->core_pick = task A
// first loop done
// rq_57->core_pick is still task A (set before lock release)
// but task_rq(A) == rq_56 now
next = rq_57->core_pick // = task A
put_prev_set_next_task(rq_57, prev, task A)
__set_next_task_fair(rq_57, task A)
hrtick_start_fair(rq_57, task A)
WARN_ON_ONCE(task_rq(task A) != rq_57)
// task_rq(A) == rq_56
IOW: by allowing pick_task_fair() to do newidle_balance and not returning
RETRY_TASK, it can end up selecting the same task on two CPUs. Restore the
previous state by never doing newidle when core scheduling is enabled.
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: "Aaron Lu" <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260603095108.GA1684319@bytedance.com
|
|
The fix in commit abad3d0bad72 ("bpf: Fix oob access in cgroup local
storage") is still incomplete. The prog-array compatibility check
treats a program with no cgroup storage as compatible with any stored
storage cookie. This allows a storage-less program to bridge a tail
call chain between an entry program and a storage-using callee even
though cgroup local storage at runtime still follows the caller's
context, that is, A -> B(no storage) -> C(storage) path.
Requiring exact cookie equality would break the legitimate case of a
storage-less leaf program being tail called from a storage-using one.
Instead, only accept a zero storage cookie if the program cannot
perform tail calls itself. This keeps A -> B(no storage) working
while rejecting the A -> B(no storage) -> C(storage) bridge.
Fixes: abad3d0bad72 ("bpf: Fix oob access in cgroup local storage")
Reported-by: Lin Ma <malin89@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20260610105539.705887-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The only reason for this janitorial change is that this initialization
adds unnecessary noise to "git grep exit_signal".
args.exit_signal has no effect with CLONE_THREAD, not to mention it is
zero-initialized by the compiler anyway.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <acAIm732QPFZs15C@redhat.com>
|
|
Map update and delete paths currently call bpf_obj_free_fields() when a
value is being replaced or recycled. That makes field destruction depend
on the context of the update/delete operation. For tracing programs this
can include NMI context, where referenced kptr destructors, uptr
unpinning, and graph root destruction are not generally safe.
Introduce bpf_obj_cancel_fields() for the reusable-value path. It only
performs NMI-safe cleanup for timer, workqueue, and task_work fields.
Fields that need full destruction are left attached to the recycled value
and are destroyed by the final cleanup path instead.
Switch array and hashtab update/delete/recycle paths to this cancel
helper. Keep bpf_obj_free_fields() for final map destruction and for
bpf_mem_alloc destructors. Preallocated hashtabs do not have allocator
destructors, so teardown continues to walk the normal and extra elements
and fully destroy their fields.
This deliberately relaxes the eager-free semantics of map update/delete
for special fields. Programs that relied on a recycled map slot becoming
empty immediately after update/delete were relying on behavior that
cannot be implemented safely from every BPF execution context without
offloading arbitrary destructors.
There is a chance this change breaks programs making assumptions
regarding the eager freeing of fields. If so, we can relax semantics to
cancellation only when irqs_disabled() is true in the future. However,
theoretically, map values that get reused eagerly already have weaker
guarantees as parallel users can recreate freed fields before the new
element becomes visible again.
Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
Signed-off-by: Justin Suess <utilityemal77@gmail.com>
Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260609202548.3571690-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
bpf_obj_drop() runs bpf_obj_free_fields() synchronously for
program-allocated objects. When such an object contains NMI unsafe
fields, tracing programs that can run from arbitrary instrumented
context can reach that destruction from unsafe contexts, including NMI.
NMI is likely one instance of this problem, and other instances would
include possible unsafe reentrancy. Deferring bpf_obj_drop() is not
appealing either: it would add delayed-free machinery to a release
operation that otherwise has straightforward synchronous ownership
semantics.
Reject bpf_obj_drop() and bpf_percpu_obj_drop() from tracing programs
that may run from unsafe contexts unless every field in the object's BTF
record is explicitly NMI safe. Do not reject sleepable
BPF_PROG_TYPE_TRACING programs, since they are not the arbitrary/NMI
contexts that motivate the restriction.
Note that while bpf_rb_root and bpf_list_head would be NMI safe on their
own to free, the objects recursively held by them may not be; be
conservative and just mark them as not NMI safe for now.
Use a whitelist for the NMI-safe field set instead of listing only known
NMI unsafe fields. Locks, async fields, unreferenced kptrs, and
refcounts are known to be NMI safe because their destruction is either a
no-op, simple state reset, or async cancellation. Referenced kptrs,
percpu referenced kptrs, uptrs, graph roots, graph nodes, and any future
field type are rejected until audited for arbitrary tracing and NMI
contexts. This is less susceptible to future changes in fields that were
previously safe by exclusion, and to new fields being added without
updating this check.
Convert the existing recursive local-object drop success case to a
syscall program in the same commit, since this verifier change makes the
old tracing program form invalid. The test still exercises
bpf_obj_drop() releasing a referenced task kptr from a safe program
type.
Fixes: ac9f06050a35 ("bpf: Introduce bpf_obj_drop")
Signed-off-by: Justin Suess <utilityemal77@gmail.com>
Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260609202548.3571690-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull runtime verifier fixes from Steven Rostedt:
- Fix reset ordering on per-task destruction
Reset the task before dropping the slot instead of after, which was
causing out-of-bound memory accesses.
- Fix HA monitor synchronization and cleanup
Ensure synchronous cleanup for HA monitors by running timer callbacks
in RCU read-side critical sections and using synchronize_rcu() during
destruction.
- Avoid armed timers after tasks exit
Add automatic cleanup for per-task HA monitors to prevent timers from
firing after task exit.
- Fix memory ordering for DA/HA monitors
Fix race conditions during monitor start by using release-acquire
semantics for the monitoring flag.
- Fix initialization for DA/HA monitors
Ensure monitors are not initialized relying on potentially corrupted
state like the monitoring flag, that is not reset by all monitors
type and may have an unknown state in monitors reusing the storage
(per-task).
- Fix memory safety in per-task and per-object monitors
Prevent use-after-free and out-of-bounds access by synchronizing with
in-flight tracepoint probes using tracepoint_synchronize_unregister()
before freeing monitor storage or releasing task slots.
- Adjust monitors for preemptible tracepoints
Fix monitors that relied on tracepoints disabling preemption.
Explicitly disable task migration when per-CPU monitors handle events
to avoid accessing the wrong state and update the opid monitor logic.
- Fix incorrect __user specifier usage
Remove __user from a non-pointer variable in the extract_params()
helper.
- Fix bugs in the rv tool
Ensure strings are NUL-terminated, fix substring matching in monitor
searches, and improve cleanup and exit status handling.
- Fix several bugs in rvgen
Fix LTL literal stringification, subparsers' options handling, and
suffix stripping in dot2k.
* tag 'trace-rv-v7.1-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
verification/rvgen: Fix ltl2k writing True as a literal
verification/rvgen: Fix options shared among commands
verification/rvgen: Fix suffix strip in dot2k
tools/rv: Fix cleanup after failed trace setup
tools/rv: Fix substring match when listing container monitors
tools/rv: Fix substring match bug in monitor name search
tools/rv: Ensure monitor name and desc are NUL-terminated
rv: Use 0 to check preemption enabled in opid
rv: Prevent task migration while handling per-CPU events
rv: Ensure synchronous cleanup for HA monitors
rv: Add automatic cleanup handlers for per-task HA monitors
rv: Do not rely on clean monitor when initialising HA
rv: Fix monitor start ordering and memory ordering for monitoring flag
rv: Ensure all pending probes terminate on per-obj monitor destroy
rv: Prevent in-flight per-task handlers from using invalid slots
rv: Reset per-task DA monitors before releasing the slot
rv: Fix __user specifier usage in extract_params()
|
|
The previous change relaxed the rcu_dereference annotations in
lpm_trie.c so the trie walks no longer trip lockdep when reached from a
sleepable BPF program holding only rcu_read_lock_trace(). By itself
that only helps tries reached as the inner map of a map-of-maps, or
from the classic-RCU syscall path: a sleepable program that references
an LPM trie directly is still rejected at load time by
check_map_prog_compatibility(), whose sleepable whitelist omits
BPF_MAP_TYPE_LPM_TRIE:
Sleepable programs can only use array, hash, ringbuf and local storage maps
LPM trie nodes are allocated from a bpf_mem_alloc (trie->ma) and freed
with bpf_mem_cache_free_rcu(), which chains a regular RCU grace period
into a Tasks Trace grace period before the node -- and the value
embedded in it that trie_lookup_elem() returns to the program -- is
released. That is the same reclaim discipline BPF_MAP_TYPE_HASH relies
on for sleepable access, so a value handed to a sleepable reader cannot
be freed while the program is still running under rcu_read_lock_trace().
The writer paths take trie->lock across the walk and never relied on the
RCU read-side lock to keep nodes alive.
Add BPF_MAP_TYPE_LPM_TRIE to the sleepable map whitelist so these
programs can use LPM tries directly.
Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260609135558.193287-3-vlad.wing@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
trie_lookup_elem() annotates its rcu_dereference_check() walks with
only rcu_read_lock_bh_held(). Because rcu_dereference_check(p, c)
resolves to "c || rcu_read_lock_held()", this passes for XDP/NAPI and
classic RCU readers but fails for sleepable BPF programs, which enter
via __bpf_prog_enter_sleepable() and hold only rcu_read_lock_trace().
trie_update_elem() and trie_delete_elem() have the same problem in a
different form: they walk the trie with plain rcu_dereference(), which
asserts rcu_read_lock_held() unconditionally. Both are reachable from
sleepable BPF programs via the bpf_map_update_elem / bpf_map_delete_elem
helpers, and from the syscall path under classic rcu_read_lock(). In
the writer paths the trie is actually protected by trie->lock (an
rqspinlock taken across the walk); we never relied on the RCU read-side
lock to keep nodes alive there.
A sleepable LSM hook that ends up touching an LPM trie therefore
triggers lockdep on debug kernels:
=============================
WARNING: suspicious RCU usage
7.1.0-... Tainted: G E
-----------------------------
kernel/bpf/lpm_trie.c:249 suspicious rcu_dereference_check() usage!
1 lock held by net_tests/540:
#0: (rcu_tasks_trace_srcu_struct){....}-{0:0},
at: __bpf_prog_enter_sleepable+0x26/0x280
Call Trace:
dump_stack_lvl
lockdep_rcu_suspicious
trie_lookup_elem
bpf_prog_..._enforce_security_socket_connect
bpf_trampoline_...
security_socket_connect
__sys_connect
do_syscall_64
This is lockdep-only -- no UAF, since Tasks Trace RCU does serialize
against the trie's reclaim path -- but it spams the console once per
distinct callsite on every debug kernel running a sleepable BPF LSM
that touches an LPM trie, which is increasingly common.
For the lookup path, switch the rcu_dereference_check() annotation
from rcu_read_lock_bh_held() to bpf_rcu_lock_held(), which accepts all
three contexts (classic, BH, Tasks Trace). Other map types already
follow this convention.
For trie_update_elem() and trie_delete_elem(), annotate the walks as
rcu_dereference_protected(*p, 1) -- matching trie_free() in the same
file -- since trie->lock is held across the walk. rqspinlock has no
lockdep_map, so the predicate degenerates to '1' rather than
lockdep_is_held(&trie->lock); the protection is real but not
machine-verifiable. trie_get_next_key() also uses bare
rcu_dereference() but is reachable only from the BPF syscall, which
holds classic rcu_read_lock() before dispatching, so it is left
untouched.
Fixes: 694cea395fde ("bpf: Allow RCU-protected lookups to happen from bh context")
Cc: stable@vger.kernel.org
Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260609135558.193287-2-vlad.wing@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
cpu_latency_qos_init() registers cpu_dma_latency first and, when
CONFIG_PM_QOS_CPU_SYSTEM_WAKEUP is enabled, registers cpu_wakeup_latency
afterwards. The second registration overwrites the first return value.
As a result, a failure to register cpu_dma_latency can be masked if the
second registration succeeds. Conversely, if cpu_dma_latency succeeds and
cpu_wakeup_latency fails, the function returns an error while leaving the
first misc device registered.
Return immediately on the first registration failure and deregister
cpu_dma_latency if the second registration fails.
Fixes: a4e6512a79d8 ("PM: QoS: Introduce a CPU system wakeup QoS limit")
Signed-off-by: Yuho Choi <dbgh9129@gmail.com>
Link: https://patch.msgid.link/20260608170748.82273-1-dbgh9129@gmail.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
check_mem_reg() verifies both read and write access for global subprogram
memory arguments. When the caller register is PTR_TO_BTF_ID,
check_helper_mem_access() currently forwards the access to
check_ptr_to_btf_access() as BPF_READ regardless of the requested access
type.
This lets a BTF-backed kernel object field pointer pass the caller-side
writable memory check for a global subprogram argument. The callee is then
validated with a generic writable PTR_TO_MEM argument and can store through
it, even though an equivalent direct BTF field store is rejected with "only
read is supported".
Forward the requested access type to check_ptr_to_btf_access().
This enforces existing BTF write restrictions for global subprogram memory
arguments as well.
Fixes: 3e30be4288b3 ("bpf: Allow helpers access trusted PTR_TO_BTF_ID.")
Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn>
Link: https://lore.kernel.org/bpf/20260609-f01-04-btf-writable-arg-v1-1-f449cd970669@mails.tsinghua.edu.cn
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
|
|
Some workloads with different CPU capacities consume more power with
timer migration than before. The recently introduced per capacity
hierarchies were supposed to alleviate this problem. However it appears
to also regress other types of workloads, especially when plenty of
capacities live together in the same machine.
Disable the feature until a reasonable solution is found.
Fixes: 098cbaad8e57 ("timers/migration: Split per-capacity hierarchies")
Reported-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260609123356.28449-1-frederic@kernel.org
Closes: https://lore.kernel.org/all/3b79338f-6cfc-4722-8062-9103db2c8ad1@arm.com
|
|
btf_parse_struct_metas() walks user-supplied BTF during BPF_BTF_LOAD,
and btf_repeat_fields() expands repeatable fields from array elements
into the fixed BTF_FIELDS_MAX scratch array used by btf_parse_fields().
The remaining-capacity check performs the expanded field count calculation
in u32. A malformed BTF can wrap that calculation, causing the check to
pass even when the expanded field count exceeds the scratch array
capacity. The following memcpy() can then write past the end of the
array.
Use checked addition and multiplication before copying repeated fields
and reject impossible counts.
Fixes: 797d73ee232d ("bpf: Check the remaining info_cnt before repeating btf fields")
Cc: stable@vger.kernel.org
Signed-off-by: Paul Moses <p@1g4.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20260605234301.1109063-1-p@1g4.org
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
|
|
Replace the open-coded "p->on_rq == TASK_ON_RQ_MIGRATING" comparisons
in enqueue_task_dl() and dequeue_task_dl() with the existing
task_on_rq_migrating() helper, consistent with the rest of the
scheduler code.
No functional change.
Signed-off-by: Liang Luo <luoliang@kylinos.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260608075500.387271-1-luoliang@kylinos.cn
|
|
The kernel coding style recommends using 'else if' instead of
placing 'if' on a separate line after 'else'. This change makes
the code consistent with the rest of the kernel codebase.
Signed-off-by: Liang Luo <luoliang@kylinos.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260608071842.325159-1-luoliang@kylinos.cn
|
|
If we take runnable_avg in max(runnable_avg, util_avg) in cpu_util(), we
should then add or subtract task runnable_avg, but the arithmetic below
is still with task util_avg. This mixes runnable_avg with util_avg which
is incorrect.
Fix by always doing arithmetic with runnable_avg and only take
max(runnable_avg, util_avg) at the last step.
Fixes: 7d0583cf9ec7 ("sched/fair, cpufreq: Introduce 'runnable boosting'")
Signed-off-by: Hongyan Xia <hongyan.xia@transsion.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://patch.msgid.link/20260605094318.37931-1-hongyan.xia@transsion.com
|
|
Patch series "mm/swap, PM: hibernate: fix swapoff race in uswsusp by
pinning swap device", v8.
Currently, in the uswsusp path, only the swap type value is retrieved at
lookup time without holding a reference. If swapoff races after the type
is acquired, subsequent slot allocations operate on a stale swap device.
Additionally, grabbing and releasing the swap device reference on every
slot allocation is inefficient across the entire hibernation swap path.
This patch series addresses these issues:
- Patch 1: Fixes the swapoff race in uswsusp by pinning the swap device
from the point it is looked up until the session completes.
- Patch 2: Removes the overhead of per-slot reference counting in alloc/free
paths and cleans up the redundant SWP_WRITEOK check.
This patch (of 2):
Hibernation via uswsusp (/dev/snapshot ioctls) has a race window: after
selecting the resume swap area but before user space is frozen, swapoff
may run and invalidate the selected swap device.
Fix this by pinning the swap device with SWP_HIBERNATION while it is in
use. The pin is exclusive, which is sufficient since hibernate_acquire()
already prevents concurrent hibernation sessions.
The kernel swsusp path (sysfs-based hibernate/resume) uses
find_hibernation_swap_type() which is not affected by the pin. It freezes
user space before touching swap, so swapoff cannot race.
Introduce dedicated helpers:
- pin_hibernation_swap_type(): Look up and pin the swap device.
Used by the uswsusp path.
- find_hibernation_swap_type(): Lookup without pinning.
Used by the kernel swsusp path.
- unpin_hibernation_swap_type(): Clear the hibernation pin.
While a swap device is pinned, swapoff is prevented from proceeding.
Link: https://lore.kernel.org/20260323160822.1409904-1-youngjun.park@lge.com
Link: https://lore.kernel.org/20260323160822.1409904-2-youngjun.park@lge.com
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: "Rafael J . Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit "driver core: platform: set mod_name in driver registration" will
set struct device_driver's mod_name member for platform driver
registration. For a driver to be registered with its mod_name set,
module_kset needs to be initialized, which currently happens in a
subsys_initcall in param_sysfs_init(). The tegra cbb drivers register
themselves before module_kset init, in a core_initcall. This works
currently because lookup_or_create_module_kobject(), which dereferences
module_kset via kset_find_obj(), is not called if mod_name is not set,
which is the case now.
So in preparation for the commit "driver core: platform: set mod_name in
driver registration", move module_kset init to pure_initcall level,
ensuring it happens before tegra cbb driver registration.
Suggested-by: Gary Guo <gary@garyguo.net>
Reviewed-by: Gary Guo <gary@garyguo.net>
Co-developed-by: Rahul Bukte <rahul.bukte@sony.com>
Signed-off-by: Rahul Bukte <rahul.bukte@sony.com>
Signed-off-by: Shashank Balaji <shashank.mahadasyam@sony.com>
Link: https://patch.msgid.link/20260601101942.4002661-1-shashank.mahadasyam@sony.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
|
|
bpf_task_from_vpid() looks up a task in the pid namespace of the
current task, via find_task_by_vpid():
find_task_by_vpid(vpid)
find_task_by_pid_ns(vpid, task_active_pid_ns(current))
find_pid_ns(nr, ns) -> idr_find(&ns->idr, nr)
cgroup_skb programs run in softirq, which may interrupt a task that is
itself in do_exit(). Once that task has passed
exit_notify() -> release_task() -> __unhash_process(), its thread_pid is
cleared, so task_active_pid_ns(current) returns NULL and find_pid_ns()
dereferences &NULL->idr:
BUG: kernel NULL pointer dereference, address: 0000000000000050
RIP: 0010:idr_find+0x11/0x30 lib/idr.c:176
Call Trace:
<IRQ>
find_pid_ns kernel/pid.c:370 [inline]
find_task_by_pid_ns+0x3b/0xe0 kernel/pid.c:485
bpf_task_from_vpid+0x5b/0x200 kernel/bpf/helpers.c:2916
bpf_prog_run_array_cg+0x17e/0x530 kernel/bpf/cgroup.c:81
__cgroup_bpf_run_filter_skb+0x12b/0x250 kernel/bpf/cgroup.c:1612
sk_filter_trim_cap+0x1dc/0x4c0 net/core/filter.c:148
tcp_v4_rcv+0x18d1/0x2200 net/ipv4/tcp_ipv4.c:2223
</IRQ>
<TASK>
do_exit+0xa63/0x1270 kernel/exit.c:1010
get_signal+0x141c/0x1530 kernel/signal.c:3037
Bail out when current has no pid namespace.
Fixes: 675c3596ff32 ("bpf: Add bpf_task_from_vpid() kfunc")
Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com>
Acked-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/bpf/20260608050001.2545245-1-rhkrqnwk98@gmail.com
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
|
|
An ARRAY_OF_MAPS can use an array created with BPF_F_INNER_MAP as its
inner map template. A concrete inner array with a different max_entries
value can then replace the template.
After a successful outer map lookup, the verifier represents the
resulting map pointer using the inner map template. Const-key lookup
nullness elision consequently uses the template max_entries even though
the runtime helper uses the concrete inner map max_entries.
Do not elide lookup result nullness for maps marked with BPF_F_INNER_MAP,
because the template max_entries does not prove that the key is in bounds
for the concrete runtime map.
Fixes: d2102f2f5d75 ("bpf: verifier: Support eliding map lookup nullness")
Signed-off-by: Nuoqi Gui <gnq25@mails.tsinghua.edu.cn>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20260607-f01-v2-v2-1-da48453146e8@mails.tsinghua.edu.cn
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
|
|
NMI and tracepoint BPF programs can re-enter the per-CPU or global
LRU lock that bpf_lru_pop_free()/push_free() already hold on the
same CPU, AA-deadlocking. Lockdep reports "inconsistent
{INITIAL USE} -> {IN-NMI}" on &l->lock (syzbot c69a0a2c816716f1e0d5)
and "possible recursive locking detected" on &loc_l->lock (syzbot
18b26edb69b2e19f3b33).
Prior trylock and rqspinlock based fixes (see links) were nacked
because compromised on reliability.
This patch converts every LRU lock site to rqspinlock_t and adds a
recovery path for some failure windows to avoid node leaks.
Failure recovery:
- *_pop_free top-level: return NULL; prealloc_lru_pop() already
treats that as no-free-element (-ENOMEM).
- Cross-CPU steal: skip the victim's locked loc_l, try next CPU.
- Post-steal local lock fail: publish stolen node to lockless
per-CPU free_llist; next pop on this CPU picks it up.
- push_free fail: mark node pending_free=1. __local_list_flush(),
__local_list_pop_pending() reclaim the node from pending_list.
__bpf_lru_list_shrink_inactive() reclaims the node from inactive
list. Nodes from active list are reclaimed by __bpf_lru_list_shrink()
or after __bpf_lru_list_rotate_active() demotes it to the inactive.
Fixes: 3a08c2fd7634 ("bpf: LRU List")
Reported-by: syzbot+c69a0a2c816716f1e0d5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c69a0a2c816716f1e0d5
Reported-by: syzbot+18b26edb69b2e19f3b33@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=18b26edb69b2e19f3b33
Link: https://lore.kernel.org/bpf/CAPPBnEYO4R+m+SpVc2gNj_x31R6fo1uJvj2bK2YS1P09GWT6kQ@mail.gmail.com/
Link: https://lore.kernel.org/bpf/CAPPBnEZmFA3ab8Uc=PEm0bdojZy=7T_F5_+eyZSHyZR3MBG4Vw@mail.gmail.com/
Link: https://lore.kernel.org/bpf/20251030030010.95352-1-dongml2@chinatelecom.cn/
Link: https://lore.kernel.org/bpf/20260119142120.28170-1-leon.hwang@linux.dev/
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260607-lru_map_spin-v3-1-bcd9332e911b@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
- Fix the arch_inlined_clockevent_set_next_coupled() prototype in the
!CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST case (Naveen Kumar Chaudhary)
- Fix an off-by-1 bug in the sys_settimeofday() usecs validation code
(Naveen Kumar Chaudhary)
- Mark vdso_k_*_data pointers as __ro_after_init (Thomas Weißschuh)
- Fix livelock race in tmigr_handle_remote_up() (Amit Matityahu)
* tag 'timers-urgent-2026-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/migration: Fix livelock in tmigr_handle_remote_up()
vdso/datastore: Mark vdso_k_*_data pointers as __ro_after_init
time: Fix off-by-one in settimeofday() usec validation
clockevents: Fix duplicate type specifier in stub function parameter
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
- Fix a NULL pointer dereference bug in the FUTEX_CMP_REQUEUE_PI
code (Ji'an Zhou)
- Fix a NULL pointer dereference bug in the rtmutex code (Davidlohr
Bueso)
* tag 'locking-urgent-2026-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rtmutex: Skip remove_waiter() when waiter is not enqueued
futex/requeue: Prevent NULL pointer dereference in remove_waiter() on self-deadlock
|
|
Adding tracing_multi link fdinfo support with following output:
pos: 0
flags: 02000000
mnt_id: 19
ino: 3087
link_type: tracing_multi
link_id: 9
prog_tag: 599ba0e317244f86
prog_id: 94
attach_type: 59
cnt: 10
obj-id btf-id cookie func
1 91593 8 bpf_fentry_test1+0x4/0x10
1 91595 9 bpf_fentry_test2+0x4/0x10
1 91596 7 bpf_fentry_test3+0x4/0x20
1 91597 5 bpf_fentry_test4+0x4/0x20
1 91598 4 bpf_fentry_test5+0x4/0x20
1 91599 2 bpf_fentry_test6+0x4/0x20
1 91600 3 bpf_fentry_test7+0x4/0x10
1 91601 1 bpf_fentry_test8+0x4/0x10
1 91602 10 bpf_fentry_test9+0x4/0x10
1 91594 6 bpf_fentry_test10+0x4/0x10
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260606123955.345967-17-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Adding support to use session attachment with tracing_multi link.
Adding new BPF_TRACE_FSESSION_MULTI program attach type, that follows
the BPF_TRACE_FSESSION behaviour but on the tracing_multi link.
Such program is called on entry and exit of the attached function
and allows to pass cookie value from entry to exit execution.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260606123955.345967-16-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add support to specify cookies for tracing_multi link.
Cookies are provided in array where each value is paired with provided
BTF ID value with the same array index.
Such cookie can be retrieved by bpf program with bpf_get_attach_cookie
helper call.
We need to sort cookies array together with ids array in check_dup_ids,
to keep the id->cookie relation.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260606123955.345967-15-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Adding new link to allow to attach program to multiple function
BTF IDs. The link is represented by struct bpf_tracing_multi_link.
To configure the link, new fields are added to bpf_attr::link_create
to pass array of BTF IDs;
struct {
__aligned_u64 ids;
__u32 cnt;
} tracing_multi;
Each BTF ID represents function (BTF_KIND_FUNC) that the link will
attach bpf program to.
We use previously added bpf_trampoline_multi_attach/detach functions
to attach/detach the link.
The linkinfo/fdinfo callbacks will be implemented in following changes.
Note this is supported only for archs (x86_64) with ftrace direct and
have single ops support.
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS &&
CONFIG_HAVE_SINGLE_FTRACE_DIRECT_OPS
Note using sort_r (instead of plain sort) in check_dup_ids, because we
will use the swap callback in following changes.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260606123955.345967-14-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Adding bpf_trampoline_multi_attach/detach functions that allows to
attach/detach tracing program to multiple functions/trampolines.
The attachment is defined with bpf_program and array of BTF ids of
functions to attach the bpf program to.
Adding bpf_tracing_multi_link object that holds all the attached
trampolines and is initialized in attach and used in detach.
The attachment allocates or uses currently existing trampoline
for each function to attach and links it with the bpf program.
The attach works as follows:
- we get all the needed trampolines
- lock them and add the bpf program to each (__bpf_trampoline_link_prog)
- the trampoline_multi_ops passed in __bpf_trampoline_link_prog gathers
ftrace_hash (ip -> trampoline) objects
- we call update_ftrace_direct_add/mod to update needed locations
- we unlock all the trampolines
The detach works as follows:
- we lock all the needed trampolines
- remove the program from each (__bpf_trampoline_unlink_prog)
- the trampoline_multi_ops passed in __bpf_trampoline_unlink_prog gathers
ftrace_hash (ip -> trampoline) objects
- we call update_ftrace_direct_del/mod to update needed locations
- we unlock and put all the trampolines
We store the old image/flags in the trampoline before the update
and use it in case we need to rollback the attachment.
We keep the ftrace_hash objects allocated during attach in the link
so they can be used for detach as well.
Adding trampoline_(un)lock_all functions to (un)lock all trampolines
to gate the tracing_multi attachment.
Note this is supported only for archs (x86_64) with ftrace direct and
have single ops support.
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS &&
CONFIG_HAVE_SINGLE_FTRACE_DIRECT_OPS
It also needs CONFIG_BPF_SYSCALL enabled.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260606123955.345967-13-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Move sleepable verification code to btf_id_allow_sleepable function.
It will be used in following changes.
Adding code to retrieve type's name instead of passing it from
bpf_check_attach_target function, because this function will be
called from another place in following changes and it's easier
to retrieve the name directly in here.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260606123955.345967-12-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Adding new program attach types multi tracing attachment:
BPF_TRACE_FENTRY_MULTI
BPF_TRACE_FEXIT_MULTI
and their base support in verifier code.
Programs with such attach type will use specific link attachment
interface coming in following changes.
This was suggested by Andrii some (long) time ago and turned out
to be easier than having special program flag for that.
Bpf programs with such types have 'bpf_multi_func' function set as
their attach_btf_id and keep module reference when it's specified
by attach_prog_fd.
They are also accepted as sleepable programs during verification,
and the real validation for specific BTF_IDs/functions will happen
during the multi link attachment in following changes.
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260606123955.345967-11-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Now that we split trampoline attachment object (bpf_tramp_node) from
the link object (bpf_tramp_link) we can use bpf_tramp_node as fsession's
fexit attachment object and get rid of the bpf_fsession_link object.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260606123955.345967-10-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Adding struct bpf_tramp_node to decouple the link out of the trampoline
attachment info.
At the moment the object for attaching bpf program to the trampoline is
'struct bpf_tramp_link':
struct bpf_tramp_link {
struct bpf_link link;
struct hlist_node tramp_hlist;
u64 cookie;
}
The link holds the bpf_prog pointer and forces one link - one program
binding logic. In following changes we want to attach program to multiple
trampolines but we want to keep just one bpf_link object.
Splitting struct bpf_tramp_link into:
struct bpf_tramp_link {
struct bpf_link link;
struct bpf_tramp_node node;
};
struct bpf_tramp_node {
struct bpf_link *link;
struct hlist_node tramp_hlist;
u64 cookie;
};
The 'struct bpf_tramp_link' defines standard single trampoline link
and 'struct bpf_tramp_node' is the attachment trampoline object with
pointer to the bpf_link object.
This will allow us to define link for multiple trampolines, like:
struct bpf_tracing_multi_link {
struct bpf_link link;
...
int nodes_cnt;
struct bpf_tracing_multi_node nodes[] __counted_by(nodes_cnt);
};
Cc: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260606123955.345967-9-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Separate bpf_trampoline_add/remove_prog functions from
__bpf_trampoline_link/unlink functions to be able to add/remove
trampoline programs without the image being updated in following
changes. No functional change is intended.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260606123955.345967-8-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Moving trampoline image setup into bpf_trampoline_ops callbacks,
so we can have different image handling for multi attachment which
is coming in following changes.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260606123955.345967-7-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In following changes we will need to override ftrace direct attachment
behaviour. In order to do that we are adding struct bpf_trampoline_ops
object that defines callbacks for ftrace direct attachment:
register_fentry
unregister_fentry
modify_fentry
The new struct bpf_trampoline_ops object is passed as an argument to
__bpf_trampoline_link/unlink_prog functions.
At the moment the default trampoline_ops is set to the current ftrace
direct attachment functions, so there's no functional change for the
current code.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260606123955.345967-6-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Adding mutex lock pool that replaces bpf trampolines mutex.
For tracing_multi link coming in following changes we need to lock all
the involved trampolines during the attachment. This could mean thousands
of mutex locks, which is not convenient.
As suggested by Andrii we can replace bpf trampolines mutex with mutex
pool, where each trampoline is hash-ed to one of the locks from the pool.
It's better to lock all the pool mutexes (32 at the moment) than
thousands of them.
There is 48 (MAX_LOCK_DEPTH) lock limit allowed to be simultaneously
held by task, so we need to keep 32 mutexes (5 bits) in the pool, so
when we lock them all in following changes the lockdep won't scream.
Removing the mutex_is_locked in bpf_trampoline_put, because we removed
the mutex from bpf_trampoline.
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260606123955.345967-5-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Renaming __add_hash_entry to add_ftrace_hash_entry and making it global,
it will be used in following changes outside ftrace.c object.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260606123955.345967-4-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Adding ftrace_hash_remove function that removes all entries
from struct ftrace_hash object without freeing them.
It will be used in following changes where entries are allocated
as part of another structure and are free-ed separately.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260606123955.345967-3-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Adding external ftrace_hash_count function so we could get hash
count outside of ftrace object.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20260606123955.345967-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The cgroup shim runs under rcu_read_lock_dont_migrate(), so we should
not attach any sleepable BPF programs there. Add support to the verifier
to explicitly reject attempts to load sleepable BPF programs destined
for LSM cgroup attachment.
Without this, we get the following splat from a BPF_LSM_CGROUP
program marked BPF_F_SLEEPABLE attached to file_open when it calls
bpf_get_dentry_xattr():
BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:1567
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 34317, name: load
preempt_count: 0, expected: 0
RCU nest depth: 2, expected: 0
Call Trace:
down_read+0x76/0x480
ext4_xattr_get+0x11f/0x700
__vfs_getxattr+0xf0/0x150
bpf_get_dentry_xattr+0xbb/0xf0
bpf_prog_e76a298dac9218c6_test_open+0x6a/0x85
__cgroup_bpf_run_lsm_current+0x326/0x840
bpf_trampoline_6442534646+0x62/0x14d
security_file_open+0x34/0x60
do_dentry_open+0x340/0x1260
vfs_open+0x7a/0x440
path_openat+0x1bac/0x30a0
libbpf provides a .s named section variant for every sleepable
program type except lsm_cgroup, reflecting that per-cgroup LSM programs
are intended to only run in a non-sleepable context.
The above splat was obtained by bypassing libbpf by using bpf(2)
directly.
Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor")
Signed-off-by: David Windsor <dwindsor@gmail.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20260605145707.608579-1-dwindsor@gmail.com
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
|