| Age | Commit message (Collapse) | Author |
|
The file trace.c has become a catchall for most things tracing. Start
making it smaller by breaking out various aspects into their own files.
Make ftrace_trace_stack() into a static inline that tests if stack tracing
is enabled and if so to call __ftrace_trace_stack() to do the stack trace.
This keeps the test inlined in the fast paths and only does the function
call if stack tracing is enabled.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.974218132@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The file trace.c has become a catchall for most things tracing. Start
making it smaller by breaking out various aspects into their own files.
Move the __always_inline functions __trace_buffer_lock_reserve(),
__trace_buffer_unlock_commit() and trace_event_setup() into trace.h.
The trace.c file will be split up and these functions will be used in more
than one of these files. As they are already __always_inline they can
easily be moved into the trace.h header file.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.813550600@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The file trace.c has become a catchall for most things tracing. Start
making it smaller by breaking out various aspects into their own files.
Make the variable tracing_selftest_running global so that it can be used
by other files in the tracing subsystem and trace.c can be split up.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.648932796@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The tracing_disabled variable is set to one on boot up to prevent some
parts of tracing to access the tracing infrastructure before it is set up.
It also can be set after boot if an anomaly is discovered.
It is currently a static variable in trace.c and can be accessed via a
function call trace_is_disabled(). There's really no reason to use a
function call as the tracing subsystem should be able to access it
directly.
By making the variable accessed directly, code can be moved out of trace.c
without adding overhead of a function call to see if tracing is disabled
or not.
Make tracing_disabled global and remove the tracing_is_disabled() helper
function. Also add some "unlikely()"s around tracing_disabled where it's
checked in hot paths.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260208032449.483690153@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
In trace.c, the function trace_create_maxlat_file() is defined behind the
#ifdef CONFIG_TRACER_MAX_TRACE block. The #else part defines it as:
#define trace_create_maxlat_file(tr, d_tracer) \
trace_create_file("tracing_max_latency", TRACE_MODE_WRITE, \
d_tracer, tr, &tracing_max_lat_fops)
But the one place that it it used has:
#ifdef CONFIG_TRACER_MAX_TRACE
trace_create_maxlat_file(tr, d_tracer);
#endif
Which is pointless and also wrong!
It only gets created when both CONFIG_TRACE_MAX_TRACE and CONFIG_FS_NOTIFY
is defined, but the file itself should not be dependent on
CONFIG_FS_NOTIFY. Always create that file when TRACE_MAX_TRACE is defined
regardless if FS_NOTIFY is or is not.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Link: https://patch.msgid.link/20260207191101.0e014abd@robin
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The function tracing_set_filter_buffering() is only used in
trace_events_hist.c. Move it to that file and make it static.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://patch.msgid.link/20260206195936.617080218@kernel.org
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
When the triggers were first created, they may not have had a file
parameter passed to them and things needed to be done generically.
But today, all triggers have a file parameter passed to them. Remove the
generic code and add a "if (WARN_ON_ONCE(!file))" to each trigger.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Link: https://patch.msgid.link/20260206101351.609d8906@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Simplify the hardlockup detector's probe path and remove its implicit
dependency on pinned per-cpu execution.
Refactor hardlockup_detector_event_create() to be stateless. Return the
created perf_event pointer to the caller instead of directly modifying the
per-cpu 'watchdog_ev' variable. This allows the probe path to safely
manage a temporary event without the risk of leaving stale pointers should
task migration occur.
Link: https://lkml.kernel.org/r/20260129022629.2201331-1-realwujing@gmail.com
Signed-off-by: Shouxin Sun <sunshx@chinatelecom.cn>
Signed-off-by: Junnan Zhang <zhangjn11@chinatelecom.cn>
Signed-off-by: Qiliang Yuan <yuanql9@chinatelecom.cn>
Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Cc: Jinchao Wang <wangjinchao600@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Li Huafei <lihuafei1@huawei.com>
Cc: Song Liu <song@kernel.org>
Cc: Thorsten Blum <thorsten.blum@linux.dev>
Cc: Wang Jinchao <wangjinchao600@gmail.com>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
cpustat_tail indexes cpustat_util[], which is a NUM_SAMPLE_PERIODS-sized
ring buffer. need_counting_irqs() currently wraps the index using
NUM_HARDIRQ_REPORT, which only happens to match NUM_SAMPLE_PERIODS.
Use NUM_SAMPLE_PERIODS for the wrap to keep the ring math correct even if
the NUM_HARDIRQ_REPORT or NUM_SAMPLE_PERIODS changes.
Link: https://lkml.kernel.org/r/tencent_7068189CB6D6689EB353F3D17BF5A5311A07@qq.com
Fixes: e9a9292e2368 ("watchdog/softlockup: Report the most frequent interrupts")
Signed-off-by: Shengming Hu <hu.shengming@zte.com.cn>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zhang Run <zhang.run@zte.com.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This function returns NULL if kho_restore_page() returns NULL, which
happens in a couple of corner cases. It never returns an error code.
Link: https://lkml.kernel.org/r/20260123190506.1058669-1-tycho@kernel.org
Signed-off-by: Tycho Andersen (AMD) <tycho@kernel.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Alexander Graf <graf@amazon.com>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Introduce an in-kernel test module to validate the core logic of the Live
Update Orchestrator's File-Lifecycle-Bound feature. This provides a
low-level, controlled environment to test FLB registration and callback
invocation without requiring userspace interaction or actual kexec
reboots.
The test is enabled by the CONFIG_LIVEUPDATE_TEST Kconfig option.
Link: https://lkml.kernel.org/r/20251218155752.3045808-6-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Gow <davidgow@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Tamir Duberstein <tamird@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Introduce a mechanism for managing global kernel state whose lifecycle is
tied to the preservation of one or more files. This is necessary for
subsystems where multiple preserved file descriptors depend on a single,
shared underlying resource.
An example is HugeTLB, where multiple file descriptors such as memfd and
guest_memfd may rely on the state of a single HugeTLB subsystem.
Preserving this state for each individual file would be redundant and
incorrect. The state should be preserved only once when the first file is
preserved, and restored/finished only once the last file is handled.
This patch introduces File-Lifecycle-Bound (FLB) objects to solve this
problem. An FLB is a global, reference-counted object with a defined set
of operations:
- A file handler (struct liveupdate_file_handler) declares a dependency
on one or more FLBs via a new registration function,
liveupdate_register_flb().
- When the first file depending on an FLB is preserved, the FLB's
.preserve() callback is invoked to save the shared global state. The
reference count is then incremented for each subsequent file.
- Conversely, when the last file is unpreserved (before reboot) or
finished (after reboot), the FLB's .unpreserve() or .finish() callback
is invoked to clean up the global resource.
The implementation includes:
- A new set of ABI definitions (luo_flb_ser, luo_flb_head_ser) and a
corresponding FDT node (luo-flb) to serialize the state of all active
FLBs and pass them via Kexec Handover.
- Core logic in luo_flb.c to manage FLB registration, reference
counting, and the invocation of lifecycle callbacks.
- An API (liveupdate_flb_get/_incoming/_outgoing) for other kernel
subsystems to safely access the live object managed by an FLB, both
before and after the live update.
This framework provides the necessary infrastructure for more complex
subsystems like IOMMU, VFIO, and KVM to integrate with the Live Update
Orchestrator.
Link: https://lkml.kernel.org/r/20251218155752.3045808-5-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Gow <davidgow@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Tamir Duberstein <tamird@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Switch LUO to use the private list iterators.
Link: https://lkml.kernel.org/r/20251218155752.3045808-4-pasha.tatashin@soleen.com
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Alexander Graf <graf@amazon.com>
Cc: David Gow <davidgow@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <kees@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Pratyush Yadav <pratyush@kernel.org>
Cc: Samiullah Khawaja <skhawaja@google.com>
Cc: Tamir Duberstein <tamird@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The custom definition of 'struct timespec64' is incompatible with both the
kernel's internal definition and the glibc type, at least on big-endian
targets that have the tv_nsec field in a different place, and the
definition clashes with any userspace that also defines a timespec64
structure.
Running the header check with -Wpadding enabled produces this output that
warns about the incorrect padding:
usr/include/linux/taskstats.h:25:1: error: padding struct size to alignment boundary with 4 bytes [-Werror=padded]
Remove the hack and instead use the regular __kernel_timespec type that is
meant to be used in uapi definitions.
Link: https://lkml.kernel.org/r/20260202095906.1344100-1-arnd@kernel.org
Fixes: 29b63f6eff0e ("delayacct: add timestamp of delay max")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Fan Yu <fan.yu9@zte.com.cn>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: xu xin <xu.xin16@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Jiang Kun <jiang.kun2@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Miscellaneous MMCID fixes to address bugs and performance regressions
in the recent rewrite of the SCHED_MM_CID management code:
- Fix livelock triggered by BPF CI testing
- Fix hard lockup on weakly ordered systems
- Simplify the dropping of CIDs in the exit path by removing an
unintended transition phase
- Fix performance/scalability regression on a thread-pool benchmark
by optimizing transitional CIDs when scheduling out"
* tag 'sched-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/mmcid: Optimize transitional CIDs when scheduling out
sched/mmcid: Drop per CPU CID immediately when switching to per task mode
sched/mmcid: Protect transition on weakly ordered systems
sched/mmcid: Prevent live lock on task to CPU mode transition
|
|
Replace BUG_ON() with panic() in panic_on_wq_watchdog(). This is not
a bug condition but a deliberate forced panic requested by the user
via module parameters to crash the system for debugging purposes.
Using panic() instead of BUG_ON() makes this intent clearer and provides
more informative output about which threshold was exceeded and the actual
values, making it easier to diagnose the stall condition from crash dumps.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Add a new module parameter 'panic_on_stall_time' that triggers a panic
when a workqueue stall persists for longer than the specified duration
in seconds.
Unlike 'panic_on_stall' which counts accumulated stall events, this
parameter triggers based on the duration of a single continuous stall.
This is useful for catching truly stuck workqueues rather than
accumulating transient stalls.
Usage:
workqueue.panic_on_stall_time=120
This would panic if any workqueue pool has been stalled for 120 seconds
or more.
The stall duration is measured from the workqueue last progress
(poll_ts) which accounts for legitimate system stalls.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Ingo Molnar::
- Bump up the Clang minimum version requirements for livepatch
builds, due to Clang assembler section handling bugs causing
silent miscompilations
- Strip livepatching symbol artifacts from non-livepatch modules
- Fix livepatch build warnings when certain Clang LTO options
are enabled
- Fix livepatch build error when CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
* tag 'objtool-urgent-2026-02-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool/klp: Fix unexported static call key access for manually built livepatch modules
objtool/klp: Fix symbol correlation for orphaned local symbols
livepatch: Free klp_{object,func}_ext data after initialization
livepatch: Fix having __klp_objects relics in non-livepatch modules
livepatch/klp-build: Require Clang assembler >= 20
|
|
- Export irq_domain_free_irqs() to allow PCI/MSI drivers that tear down
MSI domains to be built as modules (Aaron Kling)
- Export tegra_cpuidle_pcie_irqs_in_use(), which disables Tegra CC6 while
PCI IRQs are in use, so pci-tegra can be built as a module (Aaron Kling)
- Allow pci-tegra to be built as a module (Aaron Kling)
* pci/controller/tegra:
PCI: tegra: Allow building as a module
cpuidle: tegra: Export tegra_cpuidle_pcie_irqs_in_use()
irqdomain: Export irq_domain_free_irqs()
|
|
Take care of rqspinlock error in bpf_local_storage_{map_free, destroy}()
properly by switching to bpf_selem_unlink_nofail().
Both functions iterate their own RCU-protected list of selems and call
bpf_selem_unlink_nofail(). In map_free(), to prevent infinite loop when
both map_free() and destroy() fail to remove a selem from b->list
(extremely unlikely), switch to hlist_for_each_entry_rcu(). In destroy(),
also switch to hlist_for_each_entry_rcu() since we no longer iterate
local_storage->list under local_storage->lock.
bpf_selem_unlink() now becomes dedicated to helpers and syscalls paths
so reuse_now should always be false. Remove it from the argument and
hardcode it.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-12-ameryhung@gmail.com
|
|
Introduce bpf_selem_unlink_nofail() to properly handle errors returned
from rqspinlock in bpf_local_storage_map_free() and
bpf_local_storage_destroy() where the operation must succeeds.
The idea of bpf_selem_unlink_nofail() is to allow an selem to be
partially linked and use atomic operation on a bit field, selem->state,
to determine when and who can free the selem if any unlink under lock
fails. An selem initially is fully linked to a map and a local storage.
Under normal circumstances, bpf_selem_unlink_nofail() will be able to
grab locks and unlink a selem from map and local storage in sequeunce,
just like bpf_selem_unlink(), and then free it after an RCU grace period.
However, if any of the lock attempts fails, it will only clear
SDATA(selem)->smap or selem->local_storage depending on the caller and
set SELEM_MAP_UNLINKED or SELEM_STORAGE_UNLINKED according to the
caller. Then, after both map_free() and destroy() see the selem and the
state becomes SELEM_UNLINKED, one of two racing caller can succeed in
cmpxchg the state from SELEM_UNLINKED to SELEM_TOFREE, ensuring no
double free or memory leak.
To make sure bpf_obj_free_fields() is done only once and when map is
still present, it is called when unlinking an selem from b->list under
b->lock.
To make sure uncharging memory is done only when the owner is still
present in map_free(), block destroy() from returning until there is no
pending map_free().
Since smap may not be valid in destroy(), bpf_selem_unlink_nofail()
skips bpf_selem_unlink_storage_nolock_misc() when called from destroy().
This is okay as bpf_local_storage_destroy() will return the remaining
amount of memory charge tracked by mem_charge to the owner to uncharge.
It is also safe to skip clearing local_storage->owner and owner_storage
as the owner is being freed and no users or bpf programs should be able
to reference the owner and using local_storage.
Finally, access of selem, SDATA(selem)->smap and selem->local_storage
are racy. Callers will protect these fields with RCU.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-11-ameryhung@gmail.com
|
|
The next patch will introduce bpf_selem_unlink_nofail() to handle
rqspinlock errors. bpf_selem_unlink_nofail() will allow an selem to be
partially unlinked from map or local storage. Save memory allocation
method in selem so that later an selem can be correctly freed even when
SDATA(selem)->smap is init to NULL.
In addition, keep track of memory charge to the owner in local storage
so that later bpf_selem_unlink_nofail() can return the correct memory
charge to the owner. Updating local_storage->mem_charge is protected by
local_storage->lock.
Finally, extract miscellaneous tasks performed when unlinking an selem
from local_storage into bpf_selem_unlink_storage_nolock_misc(). It will
be reused by bpf_selem_unlink_nofail().
This patch also takes the chance to remove local_storage->smap, which
is no longer used since commit f484f4a3e058 ("bpf: Replace bpf memory
allocator with kmalloc_nolock() in local storage").
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-10-ameryhung@gmail.com
|
|
Percpu locks have been removed from cgroup and task local storage. Now
that all local storage no longer use percpu variables as locks preventing
recursion, there is no need to pass them to bpf_local_storage_map_free().
Remove the argument from the function.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-9-ameryhung@gmail.com
|
|
The percpu counter in cgroup local storage is no longer needed as the
underlying bpf_local_storage can now handle deadlock with the help of
rqspinlock. Remove the percpu counter and related migrate_{disable,
enable}.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-8-ameryhung@gmail.com
|
|
The percpu counter in task local storage is no longer needed as the
underlying bpf_local_storage can now handle deadlock with the help of
rqspinlock. Remove the percpu counter and related migrate_{disable,
enable}.
Since the percpu counter is removed, merge back bpf_task_storage_get()
and bpf_task_storage_get_recur(). This will allow the bpf syscalls and
helpers to run concurrently on the same CPU, removing the spurious
-EBUSY error. bpf_task_storage_get(..., F_CREATE) will now always
succeed with enough free memory unless being called recursively.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-7-ameryhung@gmail.com
|
|
Change bpf_local_storage::lock and bpf_local_storage_map_bucket::lock
from raw_spin_lock to rqspinlock.
Finally, propagate errors from raw_res_spin_lock_irqsave() to syscall
return or BPF helper return.
In bpf_local_storage_destroy(), ignore return from
raw_res_spin_lock_irqsave() for now. A later patch will correctly
handle errors correctly in bpf_local_storage_destroy() so that it can
unlink selems even when failing to acquire locks.
For __bpf_local_storage_map_cache(), instead of handling the error,
skip updating the cache.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-6-ameryhung@gmail.com
|
|
To prepare changing both bpf_local_storage_map_bucket::lock and
bpf_local_storage::lock to rqspinlock, convert bpf_selem_unlink() to
failable. It still always succeeds and returns 0 until the change
happens. No functional change.
Open code bpf_selem_unlink_storage() in the only caller,
bpf_selem_unlink(), since unlink_map and unlink_storage must be done
together after all the necessary locks are acquired.
For bpf_local_storage_map_free(), ignore the return from
bpf_selem_unlink() for now. A later patch will allow it to unlink selems
even when failing to acquire locks.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-5-ameryhung@gmail.com
|
|
To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock,
convert bpf_selem_link_map() to failable. It still always succeeds and
returns 0 until the change happens. No functional change.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-4-ameryhung@gmail.com
|
|
To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock,
convert bpf_selem_unlink_map() to failable. It still always succeeds and
returns 0 for now.
Since some operations updating local storage cannot fail in the middle,
open-code bpf_selem_unlink_map() to take the b->lock before the
operation. There are two such locations:
- bpf_local_storage_alloc()
The first selem will be unlinked from smap if cmpxchg owner_storage_ptr
fails, which should not fail. Therefore, hold b->lock when linking
until allocation complete. Helpers that assume b->lock is held by
callers are introduced: bpf_selem_link_map_nolock() and
bpf_selem_unlink_map_nolock().
- bpf_local_storage_update()
The three step update process: link_map(new_selem),
link_storage(new_selem), and unlink_map(old_selem) should not fail in
the middle.
In bpf_selem_unlink(), bpf_selem_unlink_map() and
bpf_selem_unlink_storage() should either all succeed or fail as a whole
instead of failing in the middle. So, return if unlink_map() failed.
Remove the selem_linked_to_map_lockless() check as an selem in the
common paths (not bpf_local_storage_map_free() or
bpf_local_storage_destroy()), will be unlinked under b->lock and
local_storage->lock and therefore no other threads can unlink the selem
from map at the same time.
In bpf_local_storage_destroy(), ignore the return of
bpf_selem_unlink_map() for now. A later patch will allow
bpf_local_storage_destroy() to unlink selems even when failing to
acquire locks.
Note that while this patch removes all callers of selem_linked_to_map(),
a later patch that introduces bpf_selem_unlink_nofail() will use it
again.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-3-ameryhung@gmail.com
|
|
A later bpf_local_storage refactor will acquire all locks before
performing any update. To simplified the number of locks needed to take
in bpf_local_storage_map_update(), determine the bucket based on the
local_storage an selem belongs to instead of the selem pointer.
Currently, when a new selem needs to be created to replace the old selem
in bpf_local_storage_map_update(), locks of both buckets need to be
acquired to prevent racing. This can be simplified if the two selem
belongs to the same bucket so that only one bucket needs to be locked.
Therefore, instead of hashing selem, hashing the local_storage pointer
the selem belongs.
Performance wise, this is slightly better as update now requires locking
one bucket. It should not change the level of contention on one bucket
as the pointers to local storages of selems in a map are just as unique
as pointers to selems.
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20260205222916.1788211-2-ameryhung@gmail.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fix from Steven Rostedt:
- Fix event format field alignments for 32 bit architectures
The fields in the event format files are used to parse the raw binary
buffer data by applications. If they are incorrect, then the
application produces garbage.
On 32 bit architectures, the function graph 64bit calltime and
rettime were off by 4bytes. That's because the actual fields are in a
packed structure but the macros used by the ftrace events did not
mark them as packed, and instead, gave them their natural alignment
which made their offsets off by 4 bytes.
There are macros to have a packed field within an embedded structure
of an event, but there's no macro for normal fields within a packed
structure of the event. The macro __field_packed() was used for the
packed embedded structure field. Rename that to __field_desc_packed()
(to match the non-packed embedded field macro __field_desc()), and
make __field_packed() for fields that are in a packed event structure
(which matches the unpacked __field() macro).
Switch the calltime and rettime fields of the function graph event to
use the new __field_packed() and this makes the offsets correct.
* tag 'trace-v6.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Fix ftrace event field alignments
|
|
When the 'kprobe_event=' kernel command-line parameter is not provided,
there is no need to execute setup_boot_kprobe_events().
This change optimizes the initialization function init_kprobe_trace()
by skipping unnecessary work and effectively prevents potential blocking
that could arise from contention on the event_mutex lock in subsequent
operations.
Link: https://patch.msgid.link/20260204015401.163748-1-tianyaxiong@kylinos.cn
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The init_blk_tracer() function causes significant boot delay as it
waits for the trace_event_sem lock held by trace_event_update_all().
Specifically, its child function register_trace_event() requires
this lock, which is occupied for an extended period during boot.
To resolve this, the execution of primary init_blk_tracer() is moved
to the trace_init_wq workqueue, allowing it to run asynchronously,
and prevent blocking the main boot thread.
Link: https://patch.msgid.link/20260204015353.163331-1-tianyaxiong@kylinos.cn
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
The eval_map_work_func() function, though queued in eval_map_wq,
holds the trace_event_sem read-write lock for a long time during
kernel boot. This causes blocking issues for other functions.
Rename eval_map_wq to trace_init_wq and make it global, thereby
allowing other parts of tracing to schedule work on this queue
asynchronously and avoiding blockage of the main boot thread.
Link: https://patch.msgid.link/20260204015344.162818-1-tianyaxiong@kylinos.cn
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
"Two minor fixes for the DMA-mapping subsystem:
- check for the rare case of the allocation failure of the global CMA
pool (Shanker Donthineni)
- avoid perf buffer overflow when tracing large scatter-gather lists
(Deepanshu Kartikey)"
* tag 'dma-mapping-6.19-2026-02-06' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma: contiguous: Check return value of dma_contiguous_reserve_area()
tracing/dma: Cap dma_map_sg tracepoint arrays to prevent buffer overflow
|
|
Called when copy_process() is called to copy state to a new child.
Right now this is just a stub, but will be used shortly to properly
handle fork'ing of task based io_uring restrictions.
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
call_rcu_tasks_trace() is not safe from in_nmi() and not reentrant.
To prevent deadlock on raw_spin_lock_rcu_node(rtpcp) or memory corruption
defer to irq_work when IRQs are disabled. call_rcu_tasks_generic()
protects itself with local_irq_save().
Note when bpf_async_cb->refcnt drops to zero it's safe to reuse
bpf_async_cb->worker for a different irq_work callback, since
bpf_async_schedule_op() -> irq_work_queue(&cb->worker);
is only called when refcnt >= 1.
Fixes: 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260205190233.912-1-alexei.starovoitov@gmail.com
|
|
Currently, bpf_map_get_info_by_fd calculates and caches the hash of the
map regardless of the map's frozen state.
This leads to a TOCTOU bug where userspace can call
BPF_OBJ_GET_INFO_BY_FD to cache the hash and then modify the map
contents before freezing.
Therefore, a trusted loader can be tricked into verifying the stale hash
while loading the modified contents.
Fix this by returning -EPERM if the map is not frozen when the hash is
requested. This ensures the hash is only generated for the final,
immutable state of the map.
Fixes: ea2e6467ac36 ("bpf: Return hashes of maps in BPF_OBJ_GET_INFO_BY_FD")
Reported-by: Toshi Piazza <toshi.piazza@microsoft.com>
Signed-off-by: KP Singh <kpsingh@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260205070755.695776-1-kpsingh@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Practical BPF signatures are significantly smaller than
KMALLOC_MAX_CACHE_SIZE
Allowing larger sizes opens the door for abuse by passing excessive
size values and forcing the kernel into expensive allocation paths (via
kmalloc_large or vmalloc).
Fixes: 349271568303 ("bpf: Implement signature verification for BPF programs")
Reported-by: Chris Mason <clm@meta.com>
Signed-off-by: KP Singh <kpsingh@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260205063807.690823-1-kpsingh@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The linker script scripts/module.lds.S specifies that all input
__klp_objects sections should be consolidated into an output section of
the same name, and start/stop symbols should be created to enable
scripts/livepatch/init.c to locate this data.
This start/stop pattern is not ideal for modules because the symbols are
created even if no __klp_objects input sections are present.
Consequently, a dummy __klp_objects section also appears in the
resulting module. This unnecessarily pollutes non-livepatch modules.
Instead, since modules are relocatable files, the usual method for
locating consolidated data in a module is to read its section table.
This approach avoids the aforementioned problem.
The klp_modinfo already stores a copy of the entire section table with
the final addresses. Introduce a helper function that
scripts/livepatch/init.c can call to obtain the location of the
__klp_objects section from this data.
Fixes: dd590d4d57eb ("objtool/klp: Introduce klp diff subcommand for diffing object files")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Acked-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Miroslav Benes <mbenes@suse.cz>
Reviewed-by: Aaron Tomlin <atomlin@atomlin.com>
Link: https://patch.msgid.link/20260123102825.3521961-2-petr.pavlu@suse.com
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
|
|
The fields of ftrace specific events (events used to save ftrace internal
events like function traces and trace_printk) are generated similarly to
how normal trace event fields are generated. That is, the fields are added
to a trace_events_fields array that saves the name, offset, size,
alignment and signness of the field. It is used to produce the output in
the format file in tracefs so that tooling knows how to parse the binary
data of the trace events.
The issue is that some of the ftrace event structures are packed. The
function graph exit event structures are one of them. The 64 bit calltime
and rettime fields end up 4 byte aligned, but the algorithm to show to
userspace shows them as 8 byte aligned.
The macros that create the ftrace events has one for embedded structure
fields. There's two macros for theses fields:
__field_desc() and __field_packed()
The difference of the latter macro is that it treats the field as packed.
Rename that field to __field_desc_packed() and create replace the
__field_packed() to be a normal field that is packed and have the calltime
and rettime use those.
This showed up on 32bit architectures for function graph time fields. It
had:
~# cat /sys/kernel/tracing/events/ftrace/funcgraph_exit/format
[..]
field:unsigned long func; offset:8; size:4; signed:0;
field:unsigned int depth; offset:12; size:4; signed:0;
field:unsigned int overrun; offset:16; size:4; signed:0;
field:unsigned long long calltime; offset:24; size:8; signed:0;
field:unsigned long long rettime; offset:32; size:8; signed:0;
Notice that overrun is at offset 16 with size 4, where in the structure
calltime is at offset 20 (16 + 4), but it shows the offset at 24. That's
because it used the alignment of unsigned long long when used as a
declaration and not as a member of a structure where it would be aligned
by word size (in this case 4).
By using the proper structure alignment, the format has it at the correct
offset:
~# cat /sys/kernel/tracing/events/ftrace/funcgraph_exit/format
[..]
field:unsigned long func; offset:8; size:4; signed:0;
field:unsigned int depth; offset:12; size:4; signed:0;
field:unsigned int overrun; offset:16; size:4; signed:0;
field:unsigned long long calltime; offset:20; size:8; signed:0;
field:unsigned long long rettime; offset:28; size:8; signed:0;
Cc: stable@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reported-by: "jempty.liang" <imntjempty@163.com>
Link: https://patch.msgid.link/20260204113628.53faec78@gandalf.local.home
Fixes: 04ae87a52074e ("ftrace: Rework event_create_dir()")
Closes: https://lore.kernel.org/all/20260130015740.212343-1-imntjempty@163.com/
Closes: https://lore.kernel.org/all/20260202123342.2544795-1-imntjempty@163.com/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Replace prog and callback in bpf_async_cb after removing visibility of
bpf_async_cb in bpf_async_cancel_and_free() to increase the chances the
scheduled async callbacks short-circuit execution and exit early, and
not starting a RCU tasks trace section. This improves the overall time
spent in running the wq selftest.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260205003853.527571-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When freeing a bpf_async_cb in bpf_async_cb_rcu_tasks_trace_free(), in
case the wq callback is not scheduled, doing cancel_work() currently
returns false and leads to retry of RCU tasks trace grace period. If the
callback is never scheduled, we keep retrying indefinitely and don't put
the prog reference.
Since the only race we care about here is against a potentially running
wq callback in the first grace period, it should finish by the second
grace period, hence check work_busy() result to detect presence of
running wq callback if it's not pending, otherwise free the object
immediately without retrying.
Reasoning behind the check and its correctness with racing wq callback
invocation: cancel_work is supposed to be synchronized, hence calling it
first and getting false would mean that work is definitely not pending,
at this point, either the work is not scheduled at all or already
running, or we race and it already finished by the time we checked for
it using work_busy(). In case it is running, we synchronize using
pool->lock to check the current work running there, if we match, it
means we extend the wait by another grace period using retry = true,
otherwise either the work already finished running or was never
scheduled, so we can free the bpf_async_cb right away.
Fixes: 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260205003853.527571-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Five hotfixes. Two are cc:stable, two are for MM.
All are singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2026-02-04-15-55' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
Documentation: document liveupdate cmdline parameter
mm, shmem: prevent infinite loop on truncate race
mailmap: update Alexander Mikhalitsyn's emails
liveupdate: luo_file: do not clear serialized_data on unfreeze
x86/kfence: fix booting on 32bit non-PAE systems
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fix from Tejun Heo:
- Fix race where sched_class operations (sched_setscheduler() and
friends) could be invoked on dead tasks after sched_ext_dead()
already ran, causing invalid SCX task state transitions and NULL
pointer dereferences.
This was a regression from the cgroup exit ordering fix which
moved sched_ext_free() to finish_task_switch().
* tag 'sched_ext-for-6.19-rc8-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Short-circuit sched_class operations on dead tasks
|
|
7900aa699c34 ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free()
to finish_task_switch()") moved sched_ext_free() to finish_task_switch() and
renamed it to sched_ext_dead() to fix cgroup exit ordering issues. However,
this created a race window where certain sched_class ops may be invoked on
dead tasks leading to failures - e.g. sched_setscheduler() may try to switch a
task which finished sched_ext_dead() back into SCX triggering invalid SCX task
state transitions.
Add task_dead_and_done() which tests whether a task is TASK_DEAD and has
completed its final context switch, and use it to short-circuit sched_class
operations which may be called on dead tasks.
Fixes: 7900aa699c34 ("sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()")
Reported-by: Andrea Righi <arighi@nvidia.com>
Link: http://lkml.kernel.org/r/20260202151341.796959-1-arighi@nvidia.com
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Previously, the verifier only tracked positive constant deltas between
linked registers using BPF_ADD. This limitation meant patterns like:
r1 = r0;
r1 += -4;
if r1 s>= 0 goto l0_%=; // r1 >= 0 implies r0 >= 4
// verifier couldn't propagate bounds back to r0
if r0 != 0 goto l0_%=;
r0 /= 0; // Verifier thinks this is reachable
l0_%=:
Similar limitation exists for 32-bit registers.
With this change, the verifier can now track negative deltas in reg->off
enabling bound propagation for the above pattern.
For alu32, we make sure the destination register has the upper 32 bits
as 0s before creating the link. BPF_ADD_CONST is split into
BPF_ADD_CONST64 and BPF_ADD_CONST32, the latter is used in case of alu32
and sync_linked_regs uses this to zext the result if known_reg has this
flag.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260204151741.2678118-2-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This patch implements bitwise tracking (tnum analysis) for BPF_END
(byte swap) operation.
Currently, the BPF verifier does not track value for BPF_END operation,
treating the result as completely unknown. This limits the verifier's
ability to prove safety of programs that perform endianness conversions,
which are common in networking code.
For example, the following code pattern for port number validation:
int test(struct pt_regs *ctx) {
__u64 x = bpf_get_prandom_u32();
x &= 0x3f00; // Range: [0, 0x3f00], var_off: (0x0; 0x3f00)
x = bswap16(x); // Should swap to range [0, 0x3f], var_off: (0x0; 0x3f)
if (x > 0x3f) goto trap;
return 0;
trap:
return *(u64 *)NULL; // Should be unreachable
}
Currently generates verifier output:
1: (54) w0 &= 16128 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=16128,var_off=(0x0; 0x3f00))
2: (d7) r0 = bswap16 r0 ; R0=scalar()
3: (25) if r0 > 0x3f goto pc+2 ; R0=scalar(smin=smin32=0,smax=umax=smax32=umax32=63,var_off=(0x0; 0x3f))
Without this patch, even though the verifier knows `x` has certain bits
set, after bswap16, it loses all tracking information and treats port
as having a completely unknown value [0, 65535].
According to the BPF instruction set[1], there are 3 kinds of BPF_END:
1. `bswap(16|32|64)`: opcode=0xd7 (BPF_END | BPF_ALU64 | BPF_TO_LE)
- do unconditional swap
2. `le(16|32|64)`: opcode=0xd4 (BPF_END | BPF_ALU | BPF_TO_LE)
- on big-endian: do swap
- on little-endian: truncation (16/32-bit) or no-op (64-bit)
3. `be(16|32|64)`: opcode=0xdc (BPF_END | BPF_ALU | BPF_TO_BE)
- on little-endian: do swap
- on big-endian: truncation (16/32-bit) or no-op (64-bit)
Since BPF_END operations are inherently bit-wise permutations, tnum
(bitwise tracking) offers the most efficient and precise mechanism
for value analysis. By implementing `tnum_bswap16`, `tnum_bswap32`,
and `tnum_bswap64`, we can derive exact `var_off` values concisely,
directly reflecting the bit-level changes.
Here is the overview of changes:
1. In `tnum_bswap(16|32|64)` (kernel/bpf/tnum.c):
Call `swab(16|32|64)` function on the value and mask of `var_off`, and
do truncation for 16/32-bit cases.
2. In `adjust_scalar_min_max_vals` (kernel/bpf/verifier.c):
Call helper function `scalar_byte_swap`.
- Only do byte swap when
* alu64 (unconditional swap) OR
* switching between big-endian and little-endian machines.
- If need do byte swap:
* Firstly call `tnum_bswap(16|32|64)` to update `var_off`.
* Then reset the bound since byte swap scrambles the range.
- For 16/32-bit cases, truncate dst register to match the swapped size.
This enables better verification of networking code that frequently uses
byte swaps for protocol processing, reducing false positive rejections.
[1] https://www.kernel.org/doc/Documentation/bpf/standardization/instruction-set.rst
Co-developed-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Signed-off-by: Shenghao Yuan <shenghaoyuan0928@163.com>
Co-developed-by: Yazhou Tang <tangyazhou518@outlook.com>
Signed-off-by: Yazhou Tang <tangyazhou518@outlook.com>
Signed-off-by: Tianci Cao <ziye@zju.edu.cn>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20260204111503.77871-2-ziye@zju.edu.cn
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Do not schedule timer/wq operation on a cpu that is in irq_work
callback that is processing async_cmds queue.
Otherwise the following loop is possible:
bpf_timer_start() -> bpf_async_schedule_op() -> irq_work_queue().
irqrestore -> bpf_async_irq_worker() -> tracepoint -> bpf_timer_start().
Fixes: 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260204055147.54960-4-alexei.starovoitov@gmail.com
|
|
Though hrtimer_start/cancel() inlines all of the smaller helpers in
hrtimer.c and only call timerqueue_add/del() from lib/timerqueue.c where
everything is not traceable and not kprobe-able (because all files in
lib/ are not traceable), there are tracepoints within hrtimer that are
called with locks held. Therefore prevent the deadlock by tightening
conditions when timer/wq can be called synchronously.
hrtimer/wq are using raw_spin_lock_irqsave(), so irqs_disabled() is enough.
Fixes: 1bfbc267ec91 ("bpf: Enable bpf_timer and bpf_wq in any context")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20260204055147.54960-2-alexei.starovoitov@gmail.com
|