summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2026-01-01rcu: Add noinstr-fast rcu_read_{,un}lock_tasks_trace() APIsPaul E. McKenney
When expressing RCU Tasks Trace in terms of SRCU-fast, it was necessary to keep a nesting count and per-CPU srcu_ctr structure pointer in the task_struct structure, which is slow to access. But an alternative is to instead make rcu_read_lock_tasks_trace() and rcu_read_unlock_tasks_trace(), which match the underlying SRCU-fast semantics, avoiding the task_struct accesses. When all callers have switched to the new API, the previous rcu_read_lock_trace() and rcu_read_unlock_trace() APIs will be removed. The rcu_read_{,un}lock_{,tasks_}trace() functions need to use smp_mb() only if invoked where RCU is not watching, that is, from locations where a call to rcu_is_watching() would return false. In architectures that define the ARCH_WANTS_NO_INSTR Kconfig option, use of noinstr and friends ensures that tracing happens only where RCU is watching, so those architectures can dispense entirely with the read-side calls to smp_mb(). Other architectures include these read-side calls by default, but in many installations there might be either larger than average tolerance for risk, prohibition of removing tracing on a running system, or careful review and approval of removing of tracing. Such installations can build their kernels with CONFIG_TASKS_TRACE_RCU_NO_MB=y to avoid those read-side calls to smp_mb(), thus accepting responsibility for run-time removal of tracing from code regions that RCU is not watching. Those wishing to disable read-side memory barriers for an entire architecture can select this TASKS_TRACE_RCU_NO_MB Kconfig option, hence the polarity. [ paulmck: Apply Peter Zijlstra feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-01rcu: Move rcu_tasks_trace_srcu_struct out of #ifdef CONFIG_TASKS_RCU_GENERICPaul E. McKenney
Moving the rcu_tasks_trace_srcu_struct structure instance out from under the CONFIG_TASKS_RCU_GENERIC Kconfig option permits the CONFIG_TASKS_TRACE_RCU Kconfig option to stop enabling this CONFIG_TASKS_RCU_GENERIC Kconfig option. This commit also therefore makes it so. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-01rcu: Clean up after the SRCU-fastification of RCU Tasks TracePaul E. McKenney
Now that RCU Tasks Trace has been re-implemented in terms of SRCU-fast, the ->trc_ipi_to_cpu, ->trc_blkd_cpu, ->trc_blkd_node, ->trc_holdout_list, and ->trc_reader_special task_struct fields are no longer used. In addition, the rcu_tasks_trace_qs(), rcu_tasks_trace_qs_blkd(), exit_tasks_rcu_finish_trace(), and rcu_spawn_tasks_trace_kthread(), show_rcu_tasks_trace_gp_kthread(), rcu_tasks_trace_get_gp_data(), rcu_tasks_trace_torture_stats_print(), and get_rcu_tasks_trace_gp_kthread() functions and all the other functions that they invoke are no longer used. Also, the TRC_NEED_QS and TRC_NEED_QS_CHECKED CPP macros are no longer used. Neither are the rcu_tasks_trace_lazy_ms and rcu_task_ipi_delay rcupdate module parameters and the TASKS_TRACE_RCU_READ_MB Kconfig option. This commit therefore removes all of them. [ paulmck: Apply Alexei Starovoitov feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-01context_tracking: Remove rcu_task_trace_heavyweight_{enter,exit}()Paul E. McKenney
Because SRCU-fast does not use IPIs for its grace periods, there is no need for real-time workloads to switch to an IPI-free mode, and there is in turn no need for either rcu_task_trace_heavyweight_enter() or rcu_task_trace_heavyweight_exit(). This commit therefore removes them. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2026-01-01rcu: Re-implement RCU Tasks Trace in terms of SRCU-fastPaul E. McKenney
This commit saves more than 500 lines of RCU code by re-implementing RCU Tasks Trace in terms of SRCU-fast. Follow-up work will remove more code that does not cause problems by its presence, but that is no longer required. This variant places smp_mb() in rcu_read_{,un}lock_trace(), and in the same place that srcu_read_{,un}lock() would put them. These smp_mb() calls will be removed on common-case architectures in a later commit. In the meantime, it serves to enforce ordering between the underlying srcu_read_{,un}lock_fast() markers and the intervening critical section, even on architectures that permit attaching tracepoints on regions of code not watched by RCU. Such architectures defeat SRCU-fast's use of implicit single-instruction, interrupts-disabled, and atomic-operation RCU read-side critical sections, which have no effect when RCU is not watching. The aforementioned later commit will insert these smp_mb() calls only on architectures that have not used noinstr to prevent attaching tracepoints to code where RCU is not watching. [ paulmck: Apply kernel test robot, Boqun Feng, and Zqiang feedback. ] [ paulmck: Split out Tiny SRCU fixes per Andrii Nakryiko feedback. ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: kernel test robot <oliver.sang@intel.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bpf@vger.kernel.org Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-12-31dma-mapping: add DMA_ATTR_CPU_CACHE_CLEANMichael S. Tsirkin
When multiple small DMA_FROM_DEVICE or DMA_BIDIRECTIONAL buffers share a cacheline, and DMA_API_DEBUG is enabled, we get this warning: cacheline tracking EEXIST, overlapping mappings aren't supported. This is because when one of the mappings is removed, while another one is active, CPU might write into the buffer. Add an attribute for the driver to promise not to do this, making the overlapping safe, and suppressing the warning. Message-ID: <2d5d091f9d84b68ea96abd545b365dd1d00bbf48.1767601130.git.mst@redhat.com> Reviewed-by: Petr Tesarik <ptesarik@suse.com> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-12-31bpf: allow states pruning for misc/invalid slots in iterator loopsEduard Zingerman
Within an iterator or callback based loop, it should be safe to prune the current state if the old state stack slot is marked as STACK_INVALID or STACK_MISC: - either all branches of the old state lead to a program exit; - or some branch of the old state leads the current state. This is the same logic as applied in non-loop cases when states_equal() is called in NOT_EXACT mode. The test case that exercises stacksafe() and demonstrates the difference in verification performance is included in the next patch. I'm not sure if it is possible to prepare a test case that exercises regsafe(); it appears that the compute_live_registers() pass makes this impossible. Nevertheless, for code readability reasons, I think that stacksafe() and regsafe() should handle STACK_INVALID / NOT_INIT symmetrically. Hence, this commit changes both functions. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251230-loop-stack-misc-pruning-v1-1-585cfd6cec51@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-30bpf: bpf_scc_visit instance and backedges accumulation for bpf_loop()Eduard Zingerman
Calls like bpf_loop() or bpf_for_each_map_elem() introduce loops that are not explicitly present in the control-flow graph. The verifier processes such calls by repeatedly interpreting the callback function body within the same verification path (until the current state converges with a previous state). Such loops require a bpf_scc_visit instance in order to allow the accumulation of the state graph backedges. Otherwise, certain checkpoint states created within the bodies of such loops will have incomplete precision marks. See the next patch for an example of a program that leads to the verifier accepting an unsafe program. Fixes: 96c6aa4c63af ("bpf: compute SCCs in program control flow graph") Fixes: c9e31900b54c ("bpf: propagate read/precision marks over state graph backedges") Reported-by: Breno Leitao <leitao@debian.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Tested-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20251229-scc-for-callbacks-v1-1-ceadfe679900@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-29Merge tag 'mm-hotfixes-stable-2025-12-28-21-50' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "27 hotfixes. 12 are cc:stable, 18 are MM. There's a patch series from Jiayuan Chen which fixes some issues with KASAN and vmalloc. Apart from that it's the usual shower of singletons - please see the respective changelogs for details" * tag 'mm-hotfixes-stable-2025-12-28-21-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (27 commits) mm/ksm: fix pte_unmap_unlock of wrong address in break_ksm_pmd_entry mm/page_owner: fix memory leak in page_owner_stack_fops->release() mm/memremap: fix spurious large folio warning for FS-DAX MAINTAINERS: notify the "Device Memory" community of memory hotplug changes sparse: update MAINTAINERS info mm/page_alloc: report 1 as zone_batchsize for !CONFIG_MMU mm: consider non-anon swap cache folios in folio_expected_ref_count() rust: maple_tree: rcu_read_lock() in destructor to silence lockdep mm: memcg: fix unit conversion for K() macro in OOM log mm: fixup pfnmap memory failure handling to use pgoff tools/mm/page_owner_sort: fix timestamp comparison for stable sorting selftests/mm: fix thread state check in uffd-unit-tests kernel/kexec: fix IMA when allocation happens in CMA area kernel/kexec: change the prototype of kimage_map_segment() MAINTAINERS: add ABI headers to KHO and LIVE UPDATE .mailmap: remove one of the entries for WangYuli mm/damon/vaddr: fix missing pte_unmap_unlock in damos_va_migrate_pmd_entry() MAINTAINERS: update one straggling entry for Bartosz Golaszewski mm/page_alloc: change all pageblocks migrate type on coalescing mm: leafops.h: correct kernel-doc function param. names ...
2025-12-28Merge tag 'sched_ext-for-6.19-rc3-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix uninitialized @ret on alloc_percpu() failure leading to ERR_PTR(0) - Fix PREEMPT_RT warning when bypass load balancer sends IPI to offline CPU by using resched_cpu() instead of resched_curr() - Fix comment referring to renamed function - Update scx_show_state.py for scx_root and scx_aborting changes * tag 'sched_ext-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: tools/sched_ext: update scx_show_state.py for scx_aborting change tools/sched_ext: fix scx_show_state.py for scx_root change sched_ext: Use the resched_cpu() to replace resched_curr() in the bypass_lb_node() sched_ext: Fix some comments in ext.c sched_ext: fix uninitialized ret on alloc_percpu() failure
2025-12-28Merge tag 'cgroup-for-6.19-rc3-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: - Fix a spurious cpuset warning when disabling remote partition after CPU hotplug leaves subpartitions_cpus empty. Guard the warning and invalidate affected partitions. * tag 'cgroup-for-6.19-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: fix warning when disabling remote partition
2025-12-28PM: sleep: Fix suspend_test() at the TEST_CORE levelRafael J. Wysocki
Commit a10ad1b10402 ("PM: suspend: Make pm_test delay interruptible by wakeup events") replaced mdelay() in suspend_test() with msleep() which does not work at the TEST_CORE test level that calls suspend_test() while running on one CPU with interrupts off. Address this by making suspend_test() check if the test level is suitable for using msleep() and use mdelay() otherwise. Fixes: a10ad1b10402 ("PM: suspend: Make pm_test delay interruptible by wakeup events") Reported-by: Sebastian Reichel <sebastian.reichel@collabora.com> Closes: https://lore.kernel.org/linux-pm/aUsAk0k1N9hw8IkY@venus/ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: Sebastian Reichel <sebastian.reichel@collabora.com> Link: https://patch.msgid.link/6251576.lOV4Wx5bFT@rafael.j.wysocki
2025-12-26Merge tag 'efi-fixes-for-v6.19-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: "A couple of fixes for EFI regressions introduced this cycle: - Make EDID handling in the EFI stub mixed mode safe - Ensure that efi_mm.user_ns has a sane value - this is needed now that EFI runtime calls are preemptible on arm64" * tag 'efi-fixes-for-v6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: kthread: Warn if mm_struct lacks user_ns in kthread_use_mm() arm64: efi: Fix NULL pointer dereference by initializing user_ns efi/libstub: gop: Fix EDID support in mixed-mode
2025-12-26irqdomain: Export irq_domain_free_irqs()Aaron Kling
Export irq_domain_free_irqs() to allow PCI/MSI drivers like pci-tegra to be built as a module. Signed-off-by: Aaron Kling <webgeek1234@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20250731-pci-tegra-module-v7-1-cad4b088b8fb@gmail.com
2025-12-24kthread: Warn if mm_struct lacks user_ns in kthread_use_mm()Breno Leitao
Add a WARN_ON_ONCE() check to detect mm_struct instances that are missing user_ns initialization when passed to kthread_use_mm(). When a kthread adopts an mm via kthread_use_mm(), LSM hooks and capability checks may access current->mm->user_ns for credential validation. If user_ns is NULL, this leads to a NULL pointer dereference crash. This was observed with efi_mm on arm64, where commit a5baf582f4c0 ("arm64/efi: Call EFI runtime services without disabling preemption") introduced kthread_use_mm(&efi_mm), but efi_mm lacked user_ns initialization, causing crashes during /proc access. Adding this warning helps catch similar bugs early during development rather than waiting for hard-to-debug NULL pointer crashes in production. Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Rik van Riel <riel@surriel.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2025-12-23bpf: arena: make arena kfuncs any context safePuranjay Mohan
Make arena related kfuncs any context safe by the following changes: bpf_arena_alloc_pages() and bpf_arena_reserve_pages(): Replace the usage of the mutex with a rqspinlock for range tree and use kmalloc_nolock() wherever needed. Use free_pages_nolock() to free pages from any context. apply_range_set/clear_cb() with apply_to_page_range() has already made populating the vm_area in bpf_arena_alloc_pages() any context safe. bpf_arena_free_pages(): defer the main logic to a workqueue if it is called from a non-sleepable context. specialize_kfunc() is used to replace the sleepable arena_free_pages() with bpf_arena_free_pages_non_sleepable() when the verifier detects the call is from a non-sleepable context. In the non-sleepable case, arena_free_pages() queues the address and the page count to be freed to a lock-less list of struct arena_free_spans and raises an irq_work. The irq_work handler calls schedules_work() as it is safe to be called from irq context. arena_free_worker() (the work queue handler) iterates these spans and clears ptes, flushes tlb, zaps pages, and calls __free_page(). Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251222195022.431211-4-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-23bpf: arena: use kmalloc_nolock() in place of kvcalloc()Puranjay Mohan
To make arena_alloc_pages() safe to be called from any context, replace kvcalloc() with kmalloc_nolock() so as it doesn't sleep or take any locks. kmalloc_nolock() returns NULL for allocations larger than KMALLOC_MAX_CACHE_SIZE, which is (PAGE_SIZE * 2) = 8KB on systems with 4KB pages. So, round down the allocation done by kmalloc_nolock to 1024 * 8 and reuse the array in a loop. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251222195022.431211-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-23bpf: arena: populate vm_area without allocating memoryPuranjay Mohan
vm_area_map_pages() may allocate memory while inserting pages into bpf arena's vm_area. In order to make bpf_arena_alloc_pages() kfunc non-sleepable change bpf arena to populate pages without allocating memory: - at arena creation time populate all page table levels except the last level - when new pages need to be inserted call apply_to_page_range() again with apply_range_set_cb() which will only set_pte_at() those pages and will not allocate memory. - when freeing pages call apply_to_existing_page_range with apply_range_clear_cb() to clear the pte for the page to be removed. This doesn't free intermediate page table levels. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251222195022.431211-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-23kernel/kexec: fix IMA when allocation happens in CMA areaPingfan Liu
*** Bug description *** When I tested kexec with the latest kernel, I ran into the following warning: [ 40.712410] ------------[ cut here ]------------ [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 [...] [ 40.816047] Call trace: [ 40.818498] kimage_map_segment+0x144/0x198 (P) [ 40.823221] ima_kexec_post_load+0x58/0xc0 [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 [...] [ 40.855423] ---[ end trace 0000000000000000 ]--- *** How to reproduce *** This bug is only triggered when the kexec target address is allocated in the CMA area. If no CMA area is reserved in the kernel, use the "cma=" option in the kernel command line to reserve one. *** Root cause *** The commit 07d24902977e ("kexec: enable CMA based contiguous allocation") allocates the kexec target address directly on the CMA area to avoid copying during the jump. In this case, there is no IND_SOURCE for the kexec segment. But the current implementation of kimage_map_segment() assumes that IND_SOURCE pages exist and map them into a contiguous virtual address by vmap(). *** Solution *** If IMA segment is allocated in the CMA area, use its page_address() directly. Link: https://lkml.kernel.org/r/20251216014852.8737-2-piliu@redhat.com Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") Signed-off-by: Pingfan Liu <piliu@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Alexander Graf <graf@amazon.com> Cc: Steven Chen <chenste@linux.microsoft.com> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: Roberto Sassu <roberto.sassu@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23kernel/kexec: change the prototype of kimage_map_segment()Pingfan Liu
The kexec segment index will be required to extract the corresponding information for that segment in kimage_map_segment(). Additionally, kexec_segment already holds the kexec relocation destination address and size. Therefore, the prototype of kimage_map_segment() can be changed. Link: https://lkml.kernel.org/r/20251216014852.8737-1-piliu@redhat.com Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") Signed-off-by: Pingfan Liu <piliu@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: Roberto Sassu <roberto.sassu@huawei.com> Cc: Alexander Graf <graf@amazon.com> Cc: Steven Chen <chenste@linux.microsoft.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-22bpf: crypto: replace -EEXIST with -EBUSYDaniel Gomez
The -EEXIST error code is reserved by the module loading infrastructure to indicate that a module is already loaded. When a module's init function returns -EEXIST, userspace tools like kmod interpret this as "module already loaded" and treat the operation as successful, returning 0 to the user even though the module initialization actually failed. This follows the precedent set by commit 54416fd76770 ("netfilter: conntrack: helper: Replace -EEXIST by -EBUSY") which fixed the same issue in nf_conntrack_helper_register(). This affects bpf_crypto_skcipher module. While the configuration required to build it as a module is unlikely in practice, it is technically possible, so fix it for correctness. Signed-off-by: Daniel Gomez <da.gomez@samsung.com> Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://lore.kernel.org/r/20251220-dev-module-init-eexists-bpf-v1-1-7f186663dbe7@samsung.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-22bpf: allow calling kfuncs from raw_tp programsPuranjay Mohan
Associate raw tracepoint program type with the kfunc tracing hook. This allows calling kfuncs from raw_tp programs. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20251222133250.1890587-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-22sched_ext: Avoid multiple irq_work_queue() calls in destroy_dsq()Zqiang
llist_add() returns true only when adding to an empty list, which indicates that no IRQ work is currently queued or running. Therefore, we only need to call irq_work_queue() when llist_add() returns true, to avoid unnecessarily re-queueing IRQ work that is already pending or executing. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-22sched_ext: Use the resched_cpu() to replace resched_curr() in the ↵Zqiang
bypass_lb_node() For the PREEMPT_RT kernels, the scx_bypass_lb_timerfn() running in the preemptible per-CPU ktimer kthread context, this means that the following scenarios will occur(for x86 platform): cpu1 cpu2 ktimer kthread: ->scx_bypass_lb_timerfn ->bypass_lb_node ->for_each_cpu(cpu, resched_mask) migration/1: by preempt by migration/2: multi_cpu_stop() multi_cpu_stop() ->take_cpu_down() ->__cpu_disable() ->set cpu1 offline ->rq1 = cpu_rq(cpu1) ->resched_curr(rq1) ->smp_send_reschedule(cpu1) ->native_smp_send_reschedule(cpu1) ->if(unlikely(cpu_is_offline(cpu))) { WARN(1, "sched: Unexpected reschedule of offline CPU#%d!\n", cpu); return; } This commit therefore use the resched_cpu() to replace resched_curr() in the bypass_lb_node() to avoid send-ipi to offline CPUs. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-22cpuset: remove dead code in cpuset-v1.cChen Ridong
The commit 6e1d31ce495c ("cpuset: separate generate_sched_domains for v1 and v2") introduced dead code that was originally added for cpuset-v2 partition domain generation. Remove the redundant root_load_balance check. Fixes: 6e1d31ce495c ("cpuset: separate generate_sched_domains for v1 and v2") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/cgroups/9a442808-ed53-4657-988b-882cc0014c0d@huaweicloud.com/T/ Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-22module/decompress: Avoid open-coded kvrealloc()Kees Cook
Replace open-coded allocate/copy with kvrealloc(). Signed-off-by: Kees Cook <kees@kernel.org> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2025-12-22module: Remove SHA-1 support for module signingPetr Pavlu
SHA-1 is considered deprecated and insecure due to vulnerabilities that can lead to hash collisions. Most distributions have already been using SHA-2 for module signing because of this. The default was also changed last year from SHA-1 to SHA-512 in commit f3b93547b91a ("module: sign with sha512 instead of sha1 by default"). This was not reported to cause any issues. Therefore, it now seems to be a good time to remove SHA-1 support for module signing. Commit 16ab7cb5825f ("crypto: pkcs7 - remove sha1 support") previously removed support for reading PKCS#7/CMS signed with SHA-1, along with the ability to use SHA-1 for module signing. This change broke iwd and was subsequently completely reverted in commit 203a6763ab69 ("Revert "crypto: pkcs7 - remove sha1 support""). However, dropping only the support for using SHA-1 for module signing is unrelated and can still be done separately. Note that this change only removes support for new modules to be SHA-1 signed, but already signed modules can still be loaded. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2025-12-22module: replace use of system_wq with system_dfl_wqMarco Crivellari
Currently if a user enqueues a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistency cannot be addressed without refactoring the API. This continues the effort to refactor workqueue APIs, which began with the introduction of new workqueues and a new alloc_workqueue flag in: commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq") commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag") Switch to using system_dfl_wq, the new unbound workqueue, because the users do not benefit from a per-cpu workqueue. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2025-12-22params: Replace __modinit with __init_or_modulePetr Pavlu
Remove the custom __modinit macro from kernel/params.c and instead use the common __init_or_module macro from include/linux/module.h. Both provide the same functionality. Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2025-12-21Merge tag 'irq-urgent-2025-12-21' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Ingo Molnar: "Fix IRQ thread affinity flags setup regression" * tag 'irq-urgent-2025-12-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Don't overwrite interrupt thread flags on setup
2025-12-21bpf: annotate file argument as __nullable in bpf_lsm_mmap_fileMatt Bobrowski
As reported in [0], anonymous memory mappings are not backed by a struct file instance. Consequently, the struct file pointer passed to the security_mmap_file() LSM hook is NULL in such cases. The BPF verifier is currently unaware of this, allowing BPF LSM programs to dereference this struct file pointer without needing to perform an explicit NULL check. This leads to potential NULL pointer dereference and a kernel crash. Add a strong override for bpf_lsm_mmap_file() which annotates the struct file pointer parameter with the __nullable suffix. This explicitly informs the BPF verifier that this pointer (PTR_MAYBE_NULL) can be NULL, forcing BPF LSM programs to perform a check on it before dereferencing it. [0] https://lore.kernel.org/bpf/5e460d3c.4c3e9.19adde547d8.Coremail.kaiyanm@hust.edu.cn/ Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn> Closes: https://lore.kernel.org/bpf/5e460d3c.4c3e9.19adde547d8.Coremail.kaiyanm@hust.edu.cn/ Signed-off-by: Matt Bobrowski <mattbobrowski@google.com> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20251216133000.3690723-1-mattbobrowski@google.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-21bpf: arm64: Optimize recursion detection by not using atomicsPuranjay Mohan
BPF programs detect recursion using a per-CPU 'active' flag in struct bpf_prog. The trampoline currently sets/clears this flag with atomic operations. On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic operations are relatively slow. Unlike x86_64 - where per-CPU updates can avoid cross-core atomicity, arm64 LSE atomics are always atomic across all cores, which is unnecessary overhead for strictly per-CPU state. This patch removes atomics from the recursion detection path on arm64 by changing 'active' to a per-CPU array of four u8 counters, one per context: {NMI, hard-irq, soft-irq, normal}. The running context uses a non-atomic increment/decrement on its element. After increment, recursion is detected by reading the array as a u32 and verifying that only the expected element changed; any change in another element indicates inter-context recursion, and a value > 1 in the same element indicates same-context recursion. For example, starting from {0,0,0,0}, a normal-context trigger changes the array to {0,0,0,1}. If an NMI arrives on the same CPU and triggers the program, the array becomes {1,0,0,1}. When the NMI context checks the u32 against the expected mask for normal (0x00000001), it observes 0x01000001 and correctly reports recursion. Same-context recursion is detected analogously. Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251219184422.2899902-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-21bpf: move recursion detection logic to helpersPuranjay Mohan
BPF programs detect recursion by doing atomic inc/dec on a per-cpu active counter from the trampoline. Create two helpers for operations on this active counter, this makes it easy to changes the recursion detection logic in future. This commit makes no functional changes. Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251219184422.2899902-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-19sched_ext: Fix some comments in ext.cZqiang
This commit update balance_scx() in the comments to balance_one(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-19sched/fair: Fix sched_avg foldPeter Zijlstra
After the robot reported a regression wrt commit: 089d84203ad4 ("sched/fair: Fold the sched_avg update"), Shrikanth noted that two spots missed a factor se_weight(). Fixes: 089d84203ad4 ("sched/fair: Fold the sched_avg update") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202512181208.753b9f6e-lkp@intel.com Debugged-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251218102020.GO3707891@noisy.programming.kicks-ass.net
2025-12-19perf: Use EXPORT_SYMBOL_FOR_KVM() for the mediated APIsPeter Zijlstra
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251208115156.GE3707891@noisy.programming.kicks-ass.net
2025-12-19irqdomain: Fix up const problem in irq_domain_set_name()Greg Kroah-Hartman
In irq_domain_set_name() a const pointer is passed in, and then the const is "lost" when container_of() is called. Fix this up by properly preserving the const pointer attribute when container_of() is used to enforce the fact that this pointer should not have anything at it changed. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/2025121731-facing-unhitched-63ae@gregkh
2025-12-19Merge tag 'trace-v6.19-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Add Documentation/core-api/tracepoint.rst to TRACING in MAINTAINERS file Updates to the tracepoint.rst document should be reviewed by the tracing maintainers. - Fix warning triggered by perf attaching to synthetic events The synthetic events do not add a function to be registered when perf attaches to them. This causes a warning when perf registers a synthetic event and passes a NULL pointer to the tracepoint register function. Ideally synthetic events should be updated to work with perf, but as that's a feature and not a bug fix, simply now return -ENODEV when perf tries to register an event that has a NULL pointer for its function. This no longer causes a kernel warning and simply causes the perf code to fail with an error message. - Fix 32bit overflow in option flag test The option's flags changed from 32 bits in size to 64 bits in size. Fix one of the places that shift 1 by the option bit number to to be 1ULL. - Fix the output of printing the direct jmp functions The enabled_functions that shows how functions are being attached by ftrace wasn't updated to accommodate the new direct jmp trampolines that set the LSB of the pointer, and outputs garbage. Update the output to handle the direct jmp trampolines. * tag 'trace-v6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ftrace: Fix address for jmp mode in t_show() tracing: Fix UBSAN warning in __remove_instance() tracing: Do not register unsupported perf events MAINTAINERS: add tracepoint core-api doc files to TRACING
2025-12-19Merge tag 'net-6.19-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter and CAN. Current release - regressions: - netfilter: nf_conncount: fix leaked ct in error paths - sched: act_mirred: fix loop detection - sctp: fix potential deadlock in sctp_clone_sock() - can: fix build dependency - eth: mlx5e: do not update BQL of old txqs during channel reconfiguration Previous releases - regressions: - sched: ets: always remove class from active list before deleting it - inet: frags: flush pending skbs in fqdir_pre_exit() - netfilter: nf_nat: remove bogus direction check - mptcp: - schedule rtx timer only after pushing data - avoid deadlock on fallback while reinjecting - can: gs_usb: fix error handling - eth: - mlx5e: - avoid unregistering PSP twice - fix double unregister of HCA_PORTS component - bnxt_en: fix XDP_TX path - mlxsw: fix use-after-free when updating multicast route stats Previous releases - always broken: - ethtool: avoid overflowing userspace buffer on stats query - openvswitch: fix middle attribute validation in push_nsh() action - eth: - mlx5: fw_tracer, validate format string parameters - mlxsw: spectrum_router: fix neighbour use-after-free - ipvlan: ignore PACKET_LOOPBACK in handle_mode_l2() Misc: - Jozsef Kadlecsik retires from maintaining netfilter - tools: ynl: fix build on systems with old kernel headers" * tag 'net-6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits) net: hns3: add VLAN id validation before using net: hns3: using the num_tqps to check whether tqp_index is out of range when vf get ring info from mbx net: hns3: using the num_tqps in the vf driver to apply for resources net: enetc: do not transmit redirected XDP frames when the link is down selftests/tc-testing: Test case exercising potential mirred redirect deadlock net/sched: act_mirred: fix loop detection sctp: Clear inet_opt in sctp_v6_copy_ip_options(). sctp: Fetch inet6_sk() after setting ->pinet6 in sctp_clone_sock(). net/handshake: duplicate handshake cancellations leak socket net/mlx5e: Don't include PSP in the hard MTU calculations net/mlx5e: Do not update BQL of old txqs during channel reconfiguration net/mlx5e: Trigger neighbor resolution for unresolved destinations net/mlx5e: Use ip6_dst_lookup instead of ipv6_dst_lookup_flow for MAC init net/mlx5: Serialize firmware reset with devlink net/mlx5: fw_tracer, Handle escaped percent properly net/mlx5: fw_tracer, Validate format string parameters net/mlx5: Drain firmware reset in shutdown callback net/mlx5: fw reset, clear reset requested on drain_fw_reset net: dsa: mxl-gsw1xx: manually clear RANEG bit net: dsa: mxl-gsw1xx: fix .shutdown driver operation ...
2025-12-18cpuset: remove v1-specific code from generate_sched_domainsChen Ridong
Following the introduction of cpuset1_generate_sched_domains() for v1 in the previous patch, v1-specific logic can now be removed from the generic generate_sched_domains(). This patch cleans up the v1-only code and ensures uf_node is only visible when CONFIG_CPUSETS_V1=y. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18cpuset: separate generate_sched_domains for v1 and v2Chen Ridong
The generate_sched_domains() function currently handles both v1 and v2 logic. However, the underlying mechanisms for building scheduler domains differ significantly between the two versions. For cpuset v2, scheduler domains are straightforwardly derived from valid partitions, whereas cpuset v1 employs a more complex union-find algorithm to merge overlapping cpusets. Co-locating these implementations complicates maintenance. This patch, along with subsequent ones, aims to separate the v1 and v2 logic. For ease of review, this patch first copies the generate_sched_domains() function into cpuset-v1.c as cpuset1_generate_sched_domains() and removes v2-specific code. Common helpers and top_cpuset are declared in cpuset-internal.h. When operating in v1 mode, the code now calls cpuset1_generate_sched_domains(). Currently there is some code duplication, which will be largely eliminated once v1-specific code is removed from v2 in the following patch. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18cpuset: move update_domain_attr_tree to cpuset_v1.cChen Ridong
Since relax_domain_level is only applicable to v1, move update_domain_attr_tree() to cpuset-v1.c, which solely updates relax_domain_level, Additionally, relax_domain_level is now initialized in cpuset1_inited. Accordingly, the initialization of relax_domain_level in top_cpuset is removed. The unnecessary remote_partition initialization in top_cpuset is also cleaned up. As a result, relax_domain_level can be defined in cpuset only when CONFIG_CPUSETS_V1=y. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18cpuset: add cpuset1_init helper for v1 initializationChen Ridong
This patch introduces the cpuset1_init helper in cpuset_v1.c to initialize v1-specific fields, including the fmeter and relax_domain_level members. The relax_domain_level related code will be moved to cpuset_v1.c in a subsequent patch. After this move, v1-specific members will only be visible when CONFIG_CPUSETS_V1=y. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18cpuset: add cpuset1_online_css helper for v1-specific operationsChen Ridong
This commit introduces the cpuset1_online_css helper to centralize v1-specific handling during cpuset online. It performs operations such as updating the CS_SPREAD_PAGE, CS_SPREAD_SLAB, and CGRP_CPUSET_CLONE_CHILDREN flags, which are unique to the cpuset v1 control group interface. The helper is now placed in cpuset-v1.c to maintain clear separation between v1 and v2 logic. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18cpuset: add lockdep_assert_cpuset_lock_held helperChen Ridong
Add lockdep_assert_cpuset_lock_held() to allow other subsystems to verify that cpuset_mutex is held. Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18cpuset: fix warning when disabling remote partitionChen Ridong
A warning was triggered as follows: WARNING: kernel/cgroup/cpuset.c:1651 at remote_partition_disable+0xf7/0x110 RIP: 0010:remote_partition_disable+0xf7/0x110 RSP: 0018:ffffc90001947d88 EFLAGS: 00000206 RAX: 0000000000007fff RBX: ffff888103b6e000 RCX: 0000000000006f40 RDX: 0000000000006f00 RSI: ffffc90001947da8 RDI: ffff888103b6e000 RBP: ffff888103b6e000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: ffff88810b2e2728 R12: ffffc90001947da8 R13: 0000000000000000 R14: ffffc90001947da8 R15: ffff8881081f1c00 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f55c8bbe0b2 CR3: 000000010b14c000 CR4: 00000000000006f0 Call Trace: <TASK> update_prstate+0x2d3/0x580 cpuset_partition_write+0x94/0xf0 kernfs_fop_write_iter+0x147/0x200 vfs_write+0x35d/0x500 ksys_write+0x66/0xe0 do_syscall_64+0x6b/0x390 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f55c8cd4887 Reproduction steps (on a 16-CPU machine): # cd /sys/fs/cgroup/ # mkdir A1 # echo +cpuset > A1/cgroup.subtree_control # echo "0-14" > A1/cpuset.cpus.exclusive # mkdir A1/A2 # echo "0-14" > A1/A2/cpuset.cpus.exclusive # echo "root" > A1/A2/cpuset.cpus.partition # echo 0 > /sys/devices/system/cpu/cpu15/online # echo member > A1/A2/cpuset.cpus.partition When CPU 15 is offlined, subpartitions_cpus gets cleared because no CPUs remain available for the top_cpuset, forcing partitions to share CPUs with the top_cpuset. In this scenario, disabling the remote partition triggers a warning stating that effective_xcpus is not a subset of subpartitions_cpus. Partitions should be invalidated in this case to inform users that the partition is now invalid(cpus are shared with top_cpuset). To fix this issue: 1. Only emit the warning only if subpartitions_cpus is not empty and the effective_xcpus is not a subset of subpartitions_cpus. 2. During the CPU hotplug process, invalidate partitions if subpartitions_cpus is empty. Fixes: f62a5d39368e ("cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition") Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18test-ww_mutex: Allow test to be run (and re-run) from userlandJohn Stultz
In cases where the ww_mutex test was occasionally tripping on hard to find issues, leaving qemu in a reboot loop was my best way to reproduce problems. These reboots however wasted time when I just wanted to run the test-ww_mutex logic. So tweak the test-ww_mutex test so that it can be re-triggered via a sysfs file, so the test can be run repeatedly without doing module loads or restarting. This has been particularly valuable to stressing and finding issues with the proxy-exec series. To use, run as root: echo 1 > /sys/kernel/test_ww_mutex/run_tests Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251205013515.759030-4-jstultz@google.com
2025-12-18test-ww_mutex: Move work to its own UNBOUND workqueueJohn Stultz
The test-ww_mutex test already allocates its own workqueue so be sure to use it for the mtx.work and abba.work rather then the default system workqueue. This resolves numerous messages of the sort: "workqueue: test_abba_work hogged CPU... consider switching to WQ_UNBOUND" "workqueue: test_mutex_work hogged CPU... consider switching to WQ_UNBOUND" Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251205013515.759030-3-jstultz@google.com
2025-12-18test-ww_mutex: Extend ww_mutex tests to test both classes of ww_mutexesJohn Stultz
Currently the test-ww_mutex tool only utilizes the wait-die class of ww_mutexes, and thus isn't very helpful in exercising the wait-wound class of ww_mutexes. So extend the test to exercise both classes of ww_mutexes for all of the subtests. Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251205013515.759030-2-jstultz@google.com
2025-12-17ftrace: Fix address for jmp mode in t_show()Menglong Dong
The address from ftrace_find_rec_direct() is printed directly in t_show(). This can mislead symbol offsets if it has the "jmp" bit in the last bit. Fix this by printing the address that returned by ftrace_jmp_get(). Link: https://patch.msgid.link/20251217030053.80343-1-dongml2@chinatelecom.cn Fixes: 25e4e3565d45 ("ftrace: Introduce FTRACE_OPS_FL_JMP") Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>