linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2026-05-26	Merge tag 'mm-hotfixes-stable-2026-05-25-16-22' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "13 hotfixes. 9 are for MM. 9 are cc:stable and the remaining 4 address post-7.1 issues or aren't considered suitable for backporting. All patches are singletons - please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: Revert "mm: introduce a new page type for page pool in page type" mm/vmalloc: do not trigger BUG() on BH disabled context MAINTAINERS, mailmap: change email for Eugen Hristev mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_page kernel/fork: validate exit_signal in kernel_clone() mm: memcontrol: propagate NMI slab stats to memcg vmstats mm/damon/sysfs-schemes: delete tried region in regions_rmdirs() mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one zram: fix use-after-free in zram_writeback_endio memfd: deny writeable mappings when implying SEAL_WRITE ipc: limit next_id allocation to the valid ID range Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare" MAINTAINERS: .mailmap: update after GEHC spin-off
2026-05-26	genirq/proc: Speed up /proc/interrupts iteration	Thomas Gleixner
	Reading /proc/interrupts iterates over the interrupt number space one by one and looks up the descriptors one by one. That's just a waste of time. When CONFIG_GENERIC_IRQ_SHOW is enabled this can utilize the maple tree and cache the descriptor pointer efficiently for the sequence file operations. Implement a CONFIG_GENERIC_IRQ_SHOW specific version in the core code and leave the fs/proc/ variant for the legacy architectures which ignore generic code. This reduces the time wasted for looking up the next record significantly. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com> Link: https://patch.msgid.link/20260517194932.165280601@kernel.org
2026-05-26	genirq/proc: Runtime size the chip name	Thomas Gleixner
	The chip name column in the /proc/interrupt output is 8 characters and right aligned, which causes visual clutter due to the fixed length and the alignment. Many interrupt chips, e.g. PCI/MSI[X] have way longer names. Update the length when a chip is assigned to an interrupt and utilize this information for the output. Align it left so all chip names start at the begin of the column. Update the GDB script as well and disentangle the header maze so it actually works with all .config combinations. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com> Link: https://patch.msgid.link/20260517194932.085786035@kernel.org
2026-05-26	genirq: Expose irq_find_desc_at_or_after() in core code	Thomas Gleixner
	... in preparation for a smarter iterator for /proc/interrupts. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com> Link: https://patch.msgid.link/20260517194932.005787611@kernel.org
2026-05-26	genirq: Add rcuref count to struct irq_desc	Thomas Gleixner
	Prepare for a smarter iterator for /proc/interrupts so that the next interrupt descriptor can be cached after lookup. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com> Link: https://patch.msgid.link/20260517194931.917415190@kernel.org
2026-05-26	genirq/proc: Increase default interrupt number precision to four	Thomas Gleixner
	Quite some architectures have four character wide acronyms for architecture specific interrupts like IPI, NMI, etc. The default precision of printing the Linux device interrupt numbers is three, which causes quite some code to play games with adding or omitting space after the acronym and the colon in order to keep the per CPU numbers properly aligned. Increase the default number precision to four in the core code and get rid of the space games all over the place. At the same time align all architecture specific descriptor texts left so that they show up in the same column as the interrupt chip names, which makes the output more uniform accross architectures. Fix up the GDB script to this new scheme as well. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260517194931.839482411@kernel.org
2026-05-26	genirq: Calculate precision only when required	Thomas Gleixner
	Calculating the precision of the interrupt number column on every initial show_interrupt() invocation is a pointless exercise as the underlying maximum number of interrupts rarely changes. Calculate it only when that number is modified and let show_interrupts() use the cached value. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com> Reviewed-by: Radu Rendec <radu@rendec.net> Link: https://patch.msgid.link/20260517194931.760664517@kernel.org
2026-05-26	genirq: Cache the condition for /proc/interrupts exposure	Thomas Gleixner
	show_interrupts() evaluates a boatload of conditions to establish whether it should expose an interrupt in /proc/interrupts or not. That can be simplified by caching the condition in an internal status flag, which is updated when one of the relevant inputs changes. The irq_desc::kstat_irq check is dropped because visible interrupt descriptors always have a valid pointer. As a result the number of instructions and branches for reading /proc/interrupts is reduced significantly. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com> Reviewed-by: Radu Rendec <radu@rendec.net> Link: https://patch.msgid.link/20260517194931.680943749@kernel.org
2026-05-26	genirq/manage: Make NMI cleanup RT safe	Thomas Gleixner
	Eventually blocking functions cannot be invoked with interrupts disabled and a raw spin lock held. Restructure the code so this happens outside of the descriptor lock held region. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Michael Kelley <mhklinux@outlook.com> Link: https://patch.msgid.link/20260517194931.601972758@kernel.org
2026-05-26	genirq: Expose nr_irqs in core code	Thomas Gleixner
	... to avoid function calls in the core code to retrieve the maximum number of interrupts. Rename it to 'total_nr_irqs' as 'nr_irqs' is too generic and fix up the 'nr_irqs' reference in the related GDB script as well. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com> Reviewed-by: Radu Rendec <radu@rendec.net> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260517194931.522168332@kernel.org
2026-05-26	genirq/proc: Utilize irq_desc::tot_count to avoid evaluation	Thomas Gleixner
	Interrupts which are not marked per CPU increment not only the per CPU statistics, but also the accumulation counter irq_desc::tot_count. Change the counter to type unsigned long so it does not produce sporadic zeros due to wrap arounds on 64-bit machines and do a quick check for non per CPU interrupts. If the counter is zero, then simply emit a full set of zero strings. That spares the evaluation of the per CPU counters completely for interrupts with zero events. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com> Reviewed-by: Radu Rendec <radu@rendec.net> Link: https://patch.msgid.link/20260517194931.115522199@kernel.org
2026-05-26	genirq/proc: Avoid formatting zero counts in /proc/interrupts	Thomas Gleixner
	A large portion of interrupt count entries are zero. There is no point in formatting the zero value as it is way cheeper to just emit a constant string. Collect the number of consecutive zero counts and emit them in one go before a non-zero count and at the end of the line. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com> Reviewed-by: Radu Rendec <radu@rendec.net> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260517194931.034728540@kernel.org
2026-05-26	PM: hibernate: Use flexible array for CRC uncompressed buffers	Rosen Penev
	The CRC uncompressed buffer pointer array has the same lifetime as struct crc_data, but it is currently allocated separately. That adds another allocation failure path and a matching cleanup branch without providing any extra flexibility. Store the pointer array as a flexible array member and allocate it together with the crc_data using kzalloc_flex(). The array remains zero-initialized, while the allocation and error handling become simpler. Assisted-by: Codex:GPT-5.5 Signed-off-by: Rosen Penev <rosenp@gmail.com> Link: https://patch.msgid.link/20260510213948.41750-1-rosenp@gmail.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-05-26	sched: Remove sched_class::pick_next_task()	Peter Zijlstra
	The reason for pick_next_task_fair() is the put/set optimization that avoids touching the common ancestors. However, it is possible to implement this in the put_prev_task() and set_next_task() calls as used in put_prev_set_next_task(). Notably, put_prev_set_next_task() is the only site that: - calls put_prev_task() with a .next argument; - calls set_next_task() with .first = true. This means that put_prev_task() can determine the common hierarchy and stop there, and then set_next_task() can terminate where put_prev_task stopped. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260511120628.057634261@infradead.org
2026-05-26	sched/fair: Add newidle balance to pick_task_fair()	Peter Zijlstra
	With commit 50653216e4ff ("sched: Add support to pick functions to take rf") removing the balance callback, the pick_task() callback is in charge of newidle balancing. This means pick_task_fair() should do so too. This hasn't been a problem in practise because pick_next_task_fair() is used. However, since we'll be removing that one shortly, make sure pick_next_task() is up to scratch. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://patch.msgid.link/20260511120627.944705718@infradead.org
2026-05-26	sched/debug: Collapse subsequent CONFIG_SCHED_CLASS_EXT sections	Peter Zijlstra
	Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260511120627.281160085@infradead.org
2026-05-26	sched: Use {READ,WRITE}_ONCE() for preempt_dynamic_mode	Peter Zijlstra
	Robots figured out you can read and write this concurrently and got 'upset'. Gemini even noted sched_dynamic_show() can generate 'confusing' output if it observed different values during the printing. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260511120627.176946327@infradead.org
2026-05-26	sched/debug: Use char * instead of char (*)[]	Peter Zijlstra
	Some of the fancy AI robots are getting 'upset'. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260511120627.065013766@infradead.org
2026-05-26	sched/fair: Fix RCU usage in NOHZ exit path on CPU offline	Andrea Righi
	Commit c9d93a73ce87 ("sched/fair: Drop redundant RCU read lock in NOHZ kick path") removed the rcu_read_lock()/unlock() pair from set_cpu_sd_state_busy() and set_cpu_sd_state_idle() on the assumption that all callers run in a safe context for rcu_dereference_all(): IRQs disabled or cpus_write_lock() held. That assumption is wrong for the CPU hotplug teardown path. When CPUs are taken offline, set_cpu_sd_state_busy() is invoked via: cpuhp/N kthread cpuhp_thread_fun() cpuhp_invoke_callback() sched_cpu_deactivate() nohz_balance_exit_idle() set_cpu_sd_state_busy() rcu_dereference_all(per_cpu(sd_llc, cpu)) The cpuhp kthread holds cpu_hotplug_lock (percpu-rwsem) but runs with preemption and IRQs enabled. As a result, lockdep correctly reports a suspicious RCU usage on CPU offline, e.g.: # echo 0 > /sys/devices/system/cpu/cpu1/online ============================= WARNING: suspicious RCU usage ----------------------------- kernel/sched/fair.c:12793 suspicious rcu_dereference_check() usage! ... 2 locks held by cpuhp/1/20: #0: (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x42/0x1ae #1: (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0x72/0x1ae Call Trace: lockdep_rcu_suspicious nohz_balance_exit_idle sched_cpu_deactivate cpuhp_invoke_callback cpuhp_thread_fun smpboot_thread_fn Fix this by adding RCU read lock coverage to the one caller that lacks it: nohz_balance_exit_idle() in the CPU hotplug teardown. The other callers (nohz_balancer_kick() and nohz_balance_enter_idle()) genuinely run with IRQs disabled, so they remain unchanged. Fixes: c9d93a73ce87 ("sched/fair: Drop redundant RCU read lock in NOHZ kick path") Closes: https://lore.kernel.org/all/38fe0a1d-1a48-435a-910a-c278024d9ac9@samsung.com/ Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260522092523.2046095-1-arighi@nvidia.com
2026-05-26	PM: hibernate: make LZ4 available for hibernation compression	l1rox3
	Without this, CRYPTO_LZ4 had to be manually enabled in the config to use LZ4 for hibernation compression. Add the select so it gets pulled in automatically when hibernation is enabled, just like CRYPTO_LZO already does. Tested-by: l1rox3 <l1rox3.developer@gmail.com> Signed-off-by: l1rox3 <l1rox3.developer@gmail.com> Link: https://patch.msgid.link/20260520081254.13493-1-l1rox3.developer@gmail.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-05-26	exec_state: relocate dumpable information	Christian Brauner (Amutable)
	The dumpable flag captured at execve() is consulted by __ptrace_may_access() and several /proc owner / visibility checks. It lives on mm_struct today, which exit_mm() clears from the task long before the task itself is reaped. exec_state is anchored to the execve() that established the current privilege domain. CLONE_VM siblings refcount-share the parent's exec_state via copy_exec_state(); non-CLONE_VM clones allocate a fresh exec_state inheriting the parent's dumpable mode and user_ns reference via task_exec_state_copy(). execve() allocates a fresh instance (via alloc_task_exec_state() in begin_new_exec()) and installs it under task_lock + exec_update_lock with task_exec_state_replace(). init_task uses a static instance. The dumpable mode now lives on task->exec_state->dumpable. task->mm->flags no longer carries dumpability; MMF_DUMPABLE_MASK is removed, but MMF_DUMPABLE_BITS is reserved so MMF_DUMP_FILTER_* bit positions remain stable for the /proc/<pid>/coredump_filter ABI. The task->user_dumpable cache bit and its assignment in exit_mm() are removed; readers go through get_dumpable(task) directly. coredump_params gains a snapshot field cprm.dumpable, populated from get_dumpable(current) at vfs_coredump() entry, replacing the previous __get_dumpable(cprm->mm_flags) consumers in fs/coredump.c and fs/pidfs.c. The user namespace recorded at execve() is consulted by __ptrace_may_access() and by /proc/PID/* owner derivation. Move the captured user_ns onto task_exec_state, which stays attached to the task past exit_mm() and across exit_files(). bprm grows a user_ns field staged in bprm_mm_init() with the caller's user_ns, narrowed by would_dump() to the closest privileged ancestor, and consumed by exec_mmap() via alloc_task_exec_state(bprm->user_ns). free_bprm() releases the staging reference. mm_struct loses ->user_ns entirely. Initializers in init-mm, efi_mm, and the implicit one in mm_init()/dup_mm()/mm_alloc() are removed; __mmdrop() drops the matching put_user_ns(). The kthread_use_mm() WARN_ON_ONCE(!mm->user_ns) is no longer meaningful and goes too. Reviewed-by: Jann Horn <jannh@google.com> Link: https://patch.msgid.link/20260520-work-task_exec_state-v3-4-69f895bc1385@kernel.org Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-05-26	ptrace: add ptracer_access_allowed()	Christian Brauner (Amutable)
	Add a helper that encapsulates all of the logic for checking ptrace access and remove open-coded versions in follow-up patches. Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: David Hildenbrand (arm) <david@kernel.org> Link: https://patch.msgid.link/20260520-work-task_exec_state-v3-3-69f895bc1385@kernel.org Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-05-26	exec: introduce struct task_exec_state	Christian Brauner (Amutable)
	Introduce struct task_exec_state, a per-task RCU-protected structure that holds the dumpable mode and the user namespace and stays attached to the task for its full lifetime. task_exec_state_rcu() is the canonical reader: asserts RCU or task_lock is held, WARNs on a NULL state, returns the rcu_dereference()'d pointer. Reviewed-by: Jann Horn <jannh@google.com> Link: https://patch.msgid.link/20260520-work-task_exec_state-v3-2-69f895bc1385@kernel.org Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-05-26	sched/coredump: introduce enum task_dumpable	Christian Brauner (Amutable)
	Replace the SUID_DUMP_DISABLE/USER/ROOT preprocessor constants with enum task_dumpable. Numeric values are preserved (kernel.suid_dumpable sysctl and prctl(PR_SET_DUMPABLE) ABI), so this is a pure rename with no behavioral change. Subsequent commits relocate dumpability onto a per-task structure where the enum type will allow stronger type-checking on the new API. Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: David Hildenbrand (arm) <david@kernel.org> Link: https://patch.msgid.link/20260520-work-task_exec_state-v3-1-69f895bc1385@kernel.org Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-05-26	kho: fix order calculation for kho_unpreserve_pages()	Pratyush Yadav (Google)
	Commit 91e74fa8b1bc ("kho: make sure preservations do not span multiple NUMA nodes") made sure preservations from kho_preserve_pages() do not span multiple NUMA nodes. If they do, the order is reduced and tried again. The same logic was not implemented for kho_unpreserve_pages(). This can result in unpreserve calculating a different order than preserve, and thus not actually unpreserving the pages. Fix this by moving the order calculation logic to __kho_preserve_pages_order() and use it from both preserve and unpreserve paths. Move __kho_unpreserve() down to avoid having a forward declaration. Its users are further down in the file anyway. Also, it results in grouping for all the page-level preservation and unpreservation functions. This unfortunately makes the diff hard to read, but the main change in __kho_unpreserve() is to call __kho_preserve_pages_order() instead of open-coding the order calculation. Fixes: 91e74fa8b1bc ("kho: make sure preservations do not span multiple NUMA nodes") Cc: stable@vger.kernel.org Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://patch.msgid.link/20260519133332.2498092-1-pratyush@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-05-25	sched_ext: Convert ops.set_cmask() to arena-resident cmask	Tejun Heo
	ops_cid.set_cmask() expects a cmask. The kernel couldn't write into the arena, so it translated cpumask -> cmask in kernel memory and passed the result as a trusted pointer. The BPF cmask helpers all operate on arena cmasks though, so the BPF side had to word-by-word probe-read the kernel cmask into an arena cmask via cmask_copy_from_kernel() before any helper could touch it. It works, but is clumsy. With direct kernel-side arena access now in place, build the cmask in the arena. The kernel writes to it through the kern_va side of the dual mapping. BPF directly dereferences it via an __arena pointer like any other arena struct. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
2026-05-25	sched_ext: Sub-allocator over kernel-claimed BPF arena pages	Tejun Heo
	Build a per-scheduler sub-allocator on top of pages claimed from the BPF arena registered in the previous patch. Subsequent kernel-managed arena-resident structures (e.g. per-CPU set_cmask cmask) carve their storage from this pool. scx_arena_pool_init() creates a gen_pool. scx_arena_alloc() returns the kernel VA. On exhaustion, the pool grows by claiming more pages via bpf_arena_alloc_pages_sleepable(). Chunks are added at the kernel-side mapping address. Callers translate to the BPF-arena form themselves if needed. Allocations sleep (GFP_KERNEL) - they may grow the pool through vzalloc and arena page allocation. All current consumers run from the enable path (after ops.init() and the kernel-side arena auto-discovery, before validate_ops()), where sleeping is fine. scx_arena_pool_destroy() walks each chunk, returns outstanding ranges to the gen_pool with gen_pool_free() and then calls gen_pool_destroy(). The underlying arena pages are released when the arena map itself is torn down, so the pool destroy doesn't free them explicitly. v2: Switch scx_arena_alloc() to a loop. (Andrea) Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-05-25	sched_ext: Require an arena for cid-form schedulers	Tejun Heo
	Upcoming patches will let the kernel place arena-resident scratch shared with the BPF program (e.g. per-CPU set_cmask cmask) so the BPF side can dereference it directly via __arena pointers, replacing the current cmask_copy_from_kernel() probe-read loop. That requires each cid-form scheduler to expose its arena to the kernel. Kernel- side accesses are recovered by the per-arena scratch-page mechanism. bpf_scx_reg_cid() walks the struct_ops member progs via bpf_struct_ops_for_each_prog() and reads each prog's arena via bpf_prog_arena(). The verifier enforces one arena per program, so each member prog contributes at most one arena. All non-NULL contributions must match and at least one member prog must use an arena. The map ref is held on scx_sched and dropped on sched destroy. cpu-form schedulers (bpf_scx_reg) are unchanged - no arena requirement. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-25	Merge branch 'arena_direct_access' of ↵	Tejun Heo
	git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next into for-7.2
2026-05-25	Merge branch 'arena_direct_access'	Alexei Starovoitov
	Tejun Heo says: ==================== This makes BPF arena memory directly dereferenceable from kernel code (struct_ops callbacks, kfuncs). Each arena gets a per-arena scratch page that an arch fault hook installs into empty PTEs on kernel-side faults, after KFENCE. The faulting instruction retries and the violation is reported through the program's BPF stream. v4: - Patch 1: note that the strict-zero cmpxchg is narrower than pte_none() in inline comments on both x86 and arm64. (Andrea) - Patch 2: stub bpf_arena_handle_page_fault() for !CONFIG_BPF_SYSCALL via a new include/linux/bpf_defs.h. (lkp) - Patch 7: scx_arena_alloc() retries via a loop instead of a single retry on pool growth. (Andrea) - Picked up Reviewed-by tags from Emil and Andrea. v3: https://lore.kernel.org/r/20260520235052.4180316-1-tj@kernel.org v2: https://lore.kernel.org/r/20260517211232.1670594-1-tj@kernel.org v1 (RFC): https://lore.kernel.org/r/20260427105109.2554518-1-tj@kernel.org Motivation ---------- sched_ext's ops_cid.set_cmask() hands the BPF scheduler a struct scx_cmask . The kernel translates a kernel cpumask to a cmask, but it had no way to write into the arena, so the cmask lived in kernel memory and was passed as a trusted pointer. BPF cmask helpers all operate on arena cmasks though, so the BPF side had to word-by-word probe-read the kernel cmask into an arena cmask via cmask_copy_from_kernel() before any helper could touch it. It works, but is clumsy. The shape isn't unique to set_cmask. Sub-scheduler support is on the way and more sched_ext callbacks will want to pass structured data to BPF. Anywhere a kfunc or struct_ops callback wants to hand a struct to a BPF program, arena residence is the natural answer. Approach -------- Each arena gets a per-arena scratch page. Arenas stay sparsely mapped as today - PTEs are populated only for allocated pages. A new arch fault hook (bpf_arena_handle_page_fault) is wired into x86 page_fault_oops() and arm64 __do_kernel_fault(), after KFENCE. When a kernel-side access faults inside an arena's kern_vm range, the helper walks the stack to find the BPF program responsible, range-checks the fault address against prog->aux->arena, and atomically installs the scratch page into the empty PTE via the new ptep_try_set() wrapper. The kernel instruction retries and reads/writes the scratch page. Free paths and map destruction treat scratch as non-owned. Real allocation refuses to overwrite scratch (apply_range_set_cb returns -EBUSY). A scratched address stays dead until map destroy, since its presence means the BPF program has already malfunctioned. The mechanism is default behavior - no UAPI flag. What this preserves ------------------- All the debugging properties of today's sparse-PTE design are preserved: BPF programs still fault on unmapped arena accesses. The fault semantics (instruction retry with rdst = 0) and the violation report through bpf_streams are unchanged for prog-side accesses. * The first kernel-side touch of an unmapped address is reported via bpf_streams the same way as a prog-side fault, with the stack walk attributing it to the originating prog. * User-side fault on a never-scratched address still lazy-allocates a real page (or returns SIGSEGV under BPF_F_SEGV_ON_FAULT). User-side fault on a scratched address SIGSEGVs. What changes for the kernel-side caller is just that an unmapped deref no longer oopses - it retries through the scratch page and emits a violation report. The same shape today's BPF instruction faults have. Patches 1-2 (atomic PTE install + arena scratch-page recovery) -------------------------------------------------------------- mm: Add ptep_try_set() for lockless empty-slot installs bpf: Recover arena kernel faults with scratch page Patches 3-5 (helpers used by struct_ops registration) ----------------------------------------------------- bpf: Add sleepable variant of bpf_arena_alloc_pages for kernel callers bpf: Add bpf_struct_ops_for_each_prog() bpf/arena: Add bpf_arena_map_kern_vm_start() and bpf_prog_arena() ==================== Link: https://lore.kernel.org/bpf/20260522172219.1423324-1-tj@kernel.org/ Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-25	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf 7.1-rc5	Alexei Starovoitov
	Cross-merge BPF and other fixes after downstream PR. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-25	Merge tag 'v7.1-rc5' into driver-core-next	Danilo Krummrich
	We need the driver-core fixes in here as well to build on top of. Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2026-05-24	Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf	Linus Torvalds
	Pull bpf fixes from Alexei Starovoitov: - Fix bpf_throw() and global subprog combination (Kumar Kartikeya Dwivedi) - Fix out of bounds access in BPF interpreter (Yazhou Tang) - Fix potential out of bounds access in inner per-cpu array map (Guannan Wang) - Reject NULL data/sig in bpf_verify_pkcs7_signature (KP Singh) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: libbpf: fix off-by-one in emit_signature_match jump offset bpf: Reject NULL data/sig in bpf_verify_pkcs7_signature selftests/bpf: Cover global subprog exception leaks bpf: Check global subprog exception paths bpf: make bpf_session_is_return() reference optional bpf: Use array_map_meta_equal for percpu array inner map replacement selftests/bpf: Add test for large offset bpf-to-bpf call bpf: Fix s16 truncation for large bpf-to-bpf call offsets bpf: Fix out-of-bounds read in bpf_patch_call_args()
2026-05-24	rcu-tasks: Fix possible boot-time tests failed for the call_rcu_tasks()	Zqiang
	The following scenarios will cause the call_rcu_tasks() boot-time tests failed: CPU0 CPU1 rcu_init_tasks_generic() ->rcu_tasks_initiate_self_tests() ->call_rcu_tasks_trace(&tests[1].rh, test_rcu_tasks_callback) ->call_rcu_tasks_generic() ->havekthread = smp_load_acquire(&rtp->kthread_ptr) "The havekthread is false" .... rcu_tasks_kthread() ->smp_store_release(&rtp->kthread_ptr, current) ->rcu_tasks_one_gp() ->rcuwait_wait_event() ->rcu_tasks_need_gpcb() ->for (cpu = 0; cpu < dequeue_limit; cpu++) ->rcu_segcblist_n_cbs(&rtpcp->cblist) == 0 ->schedule() ->raw_spin_trylock_rcu_node() ->needwake = (func == wakeme_after_rcu) \|\| (rcu_segcblist_n_cbs(&rtpcp->cblist) == rcu_task_lazy_lim) "the rcu_task_lazy_lim default value is 32, and the func pointer is test_rcu_tasks_callback, lead to needwake is false." ->if (havekthread && !needwake && !timer_pending(&rtpcp->lazy_timer)) "the havekthread is false, will not enter here." .... "the needwake is false lead to rtp_irq_work can not queue, even if the rtp->kthread_ptr already exists at this point." ->if (needwake && READ_ONCE(rtp->kthread_ptr)) ->irq_work_queue(&rtpcp->rtp_irq_work) For the above scenarios, if the call_rcu_tasks() is not called again afterward, the rcu_tasks_kthread will not have a chance to be wakeup, the test_rcu_tasks_callback() will never be called, the boot-time tests failed can happen, this commit therefore check havekthread variable, if it's false and the rtpcp->cblist is empty, set needwake variable is true, if the rtp->kthread_ptr exist, the rtpcp->rtp_irq_work can be queued to wakeup rcu_tasks_kthread. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2026-05-24	rcu: Latch normal synchronize_rcu() path on flood	Uladzislau Rezki (Sony)
	Currently, rcu_normal_wake_from_gp is only enabled by default on small systems(<= 16 CPUs) or when a user explicitly set it enabled. Introduce an adaptive latching mechanism: * Track the number of in-flight synchronize_rcu() requests using a new rcu_sr_normal_count counter; * If the count reaches/exceeds RCU_SR_NORMAL_LATCH_THR(64), it sets the rcu_sr_normal_latched, reverting new requests onto the scaled wait_rcu_gp() path; * The latch is cleared only when the pending requests are fully drained(nr == 0); * Enables rcu_normal_wake_from_gp by default for all systems, relying on this dynamic throttling instead of static CPU limits. Testing(synthetic flood workload): * Kernel version: 6.19.0-rc6 * Number of CPUs: 1536 * 60K concurrent synchronize_rcu() calls Perf(cycles, system-wide): total cycles: 932020263832 rcu_sr_normal_add_req(): 2650282811 cycles(~0.28%) Perf report excerpt: 0.01% 0.01% sync_test/... [k] rcu_sr_normal_add_req Measured overhead of rcu_sr_normal_add_req() remained ~0.28% of total CPU cycles in this synthetic stress test. Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Tested-by: Samir M <samir@linux.ibm.com> Suggested-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2026-05-24	rcu: Simplify param_set_next_fqs_jiffies() by applying clamp_val()	Paul E. McKenney
	This commit replaces a nested ?: sequence with clamp_val(). This does not reduce the number of lines of code, but it does simplify the line that it modifies. Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2026-05-24	rcu: Simplify rcu_do_batch() by applying clamp()	Paul E. McKenney
	This commit replaces a nested ?: sequence with clamp(). This does not reduce the number of lines of code, but it does simplify the line that it modifies. Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2026-05-24	torture: Add torture_sched_set_normal() for user-specified nice values	Paul E. McKenney
	This new torture_sched_set_normal() function clamps the nice value at the MIN_NICE..MAX_NICE limits, splatting it these limits are exceeded. It then invokes sched_set_normal() to set the new value. This prevents more difficult-to-debug failures within the scheduler. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2026-05-24	rcutorture: Fully test lazy RCU	Paul E. McKenney
	Currently, rcutorture bypasses lazy RCU by using call_rcu_hurry(). This works, avoiding the dreaded rtort_pipe_count WARN(), but fails to fully test lazy RCU. The rtort_pipe_count WARN() splats because lazy RCU could delay the start of an RCU grace period for a full stutter period, which defaults to only three seconds. This commit therefore reverts the call_rcu_hurry() instances back to call_rcu(), but, in kernels built with CONFIG_RCU_LAZY=y, queues a workqueue handler just before the call to stutter_wait() in rcu_torture_writer(). This workqueue handler invokes rcu_barrier(), which motivates any lingering lazy callbacks, thus avoiding the splat. Reported-by: Saravana Kannan <saravanak@kernel.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
2026-05-23	bpf/arena: Add bpf_arena_map_kern_vm_start() and bpf_prog_arena()	Tejun Heo
	struct bpf_arena is opaque to callers outside arena.c. Add two helpers for struct_ops subsystems that need to reach into an arena: bpf_arena_map_kern_vm_start(struct bpf_map map) returns @map's kern_vm_start. A sched_ext follow-up needs this to translate kern_va <-> uaddr. bpf_prog_arena(struct bpf_prog prog) returns the bpf_map of the arena referenced by @prog (NULL if @prog references no arena). The verifier enforces at most one arena per program. Used by struct_ops callers that auto-discover an arena from a member prog and need to take a map reference. Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260522172219.1423324-6-tj@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-23	bpf: Add bpf_struct_ops_for_each_prog()	Tejun Heo
	Add a helper that walks the member progs of the struct_ops map containing a given @kdata vmtable. struct_ops ->reg() callbacks (and similar) sometimes need to inspect the loaded BPF programs, e.g. to discover maps they reference via prog->aux->used_maps. The implementation mirrors bpf_struct_ops_id(): container_of @kdata to recover the bpf_struct_ops_map, then iterate st_map->links[i]->prog for i in [0, funcs_cnt). Same access pattern, no new locking - by the time ->reg() fires st_map is fully populated and stable. A sched_ext follow-up walks the member progs of a cid-form scheduler's struct_ops map, reads prog->aux->arena directly, and requires all member progs to reference exactly one arena, without requiring the BPF program to call a registration kfunc. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260522172219.1423324-5-tj@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-23	bpf: Add sleepable variant of bpf_arena_alloc_pages for kernel callers	Tejun Heo
	The existing kernel-side export of bpf_arena_alloc_pages is _non_sleepable only - it's used by the verifier to inline the kfunc when the call site is non-sleepable. There is no sleepable equivalent for kernel callers. The kfunc bpf_arena_alloc_pages itself is BPF-only. sched_ext needs sleepable kernel-side allocs for its arena pool init/grow paths. Add bpf_arena_alloc_pages_sleepable() mirroring the _non_sleepable wrapper but passing sleepable=true to arena_alloc_pages(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Link: https://lore.kernel.org/r/20260522172219.1423324-4-tj@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-23	bpf: Recover arena kernel faults with scratch page	Kumar Kartikeya Dwivedi
	BPF arena usage is becoming more prevalent, but kernel <-> BPF communication over arena memory is awkward today. Data has to be staged through a trusted kernel pointer with extra code and copying on the BPF side. While reads through arena pointers can use a fault-safe helper, writes don't have a good solution. The in-line alternative would need instruction emulation or asm fixup labels. Enable direct kernel-side reads and writes within GUARD_SZ / 2 of any handed-in arena pointer, without bounds checking. A per-arena scratch page is installed by the arch fault path into empty arena kernel PTEs - x86 from page_fault_oops() for not-present faults, arm64 from __do_kernel_fault() for translation faults, both after the existing exception-table and KFENCE handling. The faulting instruction retries and the access is also reported through the program's BPF stream, preserving error reporting. bpf_prog_find_from_stack() resolves the current BPF program (and its arena) from the kernel stack - no new bpf_run_ctx state is added. Recovery covers the 4 GiB arena plus the upper half-guard (GUARD_SZ / 2). The lower half-guard is excluded because well-behaved kfuncs only access forward from arena pointers. The kfunc-author contract - access at most GUARD_SZ / 2 past a handed-in pointer - is documented in Documentation/bpf/kfuncs.rst. The install is lock-free via ptep_try_set(). On race-loss the winning installer's PTE is already valid, so the access retry succeeds. The arena clear path uses ptep_get_and_clear() so installer and clearer race through atomic accessors. No flush_tlb_kernel_range() afterwards. Stale "not mapped" entries just cause one extra re-fault, cheaper than a global IPI on every install. Scratch exists only to keep the kernel from oopsing on an in-line arena access. Its presence at a PTE means the BPF program has already malfunctioned, and the violation is reported through the program's BPF stream. The only requirement for behavior on a scratched PTE is that the kernel doesn't crash. In particular, any user-side access through such a PTE may segfault. The shared scratch page is freed once during map destruction. BPF instruction faults continue to use the existing JIT exception-table path. This patch changes only the kernel-text fault path. No UAPI flag is added. The new behavior is the default. v2: Use ptep_get_and_clear() in apply_range_clear_cb(). (David) v3: Stub bpf_arena_handle_page_fault() for !CONFIG_BPF_SYSCALL. (lkp) Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Cc: David Hildenbrand <david@kernel.org> Link: https://lore.kernel.org/r/20260522172219.1423324-3-tj@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-05-22	Merge tag 'sched_ext-for-7.1-rc4-fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Spurious WARN in ops_dequeue() racing with concurrent dispatch - Self-deadlock between scheduler disable and a concurrent sub-sched enable * tag 'sched_ext-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue() sched_ext: Fix deadlock between scx_root_disable() and concurrent forks
2026-05-22	Merge tag 'cgroup-for-7.1-rc4-fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "Two rstat fixes: - Out-of-bounds access in the css_rstat_updated() BPF kfunc when called with an unchecked user-supplied cpu - Over-strict NMI guard after the recent switch to try_cmpxchg left sparc and ppc64 unable to queue rstat updates from NMI" * tag 'cgroup-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: rstat: relax NMI guard after switch to try_cmpxchg cgroup/rstat: validate cpu before css_rstat_cpu() access
2026-05-22	signal: clear JOBCTL_PENDING_MASK for caller in zap_other_threads()	Aleksandr Nogikh
	When a multi-threaded process receives a stop signal (e.g., SIGSTOP), do_signal_stop() sets JOBCTL_STOP_PENDING and JOBCTL_STOP_CONSUME on all threads and sets signal->group_stop_count to the number of threads. If one of the threads concurrently calls execve(), de_thread() invokes zap_other_threads() to kill all other threads. zap_other_threads() aborts the pending group stop by resetting signal->group_stop_count to 0 and clears the JOBCTL_PENDING_MASK for all other threads. However, it fails to clear the job control flags for the calling thread. When execve() completes, the calling thread returns to user mode and checks for pending signals. Seeing the stale JOBCTL_STOP_PENDING flag, it calls do_signal_stop(), which invokes task_participate_group_stop(). Since JOBCTL_STOP_CONSUME is still set, it attempts to decrement the already-zero signal->group_stop_count, triggering a warning: sig->group_stop_count == 0 WARNING: CPU: 1 PID: 6475 at kernel/signal.c:373 task_participate_group_stop+0x215/0x2d0 Call Trace: <TASK> do_signal_stop+0x3be/0x5c0 kernel/signal.c:2619 get_signal+0xa8c/0x1330 kernel/signal.c:2884 arch_do_signal_or_restart+0xbc/0x840 arch/x86/kernel/signal.c:337 exit_to_user_mode_loop+0x8c/0x4d0 kernel/entry/common.c:98 do_syscall_64+0x33e/0xf80 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> Fix this race condition by clearing the JOBCTL_PENDING_MASK for the calling thread in zap_other_threads(), ensuring it does not retain any stale job control state after the thread group is destroyed. This aligns with other functions that tear down a thread group and abort group stops, such as zap_process() and complete_signal(), which correctly clear these flags for all threads including the current one. Fixes: 39efa3ef3a37 ("signal: Use GROUP_STOP_PENDING to stop once for a single group stop") Assisted-by: Gemini:gemini-3.1-pro-preview Gemini:gemini-3-flash-preview syzbot Reported-by: syzbot+b109633ea805cac54a61@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b109633ea805cac54a61 Link: https://syzkaller.appspot.com/ai_job?id=d70208cc-862b-4fe3-bf02-3031e10cd0b3 Signed-off-by: Aleksandr Nogikh <nogikh@google.com> Link: https://patch.msgid.link/20260521142240.2973022-1-nogikh@google.com Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-05-22	Merge tag 'dma-mapping-7.1-2026-05-22' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: "Two minor updates for the DMA-mapping code, mainly fixing some rare corner cases (Petr Tesarik, Jianpeng Chang)" * tag 'dma-mapping-7.1-2026-05-22' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-mapping: move dma_map_resource() sanity check into debug code dma-direct: fix use of max_pfn
2026-05-22	Merge tag 'trace-v7.1-rc4' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Avoid NULL return from hist_field_name() The function hist_field_name() is directly passed to a strcat() which does not handle "NULL" characters. Return a zero length string when size is greater than the limit. This is used only to output already created histograms and no field currently is greater than the limit. But it should still not return NULL. - Do not call map->ops->elt_free() on allocation failure When elt_alloc() fails, it should not call the map->ops->elt_free() function if it exists, as that function may not be able to handle the free on allocation failures. The ->elt_free() should only be called when elt_alloc() succeeds. * tag 'trace-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Do not call map->ops->elt_free() if elt_alloc() fails tracing: Avoid NULL return from hist_field_name() on truncation
2026-05-22	dma-contiguous: add kconfig option to setup numa cma area if not configured ↵	Feng Tang
	explicitly There was a report on a multi-numa-nodes arm64 server that when IOMMU is disabled, the dma_alloc_coherent() function always returns memory from node 0 even for devices attaching to other nodes, while they can get local dma memory when IOMMU is on with the same API. The reason is, when IOMMU is disabled, the dma_alloc_coherent() will go the direct way and call dma_alloc_contiguous(). The system doesn't have any explicit cma setting (like per-numa cma), and only has a default 64MB cma reserved area (on node 0), where kernel will try first to allocate memory from. Robin Murphy suggested to setup pernuma cma or disable cma, which did solve the issue. While there is still concern that for customers which don't have much kernel knowledge, they could still suffer from this silently as some architectures enable cma area by default (not an issue for X86 though, which set CONFIG_CMA_SIZE_MBYTES to 0 by default) for most Linux distributions. One thought is to follow the current cma reserving policy for platform with 'CONFIG_DMA_NUMA_CMA=y', that if the numa cma (either the 'numa cma' or 'cma pernuma' method) is not explicitly configured, and the platform really has multiple NUMA nodes, set it up according to size of default 'dma_contiguous_default_area'. This way, the default behavior of platform with one NUMA node is kept unchanged (say embedded/small devices don't need to allocate extra memory), while the general dma locality is improved. Add a new bool kernel config CONFIG_CMA_SIZE_PERNUMA to control whether to enable it. Even when the config is enabled, user can still disable it by kernel-cmdline setting like "numa_cma=0:0" or "cma_pernuma=0". Reported-by: Changrong Chen <chenchangrong.ccr@alibaba-inc.com> Suggested-by: Ying Huang <ying.huang@linux.alibaba.com> Suggested-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Link: https://lore.kernel.org/r/20260512085509.83002-1-feng.tang@linux.alibaba.com Link: https://lore.kernel.org/all/20260520222742.GA1607511@ax162/ [mszyprow: squashed changes from both links, added __initdata attribute to the numa_cma_configured variable] Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
2026-05-21	kernel/fork: validate exit_signal in kernel_clone()	Deepanshu Kartikey
	When a child process exits, it sends exit_signal to its parent via do_notify_parent(). The clone() syscall constructs exit_signal as: (lower_32_bits(clone_flags) & CSIGNAL) CSIGNAL is 0xff, so values in the range 65-255 are possible. However, valid_signal() only accepts signals up to _NSIG (64 on x86_64). A non-zero non-valid exit_signal acts the same as exit_signal == 0: the parent process is not signaled when the child terminates. The syzkaller reproducer triggers this by calling clone() with flags=0x80, resulting in exit_signal = (0x80 & CSIGNAL) = 128, which exceeds _NSIG and is not a valid signal. The v1 of this patch added the check only in the clone() syscall handler, which is incomplete. kernel_clone() has other callers such as sys_ia32_clone() which would remain unprotected. Move the check to kernel_clone() to cover all callers. Since the valid_signal() check is now in kernel_clone() and covers all callers including clone3(), the same check in copy_clone_args_from_user() becomes redundant and is removed. The higher 32bits check for clone3() is kept as it is clone3() specific. Note that this is a user-visible change: previously, passing an invalid exit_signal to clone() was silently accepted. The man page for clone() does not document any defined behavior for invalid exit_signal values, so rejecting them with -EINVAL is the correct behavior. It is unlikely that any sane application relies on passing an invalid exit_signal. [oleg@redhat.com: the comment above kernel_clone() should be updated] Link: https://lore.kernel.org/abwvgU17W8wuW2-J@redhat.com Link: https://lore.kernel.org/20260316151956.563558-1-kartikey406@gmail.com Fixes: 3f2c788a1314 ("fork: prevent accidental access to clone3 features") Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: syzbot+bbe6b99feefc3a0842de@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=bbe6b99feefc3a0842de Tested-by: syzbot+bbe6b99feefc3a0842de@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/20260307064202.353405-1-kartikey406@gmail.com/T/ [v1] Link: https://lore.kernel.org/all/20260316104536.558108-1-kartikey406@gmail.com/T/ [v2] Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Ben Segall <bsegall@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>