linux.git - Linux kernel source tree

Age	Commit message (Collapse)	Author
2025-11-20	sched/mmcid: Prevent pointless work in mm_update_cpus_allowed()	Thomas Gleixner
	mm_update_cpus_allowed() is not required to be invoked for affinity changes due to migrate_disable() and migrate_enable(). migrate_disable() restricts the task temporarily to a CPU on which the task was already allowed to run, so nothing changes. migrate_enable() restores the actual task affinity mask. If that mask changed between migrate_disable() and migrate_enable() then that change was already accounted for. Move the invocation to the proper place to avoid that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.385208276@linutronix.de
2025-11-20	sched/mmcid: Move scheduler code out of global header	Thomas Gleixner
	This is only used in the scheduler core code, so there is no point to have it in a global header. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Link: https://patch.msgid.link/20251119172549.321259077@linutronix.de
2025-11-20	sched: Fixup whitespace damage	Thomas Gleixner
	With whitespace checks enabled in the editor this makes eyes bleed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.258651925@linutronix.de
2025-11-20	sched/mmcid: Use proper data structures	Thomas Gleixner
	Having a lot of CID functionality specific members in struct task_struct and struct mm_struct is not really making the code easier to read. Encapsulate the CID specific parts in data structures and keep them separate from the stuff they are embedded in. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.131573768@linutronix.de
2025-11-20	sched/mmcid: Revert the complex CID management	Thomas Gleixner
	The CID management is a complex beast, which affects both scheduling and task migration. The compaction mechanism forces random tasks of a process into task work on exit to user space causing latency spikes. Revert back to the initial simple bitmap allocating mechanics, which are known to have scalability issues as that allows to gradually build up a replacement functionality in a reviewable way. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251119172549.068197830@linutronix.de
2025-11-20	perf: Fix 0 count issue of cpu-clock	Dapeng Mi
	Currently cpu-clock event always returns 0 count, e.g., perf stat -e cpu-clock -- sleep 1 Performance counter stats for 'sleep 1': 0 cpu-clock # 0.000 CPUs utilized 1.002308394 seconds time elapsed The root cause is the commit 'bc4394e5e79c ("perf: Fix the throttle error of some clock events")' adds PERF_EF_UPDATE flag check before calling cpu_clock_event_update() to update the count, however the PERF_EF_UPDATE flag is never set when the cpu-clock event is stopped in counting mode (pmu->dev() -> cpu_clock_event_del() -> cpu_clock_event_stop()). This leads to the cpu-clock event count is never updated. To fix this issue, force to set PERF_EF_UPDATE flag for cpu-clock event just like what task-clock does. Fixes: bc4394e5e79c ("perf: Fix the throttle error of some clock events") Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Link: https://patch.msgid.link/20251112080526.3971392-1-dapeng1.mi@linux.intel.com
2025-11-19	tick/sched: Fix bogus condition in report_idle_softirq()	Wen Yang
	In commit 0345691b24c0 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") the new function report_idle_softirq() was created by breaking code out of the existing can_stop_idle_tick() for kernels v5.18 and newer. In doing so, the code essentially went from this form: if (A) { static int ratelimit; if (ratelimit < 10 && !C && A&D) { pr_warn("NOHZ tick-stop error: ..."); ratelimit++; } return false; } to a new function: static bool report_idle_softirq(void) { static int ratelimit; if (likely(!A)) return false; if (ratelimit < 10) return false; ... pr_warn("NOHZ tick-stop error: local softirq work is pending, handler #%02x!!!\n", pending); ratelimit++; return true; } commit a7e282c77785 ("tick/rcu: Fix bogus ratelimit condition") realized ratelimit was essentially set to zero instead of ten, and hence no softirq pending messages would ever be issued, but "fixed" it as: - if (ratelimit < 10) + if (ratelimit >= 10) return false; However, this fix introduced another issue: When ratelimit is greater than or equal 10, even if A is true, it will directly return false. While ratelimit in the original code was only used to control printing and will not affect the return value. Restore the original logic and restrict ratelimit to control the printk and not the return value. Fixes: 0345691b24c0 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") Fixes: a7e282c77785 ("tick/rcu: Fix bogus ratelimit condition") Signed-off-by: Wen Yang <wen.yang@linux.dev> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251119174525.29470-1-wen.yang@linux.dev
2025-11-19	smp: Introduce a helper function to check for pending IPIs	Ulf Hansson
	When governors used during cpuidle try to find the most optimal idle state for a CPU or a group of CPUs, they are known to quite often fail. One reason for this is, that they are not taking into account whether there has been an IPI scheduled for any of the CPUs that are affected by the selected idle state. To enable pending IPIs to be taken into account for cpuidle decisions, introduce a new helper function, cpus_peek_for_pending_ipi(). Suggested-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
2025-11-19	printk: Avoid scheduling irq_work on suspend	John Ogness
	Allowing irq_work to be scheduled while trying to suspend has shown to cause problems as some architectures interpret the pending interrupts as a reason to not suspend. This became a problem for printk() with the introduction of NBCON consoles. With every printk() call, NBCON console printing kthreads are woken by queueing irq_work. This means that irq_work continues to be queued due to printk() calls late in the suspend procedure. Avoid this problem by preventing printk() from queueing irq_work once console suspending has begun. This applies to triggering NBCON and legacy deferred printing as well as klogd waiters. Since triggering of NBCON threaded printing relies on irq_work, the pr_flush() within console_suspend_all() is used to perform the final flushing before suspending consoles and blocking irq_work queueing. NBCON consoles that are not suspended (due to the usage of the "no_console_suspend" boot argument) transition to atomic flushing. Introduce a new global variable @console_irqwork_blocked to flag when irq_work queueing is to be avoided. The flag is used by printk_get_console_flush_type() to avoid allowing deferred printing and switch NBCON consoles to atomic flushing. It is also used by vprintk_emit() to avoid klogd waking. Add WARN_ON_ONCE(console_irqwork_blocked) to the irq_work queuing functions to catch any code that attempts to queue printk irq_work during the suspending/resuming procedure. Cc: stable@vger.kernel.org # 6.13.x because no drivers in 6.12.x Fixes: 6b93bb41f6ea ("printk: Add non-BKL (nbcon) console basic infrastructure") Closes: https://lore.kernel.org/lkml/DB9PR04MB8429E7DDF2D93C2695DE401D92C4A@DB9PR04MB8429.eurprd04.prod.outlook.com Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Sherry Sun <sherry.sun@nxp.com> Link: https://patch.msgid.link/20251113160351.113031-3-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-11-19	printk: Allow printk_trigger_flush() to flush all types	John Ogness
	Currently printk_trigger_flush() only triggers legacy offloaded flushing, even if that may not be the appropriate method to flush for currently registered consoles. (The function predates the NBCON consoles.) Since commit 6690d6b52726 ("printk: Add helper for flush type logic") there is printk_get_console_flush_type(), which also considers NBCON consoles and reports all the methods of flushing appropriate based on the system state and consoles available. Update printk_trigger_flush() to use printk_get_console_flush_type() to appropriately flush registered consoles. Suggested-by: Petr Mladek <pmladek@suse.com> Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/stable/20251113160351.113031-2-john.ogness%40linutronix.de Tested-by: Sherry Sun <sherry.sun@nxp.com> Link: https://patch.msgid.link/20251113160351.113031-2-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-11-19	ima: Access decompressed kernel module to verify appended signature	Coiby Xu
	Currently, when in-kernel module decompression (CONFIG_MODULE_DECOMPRESS) is enabled, IMA has no way to verify the appended module signature as it can't decompress the module. Define a new kernel_read_file_id enumerate READING_MODULE_COMPRESSED so IMA can calculate the compressed kernel module data hash on READING_MODULE_COMPRESSED and defer appraising/measuring it until on READING_MODULE when the module has been decompressed. Before enabling in-kernel module decompression, a kernel module in initramfs can still be loaded with ima_policy=secure_boot. So adjust the kernel module rule in secure_boot policy to allow either an IMA signature OR an appended signature i.e. to use "appraise func=MODULE_CHECK appraise_type=imasig\|modsig". Reported-by: Karel Srot <ksrot@redhat.com> Suggested-by: Mimi Zohar <zohar@linux.ibm.com> Suggested-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Coiby Xu <coxu@redhat.com> Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2025-11-19	tracing: Switch to use %ptSp	Andy Shevchenko
	Use %ptSp instead of open coded variants to print content of struct timespec64 in human readable format. Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://patch.msgid.link/20251113150217.3030010-22-andriy.shevchenko@linux.intel.com Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-11-19	watch_queue: Use local kmap in post_one_notification()	Davidlohr Bueso
	Replace the now deprecated kmap_atomic() with kmap_local_page(). Optimize for the non-highmem cases and avoid disabling preemption and pagefaults, the caller's context is atomic anyway, but that is irrelevant to kmap. The memcpy itself does not require any such semantics and the mapping would hold valid across context switches anyway. Further, highmem is planned to to be removed[1]. [1] https://lore.kernel.org/all/4ff89b72-03ff-4447-9d21-dd6a5fe1550f@app.fastmail.com/ Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Link: https://patch.msgid.link/20251118210706.1816303-1-dave@stgolabs.net Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-18	bpf: Replace bpf memory allocator with kmalloc_nolock() in local storage	Amery Hung
	Replace bpf memory allocator with kmalloc_nolock() to reduce memory wastage due to preallocation. In bpf_selem_free(), an selem now needs to wait for a RCU grace period before being freed when reuse_now == true. Therefore, rcu_barrier() should be always be called in bpf_local_storage_map_free(). In bpf_local_storage_free(), since smap->storage_ma is no longer needed to return the memory, the function is now independent from smap. Remove the outdated comment in bpf_local_storage_alloc(). We already free selem after an RCU grace period in bpf_local_storage_update() when bpf_local_storage_alloc() failed the cmpxchg since commit c0d63f309186 ("bpf: Add bpf_selem_free()"). Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-5-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-18	bpf: Save memory alloction info in bpf_local_storage	Amery Hung
	Save the memory allocation method used for bpf_local_storage in the struct explicitly so that we don't need to go through the hassle to find out the info. When a later patch replaces BPF memory allocator with kmalloc_noloc(), bpf_local_storage_free() will no longer need smap->storage_ma to return the memory and completely remove the dependency on smap in bpf_local_storage_free(). Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-4-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-18	bpf: Remove smap argument from bpf_selem_free()	Amery Hung
	Since selem already saves a pointer to smap, use it instead of an additional argument in bpf_selem_free(). This requires moving the SDATA(selem)->smap assignment from bpf_selem_link_map() to bpf_selem_alloc() since bpf_selem_free() may be called without the selem being linked to smap in bpf_local_storage_update(). Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-3-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-18	bpf: Always charge/uncharge memory when allocating/unlinking storage elements	Amery Hung
	Since commit a96a44aba556 ("bpf: bpf_sk_storage: Fix invalid wait context lockdep report"), {charge,uncharge}_mem are always true when allocating a bpf_local_storage_elem or unlinking a bpf_local_storage_elem from local storage, so drop these arguments. No functional change. Signed-off-by: Amery Hung <ameryhung@gmail.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20251114201329.3275875-2-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-18	genirq: Use raw_spinlock_irq() in irq_set_affinity_notifier()	Chengkaitao
	Since irq_set_affinity_notifier() may sleep, interrupts are enabled. So raw_spinlock_irqsave() can be replaced with raw_spinlock_irq(). Signed-off-by: Chengkaitao <chengkaitao@kylinos.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251118012754.61805-1-pilgrimtao@gmail.com
2025-11-17	bpf: Fix invalid prog->stats access when update_effective_progs fails	Pu Lehui
	Syzkaller triggers an invalid memory access issue following fault injection in update_effective_progs. The issue can be described as follows: __cgroup_bpf_detach update_effective_progs compute_effective_progs bpf_prog_array_alloc <-- fault inject purge_effective_progs /* change to dummy_bpf_prog / array->items[index] = &dummy_bpf_prog.prog ---softirq start--- __do_softirq ... __cgroup_bpf_run_filter_skb __bpf_prog_run_save_cb bpf_prog_run stats = this_cpu_ptr(prog->stats) / invalid memory access */ flags = u64_stats_update_begin_irqsave(&stats->syncp) ---softirq end--- static_branch_dec(&cgroup_bpf_enabled_key[atype]) The reason is that fault injection caused update_effective_progs to fail and then changed the original prog into dummy_bpf_prog.prog in purge_effective_progs. Then a softirq came, and accessing the members of dummy_bpf_prog.prog in the softirq triggers invalid mem access. To fix it, skip updating stats when stats is NULL. Fixes: 492ecee892c2 ("bpf: enable program stats") Signed-off-by: Pu Lehui <pulehui@huawei.com> Link: https://lore.kernel.org/r/20251115102343.2200727-1-pulehui@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-17	PM: hibernate: Clean up kernel-doc comment style usage	Sunday Adelodun
	Several static functions in kernel/power/swap.c were described using the kernel-doc comment style (/** ... */) even though they are not exported or referenced by generated documentation. This led to kernel-doc warnings and stylistic inconsistencies. Convert these unnecessary kernel-doc blocks to regular C comments, remove comment blocks that are no longer useful, relocate comments to more appropriate positions where needed, and fix a few "Return:" descriptions that were either missing or incorrectly formatted. No functional changes. Signed-off-by: Sunday Adelodun <adelodunolaoluwa@yahoo.com> [ rjw: Subject adjustment, changelog edits, comment edits ] Link: https://patch.msgid.link/20251114220438.52448-1-adelodunolaoluwa@yahoo.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-17	Merge tag 'vfs-6.18-rc7.fixes' of ↵	Linus Torvalds
	gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix unitialized variable in statmount_string() - Fix hostfs mounting when passing host root during boot - Fix dynamic lookup to fail on cell lookup failure - Fix missing file type when reading bfs inodes from disk - Enforce checking of sb_min_blocksize() calls and update all callers accordingly - Restore write access before closing files opened by open_exec() in binfmt_misc - Always freeze efivarfs during suspend/hibernate cycles - Fix statmount()'s and listmount()'s grab_requested_mnt_ns() helper to actually allow mount namespace file descriptor in addition to mount namespace ids - Fix tmpfs remount when noswap is specified - Switch Landlock to iput_not_last() to remove false-positives from might_sleep() annotations in iput() - Remove dead node_to_mnt_ns() code - Ensure that per-queue kobjects are successfully created * tag 'vfs-6.18-rc7.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: landlock: fix splats from iput() after it started calling might_sleep() fs: add iput_not_last() shmem: fix tmpfs reconfiguration (remount) when noswap is set fs/namespace: correctly handle errors returned by grab_requested_mnt_ns power: always freeze efivarfs binfmt_misc: restore write access before closing files opened by open_exec() block: add __must_check attribute to sb_min_blocksize() virtio-fs: fix incorrect check for fsvq->kobj xfs: check the return value of sb_min_blocksize() in xfs_fs_fill_super isofs: check the return value of sb_min_blocksize() in isofs_fill_super exfat: check return value of sb_min_blocksize in exfat_read_boot_sector vfat: fix missing sb_min_blocksize() return value checks mnt: Remove dead code which might prevent from building bfs: Reconstruct file type when loading from disk afs: Fix dynamic lookup to fail on cell lookup failure hostfs: Fix only passing host root in boot stage with new mount fs: Fix uninitialized 'offp' in statmount_string()
2025-11-17	Merge tag 'sched_ext-for-6.18-rc6-fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "Five fixes addressing PREEMPT_RT compatibility and locking issues. Three commits fix potential deadlocks and sleeps in atomic contexts on RT kernels by converting locks to raw spinlocks and ensuring IRQ work runs in hard-irq context. The remaining two fix unsafe locking in the debug dump path and a variable dereference typo" * tag 'sched_ext-for-6.18-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Use IRQ_WORK_INIT_HARD() to initialize rq->scx.kick_cpus_irq_work sched_ext: Fix possible deadlock in the deferred_irq_workfn() sched/ext: convert scx_tasks_lock to raw spinlock sched_ext: Fix unsafe locking in the scx_dump_state() sched_ext: Fix use of uninitialized variable in scx_bpf_cpuperf_set()
2025-11-17	sched/fair: Proportional newidle balance	Peter Zijlstra
	Add a randomized algorithm that runs newidle balancing proportional to its success rate. This improves schbench significantly: 6.18-rc4: 2.22 Mrps/s 6.18-rc4+revert: 2.04 Mrps/s 6.18-rc4+revert+random: 2.18 Mrps/S Conversely, per Adam Li this affects SpecJBB slightly, reducing it by 1%: 6.17: -6% 6.17+revert: 0% 6.17+revert+random: -1% Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Chris Mason <clm@meta.com> Link: https://lkml.kernel.org/r/6825c50d-7fa7-45d8-9b81-c6e7e25738e2@meta.com Link: https://patch.msgid.link/20251107161739.770122091@infradead.org
2025-11-17	sched/fair: Small cleanup to update_newidle_cost()	Peter Zijlstra
	Simplify code by adding a few variables. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Chris Mason <clm@meta.com> Link: https://patch.msgid.link/20251107161739.655208666@infradead.org
2025-11-17	sched/fair: Small cleanup to sched_balance_newidle()	Peter Zijlstra
	Pull out the !sd check to simplify code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Chris Mason <clm@meta.com> Link: https://patch.msgid.link/20251107161739.525916173@infradead.org
2025-11-17	sched/fair: Revert max_newidle_lb_cost bump	Peter Zijlstra
	Many people reported regressions on their database workloads due to: 155213a2aed4 ("sched/fair: Bump sd->max_newidle_lb_cost when newidle balance fails") For instance Adam Li reported a 6% regression on SpecJBB. Conversely this will regress schbench again; on my machine from 2.22 Mrps/s down to 2.04 Mrps/s. Reported-by: Joseph Salisbury <joseph.salisbury@oracle.com> Reported-by: Adam Li <adamli@os.amperecomputing.com> Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Chris Mason <clm@meta.com> Link: https://lkml.kernel.org/r/20250626144017.1510594-2-clm@fb.com Link: https://lkml.kernel.org/r/006c9df2-b691-47f1-82e6-e233c3f91faf@oracle.com Link: https://patch.msgid.link/20251107161739.406147760@infradead.org
2025-11-17	sched/fair: Reimplement NEXT_BUDDY to align with EEVDF goals	Mel Gorman
	Reimplement NEXT_BUDDY preemption to take into account the deadline and eligibility of the wakee with respect to the waker. In the event multiple buddies could be considered, the one with the earliest deadline is selected. Sync wakeups are treated differently to every other type of wakeup. The WF_SYNC assumption is that the waker promises to sleep in the very near future. This is violated in enough cases that WF_SYNC should be treated as a suggestion instead of a contract. If a waker does go to sleep almost immediately then the delay in wakeup is negligible. In other cases, it's throttled based on the accumulated runtime of the waker so there is a chance that some batched wakeups have been issued before preemption. For all other wakeups, preemption happens if the wakee has a earlier deadline than the waker and eligible to run. While many workloads were tested, the two main targets were a modified dbench4 benchmark and hackbench because the are on opposite ends of the spectrum -- one prefers throughput by avoiding preemption and the other relies on preemption. First is the dbench throughput data even though it is a poor metric but it is the default metric. The test machine is a 2-socket machine and the backing filesystem is XFS as a lot of the IO work is dispatched to kernel threads. It's important to note that these results are not representative across all machines, especially Zen machines, as different bottlenecks are exposed on different machines and filesystems. dbench4 Throughput (misleading but traditional) 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v5 Hmean 1 1268.80 ( 0.00%) 1269.74 ( 0.07%) Hmean 4 3971.74 ( 0.00%) 3950.59 ( -0.53%) Hmean 7 5548.23 ( 0.00%) 5420.08 ( -2.31%) Hmean 12 7310.86 ( 0.00%) 7165.57 ( -1.99%) Hmean 21 8874.53 ( 0.00%) 9149.04 ( 3.09%) Hmean 30 9361.93 ( 0.00%) 10530.04 ( 12.48%) Hmean 48 9540.14 ( 0.00%) 11820.40 ( 23.90%) Hmean 79 9208.74 ( 0.00%) 12193.79 ( 32.42%) Hmean 110 8573.12 ( 0.00%) 11933.72 ( 39.20%) Hmean 141 7791.33 ( 0.00%) 11273.90 ( 44.70%) Hmean 160 7666.60 ( 0.00%) 10768.72 ( 40.46%) As throughput is misleading, the benchmark is modified to use a short loadfile report the completion time duration in milliseconds. dbench4 Loadfile Execution Time 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v5 Amean 1 14.62 ( 0.00%) 14.69 ( -0.46%) Amean 4 18.76 ( 0.00%) 18.85 ( -0.45%) Amean 7 23.71 ( 0.00%) 24.38 ( -2.82%) Amean 12 31.25 ( 0.00%) 31.87 ( -1.97%) Amean 21 45.12 ( 0.00%) 43.69 ( 3.16%) Amean 30 61.07 ( 0.00%) 54.33 ( 11.03%) Amean 48 95.91 ( 0.00%) 77.22 ( 19.49%) Amean 79 163.38 ( 0.00%) 123.08 ( 24.66%) Amean 110 243.91 ( 0.00%) 175.11 ( 28.21%) Amean 141 343.47 ( 0.00%) 239.10 ( 30.39%) Amean 160 401.15 ( 0.00%) 283.73 ( 29.27%) Stddev 1 0.52 ( 0.00%) 0.51 ( 2.45%) Stddev 4 1.36 ( 0.00%) 1.30 ( 4.04%) Stddev 7 1.88 ( 0.00%) 1.87 ( 0.72%) Stddev 12 3.06 ( 0.00%) 2.45 ( 19.83%) Stddev 21 5.78 ( 0.00%) 3.87 ( 33.06%) Stddev 30 9.85 ( 0.00%) 5.25 ( 46.76%) Stddev 48 22.31 ( 0.00%) 8.64 ( 61.27%) Stddev 79 35.96 ( 0.00%) 18.07 ( 49.76%) Stddev 110 59.04 ( 0.00%) 30.93 ( 47.61%) Stddev 141 85.38 ( 0.00%) 40.93 ( 52.06%) Stddev 160 96.38 ( 0.00%) 39.72 ( 58.79%) That is still looking good and the variance is reduced quite a bit. Finally, fairness is a concern so the next report tracks how many milliseconds does it take for all clients to complete a workfile. This one is tricky because dbench makes to effort to synchronise clients so the durations at benchmark start time differ substantially from typical runtimes. This problem could be mitigated by warming up the benchmark for a number of minutes but it's a matter of opinion whether that counts as an evasion of inconvenient results. dbench4 All Clients Loadfile Execution Time 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v5 Amean 1 15.06 ( 0.00%) 15.07 ( -0.03%) Amean 4 603.81 ( 0.00%) 524.29 ( 13.17%) Amean 7 855.32 ( 0.00%) 1331.07 ( -55.62%) Amean 12 1890.02 ( 0.00%) 2323.97 ( -22.96%) Amean 21 3195.23 ( 0.00%) 2009.29 ( 37.12%) Amean 30 13919.53 ( 0.00%) 4579.44 ( 67.10%) Amean 48 25246.07 ( 0.00%) 5705.46 ( 77.40%) Amean 79 29701.84 ( 0.00%) 15509.26 ( 47.78%) Amean 110 22803.03 ( 0.00%) 23782.08 ( -4.29%) Amean 141 36356.07 ( 0.00%) 25074.20 ( 31.03%) Amean 160 17046.71 ( 0.00%) 13247.62 ( 22.29%) Stddev 1 0.47 ( 0.00%) 0.49 ( -3.74%) Stddev 4 395.24 ( 0.00%) 254.18 ( 35.69%) Stddev 7 467.24 ( 0.00%) 764.42 ( -63.60%) Stddev 12 1071.43 ( 0.00%) 1395.90 ( -30.28%) Stddev 21 1694.50 ( 0.00%) 1204.89 ( 28.89%) Stddev 30 7945.63 ( 0.00%) 2552.59 ( 67.87%) Stddev 48 14339.51 ( 0.00%) 3227.55 ( 77.49%) Stddev 79 16620.91 ( 0.00%) 8422.15 ( 49.33%) Stddev 110 12912.15 ( 0.00%) 13560.95 ( -5.02%) Stddev 141 20700.13 ( 0.00%) 14544.51 ( 29.74%) Stddev 160 9079.16 ( 0.00%) 7400.69 ( 18.49%) This is more of a mixed bag but it at least shows that fairness is not crippled. The hackbench results are more neutral but this is still important. It's possible to boost the dbench figures by a large amount but only by crippling the performance of a workload like hackbench. The WF_SYNC behaviour is important for these workloads and is why the WF_SYNC changes are not a separate patch. hackbench-process-pipes 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v5 Amean 1 0.2657 ( 0.00%) 0.2150 ( 19.07%) Amean 4 0.6107 ( 0.00%) 0.6060 ( 0.76%) Amean 7 0.7923 ( 0.00%) 0.7440 ( 6.10%) Amean 12 1.1500 ( 0.00%) 1.1263 ( 2.06%) Amean 21 1.7950 ( 0.00%) 1.7987 ( -0.20%) Amean 30 2.3207 ( 0.00%) 2.5053 ( -7.96%) Amean 48 3.5023 ( 0.00%) 3.9197 ( -11.92%) Amean 79 4.8093 ( 0.00%) 5.2247 ( -8.64%) Amean 110 6.1160 ( 0.00%) 6.6650 ( -8.98%) Amean 141 7.4763 ( 0.00%) 7.8973 ( -5.63%) Amean 172 8.9560 ( 0.00%) 9.3593 ( -4.50%) Amean 203 10.4783 ( 0.00%) 10.8347 ( -3.40%) Amean 234 12.4977 ( 0.00%) 13.0177 ( -4.16%) Amean 265 14.7003 ( 0.00%) 15.5630 ( -5.87%) Amean 296 16.1007 ( 0.00%) 17.4023 ( -8.08%) Processes using pipes are impacted but the variance (not presented) indicates it's close to noise and the results are not always reproducible. If executed across multiple reboots, it may show neutral or small gains so the worst measured results are presented. Hackbench using sockets is more reliably neutral as the wakeup mechanisms are different between sockets and pipes. hackbench-process-sockets 6.18-rc1 6.18-rc1 vanilla sched-preemptnext-v2 Amean 1 0.3073 ( 0.00%) 0.3263 ( -6.18%) Amean 4 0.7863 ( 0.00%) 0.7930 ( -0.85%) Amean 7 1.3670 ( 0.00%) 1.3537 ( 0.98%) Amean 12 2.1337 ( 0.00%) 2.1903 ( -2.66%) Amean 21 3.4683 ( 0.00%) 3.4940 ( -0.74%) Amean 30 4.7247 ( 0.00%) 4.8853 ( -3.40%) Amean 48 7.6097 ( 0.00%) 7.8197 ( -2.76%) Amean 79 14.7957 ( 0.00%) 16.1000 ( -8.82%) Amean 110 21.3413 ( 0.00%) 21.9997 ( -3.08%) Amean 141 29.0503 ( 0.00%) 29.0353 ( 0.05%) Amean 172 36.4660 ( 0.00%) 36.1433 ( 0.88%) Amean 203 39.7177 ( 0.00%) 40.5910 ( -2.20%) Amean 234 42.1120 ( 0.00%) 43.5527 ( -3.42%) Amean 265 45.7830 ( 0.00%) 50.0560 ( -9.33%) Amean 296 50.7043 ( 0.00%) 54.3657 ( -7.22%) As schbench has been mentioned in numerous bugs recently, the results are interesting. A test case that represents the default schbench behaviour is schbench Wakeup Latency (usec) 6.18.0-rc1 6.18.0-rc1 vanilla sched-preemptnext-v5 Amean Wakeup-50th-80 7.17 ( 0.00%) 6.00 ( 16.28%) Amean Wakeup-90th-80 46.56 ( 0.00%) 19.78 ( 57.52%) Amean Wakeup-99th-80 119.61 ( 0.00%) 89.94 ( 24.80%) Amean Wakeup-99.9th-80 3193.78 ( 0.00%) 328.22 ( 89.72%) schbench Requests Per Second (ops/sec) 6.18.0-rc1 6.18.0-rc1 vanilla sched-preemptnext-v5 Hmean RPS-20th-80 8900.91 ( 0.00%) 9176.78 ( 3.10%) Hmean RPS-50th-80 8987.41 ( 0.00%) 9217.89 ( 2.56%) Hmean RPS-90th-80 9123.73 ( 0.00%) 9273.25 ( 1.64%) Hmean RPS-max-80 9193.50 ( 0.00%) 9301.47 ( 1.17%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251112122521.1331238-3-mgorman@techsingularity.net
2025-11-17	sched/fair: Enable scheduler feature NEXT_BUDDY	Mel Gorman
	The NEXT_BUDDY feature reinforces wakeup preemption to encourage the last wakee to be scheduled sooner on the assumption that the waker/wakee share cache-hot data. In CFS, it was paired with LAST_BUDDY to switch back on the assumption that the pair of tasks still share data but also relied on START_DEBIT and the exact WAKEUP_PREEMPTION implementation to get good results. NEXT_BUDDY has been disabled since commit 0ec9fab3d186 ("sched: Improve latencies and throughput") and LAST_BUDDY was removed in commit 5e963f2bd465 ("sched/fair: Commit to EEVDF"). The reasoning is not clear but as vruntime spread is mentioned so the expectation is that NEXT_BUDDY had an impact on overall fairness. It was not noted why LAST_BUDDY was removed but it is assumed that it's very difficult to reason what LAST_BUDDY's correct and effective behaviour should be while still respecting EEVDFs goals. Peter Zijlstra noted during review; I think I was just struggling to make sense of things and figured less is more and axed it. I have vague memories trying to work through the dynamics of a wakeup-stack and the EEVDF latency requirements and getting a head-ache. NEXT_BUDDY is easier to reason about given that it's a point-in-time decision on the wakees deadline and eligibilty relative to the waker. Enable NEXT_BUDDY as a preparation path to document that the decision to ignore the current implementation is deliberate. While not presented, the results were at best neutral and often much more variable. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251112122521.1331238-2-mgorman@techsingularity.net
2025-11-17	sched: Increase sched_tick_remote timeout	Phil Auld
	Increase the sched_tick_remote WARN_ON timeout to remove false positives due to temporarily busy HK cpus. The suggestion was 30 seconds to catch really stuck remote tick processing but not trigger it too easily. Suggested-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/20250911161300.437944-1-pauld@redhat.com
2025-11-17	sched/fair: Have SD_SERIALIZE affect newidle balancing	Peter Zijlstra
	Also serialize the possiblty much more frequent newidle balancing for the 'expensive' domains that have SD_BALANCE set. Initial benchmarking by K Prateek and Tim showed no negative effect. Split out from the larger patch moving sched_balance_running around for ease of bisect and such. Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Seconded-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/df068896-82f9-458d-8fff-5a2f654e8ffd@amd.com Link: https://patch.msgid.link/6fed119b723c71552943bfe5798c93851b30a361.1762800251.git.tim.c.chen@linux.intel.com # Conflicts: # kernel/sched/fair.c
2025-11-17	sched/fair: Skip sched_balance_running cmpxchg when balance is not due	Tim Chen
	The NUMA sched domain sets the SD_SERIALIZE flag by default, allowing only one NUMA load balancing operation to run system-wide at a time. Currently, each sched group leader directly under NUMA domain attempts to acquire the global sched_balance_running flag via cmpxchg() before checking whether load balancing is due or whether it is the designated load balancer for that NUMA domain. On systems with a large number of cores, this causes significant cache contention on the shared sched_balance_running flag. This patch reduces unnecessary cmpxchg() operations by first checking that the balancer is the designated leader for a NUMA domain from should_we_balance(), and the balance interval has expired before trying to acquire sched_balance_running to load balance a NUMA domain. On a 2-socket Granite Rapids system with sub-NUMA clustering enabled, running an OLTP workload, 7.8% of total CPU cycles were previously spent in sched_balance_domain() contending on sched_balance_running before this change. : 104 static __always_inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new) : 105 { : 106 return arch_cmpxchg(&v->counter, old, new); 0.00 : ffffffff81326e6c: xor %eax,%eax 0.00 : ffffffff81326e6e: mov $0x1,%ecx 0.00 : ffffffff81326e73: lock cmpxchg %ecx,0x2394195(%rip) # ffffffff836bb010 <sched_balance_running> : 110 sched_balance_domains(): : 12234 if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) 99.39 : ffffffff81326e7b: test %eax,%eax 0.00 : ffffffff81326e7d: jne ffffffff81326e99 <sched_balance_domains+0x209> : 12238 if (time_after_eq(jiffies, sd->last_balance + interval)) { 0.00 : ffffffff81326e7f: mov 0x14e2b3a(%rip),%rax # ffffffff828099c0 <jiffies_64> 0.00 : ffffffff81326e86: sub 0x48(%r14),%rax 0.00 : ffffffff81326e8a: cmp %rdx,%rax After applying this fix, sched_balance_domain() is gone from the profile and there is a 5% throughput improvement. [peterz: made it so that redo retains the 'lock' and split out the CPU_NEWLY_IDLE change to a separate patch] Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/6fed119b723c71552943bfe5798c93851b30a361.1762800251.git.tim.c.chen@linux.intel.com
2025-11-17	Merge back earlier material related to system sleep for 6.19	Rafael J. Wysocki

2025-11-17	sched_ext: Use kvfree_rcu() to release per-cpu ksyncs object	Zqiang
	The free_kick_syncs_rcu() rcu-callback only invoke kvfree() to release per-cpu ksyncs object, this can use kvfree_rcu() replace call_rcu() to release per-cpu ksyncs object in the free_kick_syncs(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-17	sched_ext: Use IRQ_WORK_INIT_HARD() to initialize rq->scx.kick_cpus_irq_work	Zqiang
	For PREEMPT_RT kernels, the kick_cpus_irq_workfn() be invoked in the per-cpu irq_work/* task context and there is no rcu-read critical section to protect. this commit therefore use IRQ_WORK_INIT_HARD() to initialize the per-cpu rq->scx.kick_cpus_irq_work in the init_sched_ext_class(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-16	mm: make INVALID_PHYS_ADDR a generic macro	Anshuman Khandual
	INVALID_PHYS_ADDR has very similar definitions across the code base. Hence just move that inside header <liux/mm.h> for more generic usage. Also drop the now redundant ones which are no longer required. Link: https://lkml.kernel.org/r/20251021025638.2420216-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> [s390] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16	treewide: include linux/pgalloc.h instead of asm/pgalloc.h	Harry Yoo
	For now, including <asm/pgalloc.h> instead of <linux/pgalloc.h> is technically fine unless the .c file calls p*d_populate_kernel() helper functions. But it is a better practice to always include <linux/pgalloc.h>. Include <linux/pgalloc.h> instead of <asm/pgalloc.h> outside arch/. Link: https://lkml.kernel.org/r/20251024113047.119058-3-harry.yoo@oracle.com Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16	relay: update relay to use mmap_prepare	Lorenzo Stoakes
	It is relatively trivial to update this code to use the f_op->mmap_prepare hook in favour of the deprecated f_op->mmap hook, so do so. Link: https://lkml.kernel.org/r/7c9e82cdddf8b573ea3edb8cdb697363e3ccb5d7.1760959442.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Chatre, Reinette <reinette.chatre@intel.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Martin <dave.martin@arm.com> Cc: Dave Young <dyoung@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morse <james.morse@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicolas Pitre <nico@fluxnic.net> Cc: Oscar Salvador <osalvador@suse.de> Cc: Robin Murohy <robin.murphy@arm.com> Cc: Sumanth Korikkar <sumanthk@linux.ibm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16	memcg: net: track network throttling due to memcg memory pressure	Shakeel Butt
	The kernel can throttle network sockets if the memory cgroup associated with the corresponding socket is under memory pressure. The throttling actions include clamping the transmit window, failing to expand receive or send buffers, aggressively prune out-of-order receive queue, FIN deferred to a retransmitted packet and more. Let's add memcg metric to track such throttling actions. At the moment memcg memory pressure is defined through vmpressure and in future it may be defined using PSI or we may add more flexible way for the users to define memory pressure, maybe through ebpf. However the potential throttling actions will remain the same, so this newly introduced metric will continue to track throttling actions irrespective of how memcg memory pressure is defined. Link: https://lkml.kernel.org/r/20251016161035.86161-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Daniel Sedlak <daniel.sedlak@cdn77.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Neal Cardwell <ncardwell@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16	mm: consistently use current->mm in mm_get_unmapped_area()	Ryan Roberts
	mm_get_unmapped_area() is a wrapper around arch_get_unmapped_area() / arch_get_unmapped_area_topdown(), both of which search current->mm for some free space. Neither take an mm_struct - they implicitly operate on current->mm. But the wrapper takes an mm_struct and uses it to decide whether to search bottom up or top down. All callers pass in current->mm for this, so everything is working consistently. But it feels like an accident waiting to happen; eventually someone will call that function with a different mm, expecting to find free space in it, but what gets returned is free space in the current mm. So let's simplify by removing the parameter and have the wrapper use current->mm to decide which end to start at. Now everything is consistent and self-documenting. Link: https://lkml.kernel.org/r/20251003155306.2147572-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-16	Merge tag 'mm-hotfixes-stable-2025-11-16-10-40' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "7 hotfixes. 5 are cc:stable, 4 are against mm/ All are singletons - please see the respective changelogs for details" * tag 'mm-hotfixes-stable-2025-11-16-10-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm, swap: fix potential UAF issue for VMA readahead selftests/user_events: fix type cast for write_index packed member in perf_test lib/test_kho: check if KHO is enabled mm/huge_memory: fix folio split check for anon folios in swapcache MAINTAINERS: update David Hildenbrand's email address crash: fix crashkernel resource shrink mm: fix MAX_FOLIO_ORDER on powerpc configs with hugetlb
2025-11-16	convert bpf	Al Viro
	object creation goes through the normal VFS paths or approximation thereof (user_path_create()/done_path_create() in case of bpf_obj_do_pin(), open-coded simple_{start,done}_creating() in bpf_iter_link_pin_kernel() at mount time), removals go entirely through the normal VFS paths (and ->unlink() is simple_unlink() there). Enough to have bpf_dentry_finalize() use d_make_persistent() instead of dget() and we are done. Convert bpf_iter_link_pin_kernel() to simple_{start,done}_creating(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-11-15	crash: fix crashkernel resource shrink	Sourabh Jain
	When crashkernel is configured with a high reservation, shrinking its value below the low crashkernel reservation causes two issues: 1. Invalid crashkernel resource objects 2. Kernel crash if crashkernel shrinking is done twice For example, with crashkernel=200M,high, the kernel reserves 200MB of high memory and some default low memory (say 256MB). The reservation appears as: cat /proc/iomem \| grep -i crash af000000-beffffff : Crash kernel 433000000-43f7fffff : Crash kernel If crashkernel is then shrunk to 50MB (echo 52428800 > /sys/kernel/kexec_crash_size), /proc/iomem still shows 256MB reserved: af000000-beffffff : Crash kernel Instead, it should show 50MB: af000000-b21fffff : Crash kernel Further shrinking crashkernel to 40MB causes a kernel crash with the following trace (x86): BUG: kernel NULL pointer dereference, address: 0000000000000038 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI <snip...> Call Trace: <TASK> ? __die_body.cold+0x19/0x27 ? page_fault_oops+0x15a/0x2f0 ? search_module_extables+0x19/0x60 ? search_bpf_extables+0x5f/0x80 ? exc_page_fault+0x7e/0x180 ? asm_exc_page_fault+0x26/0x30 ? __release_resource+0xd/0xb0 release_resource+0x26/0x40 __crash_shrink_memory+0xe5/0x110 crash_shrink_memory+0x12a/0x190 kexec_crash_size_store+0x41/0x80 kernfs_fop_write_iter+0x141/0x1f0 vfs_write+0x294/0x460 ksys_write+0x6d/0xf0 <snip...> This happens because __crash_shrink_memory()/kernel/crash_core.c incorrectly updates the crashk_res resource object even when crashk_low_res should be updated. Fix this by ensuring the correct crashkernel resource object is updated when shrinking crashkernel memory. Link: https://lkml.kernel.org/r/20251101193741.289252-1-sourabhjain@linux.ibm.com Fixes: 16c6006af4d4 ("kexec: enable kexec_crash_size to support two crash kernel regions") Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Zhen Lei <thunder.leizhen@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-15	Merge tag 'timers-urgent-2025-11-15' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix a memory leak in the posix timer creation logic" * tag 'timers-urgent-2025-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-timers: Plug potential memory leak in do_timer_create()
2025-11-14	bpf: don't skip other information if xlated_prog_insns is skipped	Altgelt, Max (Nextron)
	If xlated_prog_insns should not be exposed, other information (such as func_info) still can and should be filled in. Therefore, instead of directly terminating in this case, continue with the normal flow. Signed-off-by: Max Altgelt <max.altgelt@nextron-systems.com> Link: https://lore.kernel.org/r/efd00fcec5e3e247af551632726e2a90c105fbd8.camel@nextron-systems.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14	bpf: verifier: Move desc->imm setup to sort_kfunc_descs_by_imm_off()	Puranjay Mohan
	Metadata about a kfunc call is added to the kfunc_tab in add_kfunc_call() but the call instruction itself could get removed by opt_remove_dead_code() later if it is not reachable. If the call instruction is removed, specialize_kfunc() is never called for it and the desc->imm in the kfunc_tab is never initialized for this kfunc call. In this case, sort_kfunc_descs_by_imm_off(env->prog); in do_misc_fixups() doesn't sort the table correctly. This is a problem for s390 as its JIT uses this table to find the addresses for kfuncs, and if this table is not sorted properly, JIT may fail to find addresses for valid kfunc calls. This was exposed by: commit d869d56ca848 ("bpf: verifier: refactor kfunc specialization") as before this commit, desc->imm was initialised in add_kfunc_call() which happens before dead code elimination. Move desc->imm setup down to sort_kfunc_descs_by_imm_off(), this fixes the problem and also saves us from having the same logic in add_kfunc_call() and specialize_kfunc(). Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251114154023.12801-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after 6.18-rc5+	Alexei Starovoitov
	Cross-merge BPF and other fixes after downstream PR. Minor conflict in kernel/bpf/helpers.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14	Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf	Linus Torvalds
	Pull bpf fixes from Alexei Starovoitov: - Fix interaction between livepatch and BPF fexit programs (Song Liu) With Steven and Masami acks. - Fix stack ORC unwind from BPF kprobe_multi (Jiri Olsa) With Steven and Masami acks. - Fix out of bounds access in widen_imprecise_scalars() in the verifier (Eduard Zingerman) - Fix conflicts between MPTCP and BPF sockmap (Jiayuan Chen) - Fix net_sched storage collision with BPF data_meta/data_end (Eric Dumazet) - Add _impl suffix to BPF kfuncs with implicit args to avoid breaking them in bpf-next when KF_IMPLICIT_ARGS is added (Mykyta Yatsenko) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Test widen_imprecise_scalars() with different stack depth bpf: account for current allocated stack depth in widen_imprecise_scalars() bpf: Add bpf_prog_run_data_pointers() selftests/bpf: Add mptcp test with sockmap mptcp: Fix proto fallback detection with BPF mptcp: Disallow MPTCP subflows from sockmap selftests/bpf: Add stacktrace ips test for raw_tp selftests/bpf: Add stacktrace ips test for kprobe_multi/kretprobe_multi x86/fgraph,bpf: Fix stack ORC unwind from kprobe_multi return probe Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()" bpf: add _impl suffix for bpf_stream_vprintk() kfunc bpf:add _impl suffix for bpf_task_work_schedule* kfuncs selftests/bpf: Add tests for livepatch + bpf trampoline ftrace: bpf: Fix IPMODIFY + DIRECT in modify_ftrace_direct() ftrace: Fix BPF fexit with livepatch
2025-11-14	bpf: Handle return value of ftrace_set_filter_ip in register_fentry	Menglong Dong
	The error that returned by ftrace_set_filter_ip() in register_fentry() is not handled properly. Just fix it. Fixes: 00963a2e75a8 ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)") Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20251110120705.1553694-1-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14	bpf: Add missing checks to avoid verbose verifier log	Eduard Zingerman
	There are a few places where log level is not checked before calling "verbose()". This forces programs working only at BPF_LOG_LEVEL_STATS (e.g. veristat) to allocate unnecessarily large log buffers. Add missing checks. Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251114200542.912386-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14	sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docs	Tejun Heo
	With the buddy lockup detector, smp_processor_id() returns the detecting CPU, not the locked CPU, making scx_hardlockup()'s printouts confusing. Pass the locked CPU number from watchdog_hardlockup_check() as a parameter instead. Also add kerneldoc comments to handle_lockup(), scx_hardlockup(), and scx_rcu_cpu_stall() documenting their return value semantics. Suggested-by: Doug Anderson <dianders@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Acked-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>