linux-stable.git - Linux kernel stable tree

Age	Commit message (Collapse)	Author
2026-05-28	raid6: remove __KERNEL__ ifdefs	Christoph Hellwig
	With the test code ported to kernel space, none of this is required. Link: https://lore.kernel.org/20260518051804.462141-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Ard Biesheuvel <ardb@kernel.org> # kunit only on arm64 Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Sterba <dsterba@suse.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Li Nan <linan122@huawei.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Song Liu <song@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	raid6: turn the userspace test harness into a kunit test	Christoph Hellwig
	Patch series "cleanup the RAID6 P/Q library", v3. This series cleans up the RAID6 P/Q library to match the recent updates to the RAID 5 XOR library and other CRC/crypto libraries. This includes providing properly documented external interfaces, hiding the internals, using static_call instead of indirect calls and turning the user space test suite into an in-kernel kunit test which is also extended to improve coverage. Note that this changes registration so that non-priority algorithms are not registered, which greatly helps with the benchmark time at boot time. I'd like to encourage all architecture maintainers to see if they can further optimized this by registering as few as possible algorithms when there is a clear benefit in optimized or more unrolled implementations. This patch (of 18): Currently the raid6 code can be compiled as userspace code to run the test suite. Convert that to be a kunit case with minimal changes to avoid mutating global state so that we can drop this requirement. Note that this is not a good kunit test case yet and will need a lot more work, but that is deferred until the raid6 code is moved to it's new place, which is easier if the userspace makefile doesn't need adjustments for the new location first. Link: https://lore.kernel.org/20260518051804.462141-1-hch@lst.de Link: https://lore.kernel.org/20260518051804.462141-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Ard Biesheuvel <ardb@kernel.org> # kunit only on arm64 Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Sterba <dsterba@suse.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Li Nan <linan122@huawei.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Song Liu <song@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	kcov: allow simultaneous KCOV_ENABLE/KCOV_REMOTE_ENABLE	Jann Horn
	Allow the same userspace thread to simultaneously collect normal coverage in syscall context (KCOV_ENABLE) and remote coverage of asynchronous work created by the thread (KCOV_REMOTE_ENABLE). With this, remote KCOV coverage becomes useful for generic fuzzing and not just fuzzing of specific data injection interfaces. This requires that the task_struct::kcov_* fields are separated into ones that are used by the task that generates coverage, and ones that are used by the task that requested remote coverage. To split this up: - Split task_struct::kcov into kcov and kcov_remote. kcov_task_exit() now has to clean up both separately. - Only use task_struct::kcov_mode on the task that generates coverage. - Only reset task_struct::kcov_handle on the task that requested remote coverage. After this change, fields used by the task that generates coverage are: - kcov_mode - kcov_size - kcov_area - kcov - kcov_sequence - kcov_softirq Fields used by the task that requested remote coverage are: - kcov_remote - kcov_handle [jannh@google.com: remove unused constant KCOV_MODE_REMOTE, per Dmitry] Link: https://lore.kernel.org/20260515-kcov-simultaneous-remote-v2-1-56fde1cfa509@google.com [jannh@google.com: update documentation on remote coverage collection] Link: https://lore.kernel.org/20260519-kcov-docs-v1-1-5bb22f4cb20c@google.com [jannh@google.com: move and reword sentence on simultaneous normal/remote collection Link: https://lore.kernel.org/20260520-kcov-docs-v2-1-819f78778763@google.com Link: https://lore.kernel.org/20260505-kcov-simultaneous-remote-v1-1-a670ba7cefd2@google.com Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	llist: make locking comments consistent	Philipp Stanner
	llist's locking requirement table has a legend which claims that all operations not needing a lock a marked with '-', whereas in truth for some table entries just a whitespace is used. Add the '-' to all appropriate places. Link: https://lore.kernel.org/20260507094918.23910-2-phasta@kernel.org Signed-off-by: Philipp Stanner <phasta@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	kcov: refactor common handle ID into kcov_common_handle_id	Jann Horn
	Store common handle IDs in "struct kcov_common_handle_id", which consumes no space in non-KCOV builds. This cleanup removes #ifdef boilerplate code from subsystems that integrate with KCOV (in particular in usbip_common.h and skbuff.h, see the diffstat). This should also make it easier to add KCOV remote coverage to more subsystems in the future. Link: https://lore.kernel.org/20260430-kcov-refactor-common-handle-v1-1-23a0c7a0ba38@google.com Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Hongren (Zenithal) Zheng <i@zenithal.me> Cc: Jann Horn <jannh@google.com> Cc: Jason Wang <jasowang@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Valentina Manea <valentina.manea.m@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	uaccess: minimize INLINE_COPY_USER-related ifdefery	Yury Norov
	Now that we've got the same config selecting inline vs outline copy_to_user() and copy_from_user(), we can simplify the corresponding logic in the uaccess.h. Link: https://lore.kernel.org/20260425020857.356850-4-ynorov@nvidia.com Fixes: 1f9a8286bc0c ("uaccess: always export _copy_[from\|to]_user with CONFIG_RUST") Signed-off-by: Yury Norov <ynorov@nvidia.com> Tested-by: Alice Ryhl <aliceryhl@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Viktor Malik <vmalik@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	uaccess: unify inline vs outline copy_{from,to}_user() selection	Yury Norov
	The kernel allows arches to select between inline and outline implementations of the copy_{from,to}_user() by defining individual INLINE_COPY_FROM_USER and INLINE_COPY_TO_USER, correspondingly. However, all arches enable or disable them always together. Without the real use-case for one helper being inlined while the other outlined, having independent controls is excessive and error prone. Switch the codebase to the single unified INLINE_COPY_USER control. Link: https://lore.kernel.org/20260425020857.356850-3-ynorov@nvidia.com Signed-off-by: Yury Norov <ynorov@nvidia.com> Tested-by: Alice Ryhl <aliceryhl@google.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Viktor Malik <vmalik@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	init.h: discard exitcall symbols early	Arnd Bergmann
	Any __exitcall() and built-in module_exit() handler is marked as __used, which leads to the code being included in the object file and later discarded at link time. As far as I can tell, this was originally added at the same time as initcalls were marked the same way, to prevent them from getting dropped with gcc-3.4, but it was never actaully necessary to keep exit functions around. Mark them as __maybe_unused instead, which lets the compiler treat the exitcalls as entirely unused, and make better decisions about dropping specializing static functions called from these. Link: https://lore.kernel.org/all/acruxMNdnUlyRHiy@google.com/ Link: https://lore.kernel.org/20260331142846.3187706-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Nicolas Schier <nsc@kernel.org> Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Marco Elver <elver@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	highmem-internal.h: fix typo in the comment for kunmap_atomic()	Zhouyi Zhou
	Replace `PREEMP_RT` with `PREEMPT_RT` in the header comment to match the correct kernel configuration name. Link: https://lore.kernel.org/20260505021125.1941691-1-zhouzhouyi@gmail.com Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	mm/damon/core: remove damon_set_region_biggest_system_ram_default()	SeongJae Park
	Now nobody is using damon_set_region_biggest_system_ram_default(). Remove it. Link: https://lore.kernel.org/20260429041232.90257-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	mm/damon: introduce damon_set_region_system_rams_default()	SeongJae Park
	Patch series "mm/damon/reclaim,lru_sort: monitor all system rams by default". DAMON_RECLAIM and DAMON_LRU_SORT set the biggest 'System RAM' resource of the system as the default monitoring target address range. The main intention behind the design is to minimize the overhead coming from monitoring of non-System RAM areas. This could result in an odd setup when there are multiple discrete System RAMs of considerable sizes. For example, there are System RAMs each having 500 GiB size. In this case, only the first 500 GiB will be set as the monitoring region by default. This is particularly common on NUMA systems. Hence the modules allow users to set the monitoring target address range using the module parameters if the default setup doesn't work for them. In other words, the current design trades ease of setup for lower overhead. However, because DAMON utilizes the sampling based access check and the adaptive regions adjustment mechanisms, the overhead from the monitoring of non-System RAM areas should be negligible in most setups. Meanwhile, the setup complexity is causing real headaches for users who need to run those modules on various types of systems. That is, the current tradeoff is not a good deal. Set the physical address range that can cover all System RAM areas of the system as the default monitoring regions for DAMON_RECLAIM and DAMON_LRU_SORT. Technically speaking, this is changing documented behavior. However, it makes no sense to believe there is a real use case that really depends on the old weird default behavior. If the old default behavior was working for them in the reasonable way, this change will only add a negligible amount of monitoring overhead. If it didn't work, the users may already be using manual monitoring regions setup, and they will not be affected by this change. Patches Sequence ================ Patch 1 introduces a new core function that will be used for the new default monitoring target region setup. Patch 2 and 3 update DAMON_RECLAIM and DAMON_LRU_SORT to use the new function instead of the old one, respectively. Patch 4 removes the old core function that was replaced by the new one, as there is no more user of it. Patch 5 updates DAMON_STAT to use the new one instead of its in-house nearly-duplicate self implementation of the functionality. Finally patches 6 and 7 update the DAMON_RECLAIM and DAMON_LRU_SORT user documentation for the new behaviors, respectively. This patch (of 7): damon_set_region_biggest_system_ram_default() sets the monitoring target region as the caller requested. If the caller didn't specify the region, it finds the biggest System RAM of the system and sets it as the target region. When there are more than one considerable size of System RAM resources in the system, the default target setup makes no sense. Introduce a variant, namely damon_set_region_system_rams_default(). It sets a physical address range that covers all System RAM resources as the default target region. Link: https://lore.kernel.org/20260429041232.90257-1-sj@kernel.org Link: https://lore.kernel.org/20260429041232.90257-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	kasan: skip HW tagging for all kernel thread stacks	Muhammad Usama Anjum
	HW-tag KASAN never checks kernel stacks because stack pointers carry the match-all tag, so setting/poisoning tags is pure overhead. - Add __GFP_SKIP_KASAN to THREADINFO_GFP so every stack allocator that uses it skips tagging (fork path plus arch users) - Add __GFP_SKIP_KASAN to GFP_VMAP_STACK for the fork-specific vmap stacks. - When reusing cached vmap stacks, skip kasan_unpoison_range() if HW tags are enabled. Software KASAN is unchanged; this only affects tag-based KASAN. Link: https://lore.kernel.org/20260429102704.680174-3-dev.jain@arm.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	vmalloc: add __GFP_SKIP_KASAN support	Muhammad Usama Anjum
	Patch series "kasan: hw_tags: Disable tagging for stack and page-tables", v4. Stacks and page tables are always accessed with the match-all tag, so assigning a new random tag every time at allocation and setting invalid tag at deallocation time, just adds overhead without improving the detection. With __GFP_SKIP_KASAN the page keeps its poison tag and KASAN_TAG_KERNEL (match-all tag) is stored in the page flags while keeping the poison tag in the hardware. The benefit of it is that 256 tag setting instruction per 4 kB page aren't needed at allocation and deallocation time. Thus match-all pointers still work, while non-match tags (other than poison tag) still fault. __GFP_SKIP_KASAN only skips for KASAN_HW_TAGS mode, so coverage is unchanged. Benchmark: The benchmark has two modes. In thread mode, the child process forks and creates N threads. In pgtable mode, the parent maps and faults a specified memory size and then forks repeatedly with children exiting immediately. Thread benchmark: 2000 iterations, 2000 threads: 2.575 s → 2.229 s (~13.4% faster) The pgtable samples: - 2048 MB, 2000 iters 19.08 s → 17.62 s (~7.6% faster) This patch (of 3): For allocations that will be accessed only with match-all pointers (e.g., kernel stacks), setting tags is wasted work. If the caller already set __GFP_SKIP_KASAN, skip tag setting of vmalloc pages. Before this patch, __GFP_SKIP_KASAN wasn't being used with vmalloc APIs. So it wasn't being checked. Now its being checked and acted upon. Other KASAN modes are unchanged because __GFP_SKIP_KASAN is ignored for them in the page allocator, and in vmalloc too we ignore this flag for them. This is a preparatory patch for optimizing kernel stack allocations. Link: https://lore.kernel.org/20260429102704.680174-1-dev.jain@arm.com Link: https://lore.kernel.org/20260429102704.680174-2-dev.jain@arm.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com> Co-developed-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Co-developed-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	mm/damon/core: introduce damon_ctx->paused	SeongJae Park
	Patch series "mm/damon: let DAMON be paused and resumed", v2. DAMON utilizes a few mechanisms that enhance itself over time. Adaptive regions adjustment, goal-based DAMOS quota auto-tuning and monitoring intervals auto-tuning like self-training mechanisms are such examples. It also adds access frequency stability information (age) to the monitoring results, which makes it enhanced over time. Sometimes users have to stop DAMON. In this case, DAMON internal state that enhanced over the time of the last execution simply goes away. Restarted DAMON have to train itself and enhance its output from the scratch. This makes DAMON less useful in such cases. Introducing three such use cases below. Investigation of DAMON. It is best to do the investigation online, especially when it is a production environment. DAMON therefore provides features for such online investigations, including DAMOS stats, monitoring result snapshot exposure, and multiple tracepoints. When those are insufficient, and there are additional clues that could be interfered by DAMON, users have to temporarily stop DAMON to collect the additional clues. It is not very useful since many of DAMON internal clues are gone when DAMON is stopped. The loss of the monitoring results that improved over time is also problematic, especially in production environments. Monitoring of workloads that have different user-known phases. For example, in Android, applications are known to have very different access patterns and behaviors when they are running on the foreground and the background. It can therefore be useful to separate monitoring of apps based on whether they are running on the foreground and on the background. Having two DAMON threads per application that paused and resumed for the apps foreground/background switches can be useful for the purpose. But such pause/resume of the execution is not supported. Tests of DAMON. A few DAMON selftests are using drgn to dump the internal DAMON status. The tests show if the dumped status is the same as what the test code expected. Because DAMON keeps running and modifying its internal status, there are chances of data races that can cause false test results. Stopping DAMON can avoid the race. But, since the internal state of DAMON is dropped, the test coverage will be limited. Let DAMON execution be paused and resumed without loss of the internal state, to overhaul the limitations. For this, introduce a new DAMON context parameter, namely 'pause'. API callers can update it while the context is running, using the online parameters update functions (damon_commit_ctx() and damon_call()). Once it is set, kdamond_fn() main loop will do only limited works excluding the monitoring and DAMOS works, while sleeping sampling intervals per the work. The limited works include handling of the online parameters update. Hence users can unset the 'pause' parameter again. Once it is unset, kdamond_fn() main loop will do all the work again (resumed). Under the paused state, it also does stop condition checks and handling of it, so that paused DAMON can also be stopped if needed. Expose the feature to the user space via DAMON sysfs interface. Also, update existing drgn-based tests to test and use the feature. Tests ===== I confirmed the feature functionality using real time tracing ('perf trace' or 'trace-cmd stream') of damon:damon_aggregated DAMON tracepoint. By pausing and resuming the DAMON execution, I was able to see the trace stops and continued as expected. Note that the pause feature support is added to DAMON user-space tool (damo) after v3.1.9. Users can use '--pause_ctx' command line option of damo for that, and I actually used it for my test. The extended drgn-based selftests are also testing a part of the functionality. Patches Sequence ================ Patch 1 introduces the new core API for the pause feature. Patch 2 extend DAMON sysfs interface for the new parameter. Patches 3-5 update design, usage and ABI documents for the new sysfs file, respectively. The following five patches are for tests. Patch 6 implements a new kunit test for the pause parameter online commitment. Patches 7 and 8 extend DAMON selftest helpers to support the new feature. Patch 9 extends selftest to test the commitment of the feature. Finally, patch 10 updates existing selftest to be safe from the race condition using the pause/resume feature. This patch (of 10): DAMON supports only start and stop of the execution. When it is stopped, its internal data that it self-trained goes away. It will be useful if the execution can be paused and resumed with the previous self-trained data. Introduce per-context API parameter, 'paused', for the purpose. The parameter can be set and unset while DAMON is running and paused, using the online parameters commit helper functions (damon_commit_ctx() and damon_call()). Once 'paused' is set, the kdamond_fn() main loop does only limited works with sampling interval sleep during the works. The limited works include the handling of the online parameters update, so that users can unset the 'pause' and resume the execution when they want. It also keep checking DAMON stop conditions and handling of it, so that DAMON can be stopped while paused if needed. Link: https://lore.kernel.org/20260427151231.113429-1-sj@kernel.org Link: https://lore.kernel.org/20260427151231.113429-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	mm: limit filemap_fault readahead to VMA boundaries	Frederick Mayle
	When a file mapping covers a strict subset of a file, an access to the mapping can trigger readahead of file pages outside the mapped region. Readahead is meant to prefetch pages likely to be accessed soon, but these pages aren't accessible via the same means, so it fair to say we don't have a good indicator they'll be accessed soon. Take an ELF file for example: an access to the end of a program's read-only segment isn't a sign that nearby file contents will be accessed next (they are likely to be mapped discontiguously, or not at all). The pressure from loading these pages into the cache can evict more useful pages. To improve the behavior, make three changes: * Introduce a new readahead_control field, max_index, as a hard limit on the readahead. The existing file_ra_state->size can't be used as a limit, it is more of a hint and can be increased by various heuristics. * Set readahead_control->max_index to the end of the VMA in all of the readahead paths that can be triggered from a fault on a file mapping (both "sync" and "async" readahead). * Limit the read-around range start to the VMA's start. Note that these changes only affect readahead triggered in the context of a fault, they do not affect readahead triggered by read syscalls. If a user mixes the two types of accesses, the behavior is expected to be the following: if a fault causes readahead and places a PG_readahead marker and then a read(2) syscall hits the PG_readahead marker, the resulting async readahead will not be limited to the VMA end. Conversely, if a read(2) syscall places a PG_readahead marker and then a fault hits the marker, the async readahead will be limited to the VMA end. There is an edge case that the above motivation glosses over: A single file mapping might be backed by multiple VMAs. For example, a whole file could be mapped RW, then part of the mapping made RO using mprotect. This patch would hurt performance of a sequential faulted read of such a mapping, the degree depending on how fragmented the VMAs are. A usage pattern like that is likely rare and already suffering from sub-optimal performance because, e.g., the fragmented VMAs limit the fault-around, so each VMA boundary in a sequential faulted read would cause a minor fault. Still, this patch would make it worse. See a previous discussion of this topic at [1]. Tested by mapping and reading a small subset of a large file, then using the cachestat syscall to verify the number of cached pages didn't exceed the mapping size. In practical scenarios, the effect depends on the specific file and usage. Sometimes there is no effect at all, but, for some ELF files in Android, we see ~20% fewer pages pulled into the cache. A comprehensive performance evaluation hasn't been done, but, in addition to the anecdontal memory savings mentioned above, a benchmark was run with fio 3.38, showing neutral looking results: /data/local/tmp/fio --version fio --name=mmap_test --ioengine=mmap --rw=read --bs=4k \ --offset=1G --size=1G --filesize=3G --numjobs=1 \ --filename=testfile.bin Before: 4366.6 MiB/s (avg of 3459, 4592, 4613, 4697, 4472) After: 4444.0 MiB/s (avg of 4633, 4655, 4511, 4571, 3850) +1.7% Same, with --ioengine=mmap --rw=randread Before: 445.6 MiB/s (avg of 446, 447, 442, 452, 441) After: 447.0 MiB/s (avg of 447, 446, 446, 451, 445) +0.3% Same, with --ioengine=psync --rw=read Before: 3086.6 MiB/s (avg of 3122, 3094, 3066, 3094, 3057) After: 3084.6 MiB/s (avg of 3039, 3103, 3103, 3084, 3094) -0.06% Same, with --ioengine=psync --rw=randread Before: 2226.4 MiB/s (avg of 2256, 2183, 2207, 2265, 2221) After: 2231.4 MiB/s (avg of 2236, 2241, 2236, 2193, 2251) +0.2% Link: https://lore.kernel.org/20260427030148.653228-1-fmayle@google.com Link: https://lore.kernel.org/all/ivnv2crd3et76p2nx7oszuqhzzah756oecn5yuykzqfkqzoygw@yvnlkhjjssoz/ [1] Signed-off-by: Frederick Mayle <fmayle@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Kalesh Singh <kaleshsingh@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	mm: remove page_mapped()	David Hildenbrand (Arm)
	Let's replace the last user of page_mapped() by folio_mapped() so we can get rid of page_mapped(). Replace the remaining occurrences of page_mapped() in rmap documentation by folio_mapped(). Link: https://lore.kernel.org/20260427-page_mapped-v1-3-e89c3592c74c@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Harry Yoo <harry@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rich Felker <dalias@libc.org> Cc: Rik van Riel <riel@surriel.com> Cc: Song Liu <song@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	mm/damon: support MADV_COLLAPSE via DAMOS_COLLAPSE scheme action	Asier Gutierrez
	This patch set introces a new action: DAMOS_COLLAPSE. For DAMOS_HUGEPAGE and DAMOS_NOHUGEPAGE to work, khugepaged should be working, since it relies on hugepage_madvise to add a new slot. This slot should be picked up by khugepaged and eventually collapse (or not, if we are using DAMOS_NOHUGEPAGE) the pages. If THP is not enabled, khugepaged will not be working, and therefore no collapse will happen. DAMOS_COLLAPSE eventually calls madvise_collapse, which will collapse the address range synchronously. In cases where there is a large VMA (databases, for example), DAMOS_COLLAPSE allows us to collapse only the hot region, and not the entire VMA. This new action may be required to support autotuning with hugepage as a goal[1]. ========= Benchmarks: ========= MySQL ===== Tests were performed in an ARM physical server with MariaDB 10.5 and sysbench. Read only benchmark was perform with gaussian row hitting, which follows a normal distribution. T n, D h: THP set to never, DAMON action set to hugepage T m, D h: THP set to madvise, DAMON action set to hugepage T n, D c: THP set to never, DAMON action set to collapse Memory consumption. Lower is better. +------------------+----------+----------+----------+ \| \| T n, D h \| T m, D h \| T n, D c \| +------------------+----------+----------+----------+ \| Total memory use \| 2.13 \| 2.20 \| 2.20 \| \| Huge pages \| 0 \| 1.3 \| 1.27 \| +------------------+----------+----------+----------+ Performance in TPS (Transactions Per Second). Higher is better. T n, D h: 18225.58 T m, D h 18252.93 T n, D c: 18270.21 Performance counter I got the number of L1 D/I TLB accesses and the number a D/I TLB accesses that triggered a page walk. I divided the second by the first to get the percentage of page walkes per TLB access. The lower the better. +---------------+--------------+--------------+--------------+ \| \| T n, D h \| T m, D h \| T n, D c \| +---------------+--------------+--------------+--------------+ \| L1 DTLB \| 127248242753 \| 125431020479 \| 125327001821 \| \| L1 ITLB \| 80332558619 \| 79346759071 \| 79298139590 \| \| DTLB walk \| 75011087 \| 52800418 \| 55895794 \| \| ITLB walk \| 71577076 \| 71505137 \| 67262140 \| \| DTLB % misses \| 0.058948623 \| 0.042095183 \| 0.044599961 \| \| ITLB % misses \| 0.089100954 \| 0.090117275 \| 0.084821839 \| +---------------+--------------+--------------+--------------+ Masim ===== I used masim with the "demo" configuration, but changing the times to 100 seconds for the initial phase and 50 seconds for the rest of the phases. Memory consumption: +------------------+----------+----------+----------+ \| \| T n, D h \| T m, D h \| T n, D c \| +------------------+----------+----------+----------+ \| Total memory use \| 2.38 GB \| 2.36 GB \| 2.37 GB \| \| Huge pages \| 0 \| 190 MB \| 188 MB \| +------------------+----------+----------+----------+ Performance: THP never, DAMOS_HUGEPAGE initial phase: 40,491 accesses/msec, 100001 msecs run low phase 0: 39,658 accesses/msec, 50002 msecs run high phase 0: 41,678 accesses/msec, 50000 msecs run low phase 1: 39,625 accesses/msec, 50003 msecs run high phase 1: 41,658 accesses/msec, 50002 msecs run low phase 2: 39,642 accesses/msec, 50002 msecs run high phase 2: 41,640 accesses/msec, 50001 msecs run THP madvise, DAMOS_HUGEPAGE initial phase: 51,977 accesses/msec, 100000 msecs run low phase 0: 86,953 accesses/msec, 50000 msecs run high phase 0: 94,812 accesses/msec, 50000 msecs run low phase 1: 101,017 accesses/msec, 50000 msecs run high phase 1: 94,841 accesses/msec, 50000 msecs run low phase 2: 100,993 accesses/msec, 50000 msecs run high phase 2: 94,791 accesses/msec, 50001 msecs run THP never, DAMOS_COLLAPSE initial phase: 93,678 accesses/msec, 100001 msecs run low phase 0: 101,475 accesses/msec, 50000 msecs run high phase 0: 98,589 accesses/msec, 50000 msecs run low phase 1: 101,531 accesses/msec, 50001 msecs run high phase 1: 98,506 accesses/msec, 50001 msecs run low phase 2: 101,458 accesses/msec, 50001 msecs run high phase 2: 98,555 accesses/msec, 50000 msecs run Memory consumption dynamic (how quickly collapses occur): It shows in seconds how many huge pages are allocated. +----+----------+----------+ \| \| T m, D h \| T n, D c \| +----+----------+----------+ \| 5 \| 32 \| 188 \| \| 10 \| 48 \| 188 \| \| 15 \| 64 \| 188 \| \| 20 \| 96 \| 188 \| \| 30 \| 112 \| 188 \| \| 35 \| 144 \| 188 \| \| 40 \| 160 \| 188 \| \| 45 \| 190 \| 188 \| \| 50 \| 190 \| 188 \| \| 55 \| 190 \| 188 \| \| 60 \| 190 \| 188 \| +----+----------+----------+ ========= - We can see that DAMOS "hugepage" action works only when THP is set to madvise. "collapse" action works even when THP is set to never. - Performance for "collapse" action is slightly lower than "hugepage" action and THP madvise. This is due to the fact that collapases occur synchronously. With "hugepage" they may occur during page faults. - Memory consumption is slighly lower for "collapse" than "hugepage" with THP madvise. This is due to the khugepage collapses all VMAs, while "collapse" action only collapses the VMAs in the hot region. - There is an improvement in TLB utilization when collapse through "hugepage" or "collapse" actions are triggered. The amount of TLB misses is lower. - "collapse" action is performance synchronously, which means that page collapses happen earlier and more rapidly. This can be useful or not, depending on the scenario. - "hugepage" action may trigger a VMA split in some scenarios, since it needs to change the flag of the VMA to THP enabled. This may lead to additional overhead. Collapse action just adds a new option to chose the correct system balance. Link: https://lore.kernel.org/20260426231619.107231-5-sj@kernel.org Link: https://lore.kernel.org/damon/20260313000816.79933-1-sj@kernel.org/ [1] Signed-off-by: Asier Gutierrez <gutierrez.asier@huawei-partners.com> Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Cheng-Han Wu <hank20010209@gmail.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Liew Rui Yan <aethernet65535@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	mm/sparse-vmemmap: pass @pgmap argument to memory deactivation paths	Muchun Song
	Currently, the memory hot-remove call chain -- arch_remove_memory(), __remove_pages(), sparse_remove_section() and section_deactivate() -- does not carry the struct dev_pagemap pointer. This prevents the lower levels from knowing whether the section was originally populated with vmemmap optimizations (e.g., DAX with vmemmap optimization enabled). Without this information, we cannot call vmemmap_can_optimize() to determine if the vmemmap pages were optimized. As a result, the vmemmap page accounting during teardown will mistakenly assume a non-optimized allocation, leading to incorrect memmap statistics. To lay the groundwork for fixing the vmemmap page accounting, we need to pass the @pgmap pointer down to the deactivation location. Plumb the @pgmap argument through the APIs of arch_remove_memory(), __remove_pages() and sparse_remove_section(), mirroring the corresponding *_activate() paths. Link: https://lore.kernel.org/20260428081855.1249045-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Liam R. Howlett <liam@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	mm/vmpressure: skip socket pressure for costly order reclaim	JP Kobryn (Meta)
	When reclaim is triggered by high order allocations on a fragmented system, vmpressure() can report poor reclaim efficiency even though the system has plenty of free memory. This is because many pages are scanned, but few are found to actually reclaim - the pages are actively in use and don't need to be freed. The resulting scan:reclaim ratio causes vmpressure() to assert socket pressure, throttling TCP throughput unnecessarily. Costly order allocations (above PAGE_ALLOC_COSTLY_ORDER) rely heavily on compaction to succeed, so poor reclaim efficiency at these orders does not necessarily indicate memory pressure. The kernel already treats this order as the boundary where reclaim is no longer expected to succeed and compaction may take over. Make vmpressure() order-aware through an additional parameter sourced from scan_control at existing call sites. Socket pressure is now only asserted when order <= PAGE_ALLOC_COSTLY_ORDER. Memcg reclaim is unaffected since try_to_free_mem_cgroup_pages() always uses order 0, which passes the filter unconditionally. Similarly, vmpressure_prio() now passes order 0 internally when calling vmpressure(), ensuring critical pressure from low reclaim priority is not suppressed by the order filter. The patch was motivated by a case of impacted net throughput in production. On one affected host, the memory state at the time showed ~15GB available, zero cgroup pressure, and the following buddyinfo state: Order FreePages 0: 133,970 1: 29,230 2: 17,351 3: 18,984 7+: 0 Using bpf, it was found that 94% of vmpressure calls on this host were from order-7 kswapd reclaim. TCP minimum recv window is rcv_ssthresh:19712. Before patch: 723 out of 3,843 (19%) TCP connections stuck at minimum recv window After live-patching and ~30min elapsed: 0 out of 3,470 TCP connections stuck at minimum recv window Link: https://lore.kernel.org/20260406195014.112521-1-jp.kobryn@linux.dev Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qi Zheng <qi.zheng@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	mm/sparse: remove sparse buffer pre-allocation mechanism	Muchun Song
	Commit 9bdac9142407 ("sparsemem: Put mem map for one node together.") introduced a mechanism to pre-allocate a large memory block to hold all memmaps for a NUMA node upfront. However, the original commit message did not clearly state the actual benefits or the necessity of explicitly pre-allocating a single chunk for all memmap areas of a given node. One of the concerns about removing this pre-allocation is that the subsequent per-section memmap allocations could become scattered around, and might turn too many memory blocks/sections into an "un-offlinable" state. However, tests show that even without the explicit node-wide pre-allocation, memblock still allocates memory closely and back-to-back. When tracing vmemmap_set_pmd allocations, the physical chunks allocated by memblock are strictly adjacent to each other in a single contiguous physical range (mapped top-down). Because they are packed tightly together naturally, they will at most consume or pollute the exact same number of memory blocks as the explicit pre-allocation did. Another concern is the boot performance impact of calling memmap_alloc() multiple times compared to one large node-wide allocation. Tests on a 256GB VM showed that memmap allocation time increased from 199,555 ns to 741,292 ns. Even though it is 3.7x slower, on a 1TB machine, the entire memory allocation time would only take a few milliseconds. This boot performance difference is completely negligible. Since no negative impact on memory offlining behavior or noticeable boot performance regression was found, this patch proposes removing the explicit node-wide memmap pre-allocation mechanism to reduce the maintenance burden. Link: https://lore.kernel.org/20260410092419.2446420-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	mm/damon/core: introduce failed region quota charge ratio	SeongJae Park
	DAMOS quota is charged to all DAMOS action application attempted memory, regardless of how much of the memory the action was successful and failed. This makes understanding quota behavior without DAMOS stat but only with end level metrics (e.g., increased amount of free memory for DAMOS_PAGEOUT action) difficult. Also, charging action-failed memory same as action-successful memory is somewhat unfair, as successful action application will induce more overhead in most cases. Introduce DAMON core API for setting the charge ratio for such action-failed memory. It allows API callers to specify the ratio in a flexible way, by setting the numerator and the denominator. Link: https://lore.kernel.org/20260428013402.115171-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	mm/damon: add node_eligible_mem_bp goal metric	Ravi Jonnalagadda
	Background and Motivation ========================= In heterogeneous memory systems, controlling memory distribution across NUMA nodes is essential for performance optimization. This patch enables system-wide page distribution with target-state goals such as "maintain 60% of scheme-eligible memory on DRAM" using PA-mode DAMON schemes. Rather than using absolute thresholds, this metric tracks the ratio of memory that matches each scheme's access pattern filters on a target node, enabling the quota system to automatically adjust migration aggressiveness to maintain the desired distribution. What This Metric Measures ========================= node_eligible_mem_bp: scheme_eligible_bytes_on_node / total_scheme_eligible_bytes * 10000 Two-Scheme Setup for Hot Page Distribution ========================================== For maintaining 60% of hot memory on DRAM (node 0) and 40% on CXL (node 1): PULL scheme: migrate_hot to node 0 goal: node_eligible_mem_bp, nid=0, target=6000 addr filter: node 1 address range (only migrate FROM CXL) "Move hot pages to DRAM if less than 60% of hot data is in DRAM" PUSH scheme: migrate_hot to node 1 goal: node_eligible_mem_bp, nid=1, target=4000 addr filter: node 0 address range (only migrate FROM DRAM) "Move hot pages to CXL if less than 40% of hot data is in CXL" Each scheme independently measures its own eligible memory and adjusts its quota to achieve its target ratio. The schemes work in concert through DAMON's unified monitoring context, with the quota autotuner balancing their relative aggressiveness. Implementation Details ====================== The implementation adds a new quota goal metric type DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP to the existing DAMOS quota goal framework. When this metric is configured for a scheme: 1. During each quota adjustment cycle, damos_get_node_eligible_mem_bp() is called to calculate the current memory distribution. 2. The function iterates through all regions that match the scheme's access pattern (via __damos_valid_target()) and calculates: - Total eligible bytes across all nodes - Eligible bytes specifically on the target node (goal->nid) 3. For each eligible region, damos_calc_eligible_bytes() walks through the physical address range, using damon_get_folio() to look up each folio and determine its NUMA node via folio_nid(). 4. Large folios are handled by calculating the exact overlap between the region boundaries and folio boundaries, ensuring accurate byte counts even when regions partially span folios. 5. The ratio (node_eligible / total_eligible * 10000) is returned as basis points, which the quota autotuner uses to adjust the scheme's effective quota size (esz). The implementation requires CONFIG_DAMON_PADDR since damon_get_folio() is only available for physical address space monitoring. Testing Results =============== Functionally tested on a two-node heterogeneous memory system with DRAM (node 0) and CXL memory (node 1). A PUSH+PULL scheme configuration using migrate_hot actions was used to reach a target hot memory ratio between the two tiers. With the TEMPORAL tuner, the system converges quickly to the target distribution. The tuner drives esz to maximum when under goal and to zero once the goal is met, forming a simple on/off feedback loop that stabilizes at the desired ratio. With the CONSIST tuner, the scheme still converges but more slowly, as it migrates and then throttles itself based on quota feedback. The time to reach the goal varies depending on workload intensity. Note: This metric works with both TEMPORAL and CONSIST goal tuners. Link: https://lore.kernel.org/20260428030520.701-1-ravis.opensrc@gmail.com Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@gmail.com> Suggested-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Yunjeong Mun <yunjeong.mun@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	vmalloc: optimize vfree with free_pages_bulk()	Ryan Roberts
	Whenever vmalloc allocates high order pages (e.g. for a huge mapping) it must immediately split_page() to order-0 so that it remains compatible with users that want to access the underlying struct page. Commit a06157804399 ("mm/vmalloc: request large order pages from buddy allocator") recently made it much more likely for vmalloc to allocate high order pages which are subsequently split to order-0. Unfortunately this had the side effect of causing performance regressions for tight vmalloc/vfree loops (e.g. test_vmalloc.ko benchmarks). See Closes: tag. This happens because the high order pages must be gotten from the buddy but then because they are split to order-0, when they are freed they are freed to the order-0 pcp. Previously allocation was for order-0 pages so they were recycled from the pcp. It would be preferable if when vmalloc allocates an (e.g.) order-3 page that it also frees that order-3 page to the order-3 pcp, then the regression could be removed. So let's do exactly that; update stats separately first as coalescing is hard to do correctly without complexity. Use free_pages_bulk() which uses the new __free_contig_range() API to batch-free contiguous ranges of pfns. This not only removes the regression, but significantly improves performance of vfree beyond the baseline. A selection of test_vmalloc benchmarks running on arm64 server class system. mm-new is the baseline. Commit a06157804399 ("mm/vmalloc: request large order pages from buddy allocator") was added in v6.19-rc1 where we see regressions. Then with this change performance is much better. (>0 is faster, <0 is slower, (R)/(I) = statistically significant Regression/Improvement): +-----------------+----------------------------------------------------------+-------------------+--------------------+ \| Benchmark \| Result Class \| mm-new \| this series \| +=================+==========================================================+===================+====================+ \| micromm/vmalloc \| fix_align_alloc_test: p:1, h:0, l:500000 (usec) \| 1331843.33 \| (I) 67.17% \| \| \| fix_size_alloc_test: p:1, h:0, l:500000 (usec) \| 415907.33 \| -5.14% \| \| \| fix_size_alloc_test: p:4, h:0, l:500000 (usec) \| 755448.00 \| (I) 53.55% \| \| \| fix_size_alloc_test: p:16, h:0, l:500000 (usec) \| 1591331.33 \| (I) 57.26% \| \| \| fix_size_alloc_test: p:16, h:1, l:500000 (usec) \| 1594345.67 \| (I) 68.46% \| \| \| fix_size_alloc_test: p:64, h:0, l:100000 (usec) \| 1071826.00 \| (I) 79.27% \| \| \| fix_size_alloc_test: p:64, h:1, l:100000 (usec) \| 1018385.00 \| (I) 84.17% \| \| \| fix_size_alloc_test: p:256, h:0, l:100000 (usec) \| 3970899.67 \| (I) 77.01% \| \| \| fix_size_alloc_test: p:256, h:1, l:100000 (usec) \| 3821788.67 \| (I) 89.44% \| \| \| fix_size_alloc_test: p:512, h:0, l:100000 (usec) \| 7795968.00 \| (I) 82.67% \| \| \| fix_size_alloc_test: p:512, h:1, l:100000 (usec) \| 6530169.67 \| (I) 118.09% \| \| \| full_fit_alloc_test: p:1, h:0, l:500000 (usec) \| 626808.33 \| -0.98% \| \| \| kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) \| 532145.67 \| -1.68% \| \| \| kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) \| 537032.67 \| -0.96% \| \| \| long_busy_list_alloc_test: p:1, h:0, l:500000 (usec) \| 8805069.00 \| (I) 74.58% \| \| \| pcpu_alloc_test: p:1, h:0, l:500000 (usec) \| 500824.67 \| 4.35% \| \| \| random_size_align_alloc_test: p:1, h:0, l:500000 (usec) \| 1637554.67 \| (I) 76.99% \| \| \| random_size_alloc_test: p:1, h:0, l:500000 (usec) \| 4556288.67 \| (I) 72.23% \| \| \| vm_map_ram_test: p:1, h:0, l:500000 (usec) \| 107371.00 \| -0.70% \| +-----------------+----------------------------------------------------------+-------------------+--------------------+ Link: https://lore.kernel.org/20260401101634.2868165-3-usama.anjum@arm.com Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allocator") Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Sterba <dsterba@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	mm/page_alloc: optimize free_contig_range()	Ryan Roberts
	Patch series "mm: Free contiguous order-0 pages efficiently", v6. A recent change to vmalloc caused some performance benchmark regressions (see [1]). I'm attempting to fix that (and at the same time significantly improve beyond the baseline) by freeing a contiguous set of order-0 pages as a batch. At the same time I observed that free_contig_range() was essentially doing the same thing as vfree() so I've fixed it there too. While at it, optimize the __free_contig_frozen_range() as well. Check that the contiguous range falls in the same section. If they aren't enabled, the if conditions get optimized out by the compiler as memdesc_section() returns 0. See num_pages_contiguous() for more details about it. This patch (of 3): Decompose the range of order-0 pages to be freed into the set of largest possible power-of-2 size and aligned chunks and free them to the pcp or buddy. This improves on the previous approach which freed each order-0 page individually in a loop. Testing shows performance to be improved by more than 10x in some cases. Since each page is order-0, we must decrement each page's reference count individually and only consider the page for freeing as part of a high order chunk if the reference count goes to zero. Additionally free_pages_prepare() must be called for each individual order-0 page too, so that the struct page state and global accounting state can be appropriately managed. But once this is done, the resulting high order chunks can be freed as a unit to the pcp or buddy. This significantly speeds up the free operation but also has the side benefit that high order blocks are added to the pcp instead of each page ending up on the pcp order-0 list; memory remains more readily available in high orders. vmalloc will shortly become a user of this new optimized free_contig_range() since it aggressively allocates high order non-compound pages, but then calls split_page() to end up with contiguous order-0 pages. These can now be freed much more efficiently. The execution time of the following function was measured in a server class arm64 machine: static int page_alloc_high_order_test(void) { unsigned int order = HPAGE_PMD_ORDER; struct page *page; int i; for (i = 0; i < 100000; i++) { page = alloc_pages(GFP_KERNEL, order); if (!page) return -1; split_page(page, order); free_contig_range(page_to_pfn(page), 1UL << order); } return 0; } Execution time before: 4097358 usec Execution time after: 729831 usec Perf trace before: 99.63% 0.00% kthreadd [kernel.kallsyms] [.] kthread \| ---kthread 0xffffb33c12a26af8 \| \|--98.13%--0xffffb33c12a26060 \| \| \| \|--97.37%--free_contig_range \| \| \| \| \| \|--94.93%--___free_pages \| \| \| \| \| \| \| \|--55.42%--__free_frozen_pages \| \| \| \| \| \| \| \| \| --43.20%--free_frozen_page_commit \| \| \| \| \| \| \| \| \| --35.37%--_raw_spin_unlock_irqrestore \| \| \| \| \| \| \| \|--11.53%--_raw_spin_trylock \| \| \| \| \| \| \| \|--8.19%--__preempt_count_dec_and_test \| \| \| \| \| \| \| \|--5.64%--_raw_spin_unlock \| \| \| \| \| \| \| \|--2.37%--__get_pfnblock_flags_mask.isra.0 \| \| \| \| \| \| \| --1.07%--free_frozen_page_commit \| \| \| \| \| --1.54%--__free_frozen_pages \| \| \| --0.77%--___free_pages \| --0.98%--0xffffb33c12a26078 alloc_pages_noprof Perf trace after: 8.42% 2.90% kthreadd [kernel.kallsyms] [k] __free_contig_range \| \|--5.52%--__free_contig_range \| \| \| \|--5.00%--free_prepared_contig_range \| \| \| \| \| \|--1.43%--__free_frozen_pages \| \| \| \| \| \| \| --0.51%--free_frozen_page_commit \| \| \| \| \| \|--1.08%--_raw_spin_trylock \| \| \| \| \| --0.89%--_raw_spin_unlock \| \| \| --0.52%--free_pages_prepare \| --2.90%--ret_from_fork kthread 0xffffae1c12abeaf8 0xffffae1c12abe7a0 \| --2.69%--vfree __free_contig_range Link: https://lore.kernel.org/20260401101634.2868165-1-usama.anjum@arm.com Link: https://lore.kernel.org/20260401101634.2868165-2-usama.anjum@arm.com Link: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com [1] Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com> Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Sterba <dsterba@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	mm: convert vmemmap_p?d_populate() to static functions	Chengkaitao
	Since the vmemmap_p?d_populate functions are unused outside the mm subsystem, we can remove their external declarations and convert them to static functions. Link: https://lore.kernel.org/20260423101441.7089-1-kaitao.cheng@linux.dev Signed-off-by: Chengkaitao <chengkaitao@kylinos.cn> Acked-by: David Hildenbrand (arm) <david@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@kernel.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	mm/memory-failure: fix hugetlb_lock AA deadlock in get_huge_page_for_hwpoison	Wupeng Ma
	Two concurrent madvise(MADV_HWPOISON) calls on the same hugetlb page can trigger a recursive spinlock self-deadlock (AA deadlock) on hugetlb_lock when racing with a concurrent unmap: thread#0 thread#1 -------- -------- madvise(folio, MADV_HWPOISON) -> poisons the folio successfully madvise(folio, MADV_HWPOISON) unmap(folio) try_memory_failure_hugetlb get_huge_page_for_hwpoison spin_lock_irq(&hugetlb_lock) <- held __get_huge_page_for_hwpoison hugetlb_update_hwpoison() -> MF_HUGETLB_FOLIO_PRE_POISONED goto out: folio_put() refcount: 1 -> 0 free_huge_folio() spin_lock_irqsave(&hugetlb_lock) -> AA DEADLOCK! The out: path in __get_huge_page_for_hwpoison() calls folio_put() to drop the GUP reference while the hugetlb_lock is still held by the hugetlb.c wrapper get_huge_page_for_hwpoison(). If concurrent unmap has released the page table mapping reference, folio_put() drops the folio refcount to zero, triggering free_huge_folio() which attempts to re-acquire the non-recursive hugetlb_lock. Fix this by moving hugetlb_lock acquisition from the hugetlb.c wrapper into get_huge_page_for_hwpoison(). Place spin_unlock_irq() before the folio_put() at the out: label so the folio is always released outside the lock. [akpm@linux-foundation.org: fix race, rename label per Miaohe] Link: https://sashiko.dev/#/patchset/20260522010305.4099834-1-mawupeng1@huawei.com Link: https://lore.kernel.org/f39f405e-4b4b-8f79-70fe-a2b5b62114eb@huawei.com Link: https://lore.kernel.org/20260522010305.4099834-1-mawupeng1@huawei.com Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()") Signed-off-by: Wupeng Ma <mawupeng1@huawei.com> Acked-by: Oscar Salvador (SUSE) <osalvador@kernel.org> Acked-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28	ring-buffer: Add persistent ring buffer invalid-page inject test	Masami Hiramatsu (Google)
	Add a self-corrupting test for the persistent ring buffer. This will inject an erroneous value to some sub-buffer pages (where the index is even or multiples of 5) in the persistent ring buffer when the kernel panics, and checks whether the number of detected invalid pages and the total entry_bytes are the same as the recorded values after reboot. This ensures that the kernel can correctly recover a partially corrupted persistent ring buffer after a reboot or panic. The test only runs on the persistent ring buffer whose name is "ptracingtest". The user has to fill it with events before a kernel panic. To run the test, enable CONFIG_RING_BUFFER_PERSISTENT_INJECT and add the following kernel cmdline: reserve_mem=20M:2M:trace trace_instance=ptracingtest^traceoff@trace panic=1 Run the following commands after the 1st boot: cd /sys/kernel/tracing/instances/ptracingtest echo 1 > tracing_on echo 1 > events/enable sleep 3 echo c > /proc/sysrq-trigger After panic message, the kernel will reboot and run the verification on the persistent ring buffer, e.g. Ring buffer meta [2] invalid buffer page detected Ring buffer meta [2] is from previous boot! (318 pages discarded) Ring buffer testing [2] invalid pages: PASSED (318/318) Ring buffer testing [2] entry_bytes: PASSED (1300476/1300476) Link: https://patch.msgid.link/20260522171051.260140328@kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2026-05-28	spmi: use kzalloc_flex in main allocation	Rosen Penev
	Add a flexible array member to avoid indexing past the struct. Signed-off-by: Rosen Penev <rosenp@gmail.com> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2026-05-28	spmi: clean up kernel-doc in spmi.h	Randy Dunlap
	Fix all kernel-doc warnings in spmi.h: Warning: include/linux/spmi.h:114 function parameter 'ctrl' not described in 'spmi_controller_put' Warning: include/linux/spmi.h:144 struct member 'shutdown' not described in 'spmi_driver' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
2026-05-28	crypto: api - Remove per-tfm refcount	Eric Biggers
	This reverts commit ae131f4970f0 ("crypto: api - Add crypto_tfm_get"). The refcount in struct crypto_tfm was added solely to support crypto_clone_tfm(). Before then it was a simple non-refcounted object. Since crypto_clone_tfm() has been removed, remove the refcount as well. Note that this eliminates an expensive atomic operation from every tfm freeing operation. So this revert doesn't just remove unused code, but it also fixes a performance regression. Signed-off-by: Eric Biggers <ebiggers@kernel.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Link: https://patch.msgid.link/20260522053028.91165-5-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-29	Merge tag 'dd-lifetimes-7.2-rc1' of ↵	Danilo Krummrich
	git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core into drm-rust-next Higher-Ranked Lifetime Types for Rust device drivers Replace drvdata() with registration data on the auxiliary bus. Private data is now scoped to the registration object, removing the ordering constraints and lifetime complications that came with drvdata(). Add Higher-Ranked Lifetime Types (HRT) so driver structs can borrow device resources like pci::Bar and IoMem directly, tied to the device binding scope. This removes the need for Devres indirection and ARef<Device> in most driver code. This is a stable tag for other trees to merge. Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2026-05-29	Merge patch series "rust: device: Higher-Ranked Lifetime Types for device ↵	Danilo Krummrich
	drivers" Danilo Krummrich <dakr@kernel.org> says: Currently, Rust device drivers access device resources such as PCI BAR mappings and I/O memory regions through Devres<T>. Devres::access() provides zero-overhead access by taking a &Device<Bound> reference as proof that the device is still bound. Since a &Device<Bound> is available in almost all contexts by design, Devres is mostly a type-system level proof that the resource is valid, but it can also be used from scopes without this guarantee through its try_access() accessor. This works well in general, but has a few limitations: - Every access to a device resource goes through Devres::access(), which despite zero cost, adds boilerplate to every access site. - Destructors do not receive a &Device<Bound>, so they must use try_access(), which can fail. In practice the access succeeds if teardown ordering is correct, but the type system can't express this, forcing drivers to handle a failure path that should never be taken. - Sharing a resource across components (e.g. passing a BAR to a sub-component) requires Arc<Devres<T>>. - Device references must be stored as ARef<Device> rather than plain &Device borrows. These limitations stem from the driver's bus device private data being 'static -- the driver struct cannot borrow from the device reference it receives in probe(), even though it structurally cannot outlive the device binding. This series introduces Higher-Ranked Lifetime Types (HRT) for Rust device drivers. An HRT is a type that is generic over a lifetime -- it does not have a fixed lifetime, but can be instantiated with any lifetime chosen by the caller. Bus driver traits use a Generic Associated Type (GAT) type Data<'bound> to introduce the lifetime on the private data, rather than parameterizing the Driver trait itself. This avoids a driver trait global lifetime and avoids the need for ForLt for bus device private data, making the bus implementations much simpler. ForLt is only needed for auxiliary registration data, where the lifetime is not introduced by a trait callback but must be threaded through Registration. With HRT, driver structs carry a lifetime parameter tied to the device binding scope -- the interval of a bus device being bound to a driver. Device resources like pci::Bar<'bound> and IoMem<'bound> are handed out with this lifetime, so the compiler enforces at build time that they do not escape the binding scope. Before: struct MyDriver { pdev: ARef<pci::Device>, bar: Devres<pci::Bar<BAR_SIZE>>, } let io = self.bar.access(dev)?; io.read32(OFFSET); After: struct MyDriver<'bound> { pdev: &'bound pci::Device, bar: pci::Bar<'bound, BAR_SIZE>, } self.bar.read32(OFFSET); Lifetime-parameterized device resources can be put into a Devres at any point via Bar::into_devres() / IoMem::into_devres(), providing the exact same semantics as before. This is useful for resources shared across subsystem boundaries where revocation is needed. This also synergizes with the upcoming self-referential initialization support in pin-init, which allows one field of the driver struct to borrow another during initialization without unsafe code. The same pattern is applied to auxiliary device registration data as a first example beyond bus device private data. Registration<F: ForLt> can hold lifetime-parameterized data tied to the parent driver's binding scope. Since the auxiliary bus guarantees that the parent remains bound while the auxiliary device is registered, the registration data can safely borrow the parent's device resources. More generally, binding resource lifetimes to a registration scope applies to every registration that is scoped to a driver binding -- auxiliary devices, class devices, IRQ handlers, workqueues. A follow-up series extends this to class device registrations, starting with DRM, so that class device callbacks (IOCTLs, etc.) can safely access device resources through the separate registration data bound to the registration's lifetime without Devres indirection. Thanks to Gary for coming up with the ForLt implementation; thanks to Alice for the early discussions around lifetime-parameterized private data that helped shape the direction of this work. Link: https://patch.msgid.link/20260525202921.124698-1-dakr@kernel.org Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2026-05-28	net: remove SIOCSHWTSTAMP and SIOCGHWTSTAMP from ndo_eth_ioctl comment	Xuan Zhuo
	Since commit 4ee58e1e5680 ("net: promote SIOCSHWTSTAMP and SIOCGHWTSTAMP ioctls to dedicated handlers"), SIOCSHWTSTAMP and SIOCGHWTSTAMP are no longer dispatched through dev_eth_ioctl() / ndo_eth_ioctl(). They are now handled by their own dedicated functions dev_set_hwtstamp() and dev_get_hwtstamp() in the ioctl path. However, the comment describing ndo_eth_ioctl in netdevice.h still lists these two ioctls, which is misleading for driver developers who may incorrectly assume they need to handle hardware timestamping commands in their ndo_eth_ioctl implementation. Remove the stale references from the comment to accurately reflect that ndo_eth_ioctl only handles SIOCGMIIPHY, SIOCGMIIREG and SIOCSMIIREG. Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20260527120936.24169-1-xuanzhuo@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-28	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net	Jakub Kicinski
	Cross-merge networking fixes after downstream PR (net-7.1-rc6). Conflicts: drivers/net/phy/air_en8811h.c d895767c33781 ("net: phy: air_en8811h: add AN8811HB MCU assert/deassert support") dddfadd75197e ("net: phy: Add Airoha phy library for shared code") 5226bb6634cdf ("net: phy: air_phy_lib: Factorize BuckPBus register accessors") e08f0ea6daf2e ("net: phy: Rename Airoha common BuckPBus register accessors") net/sched/sch_netem.c a2f6ed7b4873 ("net/sched: netem: add per-impairment extended statistics") 9552b11e3eda ("net/sched: fix packet loop on netem when duplicate is on") Adjacent changes: drivers/dpll/zl3073x/core.c c1224569cef0 ("dpll: zl3073x: make frequency monitor a per-device attribute") 54e65df8cf18 ("dpll: zl3073x: report FFO as DPLL vs input reference offset") net/iucv/af_iucv.c 347fdd4df85f ("af_iucv: convert to getsockopt_iter") 3589d20a666c ("net/iucv: fix locking in .getsockopt") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-28	Merge tag 'net-7.1-rc6' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "This is again significantly bigger than the same point into the previous cycle, but at least smaller than last week. I'm not aware of any pending regression for the current cycle. Including fixes from netfilter. Current release - regressions: - netfilter: walk fib6_siblings under RCU Previous releases - regressions: - netlink: fix sending unassigned nsid after assigned one - bridge: fix sleep in atomic context in netlink path - sched: fix ethx:ingress -> ethy:egress -> ethx:ingress mirred loop - ipv4: fix net->ipv4.sysctl_local_reserved_ports UaF - eth: tun: free page on short-frame rejection in tun_xdp_one() Previous releases - always broken: - skbuff: fix missing zerocopy reference in pskb_carve helpers - handshake: drain pending requests at net namespace exit - ethtool: - rss: avoid modifying the RSS context response - module: avoid leaking a netdev ref on module flash errors - coalesce: cap profile updates at NET_DIM_PARAMS_NUM_PROFILES - netfilter: fix dst corruption in same register operation - nfc: hci: fix out-of-bounds read in HCP header parsing - ipv6: exthdrs: refresh nh pointer after ipv6_hop_jumbo() - eth: - vti: use ip6_tnl.net in vti6_changelink(). - vxlan: do not reuse cached ip_hdr() value after skb_tunnel_check_pmtu()" * tag 'net-7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits) dpll: zl3073x: make frequency monitor a per-device attribute dpll: zl3073x: use __dpll_device_change_ntf() and remove change_work dpll: export __dpll_device_change_ntf() for use under dpll_lock net/handshake: Drain pending requests at net namespace exit net/handshake: Verify file-reference balance in submit paths net/handshake: Close the submit-side sock_hold race net/handshake: hand off the pinned file reference to accept_doit net/handshake: Take a long-lived file reference at submit net/handshake: Pass negative errno through handshake_complete() nvme-tcp: store negative errno in queue->tls_err net/handshake: Use spin_lock_bh for hn_lock net: skbuff: fix missing zerocopy reference in pskb_carve helpers net: hibmcge: move dma_rmb() after dma_sync_single_for_cpu() in RX path net: hibmcge: disable Relaxed Ordering to fix RX packet corruption selftests/tc-testing: Add netem test case exercising loops selftests/tc-testing: Add mirred test cases exercising loops net/sched: act_mirred: Fix return code in early mirred redirect error paths net/sched: act_mirred: Fix blockcast recursion bypass leading to stack overflow net/sched: Fix ethx:ingress -> ethy:egress -> ethx:ingress mirred loop net/sched: fix packet loop on netem when duplicate is on ...
2026-05-28	bitops: Define generic___bitrev8/16/32 for reuse	Jinjie Ruan
	Define generic___bitrev8/16/32 using the implementation in <linux/bitrev.h>, so they can be reused in <asm/bitrev.h>, such as RISCV. Reviewed-by: Yury Norov <ynorov@nvidia.com> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Yury Norov <ynorov@nvidia.com>
2026-05-28	bitmap: fix find helper documentation	Yury Norov
	find_nth_and_bit() and find_nth_and_andnot_bit() may return a value greater than @size when the requested bit does not exist, matching find_nth_bit(). Document that correctly. All current users are safe against the '>=' vs '==' conditions. Also fix the for_each_clear_bitrange_from() parameter descriptions so they describe clear ranges instead of set ranges. Signed-off-by: Yury Norov <ynorov@nvidia.com>
2026-05-28	bitmap: drop bitmap_print_to_pagebuf()	Yury Norov
	Now that all users of bitmap_print_to_pagebuf() are switched to the alternatives, drop the function. Signed-off-by: Yury Norov <ynorov@nvidia.com>
2026-05-28	cpumask: switch cpumap_print_to_pagebuf() to using scnprintf()	Yury Norov
	In preparation for removing bitmap_print_to_pagebuf(), switch cpumap_print_to_pagebuf() to using scnprintf("%*pbl"). Signed-off-by: Yury Norov <ynorov@nvidia.com>
2026-05-28	block: export passthrough stats enabled	Keith Busch
	A user can enable io accounting for passthrough requests, so export the helper that checks if the request should be tracked. This will enable stacking drivers to to report iostats for passthrough workloads. Since the stacking request_queue may not be the one providing the request, the API has to add a parameter for the caller to specify which one to check. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260528010041.1533124-2-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-28	gpu: host1x: Allow entries in BO caches to be freed	Mikko Perttunen
	When a buffer object is pinned via host1x_bo_pin() with a cache, the resulting mapping is kept in the cache so it can be reused on subsequent pins. Each mapping held a reference to the underlying host1x_bo (taken in tegra_bo_pin / gather_bo_pin), so as long as a mapping was cached, the bo itself could not be freed. However, the only way to remove the cached mapping was through the free path of the buffer object. This meant that if a bo got cached, it could never get freed again. Resolve the circularity by holding a weak reference to the bo from the cache side. This is done by having the .pin callbacks not bump the bo's refcount -- instead the common Host1x bo code does so, except for the cache reference. Also move the remove-cache-mapping-on-free code into a common function inside Host1x code. This is only called from the TegraDRM GEM buffers since those are the only ones that can be cached at the moment. Reported-by: Aaron Kling <webgeek1234@gmail.com> Fixes: 1f39b1dfa53c ("drm/tegra: Implement buffer object cache") Signed-off-by: Mikko Perttunen <mperttunen@nvidia.com> Tested-by: Aaron Kling <webgeek1234@gmail.com> Signed-off-by: Thierry Reding <treding@nvidia.com> Link: https://patch.msgid.link/20260515-host1x-bocache-leak-v1-1-a0375f68aeab@nvidia.com
2026-05-28	block: add a bio_endio_status helper	Christoph Hellwig
	Add a helper that sets bi_status and call bio_endio() as that is a very common pattern and convert the core block code over to it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Md Haris Iqbal <haris.iqbal@linux.dev> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Link: https://patch.msgid.link/20260528084632.2505277-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-28	bvec: make the bvec_iter helpers inline functions	Christoph Hellwig
	The macros are impossible to follow due to the lack of visual type information and all the braces. Replace them with inline helpers to improve on that. Because the calling conventions are a bit problematic with a lot of passing structures by value, all the helpers are marked as __always_inline so that they are force inlined. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://patch.msgid.link/20260527151043.2349900-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-28	block: mark biovec_init_pool static	Christoph Hellwig
	Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://patch.msgid.link/20260527150646.2349405-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-28	Disable -Wattribute-alias for clang-23 and newer	Nathan Chancellor
	Clang recently added support for -Wattribute-alias [1], which results in the same warnings that necessitated commit bee20031772a ("disable -Wattribute-alias warning for SYSCALL_DEFINEx()") for GCC. kernel/time/itimer.c:325:1: error: alias and aliasee have different types 'long (unsigned int)' and 'long (typeof (__builtin_choose_expr((__builtin_types_compatible_p(typeof ((unsigned int)0), typeof (0LL)) \|\| __builtin_types_compatible_p(typeof ((unsigned int)0), typeof (0ULL))), 0LL, 0L)))' (aka 'long (long)') [-Werror,-Wattribute-alias] 325 \| SYSCALL_DEFINE1(alarm, unsigned int, seconds) \| ^ include/linux/syscalls.h:225:36: note: expanded from macro 'SYSCALL_DEFINE1' 225 \| #define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__) \| ^ include/linux/syscalls.h:236:2: note: expanded from macro 'SYSCALL_DEFINEx' 236 \| __SYSCALL_DEFINEx(x, sname, __VA_ARGS__) \| ^ include/linux/syscalls.h:251:18: note: expanded from macro '__SYSCALL_DEFINEx' 251 \| __attribute__((alias(__stringify(__se_sys##name)))); \ \| ^ kernel/time/itimer.c:325:1: note: aliasee is declared here include/linux/syscalls.h:225:36: note: expanded from macro 'SYSCALL_DEFINE1' 225 \| #define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__) \| ^ include/linux/syscalls.h:236:2: note: expanded from macro 'SYSCALL_DEFINEx' 236 \| __SYSCALL_DEFINEx(x, sname, __VA_ARGS__) \| ^ include/linux/syscalls.h:255:18: note: expanded from macro '__SYSCALL_DEFINEx' 255 \| asmlinkage long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \ \| ^ <scratch space>:16:1: note: expanded from here 16 \| __se_sys_alarm \| ^ Disable the warnings in the same way for clang-23 and newer. Disable the warning about unknown warning options to avoid breaking the build for versions of clang-23 that do not have -Wattribute-alias, such as ones deployed by vendors like Android or CI systems or when bisecting LLVM between llvmorg-23-init and release/23.x. Cc: stable@vger.kernel.org Closes: https://github.com/ClangBuiltLinux/linux/issues/2163 Link: https://github.com/llvm/llvm-project/commit/40da6920a0d71d49dfa2392b09153600b0759f5e [1] Link: https://patch.msgid.link/20260515-syscall-disable-attribute-alias-for-clang-v1-1-9a9d95d41df6@kernel.org Signed-off-by: Nathan Chancellor <nathan@kernel.org>
2026-05-28	dpll: export __dpll_device_change_ntf() for use under dpll_lock	Ivan Vecera
	Export __dpll_device_change_ntf() so that drivers can send device change notifications from within device callbacks, which are already called under dpll_lock. Using dpll_device_change_ntf() in that context would deadlock. Add lockdep_assert_held() to catch misuse without the lock held. Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20260526074525.1451008-2-ivecera@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-28	net: Introduce skb tc depth field to track packet loops	Jamal Hadi Salim
	Add a 2-bit per-skb tc depth field to track packet loops across the stack. The previous per-CPU loop counters like MIRRED_NEST_LIMIT assume a single call stack and lose state in two cases: 1) When a packet is queued and reprocessed later (e.g., egress->ingress via backlog), the per-cpu state is gone by the time it is dequeued. 2) With XPS/RPS a packet may arrive on one CPU and be processed on another. A per-skb field solves both by travelling with the packet itself. The field fits in existing padding, using 2 bits that were previously a hole: pahole before(-) and after (+) diff looks like: __u8 slow_gro:1; /* 132: 3 1 / __u8 csum_not_inet:1; / 132: 4 1 / __u8 unreadable:1; / 132: 5 1 / + __u8 tc_depth:2; / 132: 6 1 / - / XXX 2 bits hole, try to pack / / XXX 1 byte hole, try to pack / __u16 tc_index; / 134 2 */ There used to be a ttl field which was removed as part of tc_verd in commit aec745e2c520 ("net-tc: remove unused tc_verd fields"). It was already unused by that time, due to remove earlier in commit c19ae86a510c ("tc: remove unused redirect ttl"). The first user of this field is netem, which increments tc_depth on duplicated packets before re-enqueueing them at the root qdisc. On re-entry, netem skips duplication for any skb with tc_depth already set, bounding recursion to a single level regardless of tree topology. The other user is mirred which increments it on each pass and limits to depth to MIRRED_DEFER_LIMIT (3). The new field was called ttl in earlier versions of this patch but renamed to tc_depth to avoid confusion with IP ttl. Note (looking at you Sashiko! Dont ignore me and continue bringing this up): 1. Since both mirred and netem utilize the same 2-bit tc_depth field it is possible when netem and mirred are used together that netem qdisc to skip the duplication step. This is a known trade-off, as a 2-bit field cannot independently track both features' recursion depths and it is not considered sane to have a setup that addresses both features on at the same time. 2. skb_scrub_packet does not clear tc_depth. This means a packet's loop history is preserved even across namespaces. While this might be restrictive for some topologies, it is also design intent to provide robustness against loops across namespaces. Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20260525122556.973584-2-jhs@mojatatu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-28	Merge branch 'locking/context' into locking/core	Peter Zijlstra

2026-05-28	seqlock: Allow UBSAN_ALIGNMENT to fail optimizing	Heiko Carstens
	With gcc-15 and gcc-16 with UBSAN_ALIGNMENT enabled the compiler fails to inline and optimize __scoped_seqlock_bug() away on s390: s390x-16.1.0-ld: kernel/sched/build_policy.o: in function `__scoped_seqlock_next': /.../seqlock.h:1286:(.text+0x22030): undefined reference to `__scoped_seqlock_bug' Fix this by adding UBSAN_ALIGNMENT to the list of config options where a not inlined empty __scoped_seqlock_bug() is allowed. Closes: https://lore.kernel.org/r/20260515092057.810542-1-arnd@kernel.org/ Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260519110315.1385307-1-hca@linux.ibm.com
2026-05-28	compiler-context-analysis: Bump required Clang version to 23	Marco Elver
	Clang 23 introduces several major improvements: 1. Support for multiple arguments in the `guarded_by` and `pt_guarded_by` attributes [1]. This allows defining variables protected by multiple context locks, where read access requires holding at least one lock (shared or exclusive), and write access requires holding all of them exclusively. 2. Function pointer support [2]. We can now add attributes to function pointers just like we do on normal functions. 3. A fix to use arrays of locks [3]. Each index is now correctly treated as a separate lock instance. 4. A fix for implicit member access in attributes [4]. This allows to use __guarded_by(&foo->lock) correctly. Overall that makes it worthwhile bumping the compiler version instead of trying to make both Clang 22 and later work while supporting these new features. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://github.com/llvm/llvm-project/pull/186838 [1] Link: https://github.com/llvm/llvm-project/pull/191187 [2] Link: https://github.com/llvm/llvm-project/pull/148551 [3] Link: https://github.com/llvm/llvm-project/pull/194457 [4] Link: https://patch.msgid.link/20260515124426.2227783-1-elver@google.com