summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2026-06-02mm/memcg, swap: tidy up cgroup v1 memsw swap helpersKairui Song
The cgroup v1 swap helpers always operate on swap cache folios whose swap entry is stable: the folio is locked and in the swap cache. There is no need to pass the swap entry or page count as separate parameters when they can be derived from the folio itself. Simplify the redundant parameters and add sanity checks to document the required preconditions. Also rename memcg1_swapout to __memcg1_swapout to indicate it requires special calling context: the folio must be isolated and dying, and the call must be made with interrupts disabled. No functional change. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-6-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm, swap: unify large folio allocationKairui Song
Now that direct large order allocation is supported in the swap cache, both anon and shmem can use it instead of implementing their own methods. This unifies the fallback and swap cache check, which also reduces the TOCTOU race window of swap cache state: previously, high order swapin required checking swap cache states first, then allocating and falling back separately. Now all these steps happen in the same compact loop. Order fallback and statistics are also unified, callers just need to check and pass the acceptable order bitmask. There is basically no behavior change. This only makes things more unified and prepares for later commits. Cgroup and zero map checks can also be moved into the compact loop, further reducing race windows and redundancy Link: https://lore.kernel.org/20260517-swap-table-p4-v5-5-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm, swap: add support for stable large allocation in swap cache directlyKairui Song
To make it possible to allocate large folios directly in swap cache, provide a new infrastructure helper to handle the swap cache status check, allocation, and order fallback in the swap cache layer The new helper replaces the existing swap_cache_alloc_folio. Based on this, all the separate swap folio allocation that is being done by anon / shmem before is converted to use this helper directly, unifying folio allocation for anon, shmem, and readahead. This slightly consolidates how allocation is synchronized, making it more stable and less prone to errors. The slot-count and cache-conflict check is now always performed with the cluster lock held before allocation, and repeated under the same lock right before cache insertion. This double check produces a stable result compared to the previous anon and shmem mTHP allocation implementation, avoids the false-negative conflict checks that the lockless path can return — large allocations no longer have to be unwound because the range turned out to be occupied — and aborts early for already-freed slots, which helps ordinary swapin and especially readahead, with only a marginal increase in cluster-lock contention (the lock is very lightly contended and stays local in the first place). Hence, callers of swap_cache_alloc_folio() no longer need to check the swap slot count or swap cache status themselves. And now whoever first successfully allocates a folio in the swap cache will be the one who charges it and performs the swap-in. The race window of swapping is also reduced since the loop is much more compact. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-4-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/huge_memory: move THP gfp limit helper into headerKairui Song
Shmem has some special requirements for THP GFP and has to limit it in certain zones or provide a more lenient fallback. We'll use this helper for generic swap THP allocation, which needs to support shmem. For a typical GFP_HIGHUSER_MOVABLE swap-in, this helper is basically a no-op. But it's necessary for certain shmem users, mostly drivers. No feature change. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-3-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm, swap: move common swap cache operations into standalone helpersKairui Song
Move a few swap cache checking, adding, and deletion operations into standalone helpers to be used later. And while at it, add proper kernel doc. No feature or behavior change. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-2-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm, swap: simplify swap cache allocation helperKairui Song
Patch series "mm, swap: swap table phase IV: unify allocation", v5. This series unifies the allocation and charging of anon and shmem swap in folios, provides better synchronization, consolidates the metadata management, hence dropping the static array and map, and improves the performance. The static metadata overhead is now close to zero, and workload performance is slightly improved. For example, mounting a 1TB swap device saves about 512MB of memory: Before: free -m total used free shared buff/cache available Mem: 1464 805 346 1 382 658 Swap: 1048575 0 1048575 After: free -m total used free shared buff/cache available Mem: 1464 277 899 1 356 1187 Swap: 1048575 0 1048575 Memory usage is ~512M lower, and we now have a close to 0 static overhead. It was about 2 bytes per slot before, now roughly 0.09375 bytes per slot (48 bytes ci info per cluster, which is 512 slots). Performance test is also looking good, testing Redis in a 2G VM using 6G ZRAM as swap: valkey-server --maxmemory 2560M redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get Before: 3385017.283654 RPS After: 3433309.307292 RPS (1.42% better) Testing with build kernel under global pressure on a 48c96t system, limiting the total memory to 8G, using 12G ZRAM, 24 test runs, enabling THP: make -j96, using defconfig Before: user time 2904.59s system time 4773.99s After: user time 2909.38s system time 4641.55s (2.77% better) Testing with usemem on a 32c machine using 48G brd ramdisk and 16G RAM, 12 test run: usemem --init-time -O -y -x -n 48 1G Before: Throughput (Sum): 6482.58 MB/s Free Latency: 371371.67us After: Throughput (Sum): 6539.28 MB/s Free Latency: 363059.88us Seems similar, or slightly better. This series also reduces memory thrashing, I no longer see any: "Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF", it was shown several times during stress testing before this series when under great pressure: Before: grep -Ri VM_FAULT_OOM <test logs> | wc -l => 18 After: grep -Ri VM_FAULT_OOM <test logs> | wc -l => 0 This patch (of 12): Instead of trying to return the existing folio if the entry is already cached in swap_cache_alloc_folio, simply return an error pointer if the allocation failed, and drop the output argument that indicates what kind of folio is actually returned. And a proper wrapper swap_cache_read_folio that decouples and handles the actual requirement - read in the folio, or return the already read folio in cache. This is what async swapin and readahead actually required. As for zswap swap out, the caller just needs to abort if the allocation fails because the entry is gone or already cached, so removing simplifies the return argument, making it cleaner. No feature change. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com Link: https://lore.kernel.org/20260517-swap-table-p4-v5-1-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Zi Yan <ziy@nvidia.com> Cc: Youngjun Park <youngjun.park@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm: swap_cgroup: fix NULL deref in lookup_swap_cgroup_id on swapless hostJose Fernandez (Anthropic)
lookup_swap_cgroup_id() passes swap_cgroup_ctrl[type].map to __swap_cgroup_id_lookup() without checking that the type was ever registered via swap_cgroup_swapon(). On a swapless host every ctrl->map is NULL, so __swap_cgroup_id_lookup() dereferences NULL + a scaled swp_offset(). Since commit bea67dcc5eea ("mm: attempt to batch free swap entries for zap_pte_range()"), zap_pte_range() -> swap_pte_batch() calls lookup_swap_cgroup_id() on any non-present, non-none PTE that decodes as a real swap entry, without first validating it against swap_info[]. A single PTE corrupted into a type-0 swap entry takes the host down at process exit. We hit this in production on a swapless 6.12.58 host: ~1s of "get_swap_device: Bad swap file entry 3f800204222bb" (do_swap_page() being correctly defensive about the same entry) followed by BUG: unable to handle page fault for address: 000003f800204220 RIP: 0010:lookup_swap_cgroup_id+0x2b/0x60 Call Trace: swap_pte_batch+0xbf/0x230 zap_pte_range+0x4c8/0x780 unmap_page_range+0x190/0x3e0 exit_mmap+0xd9/0x3c0 do_exit+0x20c/0x4b0 syzbot has reported the identical stack. The source of the PTE corruption is a separate bug; this change makes the teardown path as robust as the fault path already is. Every other caller of lookup_swap_cgroup_id() is downstream of a get_swap_device() that has already validated the entry, so the new branch is cold. Link: https://lore.kernel.org/20260504-swap-cgroup-fix-7-0-v1-1-f53ff41ee553@linux.dev Fixes: bea67dcc5eea ("mm: attempt to batch free swap entries for zap_pte_range()") Signed-off-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev> Reported-by: syzbot+e12bd9ca48157add237a@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/69859728.050a0220.3b3015.0033.GAE@google.com Assisted-by: Claude:unspecified Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/page_alloc: document that alloc_pages_nolock() uses RCUBrendan Jackman
The allocator interacts with cgroups which rely on RCU. RCU does not work everywhere, so the "any context" claim is slightly overstated here. This should already be enforced by objtool, since this function is not marked noinstr the x86 build should fail if you call it from a place where RCU is not watching. But, expecting readers to make that connection for themselves seems a bit cruel (I don't think there is even any documentation of what noinstr means at all, let alone the connection with RCU). Note this is not claiming that any cgroup code called from the allocator would actually break if this restriction was violated, it could very well be that there's no real way for the allocator to act on a cgroup that can disappear concurrently. But, since it's likely nobody has verified this one way or another, better to just be safe and declare that RCU is required. Allocating from an RCU-unsafe context seems a bit crazy anyway. Link: https://lore.kernel.org/20260519-nolock-rcu-comment-v1-1-4a630c8794e5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Suggested-by: Junaid Shahid <junaids@google.com> Acked-by: Harry Yoo (Oracle) <harry@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/page_alloc: drop a misleading __always_inlineBrendan Jackman
get_pfnblock_migratetype() is called from outside page_alloc.c, so it cannot always be inlined. Remove the annotation to avoid misleading readers. At least in my minimal config, with GCC, this doesn't change mm/page_alloc.o at all. Link: https://lore.kernel.org/all/20260517-b4-drop-always-inline-v1-1-97b90930e8b8@google.com/ Signed-off-by: Brendan Jackman <jackmanb@google.com> Suggested-by: Vlastimil Babka <vbabka@kernel.org> Link: https://lore.kernel.org/all/016c8bef-57ef-44ef-bf60-86dbfd368dcd@kernel.org/ Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Vishal Moola <vishal.moola@gmail.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/page_alloc: remove ifdefs from pindex helpersBrendan Jackman
The ifdefs are not technically needed here, everything used here is always defined. Switching to IS_ENABLED() makes the code a bit less tiresome to read. Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-4-dacdf5402be8@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <lenb@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm: rejig pageblock mask definitionsBrendan Jackman
- Add a PAGEBLOCK_ prefix to the names to avoid polluting the "global namespace" too much. - This new prefix makes MIGRATETYPE_AND_ISO_MASK look pretty long. Well, that global mask only exists for quite a specific purpose, and is quite a weird thing to have a name for anyway. So drop it and take advantage of the newly-defined PAGEBLOCK_ISO_MASK. Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-3-dacdf5402be8@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <lenb@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/page_alloc: don't overload migratetype in find_suitable_fallback()Brendan Jackman
This function currently returns a signed integer that encodes status in-band, as negative numbers, along with a migratetype. Switch to a more explicit/verbose style that encodes the status and migratetype separately. In the spirit of making things more explicit, also create an enum to avoid using magic integer literals with special meanings. This enables documenting the values at their definition instead of in one of the callers. Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-2-dacdf5402be8@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <lenb@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm: introduce for_each_free_list()Brendan Jackman
Patch series "mm: misc cleanups from __GFP_UNMAPPED series". In v2 of the __GFP_UNMAPPED series [0], we realised that some of the patches could potentially be merged as independent cleanups. These are all independent of one another, if you think some are useful cleanups and others are pointless churn, it should be fine to just pick whatever subset you prefer. No functional change intended. This patch (of 4): There are a couple of places that iterate over the freelists with awareness of the data structures' layout. It seems ideally, code outside of mm should not be aware of the page allocator's freelists at all. But, this patch just doesn't hide them completely, it's just a meek incremental step in that direction: provide a macro to iterate over it without needing to be aware of the actual struct fields. Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-0-dacdf5402be8@google.com Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-1-dacdf5402be8@google.com Link: https://lore.kernel.org/all/20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com/ [0] Signed-off-by: Brendan Jackman <jackmanb@google.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <lenb@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/filemap: fix page_cache_prev_miss() when no hole is foundTal Zussman
page_cache_prev_miss() is documented to return a value outside the searched range when no gap is found. However, the no-gap-found path returns xas.xa_index, which after a successful loop is the first index in the range. As such, that index is misreported as a gap. The sole caller, page_cache_sync_ra(), uses the return value to estimate the cached run preceding a sequential read. In some cases, the buggy return value can undercount the contiguous range by one, shrinking the readahead window or pushing borderline requests into the small-random-read branch. Fix this by returning the start of the range - 1 when no hole is found. Update page_cache_next_miss() for clarity as well. Both helpers were previously fixed together in commit 9425c591e06a ("page cache: fix page_cache_next/prev_miss off by one"), but the fix was reverted because it caused a hugetlb performance regression. hugetlb no longer uses these functions and next_miss was subsequently refixed in commit 901a269ff3d5 ("filemap: fix page_cache_next_miss() when no hole found") and commit bbcaee20e03e ("readahead: fix return value of page_cache_next_miss() when no hole is found"), but prev_miss was not addressed. This was found by pointing Claude Opus 4.7 at mm/filemap.c. Link: https://lore.kernel.org/20260512-prev_miss_fix-v2-1-4af8e5c1ae62@columbia.edu Fixes: 0d3f92966629 ("page cache: Convert hole search to XArray") Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Tal Zussman <tz2294@columbia.edu> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Vishal Moola <vishal.moola@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02tools/mm/page-types: fix kpageflags option argument in getopt_longYe Liu
The --kpageflags option requires an argument to specify the kpageflags file path, but has_arg was set to 0 (no_argument) in the long options table. Change it to 1 (required_argument) so getopt_long correctly parses the argument. Link: https://lore.kernel.org/20260513022120.58033-4-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02tools/mm/page-types: fix ternary operator precedence in sigbus handlerYe Liu
The ternary operator (?:) has lower precedence than addition (+), so the expression `off + sigbus_addr ? sigbus_addr - ptr : 0` was parsed as `(off + sigbus_addr) ? (sigbus_addr - ptr) : 0` rather than the intended `off + (sigbus_addr ? sigbus_addr - ptr : 0)`. Add explicit parentheses to ensure the correct evaluation order. Link: https://lore.kernel.org/20260513022120.58033-3-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Acked-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand (Arm) <david@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02tools/mm/page-types: fix typo in madvise() error messageYe Liu
Patch series "tools/mm/page-types: Fix misc bugs". This series fixes three issues in tools/mm/page-types.c: 1. Fix two typos in madvise() error messages ("madvice" -> "madvise") 2. Fix operator precedence bug in the sigbus handler where the ternary operator binds looser than addition, producing incorrect offset calculation when sigbus_addr is non-NULL 3. Fix --kpageflags option declaration in getopt_long: has_arg should be 1 (required_argument) since the option requires a file path This patch (of 3): Two error messages incorrectly spelled the madvise() function name as "madvice". Fix the typo in both occurrences. Link: https://lore.kernel.org/20260513022120.58033-1-ye.liu@linux.dev Link: https://lore.kernel.org/20260513022120.58033-2-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/shrinker: simplify shrinker_memcg_alloc() using guard()wangxuewen
Use guard(mutex) to automatically handle shrinker_mutex locking and unlocking in shrinker_memcg_alloc(). This removes the explicit mutex_unlock() call, the goto-based error path, and the redundant ret variable, resulting in cleaner and more concise code. Link: https://lore.kernel.org/20260513075214.2655710-1-18810879172@163.com Signed-off-by: wangxuewen <wangxuewen@kylinos.cn> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Dave Chinner <david@fromorbit.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Xuewen Wang <wangxuewen@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02userfaultfd: ensure mremap_userfaultfd_fail() releases mmap_changingMike Rapoport (Microsoft)
Sashiko says: mremap_userfaultfd_prep() increments ctx->mmap_changing to stall concurrent operations, but mremap_userfaultfd_fail() does not decrement it before dropping the context reference. If an mremap operation fails, ctx->mmap_changing remains elevated. This will causes subsequent userfaultfd operations like a UFFDIO_COPY to fail with -EAGAIN. Decrement ctx->mmap_changing in mremap_userfaultfd_fail(). Link: https://sashiko.dev/#/patchset/20260430113512.115938-1-rppt@kernel.org Link: https://lore.kernel.org/20260513081416.495963-1-rppt@kernel.org Fixes: df2cc96e7701 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races") Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02lib/test_hmm: use kvfree() to free kvcalloc() allocationsHao Ge
Coccinelle scripts/coccinelle/api/kfree_mismatch.cocci reports the following warnings: lib/test_hmm.c:1256:15-16: WARNING kvmalloc is used to allocate this memory at line 1191 lib/test_hmm.c:1257:15-16: WARNING kvmalloc is used to allocate this memory at line 1196 Fix this by replacing kfree() with kvfree() to correctly handle the vmalloc() fallback path of kvcalloc(). Link: https://lore.kernel.org/20260513082525.154036-1-hao.ge@linux.dev Fixes: 775465fd26a3 ("lib/test_hmm: add zone device private THP test infrastructure") Signed-off-by: Hao Ge <hao.ge@linux.dev> Acked-by: Balbir Singh <balbirs@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm, swap: avoid leaving unused extend table after alloc raceKairui Song
Allocating an extend table requires dropping the ci lock first. While the lock is dropped, a concurrent put can decrease the slot's swap count to a value that is no longer maxed out, so the extend table is no longer required. The current allocation path still attach the new extend table to the cluster anyway, leaving it unused. The next maxed out count on the same cluster may still reuse the table, and frees it properly. But swapoff could leak it indeed. To eliminate the waste, re-check under the ci lock that the extend table is still needed before publishing it, and free the local allocation otherwise. Also close the check window by ensuring every count decrement that brings a slot below SWP_TB_COUNT_MAX - 1 runs swap_extend_table_try_free(), not just the MAX to MAX - 1 transition. With this, a freshly published extend table that becomes redundant due to a racing put is freed on the very next decrement, restoring the invariant that an empty cluster never has a non-NULL ci->extend_table. The added overhead is ignorable. [kasong@tencent.com: v2] Link: https://lore.kernel.org/20260515-swap-extend-table-fix-v2-1-833d72ad53e5@tencent.com Link: https://lore.kernel.org/20260513-swap-extend-table-fix-v1-1-a71dea851fb3@tencent.com Fixes: 0d6af9bcf383 ("mm, swap: use the swap table to track the swap count") Signed-off-by: Kairui Song <kasong@tencent.com> Reported-by: Breno Leitao <leitao@debian.org> Closes: https://lore.kernel.org/linux-mm/agG6Dp0umhs6O1SY@gmail.com/ Tested-by: Breno Leitao <leitao@debian.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/readahead: no PG_readahead on EOFFrederick Mayle
When readahead pulls in all the remaining pages for a file, setting the readahead bit is counter productive. The async readahead it would trigger would almost certainly be a no-op. Additionally, for mmap'd file IO, the readahead bit limits the fault around [1], causing an extra minor fault when the page is accessed. This was discovered when looking at /sys/kernel/tracing/events/readahead traces for a simple program. With the patch applied, fewer page_cache_ra_unbounded calls are observed. [1] do_fault_around calls filemap_map_pages, which finds eligible pages by calling next_uptodate_folio [2]. next_uptodate_folio skips pages with PG_readahead set [3]. Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3921-L3939 [2] Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3721-L3722 [3] Link: https://lore.kernel.org/20260508181237.670645-1-fmayle@google.com Signed-off-by: Frederick Mayle <fmayle@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/mmu_notifier: fix a begin vs. start typo in the invalidate range commentTakahiro Itazuri
Fix a goof in the block comment for invalidate_range_{start,end}() where start() is incorrectly referred to as begin(). No functional change intended. [seanjc@google.com: split to separate patch, write changelog] Link: https://lore.kernel.org/20260513163546.1176742-1-seanjc@google.com Signed-off-by: Takahiro Itazuri <itazur@amazon.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/hugetlb_cma: restrict hugetlb_cma parameter to gigantic-page alignmentSang-Heon Jeon
Existing hugetlb_cma parameter handling logic rejects sizes smaller than one gigantic page, but rounds up larger sizes that are not a multiple of it. The two behaviors are inconsistent and neither is documented. To remove existing inconsistent and undefined behavior, restrict hugetlb_cma parameter to only accept multiples of the gigantic page size. After this restriction, the redundant round_up() in the allocation loop can be removed. The new restriction is also documented in kernel-parameters.txt. Also, including other minor changes for readability improvement with no functional change. Link: https://lore.kernel.org/20260503084225.415980-1-ekffu200098@gmail.com Signed-off-by: Sang-Heon Jeon <ekffu200098@gmail.com> Suggested-by: Muchun Song <muchun.song@linux.dev> Acked-by: Muchun Song <muchun.song@linux.dev> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/mseal: use min/max in mseal_applyThorsten Blum
Use the type-checked min()/max() macros instead of MIN()/MAX(), which are supposed to be used "for obvious constants only". Link: https://lore.kernel.org/20260503115915.18680-3-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Thorsten Blum <thorsten.blum@linux.dev> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02selftests/mm: ksm-functional-tests: fix partial write handlingVineet Agarwal
Update write() checks to properly detect and handle partial writes. Previously, the write() calls used <= 0 to detect failure. This condition is never true for partial writes (ret > 0 but ret < len), so partial writes were silently treated as success. Fix this by verifying that write() returns the full expected length and treating any mismatch as failure. Link: https://lore.kernel.org/20260504081638.683223-1-agarwal.vineet2006@gmail.com Signed-off-by: Vineet Agarwal <agarwal.vineet2006@gmail.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02lib/test_meminit: use && for boolsAlexander Potapenko
As pointed out by Dan Carpenter, test_kmemcache() was using a bitwise AND on two bools instead of a boolean AND. Fix this for the sake of code cleanliness. Link: https://lore.kernel.org/20260504100637.1535762-1-glider@google.com Fixes: 5015a300a522 ("lib: introduce test_meminit module") Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: Dan Carpenter <error27@gmail.com> Closes: https://lore.kernel.org/kernel-janitors/afOcIan1ap9kD26M@stanley.mountain/ Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/readahead: simplify page_cache_ra_unbounded loop counter resetFrederick Mayle
Minor cleanup, no behavior change intended. `read_pages` ensures that `ractl->_nr_pages` is zero before it returns, so the `ractl->_nr_pages` term in these expressions contributes nothing. This seems to have been true since the statements were introduced in commit f615bd5c4725f ("mm/readahead: Handle ractl nr_pages being modified"). The new expression has an intuitive explanation. When filesystems perform readahead, they increment `ractl->_index` by the number of pages processed, so, after `read_pages` returns, `ractl->_index` points to the first page after those already processed. `index` points to the first page considered in the loop. So, `ractl->_index - index` is the number of pages processed by the loop so far. Link: https://lore.kernel.org/20260512203154.754075-3-fmayle@google.com Signed-off-by: Frederick Mayle <fmayle@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/readahead: add kerneldoc for read_pagesFrederick Mayle
Patch series "mm: document read_pages and simplify usage". Add a kerneldoc for read_pages() to formalize an invariant and then use it to simplify the callers in page_cache_ra_unbounded(). This patch (of 2): Formalize one of the invariants provided by the current implementation so that callers can depend on it, as discussed in [1]. Link: https://lore.kernel.org/all/20260501061146.6e61392d125cf1847d7cc181@linux-foundation.org/ [1] Link: https://lore.kernel.org/20260512203154.754075-2-fmayle@google.com Signed-off-by: Frederick Mayle <fmayle@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02maple_tree: document that "last" in mtree_insert_range() is inclusiveSteven Rostedt
The kernel doc of mtree_insert_range() does not state if the address represented by the "last" parameter is inclusive or exclusive. This can lead to bugs by code that assumes it is exclusive. Explicitly state that the parameter is inclusive. Link: https://lore.kernel.org/20260512175623.4c5ca8d2@gandalf.local.home Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: "Liam R. Howlett" <liam@infradead.org> Acked-by: SeongJae Park <sj@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/shrinker: avoid out-of-bounds read in set_shrinker_bit()David Carlier
set_shrinker_bit() reads info->unit[shrinker_id_to_index(shrinker_id)] before checking shrinker_id against info->map_nr_max, so an id past the currently visible map_nr_max reads past the unit[] array before the WARN_ON_ONCE() catches it. Determined from code inspection. Move the load into the bounded branch. Link: https://lore.kernel.org/20260510183700.102475-1-devnexen@gmail.com Fixes: 307bececcd12 ("mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}") Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Qi Zheng <qi.zheng@linux.dev> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Dave Chinner <david@fromorbit.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/khugepaged: fix inconsistent MMF_VM_HUGEPAGE flag due to allocation ↵Ye Liu
failure order __khugepaged_enter() sets MMF_VM_HUGEPAGE before allocating the corresponding mm_slot. If mm_slot_alloc() fails, the function returns with the flag set but without inserting the mm into the khugepaged tracking structures, leaving the mm in an inconsistent state where future registration attempts are skipped. Fix this by reordering: allocate the mm_slot first, then check and set the flag. If the flag is already set, free the allocated slot and return. This ensures the flag is only set when the mm is successfully registered in the khugepaged tracking structures. Link: https://lore.kernel.org/20260511025408.54035-1-ye.liu@linux.dev Fixes: 16618670276a ("mm: khugepaged: avoid pointless allocation for "struct mm_slot"") Signed-off-by: Ye Liu <liuye@kylinos.cn> Suggested-by: David Hildenbrand <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/percpu-internal.h: optimise pcpu_chunk struct to save memoryzenghongling
Using pahole, we can see that there are some padding holes in the current pcpu_chunk structure,Adjusting the layout of pcpu_chunk can reduce these holes,decreasing its size from 192 bytes to 128 bytes and eliminating a wasted cache line. With allmodconfig (CONFIG_PERCPU_STATS + NEED_PCPUOBJ_EXT) Before: /* size: 256, cachelines: 4, members: 19 */ After: /* size: 192, cachelines: 3, members: 19 */ with NEED_PCPUOBJ_EXT Before: struct pcpu_chunk { struct list_head list; /* 0 16 */ int free_bytes; /* 16 4 */ struct pcpu_block_md chunk_md; /* 20 32 */ /* XXX 4 bytes hole, try to pack */ long unsigned int * bound_map; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ void * base_addr __attribute__((__aligned__(64))); /* 64 8 */ long unsigned int * alloc_map; /* 72 8 */ struct pcpu_block_md * md_blocks; /* 80 8 */ void * data; /* 88 8 */ bool immutable; /* 96 1 */ bool isolated; /* 97 1 */ /* XXX 2 bytes hole, try to pack */ int start_offset; /* 100 4 */ int end_offset; /* 104 4 */ /* XXX 4 bytes hole, try to pack */ struct obj_cgroup * * obj_cgroups; /* 112 8 */ int nr_pages; /* 120 4 */ int nr_populated; /* 124 4 */ /* --- cacheline 2 boundary (128 bytes) --- */ int nr_empty_pop_pages; /* 128 4 */ /* XXX 4 bytes hole, try to pack */ long unsigned int populated[]; /* 136 0 */ /* size: 192, cachelines: 3, members: 17 */ /* sum members: 122, holes: 4, sum holes: 14 */ /* padding: 56 */ /* forced alignments: 1 */ } __attribute__((__aligned__(64))); After: struct pcpu_chunk { struct list_head list; /* 0 16 */ int free_bytes; /* 16 4 */ struct pcpu_block_md chunk_md; /* 20 32 */ /* XXX 4 bytes hole, try to pack */ long unsigned int * bound_map; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ void * base_addr __attribute__((__aligned__(64))); /* 64 8 */ long unsigned int * alloc_map; /* 72 8 */ struct pcpu_block_md * md_blocks; /* 80 8 */ void * data; /* 88 8 */ bool immutable; /* 96 1 */ bool isolated; /* 97 1 */ /* XXX 2 bytes hole, try to pack */ int start_offset; /* 100 4 */ int end_offset; /* 104 4 */ int nr_pages; /* 108 4 */ int nr_populated; /* 112 4 */ int nr_empty_pop_pages; /* 116 4 */ struct obj_cgroup * * obj_cgroups; /* 120 8 */ /* --- cacheline 2 boundary (128 bytes) --- */ long unsigned int populated[]; /* 128 0 */ /* size: 128, cachelines: 2, members: 17 */ /* sum members: 122, holes: 2, sum holes: 6 */ /* forced alignments: 1 */ } __attribute__((__aligned__(64))); Link: https://lore.kernel.org/20260511070309.44044-1-zenghongling@kylinos.cn Signed-off-by: zenghongling <zenghongling@kylinos.cn> Suggested-by: Dennis Zhou <dennis@kernel.org> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/reclaim: validate min_region_size to be power of 2Liew Rui Yan
Problem ======= When a user sets an invalid 'addr_unit' (e.g., 3) via DAMON_RECLAIM, 'min_region_sz' becomes a non-power-of-2 value. While damon_commit_ctx() correctly detects this and returns -EINVAL, it sets the 'maybe_corrupted' flag during this process. This flag causes the running kdamond to terminate. While the termination is a safety measure, it is suboptimal in this case because the error is just a simple invalid input from the user, which shouldn't neccessitate stopping the kdamond. Reproduction ============ 1. Enable DAMON_RECLAIM 2. Set addr_unit=3 3. Commit inputs via 'commit_inputs' 4. Observe kdamond termination Solution ======== Add an early validation in damon_reclaim_apply_parameters() to check 'min_region_sz' before any state change occurs. If it is non-power-of-2, return -EINVAL immediately, preventing 'maybe_corrupted' from being set. Link: https://lore.kernel.org/20260501013750.71704-3-aethernet65535@gmail.com Signed-off-by: Liew Rui Yan <aethernet65535@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/lru_sort: validate min_region_size to be power of 2Liew Rui Yan
Patch series "mm/damon: validate min_region_size to be power of 2", v5. Problem ======= When a user sets an invalid 'addr_unit' (e.g., 3) via DAMON_LRU_SORT or DAMON_RECLAIM, 'min_region_sz' becomes a non-power-of-2 value. While damon_commit_ctx() correctly detects this and returns -EINVAL, it sets the 'maybe_corrupted' flag during this process. This flag causes the running kdamond to terminate. While the termination is a safety measure, it is suboptimal in this case because the error is just a simple invalid input from the user, which shouldn't neccessitate stopping the kdamond. Solution ======== Add an early validation in damon_lru_sort_apply_parameters() and damon_reclaim_apply_parameters() to check 'min_region_sz' before any state change occurs. If it is non-power-of-2, return -EINVAL immediately, preventing 'maybe_corrupted' from being set. Patch 1 fixes the issue for DAMON_LRU_SORT. Patch 2 fixes the issue for DAMON_RECLAIM. This patch (of 2): Problem ======= When a user sets an invalid 'addr_unit' (e.g., 3) via DAMON_LRU_SORT, 'min_region_sz' becomes a non-power-of-2 value. While damon_commit_ctx() correctly detects this and returns -EINVAL, it sets the 'maybe_corrupted' flag during this process. This flag causes the running kdamond to terminate. While the termination is a safety measure, it is suboptimal in this case because the error is just a simple invalid input from the user, which shouldn't neccessitate stopping the kdamond. Reproduction ============ 1. Enable DAMON_LRU_SORT 2. Set addr_unit=3 3. Commit inputs via 'commit_inputs' 4. Observe kdamond termination Solution ======== Add an early validation in damon_lru_sort_apply_parameters() to check 'min_region_sz' before any state change occurs. If it is non-power-of-2, return -EINVAL immediately, preventing 'maybe_corrupted' from being set. Link: https://lore.kernel.org/20260501013750.71704-1-aethernet65535@gmail.com Link: https://lore.kernel.org/20260501013750.71704-2-aethernet65535@gmail.com Signed-off-by: Liew Rui Yan <aethernet65535@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon/sysfs-schemes: fix double increment of nr_regionsVineet Agarwal
damos_sysfs_populate_region_dir() increments sysfs_regions->nr_regions twice when adding a new region: once explicitly before kobject_init_and_add(), and once again through the post-increment used for the kobject name. As a result, nr_regions no longer matches the actual number of live regions, and region directory names skip numbers (1, 3, 5, ...). Use the already incremented value for naming instead of incrementing nr_regions a second time. Link: https://lore.kernel.org/20260512041157.109845-1-agarwal.vineet2006@gmail.com Fixes: 66178e4ec30a ("mm/damon/sysfs: use damos_walk() for update_schemes_tried_{bytes,regions}") Signed-off-by: Vineet Agarwal <agarwal.vineet2006@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02drivers/base/memory: make memory block get/put explicitMuchun Song
Rename the memory block lookup helper to make the acquired reference explicit, add memory_block_put() to wrap put_device(), remove find_memory_block(), and use memory_block_get() as the single block-id based lookup interface. This makes it clearer to callers that a successful lookup holds a reference that must be dropped, reducing the chance of forgetting the matching put and leaking the memory block device reference. Link: https://lore.kernel.org/linux-mm/7887915D-E598-42B3-9AFE-BFFBACE8DE2D@linux.dev/#t Link: https://lore.kernel.org/20260512072635.3969576-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com> #s390 Cc: Richard Cheng <icheng@nvidia.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Doug Anderson <dianders@chromium.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02selftests/mm: check file initialization writes in split_huge_page_testVineet Agarwal
create_pagecache_thp_and_fd() fills the backing file for the pagecache THP tests using repeated write() calls, but the return value is never checked. If a write fails or completes only partially, the test may continue with an incompletely initialized file and produce misleading results. Check the result of write() and fail the test if the expected number of bytes was not written. [akpm@linux-foundation.org: remove unneeded local, per David] Link: https://lore.kernel.org/da82de92-29d8-457c-9f65-40fc4900b922@kernel.org Link: https://lore.kernel.org/20260512074924.27721-1-agarwal.vineet2006@gmail.com Signed-off-by: Vineet Agarwal <agarwal.vineet2006@gmail.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Vineet Agarwal <agarwal.vineet2006@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02powerpc/mm: remove CONFIG_HAVE_BOOTMEM_INFO_NODEDavid Hildenbrand (Arm)
register_page_bootmem_info_node() essentially only calls register_page_bootmem_memmap(). However, on powerpc that function is a nop. So there is not benefit in using CONFIG_HAVE_BOOTMEM_INFO_NODE anymore, let's just drop it. We can stop including bootmem_info.h. Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-8-3fb0be6fc688@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02s390/mm: use free_reserved_page() in vmem_free_pages()David Hildenbrand (Arm)
We never select CONFIG_HAVE_BOOTMEM_INFO_NODE on s390. Therefore, free_bootmem_page() nowadays always translates to free_reserved_page(). Let's use free_reserved_page() to replace the free_bootmem_page() loop. We can stop including bootmem_info.h. Likely, vmemmap freeing code could be factored out into the core in the future. Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-7-3fb0be6fc688@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/bootmem_info: stop marking mem_section_usage as MIX_SECTION_INFODavid Hildenbrand (Arm)
We never free the ms->usage data for boot memory sections (see section_deactivate()). And to identify whether ms->usage was allocated from memblock, we simply identify it by looking at PG_reserved. Consequently, there is no need to mark ms->usage as MIX_SECTION_INFO. Let's just stop doing that. Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-6-3fb0be6fc688@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/bootmem_info: stop marking the pgdat as NODE_INFODavid Hildenbrand (Arm)
We removed the last user of NODE_INFO in commit 119c31caa59e ("mm/sparse: remove !CONFIG_SPARSEMEM_VMEMMAP leftovers for CONFIG_MEMORY_HOTPLUG"). But it really was never used it besides for safety-checks ever since it was introduced in commit 04753278769f ("memory hotplug: register section/node id to free"), where we had the comment: 5) The node information like pgdat has similar issues. But, this will be able to be solved too by this. (Not implemented yet, but, remembering node id in the pages.) Of course, that never happened, and we are not planning on freeing the node data (pgdat/pglist_data), during memory hotunplug. So let's just stop marking the pgdat as NODE_INFO. Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-5-3fb0be6fc688@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/bootmem_info: remove call to kmemleak_free_part_phys()David Hildenbrand (Arm)
The call to kmemleak_free_part_phys() was added in 2022 in commit dd0ff4d12dd2 ("bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem"). In 2025, commit b2aad24b5333 ("mm/memmap: prevent double scanning of memmap by kmemleak") started to use MEMBLOCK_ALLOC_NOLEAKTRACE when allocating the memmap to skip the kmemleak_alloc_phys() in the buddy. So remove the call to kmemleak_free_part_phys(). If this would still be required for other purposes, either free_reserved_page() should take care of it, or selected users. Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-4-3fb0be6fc688@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/bootmem_info: stop using PG_privateDavid Hildenbrand (Arm)
Nobody checks PG_private for these pages, and we can happily use set_page_private() without setting PG_private. So let's just stop setting/clearing PG_private. Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-3-3fb0be6fc688@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/bootmem_info: drop initialization of page->lruDavid Hildenbrand (Arm)
In the past, we used to store the type in page->lru.next, introduced by commit 5f24ce5fd34c ("thp: remove PG_buddy"). The location changed over the years; ever since commit 0386aaa6e9c8 ("bootmem: stop using page->index"), we store it alongside the info in page->private. Consequently, there is no need to reset page->lru anymore. Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-2-3fb0be6fc688@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02sparc/mm: remove register_page_bootmem_info()David Hildenbrand (Arm)
Patch series "mm: remove CONFIG_HAVE_BOOTMEM_INFO_NODE (Part 1)". We want to remove CONFIG_HAVE_BOOTMEM_INFO_NODE. As a first step, let's limit the remaining harm to x86 and core code, removing sparc, ppc and s390 leftovers, starting the stepwise removal by removing and simplifying some code. Once a related x86 vmemmap fix [1] is in, we can merge part 2 that will remove CONFIG_HAVE_BOOTMEM_INFO_NODE entirely. Tested on x86-64 with hugetlb vmemmap optimization in combination with KMEMLEAK, making sure that the problem reported in dd0ff4d12dd2 ("bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem") does not reappear -- hoping I managed to trigger the original problem. This patch (of 8): sparc does not select CONFIG_HAVE_BOOTMEM_INFO_NODE, therefore, register_page_bootmem_info_node() is a nop. Let's just get rid of register_page_bootmem_info(). Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-0-3fb0be6fc688@kernel.org Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-1-3fb0be6fc688@kernel.org Link: https://lore.kernel.org/r/20260429-vmemmap-v2-1-8dfcacffd877@kernel.org [1] Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02selftests/mm: fix mmap() return value check in run_migration_benchmarkHongfu Li
mmap() returns MAP_FAILED on error, not NULL. The current check uses !buffer->ptr, which evaluates to false when mmap() fails (since MAP_FAILED is (void *)-1, not 0), so the error path is never taken. Link: https://lore.kernel.org/20260512101305.139509-1-lihongfu@kylinos.cn Signed-off-by: Hongfu Li <lihongfu@kylinos.cn> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/memory_hotplug: factor out altmap freeing checksMuchun Song
Use a small helper to centralize altmap freeing after verifying that all vmemmap pages were released. This keeps the check consistent between the normal teardown path and the memory hotplug error paths. Link: https://lore.kernel.org/20260511084307.1827127-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Suggested-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02proc/meminfo: expose per-node balloon pages in node meminfoHao Ge
Commit 835de37603ef ("meminfo: add a per node counter for balloon drivers") added NR_BALLOON_PAGES and exposed it in /proc/meminfo. However, the per-node view at /sys/devices/system/node/nodeX/meminfo was not updated, even though the counter is already tracked per-node. Add it to node_read_meminfo() so users can see balloon usage per NUMA node without having to parse the raw vmstat file. Link: https://lore.kernel.org/20260509005631.17183-1-hao.ge@linux.dev Signed-off-by: Hao Ge <hao.ge@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02mm/damon: replace damon_rand() with a per-ctx lockless PRNGJiayuan Chen
damon_rand() on the sampling_addr hot path called get_random_u32_below(), which takes a local_lock_irqsave() around a per-CPU batched entropy pool and periodically refills it with ChaCha20. At elevated nr_regions counts (20k+), the lock_acquire / local_lock pair plus __get_random_u32_below() dominate kdamond perf profiles. Replace the helper with a lockless lfsr113 generator (struct rnd_state) held per damon_ctx and seeded from get_random_u64() in damon_new_ctx(). kdamond is the single consumer of a given ctx, so no synchronization is required. Range mapping uses traditional reciprocal multiplication, similar as get_random_u32_below(); for spans larger than U32_MAX (only reachable on 64-bit) the slow path combines two u32 outputs and uses mul_u64_u64_shr() at 64-bit width. On 32-bit the slow path is dead code and gets eliminated by the compiler. The new helper takes a ctx parameter; damon_split_regions_of() and the kunit tests that call it directly are updated accordingly. lfsr113 is a linear PRNG and MUST NOT be used for anything security-sensitive. DAMON's sampling_addr is not exposed to userspace and is only consumed as a probe point for PTE accessed-bit sampling, so a non-cryptographic PRNG is appropriate here. Tested with paddr monitoring and max_nr_regions=20000: kdamond CPU usage reduced from ~72% to ~50% of one core. Link: https://lore.kernel.org/20260505145212.108644-1-jiayuan.chen@linux.dev Link: https://lore.kernel.org/damon/20260426173346.86238-1-sj@kernel.org/T/#m4f1fd74112728f83a41511e394e8c3fef703039c Link: https://lore.kernel.org/20260509011816.85145-1-sj@kernel.org Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Shu Anzai <shu17az@gmail.com> Cc: Quanmin Yan <yanquanmin1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>