summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2026-01-20selftests/mm: fix comment for check_test_requirementsChunyu Hu
The test supports arm64 as well so the comment is incorrect. And there's a check for arm64 in va_high_addr_switch.c. Link: https://lkml.kernel.org/r/20251221040025.3159990-5-chuhu@redhat.com Fixes: 983e760bcdb6 ("selftest/mm: va_high_addr_switch: add ppc64 support check") Fixes: f556acc2facd ("selftests/mm: skip test for non-LPA2 and non-LVA systems") Signed-off-by: Chunyu Hu <chuhu@redhat.com> Reviewed-by: Luiz Capitulino <luizcap@redhat.com> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20selftests/mm: va_high_addr_switch return fail when either test failedChunyu Hu
When the first test failed, and the hugetlb test passed, the result would be pass, but we expect a fail. Fix this issue by returning fail if either is not KSFT_PASS. Link: https://lkml.kernel.org/r/20251221040025.3159990-4-chuhu@redhat.com Signed-off-by: Chunyu Hu <chuhu@redhat.com> Reviewed-by: Luiz Capitulino <luizcap@redhat.com> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20selftests/mm: remove arm64 nr_hugepages setup for va_high_addr_switch testChunyu Hu
arm64 and x86_64 has the same nr_hugepages requriement for running the va_high_addr_switch test. Since commit d9d957bd7b61 ("selftests/mm: alloc hugepages in va_high_addr_switch test"), the setup can be done in va_high_addr_switch.sh. So remove the duplicated setup. Link: https://lkml.kernel.org/r/20251221040025.3159990-3-chuhu@redhat.com Signed-off-by: Chunyu Hu <chuhu@redhat.com> Reviewed-by: Luiz Capitulino <luizcap@redhat.com> Cc: Luiz Capitulino <luizcap@redhat.com> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20selftests/mm: allocate 6 hugepages in va_high_addr_switch.shChunyu Hu
The va_high_addr_switch test requires 6 hugepages, not 5. If running the test directly by: ./va_high_addr_switch.sh, the test will hit a mmap 'FAIL' caused by not enough hugepages: mmap(addr_switch_hint - hugepagesize, 2*hugepagesize, MAP_HUGETLB): 0x7f330f800000 - OK mmap(addr_switch_hint , 2*hugepagesize, MAP_FIXED | MAP_HUGETLB): 0xffffffffffffffff - FAILED The failure can't be hit if run the tests by running 'run_vmtests.sh -t hugevm' because the nr_hugepages is set to 128 at the beginning of run_vmtests.sh and va_high_addr_switch.sh skip the setup of nr_hugepages because already enough. Link: https://lkml.kernel.org/r/20251221040025.3159990-2-chuhu@redhat.com Fixes: d9d957bd7b61 ("selftests/mm: alloc hugepages in va_high_addr_switch test") Signed-off-by: Chunyu Hu <chuhu@redhat.com> Reviewed-by: Luiz Capitulino <luizcap@redhat.com> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20selftests/mm: fix va_high_addr_switch.sh return valueChunyu Hu
Patch series "Fix va_high_addr_switch.sh test failure - again", v2. The series address several issues exist for the va_high_addr_switch test: 1) the test return value is ignored in va_high_addr_switch.sh. 2) the va_high_addr_switch test requires 6 hugepages not 5. 3) the reurn value of the first test in va_high_addr_switch.c can be overridden by the second test. 4) the nr_hugepages setup in run_vmtests.sh for arm64 can be done in va_high_addr_switch.sh too. 5) update a comment for check_test_requirements. This patch: (of 5) The return value should be return value of va_high_addr_switch, otherwise a test failure would be silently ignored. Link: https://lkml.kernel.org/r/20251221040025.3159990-1-chuhu@redhat.com Fixes: d9d957bd7b61 ("selftests/mm: alloc hugepages in va_high_addr_switch test") Signed-off-by: Chunyu Hu <chuhu@redhat.com> Reviewed-by: Luiz Capitulino <luizcap@redhat.com> Cc: Luiz Capitulino <luizcap@redhat.com> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20selftests/mm/charge_reserved_hugetlb.sh: add waits with timeout helperLi Wang
The hugetlb cgroup usage wait loops in charge_reserved_hugetlb.sh were unbounded and could hang forever if the expected cgroup file value never appears (e.g. due to write_to_hugetlbfs in Error mapping). === Error log === # uname -r 6.12.0-xxx.el10.aarch64+64k # ls /sys/kernel/mm/hugepages/hugepages-* hugepages-16777216kB/ hugepages-2048kB/ hugepages-524288kB/ #./charge_reserved_hugetlb.sh -cgroup-v2 # ----------------------------------------- ... # nr hugepages = 10 # writing cgroup limit: 5368709120 # writing reseravation limit: 5368709120 ... # write_to_hugetlbfs: Error mapping the file: Cannot allocate memory # Waiting for hugetlb memory reservation to reach size 2684354560. # 0 # Waiting for hugetlb memory reservation to reach size 2684354560. # 0 # Waiting for hugetlb memory reservation to reach size 2684354560. # 0 # Waiting for hugetlb memory reservation to reach size 2684354560. # 0 # Waiting for hugetlb memory reservation to reach size 2684354560. # 0 # Waiting for hugetlb memory reservation to reach size 2684354560. # 0 ... Introduce a small helper, wait_for_file_value(), and use it for: - waiting for reservation usage to drop to 0, - waiting for reservation usage to reach a given size, - waiting for fault usage to reach a given size. This makes the waits consistent and adds a hard timeout (60 tries with 1s sleep) so the test fails instead of stalling indefinitely. Link: https://lkml.kernel.org/r/20251221122639.3168038-4-liwang@redhat.com Signed-off-by: Li Wang <liwang@redhat.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20selftests/mm/charge_reserved_hugetlb: drop mount size for hugetlbfsLi Wang
charge_reserved_hugetlb.sh mounts a hugetlbfs instance at /mnt/huge with a fixed size of 256M. On systems with large base hugepages (e.g. 512MB), this is smaller than a single hugepage, so the hugetlbfs mount ends up with zero capacity (often visible as size=0 in mount output). As a result, write_to_hugetlbfs fails with ENOMEM and the test can hang waiting for progress. === Error log === # uname -r 6.12.0-xxx.el10.aarch64+64k #./charge_reserved_hugetlb.sh -cgroup-v2 # ----------------------------------------- ... # nr hugepages = 10 # writing cgroup limit: 5368709120 # writing reseravation limit: 5368709120 ... # write_to_hugetlbfs: Error mapping the file: Cannot allocate memory # Waiting for hugetlb memory reservation to reach size 2684354560. # 0 # Waiting for hugetlb memory reservation to reach size 2684354560. # 0 ... # mount |grep /mnt/huge none on /mnt/huge type hugetlbfs (rw,relatime,seclabel,pagesize=512M,size=0) # grep -i huge /proc/meminfo ... HugePages_Total: 10 HugePages_Free: 10 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 524288 kB Hugetlb: 5242880 kB Drop the mount args with 'size=256M', so the filesystem capacity is sufficient regardless of HugeTLB page size. Link: https://lkml.kernel.org/r/20251221122639.3168038-3-liwang@redhat.com Fixes: 29750f71a9b4 ("hugetlb_cgroup: add hugetlb_cgroup reservation tests") Signed-off-by: Li Wang <liwang@redhat.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Cc: Mark Brown <broonie@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20selftests/mm/write_to_hugetlbfs: parse -s as size_tLi Wang
Patch series "selftests/mm: hugetlb cgroup charging: robustness fixes", v3. This series fixes a few issues in the hugetlb cgroup charging selftests (write_to_hugetlbfs.c + charge_reserved_hugetlb.sh) that show up on systems with large hugepages (e.g. 512MB) and when failures cause the test to wait indefinitely. On an aarch64 64k page kernel with 512MB hugepages, the test consistently fails in write_to_hugetlbfs with ENOMEM and then hangs waiting for the expected usage values. The root cause is that charge_reserved_hugetlb.sh mounts hugetlbfs with a fixed size=256M, which is smaller than a single hugepage, resulting in a mount with size=0 capacity. In addition, write_to_hugetlbfs previously parsed -s via atoi() into an int, which can overflow and print negative sizes. Reproducer / environment: - Kernel: 6.12.0-xxx.el10.aarch64+64k - Hugepagesize: 524288 kB (512MB) - ./charge_reserved_hugetlb.sh -cgroup-v2 - Observed mount: pagesize=512M,size=0 before this series After applying the series, the test completes successfully on the above setup. This patch (of 3): write_to_hugetlbfs currently parses the -s size argument with atoi() into an int. This silently accepts malformed input, cannot report overflow, and can truncate large sizes. === Error log === # uname -r 6.12.0-xxx.el10.aarch64+64k # ls /sys/kernel/mm/hugepages/hugepages-* hugepages-16777216kB/ hugepages-2048kB/ hugepages-524288kB/ #./charge_reserved_hugetlb.sh -cgroup-v2 # ----------------------------------------- ... # nr hugepages = 10 # writing cgroup limit: 5368709120 # writing reseravation limit: 5368709120 ... # Writing to this path: /mnt/huge/test # Writing this size: -1610612736 <-------- Switch the size variable to size_t and parse -s with sscanf("%zu", ...). Also print the size using %zu. This avoids incorrect behavior with large -s values and makes the utility more robust. Link: https://lkml.kernel.org/r/20251221122639.3168038-1-liwang@redhat.com Link: https://lkml.kernel.org/r/20251221122639.3168038-2-liwang@redhat.com Signed-off-by: Li Wang <liwang@redhat.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Cc: David Hildenbrand <david@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20page_alloc: allow migration of smaller hugepages during contig_allocGregory Price
We presently skip regions with hugepages entirely when trying to do contiguous page allocation. This will cause otherwise-movable 2MB HugeTLB pages to be considered unmovable, and makes 1GB gigantic page allocation less reliable on systems utilizing both. Commit 4d73ba5fa710 ("mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages") skipped all HugePage containing regions because it can cause significant delays in 1G allocation (as HugeTLB migrations may fail for a number of reasons). Instead, if hugepage migration is enabled, consider regions with hugepages smaller than the target contiguous allocation request as valid targets for allocation. We optimize for the existing behavior by searching for non-hugetlb regions in a first pass, then retrying the search to include hugetlb only on failure. This allows the existing fast-path to remain the default case with a slow-path fallback to increase reliability. We only fallback to the slow path if a hugetlb region was detected, and we do a full re-scan because the zones/blocks may have changed during the first pass (and it's not worth further complexity). isolate_migrate_pages_block() has similar hugetlb filter logic, and the hugetlb code does a migratable check in folio_isolate_hugetlb() during isolation. The code servicing the allocation and migration already supports this exact use case. To test, allocate a bunch of 2MB HugeTLB pages (in this case 48GB) and then attempt to allocate some 1G HugeTLB pages (in this case 4GB) (Scale to your machine's memory capacity). echo 24576 > .../hugepages-2048kB/nr_hugepages echo 4 > .../hugepages-1048576kB/nr_hugepages Prior to this patch, the 1GB page reservation can fail if no contiguous 1GB pages remain. After this patch, the kernel will try to move 2MB pages and successfully allocate the 1GB pages (assuming overall sufficient memory is available). Also tested this while a program had the 2MB reservations mapped, and the 1GB reservation still succeeds. folio_alloc_gigantic() is the primary user of alloc_contig_pages(), other users are debug or init-time allocations and largely unaffected. - ppc/memtrace is a debugfs interface - x86/tdx memory allocation occurs once on module-init - kfence/core happens once on module (late) init - THP uses it in debug_vm_pgtable_alloc_huge_page at __init time Link: https://lkml.kernel.org/r/20251221124656.2362540-1-gourry@gourry.net Signed-off-by: Gregory Price <gourry@gourry.net> Suggested-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/linux-mm/6fe3562d-49b2-4975-aa86-e139c535ad00@redhat.com/ Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm, hugetlb: implement movable_gigantic_pages sysctlGregory Price
This reintroduces a concept removed by: commit d6cb41cc44c6 ("mm, hugetlb: remove hugepages_treat_as_movable sysctl") This sysctl provides flexibility between ZONE_MOVABLE use cases: 1) onlining memory in ZONE_MOVABLE to maintain hotplug compatibility 2) onlining memory in ZONE_MOVABLE to make hugepage allocate reliable When ZONE_MOVABLE is used to make huge page allocation more reliable, disallowing gigantic pages memory in this region is pointless. If hotplug is not a requirement, we can loosen the restrictions to allow 1GB gigantic pages in ZONE_MOVABLE. Since 1GB can be difficult to migrate / has impacts on compaction / defragmentation, we don't enable this by default. Notably, 1GB pages can only be migrated if another 1GB page is available - so hot-unplug will fail if such a page cannot be found. However, since there are scenarios where gigantic pages are migratable, we should allow use of these on movable regions. When not valid 1GB is available for migration, hot-unplug will retry indefinitely (or until interrupted). For example: echo 0 > node0/hugepages/..-1GB/nr_hugepages # clear node0 1GB pages echo 1 > node1/hugepages/..-1GB/nr_hugepages # reserve node1 1GB page ./alloc_huge_node1 & # Allocate a 1GB page on node1 ./node1_offline & # attempt to offline all node1 memory echo 1 > node0/hugepages/..-1GB/nr_hugepages # reserve node0 1GB page In this example, node1_offline will block indefinitely until the final step, when a node0 1GB page is made available. Note: Boot-time CMA is not possible for driver-managed hotplug memory, as CMA requires the memory to be registered as SystemRAM at boot time. Additionally, 1GB huge pages are not supported by THP. Link: https://lkml.kernel.org/r/20251221125603.2364174-1-gourry@gourry.net Signed-off-by: Gregory Price <gourry@gourry.net> Suggested-by: David Rientjes <rientjes@google.com> Link: https://lore.kernel.org/all/20180201193132.Hk7vI_xaU%25akpm@linux-foundation.org/ Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> Cc: Gregory Price <gourry@gourry.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm: cleanup vma_iter_bulk_allocWentao Guan
commit d24062914837 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()"), removed the only user and mas_expected_entries has been removed, since commit e3852a1213ffc ("maple_tree: Drop bulk insert support"). Also cleanup the mas_expected_entries in maple_tree.h. No functional change. Link: https://lkml.kernel.org/r/20251106110929.3522073-1-guanwentao@uniontech.com Signed-off-by: Wentao Guan <guanwentao@uniontech.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Cheng Nie <niecheng1@uniontech.com> Cc: Guan Wentao <guanwentao@uniontech.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Pedro Falcato <pfalcato@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm: clarify GFP_ATOMIC/GFP_NOWAIT doc-commentBrendan Jackman
The current description of contexts where it's invalid to make GFP_ATOMIC and GFP_NOWAIT calls is rather vague. Replace this with a direct description of the actual contexts of concern and refer to the RT docs where this is explained more discursively. While rejigging this prose, also move the documentation of GFP_NOWAIT to the GFP_NOWAIT section. Link: https://lore.kernel.org/all/d912480a-5229-4efe-9336-b31acded30f5@suse.cz/ Link: https://lkml.kernel.org/r/20251219-b4-gfp_atomic-comment-v2-1-4c4ce274c2b6@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm/gup: remove no longer used gup_fast_undo_dev_pagemapKairui Song
This helper is no longer used after commit fd2825b0760a ("mm/gup: remove pXX_devmap usage from get_user_pages()"). Link: https://lkml.kernel.org/r/20251219-gup-cleanup-v1-1-348a70d9eecb@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocationsVlastimil Babka
Since commit cc638f329ef6 ("mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations"), THP page fault allocations have settled on the following scheme (from the commit log): 1. local node only THP allocation with no reclaim, just compaction. 2. for madvised VMA's or when synchronous compaction is enabled always - THP allocation from any node with effort determined by global defrag setting and VMA madvise 3. fallback to base pages on any node Recent customer reports however revealed we have a gap in step 1 above. What we have seen is excessive reclaim due to THP page faults on a NUMA node that's close to its high watermark, while other nodes have plenty of free memory. The problem with step 1 is that it promises no reclaim after the compaction attempt, however reclaim is only avoided for certain compaction outcomes (deferred, or skipped due to insufficient free base pages), and not e.g. when compaction is actually performed but fails (we did see compact_fail vmstat counter increasing). THP page faults can therefore exhibit a zone_reclaim_mode-like behavior, which is not the intention. Thus add a check for __GFP_THISNODE that corresponds to this exact situation and prevents continuing with reclaim/compaction once the initial compaction attempt isn't successful in allocating the page. Note that commit cc638f329ef6 has not introduced this over-reclaim possibility; it appears to exist in some form since commit 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations"). Followup commits b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed") and cc638f329ef6 have moved in the right direction, but left the abovementioned gap. Link: https://lkml.kernel.org/r/20251219-costly-noretry-thisnode-fix-v1-1-e1085a4a0c34@suse.cz Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Pedro Falcato <pfalcato@suse.de> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm/hugetlb_cgroup: fix -Wformat-truncation warningXiu Jianfeng
A false-positive compile warnings with -Wformat-trucation was introduced by commit 47179fe03588 ("mm/hugetlb_cgroup: prepare cftypes based on template") on arch s390. Suppress it by replacing snprintf() with scnprintf(). mm/hugetlb_cgroup.c: In function 'hugetlb_cgroup_file_init': mm/hugetlb_cgroup.c:829:44: warning: '%s' directive output may be truncated writing up to 1623 bytes into a region of size between 32 and 63 [-Wformat-truncation=] 829 | snprintf(cft->name, MAX_CFTYPE_NAME, "%s.%s", buf, tmpl->name); | ^~ Link: https://lkml.kernel.org/r/20251222072359.3626182-1-xiujianfeng@huaweicloud.com Fixes: 47179fe03588 ("mm/hugetlb_cgroup: prepare cftypes based on template") Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202512212332.9lFRbgdS-lkp@intel.com/ Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm: fix minor spelling mistakes in commentsKevin Lourenco
Correct several typos in comments across files in mm/ [akpm@linux-foundation.org: also fix comment grammar, per SeongJae] Link: https://lkml.kernel.org/r/20251218150906.25042-1-klourencodev@gmail.com Signed-off-by: Kevin Lourenco <klourencodev@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm/damon: fix typos in commentsKevin Lourenco
Correct minor spelling mistakes in several files under mm/damon. No functional changes. Link: https://lkml.kernel.org/r/20251217181216.47576-1-klourencodev@gmail.com Signed-off-by: Kevin Lourenco <klourencodev@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20zram: remove KMSG_COMPONENT macroHeiko Carstens
The KMSG_COMPONENT macro is a leftover of the s390 specific "kernel message catalog" from 2008 [1] which never made it upstream. The macro was added to s390 code to allow for an out-of-tree patch which used this to generate unique message ids. Also this out-of-tree doesn't exist anymore. The pattern of how the KMSG_COMPONENT is used was partially also used for non s390 specific code, for whatever reasons. Remove the macro in order to get rid of a pointless indirection. Link: https://lkml.kernel.org/r/20251126143602.2207435-1-hca@linux.ibm.com Link: https://lwn.net/Articles/292650/ [1] Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm/mm_init: replace simple_strtoul with kstrtobool in set_hashdistThorsten Blum
Use bool for 'hashdist' and replace simple_strtoul() with kstrtobool() for parsing the 'hashdist=' boot parameter. Unlike simple_strtoul(), which returns an unsigned long, kstrtobool() converts the string directly to bool and avoids implicit casting. Check the return value of kstrtobool() and reject invalid values. This adds error handling while preserving behavior for existing values, and removes use of the deprecated simple_strtoul() helper. The current code silently sets 'hashdist = 0' if parsing fails, instead of leaving the default value (HASHDIST_DEFAULT) unchanged. Additionally, kstrtobool() accepts common boolean strings such as "on" and "off". Link: https://lkml.kernel.org/r/20251217110214.50807-1-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20lib/test_vmalloc.c: minor fixes to test_vmalloc.cAudra Mitchell
If PAGE_SIZE is larger than 4k and if you have a system with a large number of CPUs, this test can require a very large amount of memory leading to oom-killer firing. Given the type of allocation, the kernel won't have anything to kill, causing the system to stall. Add a parameter to the test_vmalloc driver to represent the number of times a percpu object will be allocated. Calculate this in test_vmalloc.sh to be 90% of available memory or the current default of 35000, whichever is smaller. Link: https://lkml.kernel.org/r/20251201181848.1216197-1-audra@redhat.com Signed-off-by: Audra Mitchell <audra@redhat.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rafael Aquini <raquini@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20maple_tree: remove struct maple_allocSidhartha Kumar
struct maple_alloc is deprecated after the maple tree conversion to sheaves, remove the references from the header file. Link: https://lkml.kernel.org/r/20251203224511.469978-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm/block/fs: remove laptop_modeJohannes Weiner
Laptop mode was introduced to save battery, by delaying and consolidating writes and thereby maximize the time rotating hard drives wouldn't have to spin. Luckily, rotating hard drives, with their high spin-up times and power draw, are a thing of the past for battery-powered devices. Reclaim has also since changed to not write single filesystem pages anymore, and regular filesystem writeback is lumpy by design. The juice doesn't appear worth the squeeze anymore. The footprint of the feature is small, but nevertheless it's a complicating factor in mm, block, filesystems. Developers don't think about it, and it likely hasn't been tested with new reclaim and writeback changes in years. Let's sunset it. Keep the sysctl with a deprecation warning around for a few more cycles, but remove all functionality behind it. [akpm@linux-foundation.org: fix Documentation/admin-guide/laptops/index.rst] Link: https://lkml.kernel.org/r/20251216185201.GH905277@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20zram: drop pp_in_progressSergey Senozhatsky
pp_in_progress makes sure that only one post-processing (writeback or recomrpession) is active at any given time. Functionality wise it, basically, shadows zram init_lock, when init_lock is acquired in writer mode. Switch recompress_store() and writeback_store() to take zram init_lock in writer mode, like all store() sysfs handlers should do, so that we can drop pp_in_progress. Recompression and writeback can be somewhat slow, so holding init_lock in writer mode can block zram attrs reads, but in reality the only zram attrs reads that take place are mm_stat reads, and usually it's the same process that reads mm_stat and does recompression or writeback. Link: https://lkml.kernel.org/r/20251216071342.687993-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm/damon/stat: deduplicate intervals_goal setup in damon_stat_build_ctx()JaeJoon Jung
The damon_stat_build_ctx() function sets the values of intervals_goal structure members. These values are applied to damon_ctx in damon_set_attrs(). However, It is resetting the values that were already applied previously to the same values. I suggest removing this code as it constitutes duplicate execution. Link: https://patch.msgid.link/20251206011716.7185-1-rgbi3307@gmail.com Link: https://lkml.kernel.org/r/20251216073440.40891-1-sj@kernel.org Signed-off-by: JaeJoon Jung <rgbi3307@gmail.com> Reviewed-by: Enze Li <lienze@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm/damon/core: add trace point for damos stat per apply intervalSeongJae Park
DAMON users can read DAMOS stats via DAMON sysfs interface. It enables efficient, simple and flexible usages of the stats. Especially for systems not having advanced tools like perf or bpftrace, that can be useful. But if the advanced tools are available, exposing the stats via tracepoint can reduce unnecessary reimplementation of the wheels. Add a new tracepoint for DAMOS stats, namely damos_stat_after_apply_interval. The tracepoint is triggered for each scheme's apply interval and exposes the whole stat values. If the user needs sub-apply interval information for any chance, damos_before_apply tracepoint could be used. Link: https://lkml.kernel.org/r/20251216080128.42991-13-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20Docs/ABI/damon: update for max_nr_snapshotsSeongJae Park
Update DAMON ABI document for the newly added DAMON sysfs interface file, max_nr_snapshots. Link: https://lkml.kernel.org/r/20251216080128.42991-12-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20Docs/admin-guide/mm/damon/usage: update for max_nr_snapshotsSeongJae Park
Update DAMON usage document for the newly added DAMON sysfs interface file, max_nr_snapshots. Link: https://lkml.kernel.org/r/20251216080128.42991-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20Docs/mm/damon/design: update for max_nr_snapshotsSeongJae Park
Update DAMON design document for the newly added snapshot level DAMOS deactivation feature, max_nr_snapshots. Link: https://lkml.kernel.org/r/20251216080128.42991-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm/damon/sysfs-schemes: implement max_nr_snapshots fileSeongJae Park
Add a new DAMON sysfs file for setting and getting the newly introduced per-DAMON-snapshot level DAMOS deactivation control parameter, max_nr_snapshots. The file has a name same to the parameter and placed under the damos stat directory. Link: https://lkml.kernel.org/r/20251216080128.42991-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm/damon/core: implement max_nr_snapshotsSeongJae Park
There are DAMOS use cases that require user-space centric control of its activation and deactivation. Having the control plane on the user-space, or using DAMOS as a way for monitoring results collection are such examples. DAMON parameters online commit, DAMOS quotas and watermarks can be useful for this purpose. However, those features work only at the sub-DAMON-snapshot level. In some use cases, the DAMON-snapshot level control is required. For example, in DAMOS-based monitoring results collection use case, the user online-installs a DAMOS scheme with DAMOS_STAT action, wait it be applied to whole regions of a single DAMON-snapshot, retrieves the stats and tried regions information, and online-uninstall the scheme. It is efficient to ensure the lifetime of the scheme as no more no less one snapshot consumption. To support such use cases, introduce a new DAMOS core API per-scheme parameter, namely max_nr_snapshots. As the name implies, it is the upper limit of nr_snapshots, which is a DAMOS stat that represents the number of DAMON-snapshots that the scheme has fully applied. If the limit is set with a non-zero value and nr_snapshots reaches or exceeds the limit, the scheme is deactivated. Link: https://lkml.kernel.org/r/20251216080128.42991-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm/damon: update damos kerneldoc for stat fieldSeongJae Park
Commit 0e92c2ee9f45 ("mm/damon/schemes: account scheme actions that successfully applied") has replaced ->stat_count and ->stat_sz of 'struct damos' with ->stat. The commit mistakenly did not update the related kernel doc comment, though. Update the comment. Link: https://lkml.kernel.org/r/20251216080128.42991-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20Docs/ABI/damon: update for nr_snapshots damos statSeongJae Park
Update DAMON ABI document for the newly added damos stat, nr_snapshots. Link: https://lkml.kernel.org/r/20251216080128.42991-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20Docs/admin-guide/mm/damon/usage: update for nr_snapshots damos statSeongJae Park
Update DAMON usage document for the newly added damos stat, nr_snapshots. Link: https://lkml.kernel.org/r/20251216080128.42991-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20Docs/mm/damon/design: update for nr_snapshots damos statSeongJae Park
Update DAMON design document for the newly added damos stat, nr_snapshots. Link: https://lkml.kernel.org/r/20251216080128.42991-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm/damon/sysfs-schemes: introduce nr_snapshots damos stat fileSeongJae Park
Introduce a new DAMON sysfs interface file for exposing the newly added DAMOS stat, nr_snapshots. The file has the name same to the stat name (nr_snapshots) and placed under the damos stat sysfs directory. Link: https://lkml.kernel.org/r/20251216080128.42991-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm/damon/core: introduce nr_snapshots damos statSeongJae Park
Patch series "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos stats". Introduce three changes for improving DAMOS stat's provided information, deterministic control, and reading usability. DAMOS provides stats that are important for understanding its behavior. It lacks information about how many DAMON-generated monitoring output snapshots it has worked on. Add a new stat, nr_snapshots, to show the information. Users can control DAMOS schemes in multiple ways. Using the online parameters commit feature, they can install and uninstall DAMOS schemes whenever they want while keeping DAMON runs. DAMOS quotas and watermarks can be used for manually or automatically turning on/off or adjusting the aggressiveness of the scheme. DAMOS filters can be used for applying the scheme to specific memory entities based on their types and locations. Some users want their DAMOS scheme to be applied to only specific number of DAMON snapshots, for more deterministic control. One example use case is tracepoint based snapshot reading. Add a new knob, max_nr_snapshots, to support this. If the nr_snapshots parameter becomes same to or greater than the value of this parameter, the scheme is deactivated. Users can read DAMOS stats via DAMON's sysfs interface. For deep level investigations on environments having advanced tools like perf and bpftrace, exposing the stats via a tracepoint can be useful. Implement a new tracepoint, namely damon:damos_stat_after_apply_interval. First five patches (patches 1-5) of this series implement the new stat, nr_snapshots, on the core layer (patch 1), expose on DAMON sysfs user interface (patch 2), and update documents (patches 3-5). Following six patches (patches 6-11) are for the new stat based DAMOS deactivation (max_nr_snapshots). The first one (patch 6) of this group updates a kernel-doc comment before making further changes. Then an implementation of it on the core layer (patch 7), an introduction of a new DAMON sysfs interface file for users of the feature (patch 8), and three updates of the documents (patches 9-11) follow. The final one (patch 12) introduces the new tracepoint that exposes the DAMOS stat values for each scheme apply interval. This patch (of 12): DAMON generates monitoring results snapshots for every sampling interval. DAMOS applies given schemes on the regions of the snapshots, for every apply interval of the scheme. DAMOS stat informs a given scheme has tried to how many memory entities and applied, in the region and byte level. In some use cases including user-space oriented tuning and investigations, it is useful to know that in the DAMON-snapshot level. Introduce a new stat, namely nr_snapshots for DAMON core API callers. [sj@kernel.org: fix wrong list_is_last() call in damons_is_last_region()] Link: https://lkml.kernel.org/r/20260114152049.99727-1-sj@kernel.org Link: https://lkml.kernel.org/r/20251216080128.42991-1-sj@kernel.org Link: https://lkml.kernel.org/r/20251216080128.42991-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20tools/mm/slabinfo: fix --partial long option mappingKaushlendra Kumar
The long option "--partial" was incorrectly mapped to lowercase 'p' in the opts[] array, but the getopt string and switch case handle uppercase 'P'. This mismatch caused --partial to be rejected. Fix the long_options mapping to use 'P' so --partial works correctly alongside the existing -P short option. Link: https://lkml.kernel.org/r/20251208105240.2719773-1-kaushlendra.kumar@intel.com Signed-off-by: Kaushlendra Kumar <kaushlendra.kumar@intel.com> Reviewed-by: SeongJae Park <sj@kernel.org> Tested-by: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20tools/mm/thp_swap_allocator_test: fix small folio alignmentKaushlendra Kumar
Use ALIGNMENT_SMALLFOLIO instead of ALIGNMENT_MTHP when allocating small folios to ensure correct memory alignment for the test case. Before: test allocates small folios with 64KB alignment (ALIGNMENT_MTHP) when only 4KB alignment (ALIGNMENT_SMALLFOLIO) is needed. This wastes address space and may cause allocation failures on systems with fragmented memory. Worst-case impact: this only affects thp_swap_allocator_test tool behavior. Link: https://lkml.kernel.org/r/20251209031745.2723120-1-kaushlendra.kumar@intel.com Signed-off-by: Kaushlendra Kumar <kaushlendra.kumar@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm/damon/core: fix wasteful CPU calls by skipping non-existent targetsEnze Li
Currently, DAMON does not proactively clean up invalid monitoring targets during its runtime. When some monitored processes exit, DAMON continues to make the following unnecessary function calls, --damon_for_each_target-- --damon_for_each_region-- damon_do_apply_schemes damos_apply_scheme damon_va_apply_scheme damos_madvise damon_get_mm it is only in the damon_get_mm() function that it may finally discover the target no longer exists, which wastes CPU resources. A simple idea is to check for the existence of monitoring targets within the kdamond_need_stop() function and promptly clean up non-existent targets. However, SJ pointed out that this approach is problematic because the online commit logic incorrectly uses list indices to update the monitoring state. This can lead to data loss if the target list is changed concurrently. Meanwhile, SJ suggests checking for target existence at the damon_for_each_target level, and if a target does not exist, simply skip it and proceed to the next one. Link: https://lkml.kernel.org/r/20251210052508.264433-1-lienze@kylinos.cn Signed-off-by: Enze Li <lienze@kylinos.cn> Suggested-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm: memcontrol: rename mem_cgroup_from_slab_obj()Johannes Weiner
In addition to slab objects, this function is used for resolving non-slab kernel pointers. This has caused confusion in recent refactoring work. Rename it to mem_cgroup_from_virt(), sticking with terminology established by the virt_to_<foo>() converters. Link: https://lore.kernel.org/linux-mm/20251113161424.GB3465062@cmpxchg.org/ Link: https://lkml.kernel.org/r/20251210154301.720133-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20memcg: remove mem_cgroup_size()Chen Ridong
The mem_cgroup_size helper is used only in apply_proportional_protection to read the current memory usage. Its semantics are unclear and inconsistent with other sites, which directly call page_counter_read for the same purpose. Remove this helper and get its usage via mem_cgroup_protection for clarity. Additionally, rename the local variable 'cgroup_size' to 'usage' to better reflect its meaning. No functional changes intended. Link: https://lkml.kernel.org/r/20251211013019.2080004-3-chenridong@huaweicloud.com Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Michal Koutný <mkoutny@suse.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lu Jialin <lujialin4@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20memcg: move mem_cgroup_usage memcontrol-v1.cChen Ridong
Patch series "memcg cleanups", v3. Two code moves/removals with no behavior change. This patch (of 2): Currently, mem_cgroup_usage is only used for v1, just move it to memcontrol-v1.c Link: https://lkml.kernel.org/r/20251211013019.2080004-1-chenridong@huaweicloud.com Link: https://lkml.kernel.org/r/20251211013019.2080004-2-chenridong@huaweicloud.com Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Koutný <mkoutny@suse.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lu Jialin <lujialin4@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm: zswap: delete unused acomp->is_sleepableJohannes Weiner
This hasn't been used since 7d4c9629b74f ("mm: zswap: use object read/write APIs instead of object mapping APIs"). Drop it. Link: https://lkml.kernel.org/r/20251211025645.820517-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm/damon/sysfs-schemes: remove outdated TODO in target_nid_store()Swaraj Gaikwad
The TODO comment in target_nid_store() suggested adding range validation for target_nid. As discussed in [1], the current behavior of accepting any integer value is intentional. DAMON sysfs aims to remain flexible, including supporting users who prepare node IDs before future NUMA hotplug events. Because this behavior matches the broader design philosophy of the DAMON sysfs interface, the TODO comment is now misleading. This patch removes the comment without introducing any behavioral change. No functional changes. Link: https://lkml.kernel.org/r/20251211032722.4928-2-swarajgaikwad1925@gmail.com Link: https://lore.kernel.org/lkml/20251210150930.57679-1-sj@kernel.org/ [1] Signed-off-by: Swaraj Gaikwad <swarajgaikwad1925@gmail.com> Suggested-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm: folio_zero_user: cache neighbouring pagesAnkur Arora
folio_zero_user() does straight zeroing without caring about temporal locality for caches. This replaced commit c6ddfb6c5890 ("mm, clear_huge_page: move order algorithm into a separate function") where we cleared a page at a time converging to the faulting page from the left and the right. To retain limited temporal locality, split the clearing in three parts: the faulting page and its immediate neighbourhood, and the regions on its left and right. We clear the local neighbourhood last to maximize chances of it sticking around in the cache. Performance === AMD Genoa (EPYC 9J14, cpus=2 sockets * 96 cores * 2 threads, memory=2.2 TB, L1d=16K/thread, L2=512K/thread, L3=2MB/thread) vm-scalability/anon-w-seq-hugetlb: this workload runs with 384 processes (one for each CPU) each zeroing anonymously mapped hugetlb memory which is then accessed sequentially. stime utime discontiguous-page 1739.93 ( +- 6.15% ) 1016.61 ( +- 4.75% ) contiguous-page 1853.70 ( +- 2.51% ) 1187.13 ( +- 3.50% ) batched-pages 1756.75 ( +- 2.98% ) 1133.32 ( +- 4.89% ) neighbourhood-last 1725.18 ( +- 4.59% ) 1123.78 ( +- 7.38% ) Both stime and utime largely respond somewhat expectedly. There is a fair amount of run to run variation but the general trend is that the stime drops and utime increases. There are a few oddities, like contiguous-page performing very differently from batched-pages. As such this is likely an uncommon pattern where we saturate the memory bandwidth (since all CPUs are running the test) and at the same time are cache constrained because we access the entire region. Kernel make (make -j 12 bzImage): stime utime discontiguous-page 199.29 ( +- 0.63% ) 1431.67 ( +- .04% ) contiguous-page 193.76 ( +- 0.58% ) 1433.60 ( +- .05% ) batched-pages 193.92 ( +- 0.76% ) 1431.04 ( +- .08% ) neighbourhood-last 194.46 ( +- 0.68% ) 1431.51 ( +- .06% ) For make the utime stays relatively flat with a fairly small (-2.4%) improvement in the stime. Link: https://lkml.kernel.org/r/20260107072009.1615991-9-ankur.a.arora@oracle.com Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm: folio_zero_user: clear page rangesAnkur Arora
Use batch clearing in clear_contig_highpages() instead of clearing a single page at a time. Exposing larger ranges enables the processor to optimize based on extent. To do this we just switch to using clear_user_highpages() which would in turn use clear_user_pages() or clear_pages(). Batched clearing, when running under non-preemptible models, however, has latency considerations. In particular, we need periodic invocations of cond_resched() to keep to reasonable preemption latencies. This is a problem because the clearing primitives do not, or might not be able to, call cond_resched() to check if preemption is needed. So, limit the worst case preemption latency by doing the clearing in units of no more than PROCESS_PAGES_NON_PREEMPT_BATCH pages. (Preemptible models already define away most of cond_resched(), so the batch size is ignored when running under those.) PROCESS_PAGES_NON_PREEMPT_BATCH: for architectures with "fast" clear-pages (ones that define clear_pages()), we define it as 32MB worth of pages. This is meant to be large enough to allow the processor to optimize the operation and yet small enough that we see reasonable preemption latency for when this optimization is not possible (ex. slow microarchitectures, memory bandwidth saturation.) This specific value also allows for a cacheline allocation elision optimization (which might help unrelated applications by not evicting potentially useful cache lines) that kicks in recent generations of AMD Zen processors at around LLC-size (32MB is a typical size). At the same time 32MB is small enough that even with poor clearing bandwidth (say ~10GBps), time to clear 32MB should be well below the scheduler's default warning threshold (sysctl_resched_latency_warn_ms=100). "Slow" architectures (don't have clear_pages()) will continue to use the base value (single page). Performance == Testing a demand fault workload shows a decent improvement in bandwidth with pg-sz=1GB. Bandwidth with pg-sz=2MB stays flat. $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5 contiguous-pages batched-pages (GBps +- %stdev) (GBps +- %stdev) pg-sz=2MB 23.58 +- 1.95% 25.34 +- 1.18% + 7.50% preempt=* pg-sz=1GB 25.09 +- 0.79% 39.22 +- 2.32% + 56.31% preempt=none|voluntary pg-sz=1GB 25.71 +- 0.03% 52.73 +- 0.20% [#] +110.16% preempt=full|lazy [#] We perform much better with preempt=full|lazy because, not needing explicit invocations of cond_resched() we can clear the full extent (pg-sz=1GB) as a single unit which the processor can optimize for. (Unless otherwise noted, all numbers are on AMD Genoa (EPYC 9J13); region-size=64GB, local node; 2.56 GHz, boost=0.) Analysis == pg-sz=1GB: the improvement we see falls in two buckets depending on the batch size in use. For batch-size=32MB the number of cachelines allocated (L1-dcache-loads) -- which stay relatively flat for smaller batches, start to drop off because cacheline allocation elision kicks in. And as can be seen below, at batch-size=1GB, we stop allocating cachelines almost entirely. (Not visible here but from testing with intermediate sizes, the allocation change kicks in only at batch-size=32MB and ramps up from there.) contigous-pages 6,949,417,798 L1-dcache-loads # 883.599 M/sec ( +- 0.01% ) (35.75%) 3,226,709,573 L1-dcache-load-misses # 46.43% of all L1-dcache accesses ( +- 0.05% ) (35.75%) batched,32MB 2,290,365,772 L1-dcache-loads # 471.171 M/sec ( +- 0.36% ) (35.72%) 1,144,426,272 L1-dcache-load-misses # 49.97% of all L1-dcache accesses ( +- 0.58% ) (35.70%) batched,1GB 63,914,157 L1-dcache-loads # 17.464 M/sec ( +- 8.08% ) (35.73%) 22,074,367 L1-dcache-load-misses # 34.54% of all L1-dcache accesses ( +- 16.70% ) (35.70%) The dropoff is also visible in L2 prefetch hits (miss numbers are on similar lines): contiguous-pages 3,464,861,312 l2_pf_hit_l2.all # 437.722 M/sec ( +- 0.74% ) (15.69%) batched,32MB 883,750,087 l2_pf_hit_l2.all # 181.223 M/sec ( +- 1.18% ) (15.71%) batched,1GB 8,967,943 l2_pf_hit_l2.all # 2.450 M/sec ( +- 17.92% ) (15.77%) This largely decouples the frontend from the backend since the clearing operation does not need to wait on loads from memory (we still need cacheline ownership but that's a shorter path). This is most visible if we rerun the test above with (boost=1, 3.66 GHz). $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5 contiguous-pages batched-pages (GBps +- %stdev) (GBps +- %stdev) pg-sz=2MB 26.08 +- 1.72% 26.13 +- 0.92% - preempt=* pg-sz=1GB 26.99 +- 0.62% 48.85 +- 2.19% + 80.99% preempt=none|voluntary pg-sz=1GB 27.69 +- 0.18% 75.18 +- 0.25% +171.50% preempt=full|lazy Comparing the batched-pages numbers from the boost=0 ones and these: for a clock-speed gain of 42% we gain 24.5% for batch-size=32MB and 42.5% for batch-size=1GB. In comparison the baseline contiguous-pages case and both the pg-sz=2MB ones are largely backend bound so gain no more than ~10%. Other platforms tested, Intel Icelakex (Oracle X9) and ARM64 Neoverse-N1 (Ampere Altra) both show an improvement of ~35% for pg-sz=2MB|1GB. The first goes from around 8GBps to 11GBps and the second from 32GBps to 44 GBPs. [ankur.a.arora@oracle.com: move the unit computation and make it a const Link: https://lkml.kernel.org/r/20260108060406.1693853-1-ankur.a.arora@oracle.com Link: https://lkml.kernel.org/r/20260107072009.1615991-8-ankur.a.arora@oracle.com Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm: folio_zero_user: clear pages sequentiallyAnkur Arora
process_huge_pages(), used to clear hugepages, is optimized for cache locality. In particular it processes a hugepage in 4KB page units and in a difficult to predict order: clearing pages in the periphery in a backwards or forwards direction, then converging inwards to the faulting page (or page specified via base_addr.) This helps maximize temporal locality at time of access. However, while it keeps stores inside a 4KB page sequential, pages are ordered semi-randomly in a way that is not easy for the processor to predict. This limits the clearing bandwidth to what's available in a 4KB page. Consider the baseline bandwidth: $ perf bench mem mmap -p 2MB -f populate -s 64GB -l 3 # Running 'mem/mmap' benchmark: # function 'populate' (Eagerly populated mmap()) # Copying 64GB bytes ... 11.791097 GB/sec (Unless otherwise noted, all numbers are on AMD Genoa (EPYC 9J13); region-size=64GB, local node; 2.56 GHz, boost=0.) 11.79 GBps amounts to around 323ns/4KB. With memory access latency of ~100ns, that doesn't leave much time to help from, say, hardware prefetchers. (Note that since this is a purely write workload, it's reasonable to assume that the processor does not need to prefetch any cachelines. However, for a processor to skip the prefetch, it would need to look at the access pattern, and see that full cachelines were being written. This might be easily visible if clear_page() was using, say x86 string instructions; less so if it were using a store loop. In any case, the existence of these kind predictors or appropriately helpful threshold values is implementation specific. Additionally, even when the processor can skip the prefetch, coherence protocols will still need to establish exclusive ownership necessitating communication with remote caches.) With that, the change is quite straight-forward. Instead of clearing pages discontiguously, clear contiguously: switch to a loop around clear_user_highpage(). Performance == Testing a demand fault workload shows a decent improvement in bandwidth with pg-sz=2MB. Performance of pg-sz=1GB does not change because it has always used straight clearing. $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5 discontiguous-pages contiguous-pages (baseline) (GBps +- %stdev) (GBps +- %stdev) pg-sz=2MB 11.76 +- 1.10% 23.58 +- 1.95% +100.51% pg-sz=1GB 24.85 +- 2.41% 25.40 +- 1.33% - Analysis (pg-sz=2MB) == At L1 data cache level, nothing changes. The processor continues to access the same number of cachelines, allocating and missing them as it writes to them. discontiguous-pages 7,394,341,051 L1-dcache-loads # 445.172 M/sec ( +- 0.04% ) (35.73%) 3,292,247,227 L1-dcache-load-misses # 44.52% of all L1-dcache accesses ( +- 0.01% ) (35.73%) contiguous-pages 7,205,105,282 L1-dcache-loads # 861.895 M/sec ( +- 0.02% ) (35.75%) 3,241,584,535 L1-dcache-load-misses # 44.99% of all L1-dcache accesses ( +- 0.00% ) (35.74%) The L2 prefetcher, however, is now able to prefetch ~22% more cachelines (L2 prefetch miss rate also goes up significantly showing that we are backend limited): discontiguous-pages 2,835,860,245 l2_pf_hit_l2.all # 170.242 M/sec ( +- 0.12% ) (15.65%) contiguous-pages 3,472,055,269 l2_pf_hit_l2.all # 411.319 M/sec ( +- 0.62% ) (15.67%) That sill leaves a large gap between the ~22% improvement in prefetch and the ~100% improvement in bandwidth but better prefetching seems to streamline the traffic well enough that most of the data starts comes from the L2 leading to substantially fewer cache-misses at the LLC: discontiguous-pages 8,493,499,137 cache-references # 511.416 M/sec ( +- 0.15% ) (50.01%) 930,501,344 cache-misses # 10.96% of all cache refs ( +- 0.52% ) (50.01%) contiguous-pages 9,421,926,416 cache-references # 1.120 G/sec ( +- 0.09% ) (50.02%) 68,787,247 cache-misses # 0.73% of all cache refs ( +- 0.15% ) (50.03%) In addition, there are a few minor frontend optimizations: clear_pages() on x86 is now fully inlined, so we don't have a CALL/RET pair (which isn't free when using RETHUNK speculative execution mitigation as we do on my test system.) The loop in clear_contig_highpages() is also easier to predict (especially when handling faults) as compared to that in process_huge_pages(). discontiguous-pages 980,014,411 branches # 59.005 M/sec (31.26%) discontiguous-pages 180,897,177 branch-misses # 18.46% of all branches (31.26%) contiguous-pages 515,630,550 branches # 62.654 M/sec (31.27%) contiguous-pages 78,039,496 branch-misses # 15.13% of all branches (31.28%) Note that although clearing contiguously is easier to optimize for the processor, it does not, sadly, mean that the processor will necessarily take advantage of it. For instance this change does not result in any improvement in my tests on Intel Icelakex (Oracle X9), or on ARM64 Neoverse-N1 (Ampere Altra). Link: https://lkml.kernel.org/r/20260107072009.1615991-7-ankur.a.arora@oracle.com Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20x86/clear_page: introduce clear_pages()Ankur Arora
Performance when clearing with string instructions (x86-64-stosq and similar) can vary significantly based on the chunk-size used. $ perf bench mem memset -k 4KB -s 4GB -f x86-64-stosq # Running 'mem/memset' benchmark: # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S) # Copying 4GB bytes ... 13.748208 GB/sec $ perf bench mem memset -k 2MB -s 4GB -f x86-64-stosq # Running 'mem/memset' benchmark: # function 'x86-64-stosq' (movsq-based memset() in # arch/x86/lib/memset_64.S) # Copying 4GB bytes ... 15.067900 GB/sec $ perf bench mem memset -k 1GB -s 4GB -f x86-64-stosq # Running 'mem/memset' benchmark: # function 'x86-64-stosq' (movsq-based memset() in arch/x86/lib/memset_64.S) # Copying 4GB bytes ... 38.104311 GB/sec (Both on AMD Milan.) With a change in chunk-size from 4KB to 1GB, we see the performance go from 13.7 GB/sec to 38.1 GB/sec. For the chunk-size of 2MB the change isn't quite as drastic but it is worth adding a clear_page() variant that can handle contiguous page-extents. Link: https://lkml.kernel.org/r/20260107072009.1615991-6-ankur.a.arora@oracle.com Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Reviewed-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20x86/mm: simplify clear_page_*Ankur Arora
clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS" variations. Inlining gets rid of an unnecessary CALL/RET (which isn't free when using RETHUNK speculative execution mitigations.) Fixup and rename clear_page_orig() to adapt to the changed calling convention. Also add a comment from Dave Hansen detailing various clearing mechanisms used in clear_page(). Link: https://lkml.kernel.org/r/20260107072009.1615991-5-ankur.a.arora@oracle.com Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Hildenbrand <david@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20highmem: introduce clear_user_highpages()Ankur Arora
Define clear_user_highpages() which uses the range clearing primitive, clear_user_pages(). We can safely use this when CONFIG_HIGHMEM is disabled and if the architecture does not have clear_user_highpage. The first is needed to ensure that contiguous page ranges stay contiguous which precludes intermediate maps via HIGMEM. The second, because if the architecture has clear_user_highpage(), it likely needs flushing magic when clearing the page, magic that we aren't privy to. For both of those cases, just fallback to a loop around clear_user_highpage(). Link: https://lkml.kernel.org/r/20260107072009.1615991-4-ankur.a.arora@oracle.com Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>