linux.git/mm/Kconfig, branch v6.16

Merge tag 'loongarch-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson

2025-06-07T16:56:18+00:00

Pull LoongArch updates from Huacai Chen:

 - Adjust the 'make install' operation

 - Support SCHED_MC (Multi-core scheduler)

 - Enable ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS

 - Enable HAVE_ARCH_STACKLEAK

 - Increase max supported CPUs up to 2048

 - Introduce the numa_memblks conversion

 - Add PWM controller nodes in dts

 - Some bug fixes and other small changes

* tag 'loongarch-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  platform/loongarch: laptop: Unregister generic_sub_drivers on exit
  platform/loongarch: laptop: Add backlight power control support
  platform/loongarch: laptop: Get brightness setting from EC on probe
  LoongArch: dts: Add PWM support to Loongson-2K2000
  LoongArch: dts: Add PWM support to Loongson-2K1000
  LoongArch: dts: Add PWM support to Loongson-2K0500
  LoongArch: vDSO: Correctly use asm parameters in syscall wrappers
  LoongArch: Fix panic caused by NULL-PMD in huge_pte_offset()
  LoongArch: Preserve firmware configuration when desired
  LoongArch: Avoid using $r0/$r1 as "mask" for csrxchg
  LoongArch: Introduce the numa_memblks conversion
  LoongArch: Increase max supported CPUs up to 2048
  LoongArch: Enable HAVE_ARCH_STACKLEAK
  LoongArch: Enable ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS
  LoongArch: Add SCHED_MC (Multi-core scheduler) support
  LoongArch: Add some annotations in archhelp
  LoongArch: Using generic scripts/install.sh in `make install`
  LoongArch: Add a default install.sh

mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order

2025-06-01T05:46:13+00:00

Problem: On large page size configurations (16KiB, 64KiB), the CMA
alignment requirement (CMA_MIN_ALIGNMENT_BYTES) increases considerably,
and this causes the CMA reservations to be larger than necessary.  This
means that system will have less available MIGRATE_UNMOVABLE and
MIGRATE_RECLAIMABLE page blocks since MIGRATE_CMA can't fallback to them.

The CMA_MIN_ALIGNMENT_BYTES increases because it depends on MAX_PAGE_ORDER
which depends on ARCH_FORCE_MAX_ORDER.  The value of ARCH_FORCE_MAX_ORDER
increases on 16k and 64k kernels.

For example, in ARM, the CMA alignment requirement when:

- CONFIG_ARCH_FORCE_MAX_ORDER default value is used
- CONFIG_TRANSPARENT_HUGEPAGE is set:

PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order | CMA_MIN_ALIGNMENT_BYTES
-----------------------------------------------------------------------
   4KiB   |      10        |       9         |  4KiB * (2 ^  9) =   2MiB
  16Kib   |      11        |      11         | 16KiB * (2 ^ 11) =  32MiB
  64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB

There are some extreme cases for the CMA alignment requirement when:

- CONFIG_ARCH_FORCE_MAX_ORDER maximum value is set
- CONFIG_TRANSPARENT_HUGEPAGE is NOT set:
- CONFIG_HUGETLB_PAGE is NOT set

PAGE_SIZE | MAX_PAGE_ORDER | pageblock_order |  CMA_MIN_ALIGNMENT_BYTES
------------------------------------------------------------------------
   4KiB   |      15        |      15         |  4KiB * (2 ^ 15) = 128MiB
  16Kib   |      13        |      13         | 16KiB * (2 ^ 13) = 128MiB
  64KiB   |      13        |      13         | 64KiB * (2 ^ 13) = 512MiB

This affects the CMA reservations for the drivers. If a driver in a
4KiB kernel needs 4MiB of CMA memory, in a 16KiB kernel, the minimal
reservation has to be 32MiB due to the alignment requirements:

reserved-memory {
    ...
    cma_test_reserve: cma_test_reserve {
        compatible = "shared-dma-pool";
        size = <0x0 0x400000>; /* 4 MiB */
        ...
    };
};

reserved-memory {
    ...
    cma_test_reserve: cma_test_reserve {
        compatible = "shared-dma-pool";
        size = <0x0 0x2000000>; /* 32 MiB */
        ...
    };
};

Solution: Add a new config CONFIG_PAGE_BLOCK_ORDER that allows to set the
page block order in all the architectures.  The maximum page block order
will be given by ARCH_FORCE_MAX_ORDER.

By default, CONFIG_PAGE_BLOCK_ORDER will have the same value that
ARCH_FORCE_MAX_ORDER.  This will make sure that current kernel
configurations won't be affected by this change.  It is a opt-in change.

This patch will allow to have the same CMA alignment requirements for
large page sizes (16KiB, 64KiB) as that in 4kb kernels by setting a lower
pageblock_order.

Tests:

- Verified that HugeTLB pages work when pageblock_order is 1, 7, 10 on
  4k and 16k kernels.

- Verified that Transparent Huge Pages work when pageblock_order is 1,
  7, 10 on 4k and 16k kernels.

- Verified that dma-buf heaps allocations work when pageblock_order is
  1, 7, 10 on 4k and 16k kernels.

Benchmarks:

The benchmarks compare 16kb kernels with pageblock_order 10 and 7.  The
reason for the pageblock_order 7 is because this value makes the min CMA
alignment requirement the same as that in 4kb kernels (2MB).

- Perform 100K dma-buf heaps (/dev/dma_heap/system) allocations of
  SZ_8M, SZ_4M, SZ_2M, SZ_1M, SZ_64, SZ_8, SZ_4.  Use simpleperf
  (https://developer.android.com/ndk/guides/simpleperf) to measure the #
  of instructions and page-faults on 16k kernels.  The benchmark was
  executed 10 times.  The averages are below:

           # instructions         |     #page-faults
    order 10     |  order 7       | order 10 | order 7
--------------------------------------------------------
 13,891,765,770	 | 11,425,777,314 |    220   |   217
 14,456,293,487	 | 12,660,819,302 |    224   |   219
 13,924,261,018	 | 13,243,970,736 |    217   |   221
 13,910,886,504	 | 13,845,519,630 |    217   |   221
 14,388,071,190	 | 13,498,583,098 |    223   |   224
 13,656,442,167	 | 12,915,831,681 |    216   |   218
 13,300,268,343	 | 12,930,484,776 |    222   |   218
 13,625,470,223	 | 14,234,092,777 |    219   |   218
 13,508,964,965	 | 13,432,689,094 |    225   |   219
 13,368,950,667	 | 13,683,587,37  |    219   |   225
-------------------------------------------------------------------
 13,803,137,433  | 13,131,974,268 |    220   |   220    Averages

There were 4.85% #instructions when order was 7, in comparison with order
10.

     13,803,137,433 - 13,131,974,268 = -671,163,166 (-4.86%)

The number of page faults in order 7 and 10 were the same.

These results didn't show any significant regression when the
pageblock_order is set to 7 on 16kb kernels.

- Run speedometer 3.1 (https://browserbench.org/Speedometer3.1/) 5 times
  on the 16k kernels with pageblock_order 7 and 10.

order 10 | order 7  | order 7 - order 10 | (order 7 - order 10) %
-------------------------------------------------------------------
  15.8	 |  16.4    |         0.6        |     3.80%
  16.4	 |  16.2    |        -0.2        |    -1.22%
  16.6	 |  16.3    |        -0.3        |    -1.81%
  16.8	 |  16.3    |        -0.5        |    -2.98%
  16.6	 |  16.8    |         0.2        |     1.20%
-------------------------------------------------------------------
  16.44     16.4            -0.04	          -0.24%   Averages

The results didn't show any significant regression when the
pageblock_order is set to 7 on 16kb kernels.

Link: https://lkml.kernel.org/r/20250521215807.1860663-1-jyescas@google.com
Signed-off-by: Juan Yescas 
Acked-by: Zi Yan 
Reviewed-by: Vlastimil Babka 
Cc: Liam R. Howlett 
Cc: Lorenzo Stoakes 
Cc: David Hildenbrand 
Cc: Mike Rapoport 
Cc: Suren Baghdasaryan 
Cc: Minchan Kim 
Signed-off-by: Andrew Morton

LoongArch: Introduce the numa_memblks conversion

2025-05-30T13:45:43+00:00

Commit 87482708210ff3333a ("mm: introduce numa_memblks") has moved
numa_memblks from x86 to the generic code, but LoongArch was left out
of this conversion.

This patch introduces the generic numa_memblks for LoongArch.

In detail:
1. Enable NUMA_MEMBLKS (but disable NUMA_EMU) in Kconfig;
2. Use generic definition for numa_memblk and numa_meminfo;
3. Use generic implementation for numa_add_memblk() and its friends;
4. Use generic implementation for numa_set_distance() and its friends;
5. Use generic implementation for memory_add_physaddr_to_nid() and its
   friends.

Note: Disable NUMA_EMU because it needs more efforts and no obvious
demand now.

Tested-by: Binbin Zhou 
Signed-off-by: Yuquan Wang 
Signed-off-by: Huacai Chen

mm: khugepaged: decouple SHMEM and file folios' collapse

2025-05-22T21:55:38+00:00

Originally, the file pages collapse was intended for tmpfs/shmem to merge
into THP in the background.  However, now not only tmpfs/shmem can support
large folios, but some other file systems (such as XFS, erofs ...) also
support large folios.  Therefore, it is time to decouple the support of
file folios collapse from SHMEM.

Link: https://lkml.kernel.org/r/ce5c2314e0368cf34bda26f9bacf01c982d4da17.1747119309.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang 
Acked-by: David Hildenbrand 
Acked-by: Zi Yan 
Cc: Dev Jain 
Cc: Liam Howlett 
Cc: Lorenzo Stoakes 
Cc: Mariano Pache 
Cc: Michal Hocko 
Cc: Mike Rapoport 
Cc: Ryan Roberts 
Cc: Suren Baghdasaryan 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton

memblock: add support for scratch memory

2025-05-13T06:50:39+00:00

With KHO (Kexec HandOver), we need a way to ensure that the new kernel
does not allocate memory on top of any memory regions that the previous
kernel was handing over.  But to know where those are, we need to include
them in the memblock.reserved array which may not be big enough to hold
all ranges that need to be persisted across kexec.  To resize the array,
we need to allocate memory.  That brings us into a catch 22 situation.

The solution to that is limit memblock allocations to the scratch regions:
safe regions to operate in the case when there is memory that should
remain intact across kexec.

KHO provides several "scratch regions" as part of its metadata.  These
scratch regions are contiguous memory blocks that known not to contain any
memory that should be persisted across kexec.  These regions should be
large enough to accommodate all memblock allocations done by the kexeced
kernel.

We introduce a new memblock_set_scratch_only() function that allows KHO to
indicate that any memblock allocation must happen from the scratch
regions.

Later, we may want to perform another KHO kexec.  For that, we reuse the
same scratch regions.  To ensure that no eventually handed over data gets
allocated inside a scratch region, we flip the semantics of the scratch
region with memblock_clear_scratch_only(): After that call, no allocations
may happen from scratch memblock regions.  We will lift that restriction
in the next patch.

Link: https://lkml.kernel.org/r/20250509074635.3187114-3-changyuanl@google.com
Signed-off-by: Alexander Graf 
Co-developed-by: Mike Rapoport (Microsoft) 
Signed-off-by: Mike Rapoport (Microsoft) 
Signed-off-by: Changyuan Lyu 
Cc: Andy Lutomirski 
Cc: Anthony Yznaga 
Cc: Arnd Bergmann 
Cc: Ashish Kalra 
Cc: Ben Herrenschmidt 
Cc: Borislav Betkov 
Cc: Catalin Marinas 
Cc: Dave Hansen 
Cc: David Woodhouse 
Cc: Eric Biederman 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: James Gowans 
Cc: Jason Gunthorpe 
Cc: Jonathan Corbet 
Cc: Krzysztof Kozlowski 
Cc: Marc Rutland 
Cc: Paolo Bonzini 
Cc: Pasha Tatashin 
Cc: Peter Zijlstra 
Cc: Pratyush Yadav 
Cc: Rob Herring 
Cc: Saravana Kannan 
Cc: Stanislav Kinsburskii 
Cc: Steven Rostedt 
Cc: Thomas Gleinxer 
Cc: Thomas Lendacky 
Cc: Will Deacon 
Signed-off-by: Andrew Morton

Disable SLUB_TINY for build testing

2025-04-06T17:00:04+00:00

... and don't error out so hard on missing module descriptions.

Before commit 6c6c1fc09de3 ("modpost: require a MODULE_DESCRIPTION()")
we used to warn about missing module descriptions, but only when
building with extra warnigns (ie 'W=1').

After that commit the warning became an unconditional hard error.

And it turns out not all modules have been converted despite the claims
to the contrary.  As reported by Damian Tometzki, the slub KUnit test
didn't have a module description, and apparently nobody ever really
noticed.

The reason nobody noticed seems to be that the slub KUnit tests get
disabled by SLUB_TINY, which also ends up disabling a lot of other code,
both in tests and in slub itself.  And so anybody doing full build tests
didn't actually see this failre.

So let's disable SLUB_TINY for build-only tests, since it clearly ends
up limiting build coverage.  Also turn the missing module descriptions
error back into a warning, but let's keep it around for non-'W=1'
builds.

Reported-by: Damian Tometzki 
Link: https://lore.kernel.org/all/01070196099fd059-e8463438-7b1b-4ec8-816d-173874be9966-000000@eu-central-1.amazonses.com/
Cc: Masahiro Yamada 
Cc: Jeff Johnson 
Fixes: 6c6c1fc09de3 ("modpost: require a MODULE_DESCRIPTION()")
Signed-off-by: Linus Torvalds

Merge tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

2025-04-01T16:29:18+00:00

Pull MM updates from Andrew Morton:

- The series "Enable strict percpu address space checks" from Uros
Bizjak uses x86 named address space qualifiers to provide
compile-time checking of percpu area accesses.

This has caused a small amount of fallout - two or three issues were
reported. In all cases the calling code was found to be incorrect.

- The series "Some cleanup for memcg" from Chen Ridong implements some
relatively monir cleanups for the memcontrol code.

- The series "mm: fixes for device-exclusive entries (hmm)" from David
Hildenbrand fixes a boatload of issues which David found then using
device-exclusive PTE entries when THP is enabled. More work is
needed, but this makes thins better - our own HMM selftests now
succeed.

- The series "mm: zswap: remove z3fold and zbud" from Yosry Ahmed
remove the z3fold and zbud implementations. They have been deprecated
for half a year and nobody has complained.

- The series "mm: further simplify VMA merge operation" from Lorenzo
Stoakes implements numerous simplifications in this area. No runtime
effects are anticipated.

- The series "mm/madvise: remove redundant mmap_lock operations from
process_madvise()" from SeongJae Park rationalizes the locking in the
madvise() implementation. Performance gains of 20-25% were observed
in one MADV_DONTNEED microbenchmark.

- The series "Tiny cleanup and improvements about SWAP code" from
Baoquan He contains a number of touchups to issues which Baoquan
noticed when working on the swap code.

- The series "mm: kmemleak: Usability improvements" from Catalin
Marinas implements a couple of improvements to the kmemleak
user-visible output.

- The series "mm/damon/paddr: fix large folios access and schemes
handling" from Usama Arif provides a couple of fixes for DAMON's
handling of large folios.

- The series "mm/damon/core: fix wrong and/or useless damos_walk()
behaviors" from SeongJae Park fixes a few issues with the accuracy of
kdamond's walking of DAMON regions.

- The series "expose mapping wrprotect, fix fb_defio use" from Lorenzo
Stoakes changes the interaction between framebuffer deferred-io and
core MM. No functional changes are anticipated - this is preparatory
work for the future removal of page structure fields.

- The series "mm/damon: add support for hugepage_size DAMOS filter"
from Usama Arif adds a DAMOS filter which permits the filtering by
huge page sizes.

- The series "mm: permit guard regions for file-backed/shmem mappings"
from Lorenzo Stoakes extends the guard region feature from its
present "anon mappings only" state. The feature now covers shmem and
file-backed mappings.

- The series "mm: batched unmap lazyfree large folios during
reclamation" from Barry Song cleans up and speeds up the unmapping
for pte-mapped large folios.

- The series "reimplement per-vma lock as a refcount" from Suren
Baghdasaryan puts the vm_lock back into the vma. Our reasons for
pulling it out were largely bogus and that change made the code more
messy. This patchset provides small (0-10%) improvements on one
microbenchmark.

- The series "Docs/mm/damon: misc DAMOS filters documentation fixes and
improves" from SeongJae Park does some maintenance work on the DAMON
docs.

- The series "hugetlb/CMA improvements for large systems" from Frank
van der Linden addresses a pile of issues which have been observed
when using CMA on large machines.

- The series "mm/damon: introduce DAMOS filter type for unmapped pages"
from SeongJae Park enables users of DMAON/DAMOS to filter my the
page's mapped/unmapped status.

- The series "zsmalloc/zram: there be preemption" from Sergey
Senozhatsky teaches zram to run its compression and decompression
operations preemptibly.

- The series "selftests/mm: Some cleanups from trying to run them" from
Brendan Jackman fixes a pile of unrelated issues which Brendan
encountered while runnimg our selftests.

- The series "fs/proc/task_mmu: add guard region bit to pagemap" from
Lorenzo Stoakes permits userspace to use /proc/pid/pagemap to
determine whether a particular page is a guard page.

- The series "mm, swap: remove swap slot cache" from Kairui Song
removes the swap slot cache from the allocation path - it simply
wasn't being effective.

- The series "mm: cleanups for device-exclusive entries (hmm)" from
David Hildenbrand implements a number of unrelated cleanups in this
code.

- The series "mm: Rework generic PTDUMP configs" from Anshuman Khandual
implements a number of preparatoty cleanups to the GENERIC_PTDUMP
Kconfig logic.

- The series "mm/damon: auto-tune aggregation interval" from SeongJae
Park implements a feedback-driven automatic tuning feature for
DAMON's aggregation interval tuning.

- The series "Fix lazy mmu mode" from Ryan Roberts fixes some issues in
powerpc, sparc and x86 lazy MMU implementations. Ryan did this in
preparation for implementing lazy mmu mode for arm64 to optimize
vmalloc.

- The series "mm/page_alloc: Some clarifications for migratetype
fallback" from Brendan Jackman reworks some commentary to make the
code easier to follow.

- The series "page_counter cleanup and size reduction" from Shakeel
Butt cleans up the page_counter code and fixes a size increase which
we accidentally added late last year.

- The series "Add a command line option that enables control of how
many threads should be used to allocate huge pages" from Thomas
Prescher does that. It allows the careful operator to significantly
reduce boot time by tuning the parallalization of huge page
initialization.

- The series "Fix calculations in trace_balance_dirty_pages() for cgwb"
from Tang Yizhou fixes the tracing output from the dirty page
balancing code.

- The series "mm/damon: make allow filters after reject filters useful
and intuitive" from SeongJae Park improves the handling of allow and
reject filters. Behaviour is made more consistent and the documention
is updated accordingly.

- The series "Switch zswap to object read/write APIs" from Yosry Ahmed
updates zswap to the new object read/write APIs and thus permits the
removal of some legacy code from zpool and zsmalloc.

- The series "Some trivial cleanups for shmem" from Baolin Wang does as
it claims.

- The series "fs/dax: Fix ZONE_DEVICE page reference counts" from
Alistair Popple regularizes the weird ZONE_DEVICE page refcount
handling in DAX, permittig the removal of a number of special-case
checks.

- The series "refactor mremap and fix bug" from Lorenzo Stoakes is a
preparatoty refactoring and cleanup of the mremap() code.

- The series "mm: MM owner tracking for large folios (!hugetlb) +
CONFIG_NO_PAGE_MAPCOUNT" from David Hildenbrand reworks the manner in
which we determine whether a large folio is known to be mapped
exclusively into a single MM.

- The series "mm/damon: add sysfs dirs for managing DAMOS filters based
on handling layers" from SeongJae Park adds a couple of new sysfs
directories to ease the management of DAMON/DAMOS filters.

- The series "arch, mm: reduce code duplication in mem_init()" from
Mike Rapoport consolidates many per-arch implementations of
mem_init() into code generic code, where that is practical.

- The series "mm/damon/sysfs: commit parameters online via
damon_call()" from SeongJae Park continues the cleaning up of sysfs
access to DAMON internal data.

- The series "mm: page_ext: Introduce new iteration API" from Luiz
Capitulino reworks the page_ext initialization to fix a boot-time
crash which was observed with an unusual combination of compile and
cmdline options.

- The series "Buddy allocator like (or non-uniform) folio split" from
Zi Yan reworks the code to split a folio into smaller folios. The
main benefit is lessened memory consumption: fewer post-split folios
are generated.

- The series "Minimize xa_node allocation during xarry split" from Zi
Yan reduces the number of xarray xa_nodes which are generated during
an xarray split.

- The series "drivers/base/memory: Two cleanups" from Gavin Shan
performs some maintenance work on the drivers/base/memory code.

- The series "Add tracepoints for lowmem reserves, watermarks and
totalreserve_pages" from Martin Liu adds some more tracepoints to the
page allocator code.

- The series "mm/madvise: cleanup requests validations and
classifications" from SeongJae Park cleans up some warts which
SeongJae observed during his earlier madvise work.

- The series "mm/hwpoison: Fix regressions in memory failure handling"
from Shuai Xue addresses two quite serious regressions which Shuai
has observed in the memory-failure implementation.

- The series "mm: reliable huge page allocator" from Johannes Weiner
makes huge page allocations cheaper and more reliable by reducing
fragmentation.

- The series "Minor memcg cleanups & prep for memdescs" from Matthew
Wilcox is preparatory work for the future implementation of memdescs.

- The series "track memory used by balloon drivers" from Nico Pache
introduces a way to track memory used by our various balloon drivers.

- The series "mm/damon: introduce DAMOS filter type for active pages"
from Nhat Pham permits users to filter for active/inactive pages,
separately for file and anon pages.

- The series "Adding Proactive Memory Reclaim Statistics" from Hao Jia
separates the proactive reclaim statistics from the direct reclaim
statistics.

- The series "mm/vmscan: don't try to reclaim hwpoison folio" from
Jinjiang Tu fixes our handling of hwpoisoned pages within the reclaim
code.

* tag 'mm-stable-2025-03-30-16-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (431 commits)
mm/page_alloc: remove unnecessary __maybe_unused in order_to_pindex()
x86/mm: restore early initialization of high_memory for 32-bits
mm/vmscan: don't try to reclaim hwpoison folio
mm/hwpoison: introduce folio_contain_hwpoisoned_page() helper
cgroup: docs: add pswpin and pswpout items in cgroup v2 doc
mm: vmscan: split proactive reclaim statistics from direct reclaim statistics
selftests/mm: speed up split_huge_page_test
selftests/mm: uffd-unit-tests support for hugepages > 2M
docs/mm/damon/design: document active DAMOS filter type
mm/damon: implement a new DAMOS filter type for active pages
fs/dax: don't disassociate zero page entries
MM documentation: add "Unaccepted" meminfo entry
selftests/mm: add commentary about 9pfs bugs
fork: use __vmalloc_node() for stack allocation
docs/mm: Physical Memory: Populate the "Zones" section
xen: balloon: update the NR_BALLOON_PAGES state
hv_balloon: update the NR_BALLOON_PAGES state
balloon_compaction: update the NR_BALLOON_PAGES state
meminfo: add a per node counter for balloon drivers
mm: remove references to folio in __memcg_kmem_uncharge_page()
...

mm: CONFIG_NO_PAGE_MAPCOUNT to prepare for not maintain per-page mapcounts in large folios

2025-03-18T05:06:46+00:00

We're close to the finishing line: let's introduce a new
CONFIG_NO_PAGE_MAPCOUNT config option where we will incrementally remove
any dependencies on per-page mapcounts in large folios.  Once that's done,
we'll stop maintaining the per-page mapcounts with this config option
enabled.

CONFIG_NO_PAGE_MAPCOUNT will be EXPERIMENTAL for now, as we'll have to
learn about some of the real world impact of some of the implications.

As writing "!CONFIG_NO_PAGE_MAPCOUNT" is really nasty, let's introduce a
helper config option "CONFIG_PAGE_MAPCOUNT" that expresses the negation.

Link: https://lkml.kernel.org/r/20250303163014.1128035-16-david@redhat.com
Signed-off-by: David Hildenbrand 
Cc: Andy Lutomirks^H^Hski 
Cc: Borislav Betkov 
Cc: Dave Hansen 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: Johannes Weiner 
Cc: Jonathan Corbet 
Cc: Kirill A. Shutemov 
Cc: Lance Yang 
Cc: Liam Howlett 
Cc: Lorenzo Stoakes 
Cc: Matthew Wilcow (Oracle) 
Cc: Michal Koutn 
Cc: Muchun Song 
Cc: tejun heo 
Cc: Thomas Gleixner 
Cc: Vlastimil Babka 
Cc: Zefan Li 
Signed-off-by: Andrew Morton

mm/rmap: basic MM owner tracking for large folios (!hugetlb)

2025-03-18T05:06:46+00:00

For small folios, we traditionally use the mapcount to decide whether it
was "certainly mapped exclusively" by a single MM (mapcount == 1) or
whether it "maybe mapped shared" by multiple MMs (mapcount > 1).  For
PMD-sized folios that were PMD-mapped, we were able to use a similar
mechanism (single PMD mapping), but for PTE-mapped folios and in the
future folios that span multiple PMDs, this does not work.

So we need a different mechanism to handle large folios.  Let's add a new
mechanism to detect whether a large folio is "certainly mapped
exclusively", or whether it is "maybe mapped shared".

We'll use this information next to optimize CoW reuse for PTE-mapped
anonymous THP, and to convert folio_likely_mapped_shared() to
folio_maybe_mapped_shared(), independent of per-page mapcounts.

For each large folio, we'll have two slots, whereby a slot stores:
 (1) an MM id: unique id assigned to each MM
 (2) a per-MM mapcount

If a slot is unoccupied, it can be taken by the next MM that maps folio
page.

In addition, we'll remember the current state -- "mapped exclusively" vs. 
"maybe mapped shared" -- and use a bit spinlock to sync on updates and to
reduce the total number of atomic accesses on updates.  In the future, it
might be possible to squeeze a proper spinlock into "struct folio".  For
now, keep it simple, as we require the whole thing with THP only, that is
incompatible with RT.

As we have to squeeze this information into the "struct folio" of even
folios of order-1 (2 pages), and we generally want to reduce the required
metadata, we'll assign each MM a unique ID that can fit into an int.  In
total, we can squeeze everything into 4x int (2x long) on 64bit.

32bit support is a bit challenging, because we only have 2x long == 2x int
in order-1 folios.  But we can make it work for now, because we neither
expect many MMs nor very large folios on 32bit.

We will reliably detect folios as "mapped exclusively" vs.  "mapped
shared" as long as only two MMs map pages of a folio at one point in time
-- for example with fork() and short-lived child processes, or with apps
that hand over state from one instance to another.

As soon as three MMs are involved at the same time, we might detect "maybe
mapped shared" although the folio is "mapped exclusively".

Example 1:

(1) App1 faults in a (shmem/file-backed) folio page -> Tracked as MM0
(2) App2 faults in a folio page -> Tracked as MM1
(4) App1 unmaps all folio pages

 -> We will detect "mapped exclusively".

Example 2:

(1) App1 faults in a (shmem/file-backed) folio page -> Tracked as MM0
(2) App2 faults in a folio page -> Tracked as MM1
(3) App3 faults in a folio page -> No slot available, tracked as "unknown"
(4) App1 and App2 unmap all folio pages

 -> We will detect "maybe mapped shared".

Make use of __always_inline to keep possible performance degradation when
(un)mapping large folios to a minimum.

Note: by squeezing the two flags into the "unsigned long" that stores the
MM ids, we can use non-atomic __bit_spin_unlock() and non-atomic
setting/clearing of the "maybe mapped shared" bit, effectively not adding
any new atomics on the hot path when updating the large mapcount + new
metadata, which further helps reduce the runtime overhead in
micro-benchmarks.

Link: https://lkml.kernel.org/r/20250303163014.1128035-13-david@redhat.com
Signed-off-by: David Hildenbrand 
Cc: Andy Lutomirks^H^Hski 
Cc: Borislav Betkov 
Cc: Dave Hansen 
Cc: Ingo Molnar 
Cc: Jann Horn 
Cc: Johannes Weiner 
Cc: Jonathan Corbet 
Cc: Kirill A. Shutemov 
Cc: Lance Yang 
Cc: Liam Howlett 
Cc: Lorenzo Stoakes 
Cc: Matthew Wilcow (Oracle) 
Cc: Michal Koutn 
Cc: Muchun Song 
Cc: tejun heo 
Cc: Thomas Gleixner 
Cc: Vlastimil Babka 
Cc: Zefan Li 
Signed-off-by: Andrew Morton

mm/sparse: allow for alternate vmemmap section init at boot

2025-03-17T05:06:27+00:00

Add functions that are called just before the per-section memmap is
initialized and just before the memmap page structures are initialized. 
They are called sparse_vmemmap_init_nid_early and
sparse_vmemmap_init_nid_late, respectively.

This allows for mm subsystems to add calls to initialize memmap and page
structures in a specific way, if using SPARSEMEM_VMEMMAP.  Specifically,
hugetlb can pre-HVO bootmem allocated pages that way, so that no time and
resources are wasted on allocating vmemmap pages, only to free them later
(and possibly unnecessarily running the system out of memory in the
process).

Refactor some code and export a few convenience functions for external
use.

In sparse_init_nid, skip any sections that are already initialized, e.g. 
they have been initialized by sparse_vmemmap_init_nid_early already.

The hugetlb code to use these functions will be added in a later commit.

Export section_map_size, as any alternate memmap init code will want to
use it.

The internal config option to enable this is SPARSEMEM_VMEMMAP_PREINIT,
which is selected if an architecture-specific option,
ARCH_WANT_HUGETLB_VMEMMAP_PREINIT, is set.  In the future, if other
subsystems want to do preinit too, they can do it in a similar fashion.

The internal config option is there because a section flag is used, and
the number of flags available is architecture-dependent (see mmzone.h). 
Architecures can decide if there is room for the flag when enabling
options that select SPARSEMEM_VMEMMAP_PREINIT.

Fortunately, as of right now, all sparse vmemmap using architectures do
have room.

Link: https://lkml.kernel.org/r/20250228182928.2645936-11-fvdl@google.com
Signed-off-by: Frank van der Linden 
Cc: Johannes Weiner 
Cc: Alexander Gordeev 
Cc: Andy Lutomirski 
Cc: Arnd Bergmann 
Cc: Dan Carpenter 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Heiko Carstens 
Cc: Joao Martins 
Cc: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: Muchun Song 
Cc: Oscar Salvador 
Cc: Peter Zijlstra 
Cc: Roman Gushchin (Cruise) 
Cc: Usama Arif 
Cc: Vasily Gorbik 
Cc: Yu Zhao 
Cc: Zi Yan 
Signed-off-by: Andrew Morton