linux.git/mm/sparse-vmemmap.c, branch v6.16

mm/hugetlb: do pre-HVO for bootmem allocated pages

2025-03-17T05:06:29+00:00

For large systems, the overhead of vmemmap pages for hugetlb is
substantial.  It's about 1.5% of memory, which is about 45G for a 3T
system.  If you want to configure most of that system for hugetlb (e.g. 
to use as backing memory for VMs), there is a chance of running out of
memory on boot, even though you know that the 45G will become available
later.

To avoid this scenario, and since it's a waste to first allocate and then
free that 45G during boot, do pre-HVO for hugetlb bootmem allocated pages
('gigantic' pages).

pre-HVO is done by adding functions that are called from
sparse_init_nid_early and sparse_init_nid_late.  The first is called
before memmap allocation, so it takes care of allocating memmap HVO-style.
The second verifies that all bootmem pages look good, specifically it
checks that they do not intersect with multiple zones.  This can only be
done from sparse_init_nid_late path, when zones have been initialized.

The hugetlb page size must be aligned to the section size, and aligned to
the size of memory described by the number of page structures contained in
one PMD (since pre-HVO is not prepared to split PMDs).  This should be
true for most 'gigantic' pages, it is for 1G pages on x86, where both of
these alignment requirements are 128M.

This will only have an effect if hugetlb_bootmem_alloc was called early in
boot.  If not, it won't do anything, and HVO for bootmem hugetlb pages
works as before.

Link: https://lkml.kernel.org/r/20250228182928.2645936-20-fvdl@google.com
Signed-off-by: Frank van der Linden 
Cc: Alexander Gordeev 
Cc: Andy Lutomirski 
Cc: Arnd Bergmann 
Cc: Dan Carpenter 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Heiko Carstens 
Cc: Joao Martins 
Cc: Johannes Weiner 
Cc: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: Muchun Song 
Cc: Oscar Salvador 
Cc: Peter Zijlstra 
Cc: Roman Gushchin (Cruise) 
Cc: Usama Arif 
Cc: Vasily Gorbik 
Cc: Yu Zhao 
Cc: Zi Yan 
Signed-off-by: Andrew Morton

mm/sparse: add vmemmap_*_hvo functions

2025-03-17T05:06:28+00:00

Add a few functions to enable early HVO:

vmemmap_populate_hvo
vmemmap_undo_hvo
vmemmap_wrprotect_hvo

The populate and undo functions are expected to be used in early init,
from the sparse_init_nid_early() function.  The wrprotect function is to
be used, potentially, later.

To implement these functions, mostly re-use the existing compound pages
vmemmap logic used by DAX.  vmemmap_populate_address has its argument
changed a bit in this commit: the page structure passed in to be reused in
the mapping is replaced by a PFN and a flag.  The flag indicates whether
an extra ref should be taken on the vmemmap page containing the head page
structure.  Taking the ref is appropriate to for DAX / ZONE_DEVICE, but
not for HugeTLB HVO.

The HugeTLB vmemmap optimization maps tail page structure pages read-only.
The vmemmap_wrprotect_hvo function that does this is implemented
separately, because it cannot be guaranteed that reserved page structures
will not be write accessed during memory initialization.  Even with
CONFIG_DEFERRED_STRUCT_PAGE_INIT, they might still be written to (if they
are at the bottom of a zone).  So, vmemmap_populate_hvo leaves the tail
page structure pages RW initially, and then later during initialization,
after memmap init is fully done, vmemmap_wrprotect_hvo must be called to
finish the job.

Subsequent commits will use these functions for early HugeTLB HVO.

Link: https://lkml.kernel.org/r/20250228182928.2645936-15-fvdl@google.com
Signed-off-by: Frank van der Linden 
Cc: Alexander Gordeev 
Cc: Andy Lutomirski 
Cc: Arnd Bergmann 
Cc: Dan Carpenter 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Heiko Carstens 
Cc: Joao Martins 
Cc: Johannes Weiner 
Cc: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: Muchun Song 
Cc: Oscar Salvador 
Cc: Peter Zijlstra 
Cc: Roman Gushchin (Cruise) 
Cc: Usama Arif 
Cc: Vasily Gorbik 
Cc: Yu Zhao 
Cc: Zi Yan 
Signed-off-by: Andrew Morton

mm/sparse: allow for alternate vmemmap section init at boot

2025-03-17T05:06:27+00:00

Add functions that are called just before the per-section memmap is
initialized and just before the memmap page structures are initialized. 
They are called sparse_vmemmap_init_nid_early and
sparse_vmemmap_init_nid_late, respectively.

This allows for mm subsystems to add calls to initialize memmap and page
structures in a specific way, if using SPARSEMEM_VMEMMAP.  Specifically,
hugetlb can pre-HVO bootmem allocated pages that way, so that no time and
resources are wasted on allocating vmemmap pages, only to free them later
(and possibly unnecessarily running the system out of memory in the
process).

Refactor some code and export a few convenience functions for external
use.

In sparse_init_nid, skip any sections that are already initialized, e.g. 
they have been initialized by sparse_vmemmap_init_nid_early already.

The hugetlb code to use these functions will be added in a later commit.

Export section_map_size, as any alternate memmap init code will want to
use it.

The internal config option to enable this is SPARSEMEM_VMEMMAP_PREINIT,
which is selected if an architecture-specific option,
ARCH_WANT_HUGETLB_VMEMMAP_PREINIT, is set.  In the future, if other
subsystems want to do preinit too, they can do it in a similar fashion.

The internal config option is there because a section flag is used, and
the number of flags available is architecture-dependent (see mmzone.h). 
Architecures can decide if there is room for the flag when enabling
options that select SPARSEMEM_VMEMMAP_PREINIT.

Fortunately, as of right now, all sparse vmemmap using architectures do
have room.

Link: https://lkml.kernel.org/r/20250228182928.2645936-11-fvdl@google.com
Signed-off-by: Frank van der Linden 
Cc: Johannes Weiner 
Cc: Alexander Gordeev 
Cc: Andy Lutomirski 
Cc: Arnd Bergmann 
Cc: Dan Carpenter 
Cc: Dave Hansen 
Cc: David Hildenbrand 
Cc: Heiko Carstens 
Cc: Joao Martins 
Cc: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: Muchun Song 
Cc: Oscar Salvador 
Cc: Peter Zijlstra 
Cc: Roman Gushchin (Cruise) 
Cc: Usama Arif 
Cc: Vasily Gorbik 
Cc: Yu Zhao 
Cc: Zi Yan 
Signed-off-by: Andrew Morton

mm/memmap: prevent double scanning of memmap by kmemleak

2025-01-26T04:22:30+00:00

kmemleak explicitly scans the mem_map through the valid struct page
objects.  However, memmap_alloc() was also adding this memory to the gray
object list, causing it to be scanned twice.  Remove memmap_alloc() from
the scan list and add a comment to clarify the behavior.

Link: https://lore.kernel.org/lkml/CAOm6qn=FVeTpH54wGDFMHuCOeYtvoTx30ktnv9-w3Nh8RMofEA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20250106021126.1678334-1-guoweikang.kernel@gmail.com
Signed-off-by: Guo Weikang 
Reviewed-by: Catalin Marinas 
Cc: Mike Rapoport (Microsoft) 
Signed-off-by: Andrew Morton

mm: define general function pXd_init()

2024-11-12T01:22:27+00:00

pud_init(), pmd_init() and kernel_pte_init() are duplicated defined in
file kasan.c and sparse-vmemmap.c as weak functions.  Move them to generic
header file pgtable.h, architecture can redefine them.

Link: https://lkml.kernel.org/r/20241104070712.52902-1-maobibo@loongson.cn
Signed-off-by: Bibo Mao 
Reviewed-by: Huacai Chen 
Cc: Alexander Potapenko 
Cc: Andrey Konovalov 
Cc: Andrey Ryabinin 
Cc: Dmitry Vyukov 
Cc: Thomas Bogendoerfer 
Cc: Vincenzo Frascino 
Cc: WANG Xuerui 
Signed-off-by: Andrew Morton

LoongArch: Set initial pte entry with PAGE_GLOBAL for kernel space

2024-10-21T14:11:19+00:00

There are two pages in one TLB entry on LoongArch system. For kernel
space, it requires both two pte entries (buddies) with PAGE_GLOBAL bit
set, otherwise HW treats it as non-global tlb, there will be potential
problems if tlb entry for kernel space is not global. Such as fail to
flush kernel tlb with the function local_flush_tlb_kernel_range() which
supposed only flush tlb with global bit.

Kernel address space areas include percpu, vmalloc, vmemmap, fixmap and
kasan areas. For these areas both two consecutive page table entries
should be enabled with PAGE_GLOBAL bit. So with function set_pte() and
pte_clear(), pte buddy entry is checked and set besides its own pte
entry. However it is not atomic operation to set both two pte entries,
there is problem with test_vmalloc test case.

So function kernel_pte_init() is added to init a pte table when it is
created for kernel address space, and the default initial pte value is
PAGE_GLOBAL rather than zero at beginning. Then only its own pte entry
need update with function set_pte() and pte_clear(), nothing to do with
the pte buddy entry.

Signed-off-by: Bibo Mao 
Signed-off-by: Huacai Chen

mm: don't account memmap per-node

2024-08-16T05:16:14+00:00

Fix invalid access to pgdat during hot-remove operation:
ndctl users reported a GPF when trying to destroy a namespace:
$ ndctl destroy-namespace all -r all -f
 Segmentation fault
 dmesg:
 Oops: general protection fault, probably for
 non-canonical address 0xdffffc0000005650: 0000 [#1] PREEMPT SMP KASAN
 PTI
 KASAN: probably user-memory-access in range
 [0x000000000002b280-0x000000000002b287]
 CPU: 26 UID: 0 PID: 1868 Comm: ndctl Not tainted 6.11.0-rc1 #1
 Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS
 2.20.1 09/13/2023
 RIP: 0010:mod_node_page_state+0x2a/0x110

cxl-test users report a GPF when trying to unload the test module:
$ modrpobe -r cxl-test
 dmesg
 BUG: unable to handle page fault for address: 0000000000004200
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: Oops: 0000 [#1] PREEMPT SMP PTI
 CPU: 0 UID: 0 PID: 1076 Comm: modprobe Tainted: G O N 6.11.0-rc1 #197
 Tainted: [O]=OOT_MODULE, [N]=TEST
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/15
 RIP: 0010:mod_node_page_state+0x6/0x90

Currently, when memory is hot-plugged or hot-removed the accounting is
done based on the assumption that memmap is allocated from the same node
as the hot-plugged/hot-removed memory, which is not always the case.

In addition, there are challenges with keeping the node id of the memory
that is being remove to the time when memmap accounting is actually
performed: since this is done after remove_pfn_range_from_zone(), and
also after remove_memory_block_devices(). Meaning that we cannot use
pgdat nor walking though memblocks to get the nid.

Given all of that, account the memmap overhead system wide instead.

For this we are going to be using global atomic counters, but given that
memmap size is rarely modified, and normally is only modified either
during early boot when there is only one CPU, or under a hotplug global
mutex lock, therefore there is no need for per-cpu optimizations.

Also, while we are here rename nr_memmap to nr_memmap_pages, and
nr_memmap_boot to nr_memmap_boot_pages to be self explanatory that the
units are in page count.

[pasha.tatashin@soleen.com: address a few nits from David Hildenbrand]
  Link: https://lkml.kernel.org/r/20240809191020.1142142-4-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20240809191020.1142142-4-pasha.tatashin@soleen.com
Link: https://lkml.kernel.org/r/20240808213437.682006-4-pasha.tatashin@soleen.com
Fixes: 15995a352474 ("mm: report per-page metadata information")
Signed-off-by: Pasha Tatashin 
Reported-by: Yi Zhang 
Closes: https://lore.kernel.org/linux-cxl/CAHj4cs9Ax1=CoJkgBGP_+sNu6-6=6v=_L-ZBZY0bVLD3wUWZQg@mail.gmail.com
Reported-by: Alison Schofield 
Closes: https://lore.kernel.org/linux-mm/Zq0tPd2h6alFz8XF@aschofie-mobl2/#t
Tested-by: Dan Williams 
Tested-by: Alison Schofield 
Acked-by: David Hildenbrand 
Acked-by: David Rientjes 
Tested-by: Yi Zhang 
Cc: Domenico Cerasuolo 
Cc: Fan Ni 
Cc: Joel Granados 
Cc: Johannes Weiner 
Cc: Li Zhijian 
Cc: Matthew Wilcox (Oracle) 
Cc: Mike Rapoport 
Cc: Muchun Song 
Cc: Nhat Pham 
Cc: Sourav Panda 
Cc: Vlastimil Babka 
Cc: Yosry Ahmed 
Signed-off-by: Andrew Morton

mm: report per-page metadata information

2024-07-04T02:30:09+00:00

Today, we do not have any observability of per-page metadata and how much
it takes away from the machine capacity.  Thus, we want to describe the
amount of memory that is going towards per-page metadata, which can vary
depending on build configuration, machine architecture, and system use.

This patch adds 2 fields to /proc/vmstat that can used as shown below:

Accounting per-page metadata allocated by boot-allocator:
	/proc/vmstat:nr_memmap_boot * PAGE_SIZE

Accounting per-page metadata allocated by buddy-allocator:
	/proc/vmstat:nr_memmap * PAGE_SIZE

Accounting total Perpage metadata allocated on the machine:
	(/proc/vmstat:nr_memmap_boot +
	 /proc/vmstat:nr_memmap) * PAGE_SIZE

Utility for userspace:

Observability: Describe the amount of memory overhead that is going to
per-page metadata on the system at any given time since this overhead is
not currently observable.

Debugging: Tracking the changes or absolute value in struct pages can help
detect anomalies as they can be correlated with other metrics in the
machine (e.g., memtotal, number of huge pages, etc).

page_ext overheads: Some kernel features such as page_owner
page_table_check that use page_ext can be optionally enabled via kernel
parameters.  Having the total per-page metadata information helps users
precisely measure impact.  Furthermore, page-metadata metrics will reflect
the amount of struct pages reliquished (or overhead reduced) when
hugetlbfs pages are reserved which will vary depending on whether hugetlb
vmemmap optimization is enabled or not.

For background and results see:
lore.kernel.org/all/20240220214558.3377482-1-souravpanda@google.com

Link: https://lkml.kernel.org/r/20240605222751.1406125-1-souravpanda@google.com
Signed-off-by: Sourav Panda 
Acked-by: David Rientjes 
Reviewed-by: Pasha Tatashin 
Cc: Alexey Dobriyan 
Cc: Bjorn Helgaas 
Cc: Chen Linxuan 
Cc: David Hildenbrand 
Cc: Greg Kroah-Hartman 
Cc: Ivan Babrou 
Cc: Johannes Weiner 
Cc: Jonathan Corbet 
Cc: Kefeng Wang 
Cc: Kirill A. Shutemov 
Cc: Liam R. Howlett 
Cc: Mike Kravetz 
Cc: Mike Rapoport (IBM) 
Cc: Muchun Song 
Cc: "Rafael J. Wysocki" 
Cc: Randy Dunlap 
Cc: Shakeel Butt 
Cc: Suren Baghdasaryan 
Cc: Tomas Mudrunka 
Cc: Vlastimil Babka 
Cc: Wei Xu 
Cc: Yang Yang 
Cc: Yosry Ahmed 
Signed-off-by: Andrew Morton

mm/vmemmap: allow architectures to override how vmemmap optimization works

2023-08-18T17:12:53+00:00

Architectures like powerpc will like to use different page table
allocators and mapping mechanisms to implement vmemmap optimization. 
Similar to vmemmap_populate allow architectures to implement
vmemap_populate_compound_pages

Link: https://lkml.kernel.org/r/20230724190759.483013-5-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V 
Cc: Catalin Marinas 
Cc: Christophe Leroy 
Cc: Dan Williams 
Cc: Joao Martins 
Cc: Michael Ellerman 
Cc: Mike Kravetz 
Cc: Muchun Song 
Cc: Nicholas Piggin 
Cc: Oscar Salvador 
Cc: Will Deacon 
Signed-off-by: Andrew Morton

mm: ptep_get() conversion

2023-06-19T23:19:25+00:00

Convert all instances of direct pte_t* dereferencing to instead use
ptep_get() helper.  This means that by default, the accesses change from a
C dereference to a READ_ONCE().  This is technically the correct thing to
do since where pgtables are modified by HW (for access/dirty) they are
volatile and therefore we should always ensure READ_ONCE() semantics.

But more importantly, by always using the helper, it can be overridden by
the architecture to fully encapsulate the contents of the pte.  Arch code
is deliberately not converted, as the arch code knows best.  It is
intended that arch code (arm64) will override the default with its own
implementation that can (e.g.) hide certain bits from the core code, or
determine young/dirty status by mixing in state from another source.

Conversion was done using Coccinelle:

----

// $ make coccicheck \
//          COCCI=ptepget.cocci \
//          SPFLAGS="--include-headers" \
//          MODE=patch

virtual patch

@ depends on patch @
pte_t *v;
@@

- *v
+ ptep_get(v)

----

Then reviewed and hand-edited to avoid multiple unnecessary calls to
ptep_get(), instead opting to store the result of a single call in a
variable, where it is correct to do so.  This aims to negate any cost of
READ_ONCE() and will benefit arch-overrides that may be more complex.

Included is a fix for an issue in an earlier version of this patch that
was pointed out by kernel test robot.  The issue arose because config
MMU=n elides definition of the ptep helper functions, including
ptep_get().  HUGETLB_PAGE=n configs still define a simple
huge_ptep_clear_flush() for linking purposes, which dereferences the ptep.
So when both configs are disabled, this caused a build error because
ptep_get() is not defined.  Fix by continuing to do a direct dereference
when MMU=n.  This is safe because for this config the arch code cannot be
trying to virtualize the ptes because none of the ptep helpers are
defined.

Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com
Reported-by: kernel test robot 
Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/
Signed-off-by: Ryan Roberts 
Cc: Adrian Hunter 
Cc: Alexander Potapenko 
Cc: Alexander Shishkin 
Cc: Alex Williamson 
Cc: Al Viro 
Cc: Andrey Konovalov 
Cc: Andrey Ryabinin 
Cc: Christian Brauner 
Cc: Christoph Hellwig 
Cc: Daniel Vetter 
Cc: Dave Airlie 
Cc: Dimitri Sivanich 
Cc: Dmitry Vyukov 
Cc: Ian Rogers 
Cc: Jason Gunthorpe 
Cc: Jérôme Glisse 
Cc: Jiri Olsa 
Cc: Johannes Weiner 
Cc: Kirill A. Shutemov 
Cc: Lorenzo Stoakes 
Cc: Mark Rutland 
Cc: Matthew Wilcox 
Cc: Miaohe Lin 
Cc: Michal Hocko 
Cc: Mike Kravetz 
Cc: Mike Rapoport (IBM) 
Cc: Muchun Song 
Cc: Namhyung Kim 
Cc: Naoya Horiguchi 
Cc: Oleksandr Tyshchenko 
Cc: Pavel Tatashin 
Cc: Roman Gushchin 
Cc: SeongJae Park 
Cc: Shakeel Butt 
Cc: Uladzislau Rezki (Sony) 
Cc: Vincenzo Frascino 
Cc: Yu Zhao 
Signed-off-by: Andrew Morton