linux.git/Documentation/admin-guide/sysctl/vm.rst, branch v6.15

mm: page_alloc: defrag_mode

2025-03-18T05:07:07+00:00

The page allocator groups requests by migratetype to stave off
fragmentation.  However, in practice this is routinely defeated by the
fact that it gives up *before* invoking reclaim and compaction - which may
well produce suitable pages.  As a result, fragmentation of physical
memory is a common ongoing process in many load scenarios.

Fragmentation deteriorates compaction's ability to produce huge pages. 
Depending on the lifetime of the fragmenting allocations, those effects
can be long-lasting or even permanent, requiring drastic measures like
forcible idle states or even reboots as the only reliable ways to recover
the address space for THP production.

In a kernel build test with supplemental THP pressure, the THP allocation
rate steadily declines over 15 runs:

    thp_fault_alloc
    61988
    56474
    57258
    50187
    52388
    55409
    52925
    47648
    43669
    40621
    36077
    41721
    36685
    34641
    33215

This is a hurdle in adopting THP in any environment where hosts are shared
between multiple overlapping workloads (cloud environments), and rarely
experience true idle periods.  To make THP a reliable and predictable
optimization, there needs to be a stronger guarantee to avoid such
fragmentation.

Introduce defrag_mode.  When enabled, reclaim/compaction is invoked to its
full extent *before* falling back.  Specifically, ALLOC_NOFRAGMENT is
enforced on the allocator fastpath and the reclaiming slowpath.

For now, fallbacks are permitted to avert OOMs.  There is a plan to add
defrag_mode=2 to prefer OOMs over fragmentation, but this requires
additional prep work in compaction and the reserve management to make it
ready for all possible allocation contexts.

The following test results are from a kernel build with periodic bursts of
THP allocations, over 15 runs:

                                        vanilla    defrag_mode=1
@claimer[unmovable]:                        189              103
@claimer[movable]:                           92              103
@claimer[reclaimable]:                      207               61
@pollute[unmovable from movable]:            25                0
@pollute[unmovable from reclaimable]:        28                0
@pollute[movable from unmovable]:         38835                0
@pollute[movable from reclaimable]:      147136                0
@pollute[reclaimable from unmovable]:       178                0
@pollute[reclaimable from movable]:          33                0
@steal[unmovable from movable]:              11                0
@steal[unmovable from reclaimable]:           5                0
@steal[reclaimable from unmovable]:         107                0
@steal[reclaimable from movable]:            90                0
@steal[movable from reclaimable]:           354                0
@steal[movable from unmovable]:             130                0

Both types of polluting fallbacks are eliminated in this workload.

Interestingly, whole block conversions are reduced as well.  This is
because once a block is claimed for a type, its empty space remains
available for future allocations, instead of being padded with fallbacks;
this allows the native type to group up instead of spreading out to new
blocks.  The assumption in the allocator has been that pollution from
movable allocations is less harmful than from other types, since they can
be reclaimed or migrated out should the space be needed.  However, since
fallbacks occur *before* reclaim/compaction is invoked, movable pollution
will still cause non-movable allocations to spread out and claim more
blocks.

Without fragmentation, THP rates hold steady with defrag_mode=1:

    thp_fault_alloc
    32478
    20725
    45045
    32130
    14018
    21711
    40791
    29134
    34458
    45381
    28305
    17265
    22584
    28454
    30850

While the downward trend is eliminated, the keen reader will of course
notice that the baseline rate is much smaller than the vanilla kernel's to
begin with.  This is due to deficiencies in how reclaim and compaction are
currently driven: ALLOC_NOFRAGMENT increases the extent to which smaller
allocations are competing with THPs for pageblocks, while making no effort
themselves to reclaim or compact beyond their own request size.  This
effect already exists with the current usage of ALLOC_NOFRAGMENT, but is
amplified by defrag_mode insisting on whole block stealing much more
strongly.

Subsequent patches will address defrag_mode reclaim strategy to raise the
THP success baseline above the vanilla kernel.

Link: https://lkml.kernel.org/r/20250313210647.1314586-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner 
Cc: Mel Gorman 
Cc: Vlastimil Babka 
Cc: Zi Yan 
Signed-off-by: Andrew Morton

docs: mm: add enable_soft_offline sysctl

2024-07-05T01:06:00+00:00

Add the documentation for soft offline behaviors / costs, and what the new
enable_soft_offline sysctl is for.

[jiaqiyan@google.com: fix kerneldoc warnings]
  Link: https://lkml.kernel.org/r/CACw3F52=GxTCDw-PqFh3-GDM-fo3GbhGdu0hedxYXOTT4TQSTg@mail.gmail.com
[jiaqiyan@google.com: there are more blank lines needed]
  Link: https://lkml.kernel.org/r/CACw3F52_obAB742XeDRNun4BHBYtrxtbvp5NkUincXdaob0j1g@mail.gmail.com
Link: https://lkml.kernel.org/r/20240626050818.2277273-5-jiaqiyan@google.com
Signed-off-by: Jiaqi Yan 
Acked-by: Oscar Salvador 
Acked-by: Miaohe Lin 
Acked-by: David Rientjes 
Cc: Frank van der Linden 
Cc: Jane Chu 
Cc: Jonathan Corbet 
Cc: Lance Yang 
Cc: Muchun Song 
Cc: Naoya Horiguchi 
Cc: Randy Dunlap 
Cc: Shuah Khan 
Signed-off-by: Andrew Morton

lib: add allocation tagging support for memory allocation profiling

2024-04-26T03:55:52+00:00

Introduce CONFIG_MEM_ALLOC_PROFILING which provides definitions to easily
instrument memory allocators.  It registers an "alloc_tags" codetag type
with /proc/allocinfo interface to output allocation tag information when
the feature is enabled.

CONFIG_MEM_ALLOC_PROFILING_DEBUG is provided for debugging the memory
allocation profiling instrumentation.

Memory allocation profiling can be enabled or disabled at runtime using
/proc/sys/vm/mem_profiling sysctl when CONFIG_MEM_ALLOC_PROFILING_DEBUG=n.
CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT enables memory allocation
profiling by default.

[surenb@google.com: Documentation/filesystems/proc.rst: fix allocinfo title]
  Link: https://lkml.kernel.org/r/20240326073813.727090-1-surenb@google.com
[surenb@google.com: do limited memory accounting for modules with ARCH_NEEDS_WEAK_PER_CPU]
  Link: https://lkml.kernel.org/r/20240402180933.1663992-2-surenb@google.com
[klarasmodin@gmail.com: explicitly include irqflags.h in alloc_tag.h]
  Link: https://lkml.kernel.org/r/20240407133252.173636-1-klarasmodin@gmail.com
[surenb@google.com: fix alloc_tag_init() to prevent passing NULL to PTR_ERR()]
  Link: https://lkml.kernel.org/r/20240417003349.2520094-1-surenb@google.com
Link: https://lkml.kernel.org/r/20240321163705.3067592-14-surenb@google.com
Signed-off-by: Suren Baghdasaryan 
Co-developed-by: Kent Overstreet 
Signed-off-by: Kent Overstreet 
Signed-off-by: Klara Modin 
Tested-by: Kees Cook 
Cc: Alexander Viro 
Cc: Alex Gaynor 
Cc: Alice Ryhl 
Cc: Andreas Hindborg 
Cc: Benno Lossin 
Cc: "Björn Roy Baron" 
Cc: Boqun Feng 
Cc: Christoph Lameter 
Cc: Dennis Zhou 
Cc: Gary Guo 
Cc: Miguel Ojeda 
Cc: Pasha Tatashin 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Cc: Vlastimil Babka 
Cc: Wedson Almeida Filho 
Signed-off-by: Andrew Morton

docs: mm: fix vm overcommit documentation for OVERCOMMIT_GUESS

2023-10-10T19:35:55+00:00

Commit 8c7829b04c52 "mm: fix false-positive OVERCOMMIT_GUESS failures"
changed the behavior of the default OVERCOMMIT_GUESS setting.
Reflect the change also in the Documentation, namely files:
    Documentation/admin-guide/sysctl/vm.rst
    Documentation/mm/overcommit-accounting.rst

Reported-by: Jozef Bacik 
Signed-off-by: Vratislav Bendel 
Acked-by: Mike Rapoport 
Signed-off-by: Jonathan Corbet 
Link: https://lore.kernel.org/r/20220829124638.63748-1-vbendel@redhat.com

Documentation: admin-guide: correct spelling

2023-02-02T18:04:42+00:00

Correct spelling problems for Documentation/admin-guide/ as reported
by codespell.

Signed-off-by: Randy Dunlap 
Reviewed-by: Mukesh Ojha 
Cc: Tejun Heo 
Cc: Zefan Li 
Cc: Johannes Weiner 
Cc: cgroups@vger.kernel.org
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Mauro Carvalho Chehab 
Cc: linux-media@vger.kernel.org
Cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/20230129231053.20863-2-rdunlap@infradead.org
Signed-off-by: Jonathan Corbet

userfaultfd: update documentation to describe /dev/userfaultfd

2022-09-12T03:25:49+00:00

Explain the different ways to create a new userfaultfd, and how access
control works for each way.

[axelrasmussen@google.com: improve wording in documentation, per Mike]
  Link: https://lkml.kernel.org/r/20220819205201.658693-5-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20220808175614.3885028-5-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen 
Acked-by: Peter Xu 
Reviewed-by: Shuah Khan 
Cc: Al Viro 
Cc: Dave Hansen 
Cc: Dmitry V. Levin 
Cc: Gleb Fotengauer-Malinovskiy 
Cc: Hugh Dickins 
Cc: Jan Kara 
Cc: Jonathan Corbet 
Cc: Mel Gorman 
Cc: Mike Kravetz 
Cc: Mike Rapoport 
Cc: Nadav Amit 
Cc: Suren Baghdasaryan 
Cc: Vlastimil Babka 
Cc: Zhang Yi 
Cc: Mike Rapoport 
Signed-off-by: Andrew Morton

mm: hugetlb_vmemmap: introduce the name HVO

2022-08-09T01:06:42+00:00

It it inconvenient to mention the feature of optimizing vmemmap pages
associated with HugeTLB pages when communicating with others since there
is no specific or abbreviated name for it when it is first introduced. 
Let us give it a name HVO (HugeTLB Vmemmap Optimization) from now.

This commit also updates the document about "hugetlb_free_vmemmap" by the
way discussed in thread [1].

Link: https://lore.kernel.org/all/21aae898-d54d-cc4b-a11f-1bb7fddcfffa@redhat.com/ [1]
Link: https://lkml.kernel.org/r/20220628092235.91270-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song 
Reviewed-by: Oscar Salvador 
Reviewed-by: Mike Kravetz 
Cc: Anshuman Khandual 
Cc: Catalin Marinas 
Cc: David Hildenbrand 
Cc: Jonathan Corbet 
Cc: Will Deacon 
Cc: Xiongchun Duan 
Signed-off-by: Andrew Morton

mm: memory_hotplug: make hugetlb_optimize_vmemmap compatible with memmap_on_memory

2022-07-04T01:08:49+00:00

For now, the feature of hugetlb_free_vmemmap is not compatible with the
feature of memory_hotplug.memmap_on_memory, and hugetlb_free_vmemmap takes
precedence over memory_hotplug.memmap_on_memory.  However, someone wants
to make memory_hotplug.memmap_on_memory takes precedence over
hugetlb_free_vmemmap since memmap_on_memory makes it more likely to
succeed memory hotplug in close-to-OOM situations.  So the decision of
making hugetlb_free_vmemmap take precedence is not wise and elegant.

The proper approach is to have hugetlb_vmemmap.c do the check whether the
section which the HugeTLB pages belong to can be optimized.  If the
section's vmemmap pages are allocated from the added memory block itself,
hugetlb_free_vmemmap should refuse to optimize the vmemmap, otherwise, do
the optimization.  Then both kernel parameters are compatible.  So this
patch introduces VmemmapSelfHosted to mask any non-optimizable vmemmap
pages.  The hugetlb_vmemmap can use this flag to detect if a vmemmap page
can be optimized.

[songmuchun@bytedance.com: walk vmemmap page tables to avoid false-positive]
  Link: https://lkml.kernel.org/r/20220620110616.12056-3-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20220617135650.74901-3-songmuchun@bytedance.com
Signed-off-by: Muchun Song 
Co-developed-by: Oscar Salvador 
Signed-off-by: Oscar Salvador 
Acked-by: David Hildenbrand 
Cc: Jonathan Corbet 
Cc: Mike Kravetz 
Cc: Paul E. McKenney 
Cc: Xiongchun Duan 
Signed-off-by: Andrew Morton

docs: rename Documentation/vm to Documentation/mm

2022-06-27T19:52:53+00:00

so it will be consistent with code mm directory and with
Documentation/admin-guide/mm and won't be confused with virtual machines.

Signed-off-by: Mike Rapoport 
Suggested-by: Matthew Wilcox 
Tested-by: Ira Weiny 
Acked-by: Jonathan Corbet 
Acked-by: Wu XiangCheng

mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl

2022-05-13T23:48:56+00:00

We must add hugetlb_free_vmemmap=on (or "off") to the boot cmdline and
reboot the server to enable or disable the feature of optimizing vmemmap
pages associated with HugeTLB pages.  However, rebooting usually takes a
long time.  So add a sysctl to enable or disable the feature at runtime
without rebooting.  Why we need this?  There are 3 use cases.

1) The feature of minimizing overhead of struct page associated with
   each HugeTLB is disabled by default without passing
   "hugetlb_free_vmemmap=on" to the boot cmdline.  When we (ByteDance)
   deliver the servers to the users who want to enable this feature, they
   have to configure the grub (change boot cmdline) and reboot the
   servers, whereas rebooting usually takes a long time (we have thousands
   of servers).  It's a very bad experience for the users.  So we need a
   approach to enable this feature after rebooting.  This is a use case in
   our practical environment.

2) Some use cases are that HugeTLB pages are allocated 'on the fly'
   instead of being pulled from the HugeTLB pool, those workloads would be
   affected with this feature enabled.  Those workloads could be
   identified by the characteristics of they never explicitly allocating
   huge pages with 'nr_hugepages' but only set 'nr_overcommit_hugepages'
   and then let the pages be allocated from the buddy allocator at fault
   time.  We can confirm it is a real use case from the commit
   099730d67417.  For those workloads, the page fault time could be ~2x
   slower than before.  We suspect those users want to disable this
   feature if the system has enabled this before and they don't think the
   memory savings benefit is enough to make up for the performance drop.

3) If the workload which wants vmemmap pages to be optimized and the
   workload which wants to set 'nr_overcommit_hugepages' and does not want
   the extera overhead at fault time when the overcommitted pages be
   allocated from the buddy allocator are deployed in the same server. 
   The user could enable this feature and set 'nr_hugepages' and
   'nr_overcommit_hugepages', then disable the feature.  In this case, the
   overcommited HugeTLB pages will not encounter the extra overhead at
   fault time.

Link: https://lkml.kernel.org/r/20220512041142.39501-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song 
Reviewed-by: Mike Kravetz 
Cc: Jonathan Corbet 
Cc: Luis Chamberlain 
Cc: Kees Cook 
Cc: Iurii Zaikin 
Cc: Oscar Salvador 
Cc: David Hildenbrand 
Cc: Masahiro Yamada 
Cc: Xiongchun Duan 
Signed-off-by: Andrew Morton